-
Notifications
You must be signed in to change notification settings - Fork 39
Description
Hello,
First of all, thank you for sharing this exciting and thought-provoking work on BriLLM. The brain-inspired SiFu paradigm is a very novel approach to language modeling, and I appreciate you open-sourcing the code.
I've been studying the paper and the accompanying code in this repository, and I have a few questions to help me better understand the implementation, especially regarding some key claims made in the paper. I would be very grateful if you could provide some clarification.
Here are my main questions:
1. Unbounded Context Length vs. Hardcoded Limit
The paper repeatedly claims that the model supports "unbounded context processing" and is "context-independent." This is one of its most compelling features.
However, when reviewing the code, I noticed what appears to be a hardcoded maximum sequence length of 512. Specifically:
- In
BraLM.__init__, the aggregation weights parameter is initialized with a fixed size:self.positions = nn.Parameter(torch.ones(1, 512, 1)) - In
BraLM.forwardandBraLM.decode, the positional encoding is generated for a fixed length:pe = self.get_positional_encoding(512, self.hidden_size)
Question: Could you clarify if the current implementation is indeed limited to a 512-token context? Was this a practical constraint for the initial experiments, and are there plans to support arbitrary or longer sequences in a future version?
2. Implementation of the Aggregation Weights (α)
The paper was ambiguous about how the variable-length aggregation weight vector α ∈ ℝ^(L-1) would be implemented in a model with a fixed number of parameters. The code seems to provide the answer.
Based on this line in the forward pass:
energy_tensor = (energy_cache * self.positions[:, :i, :].softmax(1)).sum(1, keepdim=True)It appears that the aggregation weights α are not dynamically generated, but are in fact the learnable nn.Parameter named self.positions (which has a fixed size of 512).
Question: Can you confirm that this is the correct interpretation? If so, this is a very important implementation detail. It implies that the model's ability to handle context is tied to the predefined size of this parameter.
3. Computational Complexity
The paper discusses the model's advantages in terms of complexity, contrasting its O(1) model size complexity with the Transformer's O(L^2) computational complexity.
However, looking at the forward method's main loop, it seems the total computation for generating a sequence of length L is also O(L^2). The loop runs L times, and inside each step i, the aggregation operation (energy_cache * ...).sum(1) takes O(i) time.
Question: Is my understanding of the O(L^2) computational complexity for a forward pass correct? If so, could you elaborate on the model's efficiency advantages, perhaps in comparison to optimized Transformer variants?
Thank you again for your pioneering work and for making the code available. Any insights you could provide on these points would be extremely helpful for the community to fully understand BriLLM's architecture and its current capabilities.
Best regards.