Questions and Clarification on Implementation vs. Paper Claims

Hello,

First of all, thank you for sharing this exciting and thought-provoking work on BriLLM. The brain-inspired SiFu paradigm is a very novel approach to language modeling, and I appreciate you open-sourcing the code.

I've been studying the paper and the accompanying code in this repository, and I have a few questions to help me better understand the implementation, especially regarding some key claims made in the paper. I would be very grateful if you could provide some clarification.

Here are my main questions:

### 1\. Unbounded Context Length vs. Hardcoded Limit

The paper repeatedly claims that the model supports "unbounded context processing" and is "context-independent." This is one of its most compelling features.

However, when reviewing the code, I noticed what appears to be a hardcoded maximum sequence length of 512. Specifically:

  - In `BraLM.__init__`, the aggregation weights parameter is initialized with a fixed size: `self.positions = nn.Parameter(torch.ones(1, 512, 1))`
  - In `BraLM.forward` and `BraLM.decode`, the positional encoding is generated for a fixed length: `pe = self.get_positional_encoding(512, self.hidden_size)`

**Question:** Could you clarify if the current implementation is indeed limited to a 512-token context? Was this a practical constraint for the initial experiments, and are there plans to support arbitrary or longer sequences in a future version?

### 2\. Implementation of the Aggregation Weights (`α`)

The paper was ambiguous about how the variable-length aggregation weight vector `α ∈ ℝ^(L-1)` would be implemented in a model with a fixed number of parameters. The code seems to provide the answer.

Based on this line in the `forward` pass:

```python
energy_tensor = (energy_cache * self.positions[:, :i, :].softmax(1)).sum(1, keepdim=True)
```

It appears that the aggregation weights `α` are not dynamically generated, but are in fact the learnable `nn.Parameter` named `self.positions` (which has a fixed size of 512).

**Question:** Can you confirm that this is the correct interpretation? If so, this is a very important implementation detail. It implies that the model's ability to handle context is tied to the predefined size of this parameter.

### 3\. Computational Complexity

The paper discusses the model's advantages in terms of complexity, contrasting its `O(1)` model size complexity with the Transformer's `O(L^2)` computational complexity.

However, looking at the `forward` method's main loop, it seems the total computation for generating a sequence of length `L` is also `O(L^2)`. The loop runs `L` times, and inside each step `i`, the aggregation operation `(energy_cache * ...).sum(1)` takes `O(i)` time.

**Question:** Is my understanding of the `O(L^2)` computational complexity for a forward pass correct? If so, could you elaborate on the model's efficiency advantages, perhaps in comparison to optimized Transformer variants?

-----

Thank you again for your pioneering work and for making the code available. Any insights you could provide on these points would be extremely helpful for the community to fully understand BriLLM's architecture and its current capabilities.

Best regards.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Questions and Clarification on Implementation vs. Paper Claims #1

1. Unbounded Context Length vs. Hardcoded Limit

2. Implementation of the Aggregation Weights (`α`)

3. Computational Complexity

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Questions and Clarification on Implementation vs. Paper Claims #1

Description

1. Unbounded Context Length vs. Hardcoded Limit

2. Implementation of the Aggregation Weights (α)

3. Computational Complexity

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

2. Implementation of the Aggregation Weights (`α`)