Skip to content

Questions and Clarification on Implementation vs. Paper Claims #1

@ykx3

Description

@ykx3

Hello,

First of all, thank you for sharing this exciting and thought-provoking work on BriLLM. The brain-inspired SiFu paradigm is a very novel approach to language modeling, and I appreciate you open-sourcing the code.

I've been studying the paper and the accompanying code in this repository, and I have a few questions to help me better understand the implementation, especially regarding some key claims made in the paper. I would be very grateful if you could provide some clarification.

Here are my main questions:

1. Unbounded Context Length vs. Hardcoded Limit

The paper repeatedly claims that the model supports "unbounded context processing" and is "context-independent." This is one of its most compelling features.

However, when reviewing the code, I noticed what appears to be a hardcoded maximum sequence length of 512. Specifically:

  • In BraLM.__init__, the aggregation weights parameter is initialized with a fixed size: self.positions = nn.Parameter(torch.ones(1, 512, 1))
  • In BraLM.forward and BraLM.decode, the positional encoding is generated for a fixed length: pe = self.get_positional_encoding(512, self.hidden_size)

Question: Could you clarify if the current implementation is indeed limited to a 512-token context? Was this a practical constraint for the initial experiments, and are there plans to support arbitrary or longer sequences in a future version?

2. Implementation of the Aggregation Weights (α)

The paper was ambiguous about how the variable-length aggregation weight vector α ∈ ℝ^(L-1) would be implemented in a model with a fixed number of parameters. The code seems to provide the answer.

Based on this line in the forward pass:

energy_tensor = (energy_cache * self.positions[:, :i, :].softmax(1)).sum(1, keepdim=True)

It appears that the aggregation weights α are not dynamically generated, but are in fact the learnable nn.Parameter named self.positions (which has a fixed size of 512).

Question: Can you confirm that this is the correct interpretation? If so, this is a very important implementation detail. It implies that the model's ability to handle context is tied to the predefined size of this parameter.

3. Computational Complexity

The paper discusses the model's advantages in terms of complexity, contrasting its O(1) model size complexity with the Transformer's O(L^2) computational complexity.

However, looking at the forward method's main loop, it seems the total computation for generating a sequence of length L is also O(L^2). The loop runs L times, and inside each step i, the aggregation operation (energy_cache * ...).sum(1) takes O(i) time.

Question: Is my understanding of the O(L^2) computational complexity for a forward pass correct? If so, could you elaborate on the model's efficiency advantages, perhaps in comparison to optimized Transformer variants?


Thank you again for your pioneering work and for making the code available. Any insights you could provide on these points would be extremely helpful for the community to fully understand BriLLM's architecture and its current capabilities.

Best regards.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions