@@ -367,19 +367,34 @@ Lets say we have input tokens embeddings for the sequence "Dan loves ice cream":
367367Imagine this as a smooth graph which would be the continuous space. But we only
368368have access to discrete points (the tokens). We can use Zero Order Hold (ZOH)
369369for this. Order here means "that there is now change in the value."
370+ The increment where we take samples is called the time step (delta, Δ) and
371+ is something the model learns but can be influenced by the token I think.
370372
373+ So we have the tokens embeddings as points which is the input we have in to the
374+ mamba layer. This is in discrete points/values, but mambas state is a continuous
375+ system, similar to a system that needs a continuous signal, the mamba system
376+ operates on an analog signal. So it needs to be converted to such a signal.
377+ But in practice we don't transform the input tokens into a continuous signal but
378+ instead we transform the parameters A and B of the state space model into discrete
379+ values and perform the selective scan using them. This is called discretization.
371380
372- So we will first discretize the parameters A, and B of the state space model,
373- which means that we will convert them from continuous values to discrete values.
381+ To clarify this a bit more:
382+ * We have discrete token embeddings (already discrete).
383+ * We have a continuous SSM formulation: dh/dt = Ah(t) + Bx(t)
384+ * We need a way to apply this continuous SSM to discrete token embeddings.
374385
375- I think there are multiple methods/ways to do this but the paper mentions
376- the zero-order hold transform method which is a method for converting a
377- descrite time signal to continous time signal (the inner space).
386+ So we discretize A and B.
387+
388+
389+ So we will first discretize the parameters A:
390+ ``` console
391+ A_d = exp(A * Δt)
392+
393+
394+
395+ , and B of the state space model,
396+ which means that we will convert them from continuous values to discrete values.
378397
379- So we have the tokens embeddings as points which is the input we have in to the
380- mamba2 layer. This is in discrete points/values, but mambas state is a
381- continuous, similar to a system that needs a continuous signal, the mamba system
382- operates on an analog signal. So it needs to be converted to such a signal.
383398
384399So instead of the using functions as shown above we concrete values we will
385400transform A and B into discrete values and the equations become:
0 commit comments