@@ -349,6 +349,25 @@ as a continous space of information. When we process tokens there are in a
349349descrete values. The A (state transition), B (input transition), and C
350350(output transition) matrices operate in the continuous space.
351351
352+ Lets say we have input tokens embeddings for the sequence "Dan loves ice cream":
353+ ```
354+ ^
355+ |
356+ |
357+ | ice *-----+
358+ | | |
359+ Dan*-----+ |cream*-----
360+ | | |
361+ |loves*-----+
362+ |
363+ +-----|-----|-----|-----|----->
364+
365+ time (t)
366+ ```
367+ Imagine this as a smooth graph which would be the continuous space. But we only
368+ have access to discrete points (the tokens). We can use Zero Order Hold (ZOH)
369+ for this. Order here means "that there is now change in the value."
370+
352371
353372So we will first discretize the parameters A, and B of the state space model,
354373which means that we will convert them from continuous values to discrete values.
@@ -357,6 +376,11 @@ I think there are multiple methods/ways to do this but the paper mentions
357376the zero-order hold transform method which is a method for converting a
358377descrite time signal to continous time signal (the inner space).
359378
379+ So we have the tokens embeddings as points which is the input we have in to the
380+ mamba2 layer. This is in discrete points/values, but mambas state is a
381+ continuous, similar to a system that needs a continuous signal, the mamba system
382+ operates on an analog signal. So it needs to be converted to such a signal.
383+
360384So instead of the using functions as shown above we concrete values we will
361385transform A and B into discrete values and the equations become:
362386```
0 commit comments