Hybrid Architecture memory growth rate; linear or quadratic?
#1
by
tdb12
- opened
Wow, great work with this, awesome to see.
Since this model incorporates both attention and Mamba layers, is the growth rate of the memory with respect to context still dominated by the quadratic attention terms? Or is there some limitation on how large a window the attention matrices ingest (the IBM Granite 4.0 article mentioned transformers being used to "enable a more nuanced parsing of local context.") so at super-long context lengths the memory really is n over n^2?
Any clarification would be appreciated!