Gecko: An Efficient Neural Architecture Inherently Processing Sequences with Arbitrary Lengths
Abstract
Gecko is a neural architecture that improves long-range dependency capture through exponential moving average with gated attention and additional components like timestep decay normalization and sliding chunk attention, achieving efficient processing of arbitrary-length sequential data with superior long-context scalability compared to existing models.
Designing a unified neural network to efficiently and inherently process sequential data with arbitrary lengths is a central and challenging problem in sequence modeling. The design choices in Transformer, including quadratic complexity and weak length extrapolation, have limited their ability to scale to long sequences. In this work, we propose Gecko, a neural architecture that inherits the design of Mega and Megalodon (exponential moving average with gated attention), and further introduces multiple technical components to improve its capability to capture long range dependencies, including timestep decay normalization, sliding chunk attention mechanism, and adaptive working memory. In a controlled pretraining comparison with Llama2 and Megalodon in the scale of 7 billion parameters and 2 trillion training tokens, Gecko achieves better efficiency and long-context scalability. Gecko reaches a training loss of 1.68, significantly outperforming Llama2-7B (1.75) and Megalodon-7B (1.70), and landing close to Llama2-13B (1.67). Notably, without relying on any context-extension techniques, Gecko exhibits inherent long-context processing and retrieval capabilities, stably handling sequences of up to 4 million tokens and retrieving information from contexts up to 4times longer than its attention window. Code: https://github.com/XuezheMax/gecko-llm
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Trellis: Learning to Compress Key-Value Memory in Attention Models (2025)
- Every Token Counts: Generalizing 16M Ultra-Long Context in Large Language Models (2025)
- Spectral-Window Hybrid (SWH) (2026)
- InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models (2025)
- GatedFWA: Linear Flash Windowed Attention with Gated Associative Memory (2025)
- Sliding Window Recurrences for Sequence Models (2025)
- MIDUS: Memory-Infused Depth Up-Scaling (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper


