--- license: apache-2.0 language: - en library_name: transformers base_model: - arcee-ai/AFM-4.5B-Base tags: - kda - kimi-delta-attention - linear-attention - distillation - research --- # AFM-4.5B-Base-KDA-Only A research variant of [AFM-4.5B-Base](https://huggingface.co/arcee-ai/AFM-4.5B-Base) where all attention layers have been replaced with Kimi Delta Attention (KDA) through knowledge distillation. This model contains **no full-attention layers**. > ⚠️ **Research Model**: This is an experimental model released for research purposes. For production use, see [AFM-4.5B](https://huggingface.co/arcee-ai/AFM-4.5B). More details available in our blog post here: https://www.arcee.ai/blog/distilling-kimi-delta-attention-into-afm-4-5b-and-the-tool-we-used-to-do-it ## Overview This model explores whether full attention can be completely replaced with linear attention mechanisms. Using [DistillKit](https://github.com/arcee-ai/DistillKit), we distilled the original AFM-4.5B-Base (teacher) into a pure-KDA architecture (student). **Key characteristics:** - All 24 layers use KDA instead of full attention - Trained up to 32k sequence length - Linear memory scaling with sequence length - Smoother long-context degradation compared to hybrid architectures ## Architecture | Component | Details | |-----------|---------| | Parameters | 4.5B | | Attention Type | Kimi Delta Attention (All layers) | | Positional Encoding | None (inherent to KDA) | | Max Training Length | 32k tokens | | Base Model | AFM-4.5B-Base | ## Benchmark Results Performance compared to the teacher model and hybrid configurations: | Benchmark | Teacher (Full Attn) | KDA-Only | |-----------|:-------------------:|:--------:| | MMLU (Avg) | 63.1% | 55.8% | | ARC-Challenge | 55.6% | 49.9% | | HellaSwag (Norm) | 78.0% | 74.3% | | GSM8K (Math) | 52.1% | 26.8% | ### Key Findings - **Knowledge benchmarks**: KDA-Only performs within statistical range of hybrid approaches on MMLU, ARC, and HellaSwag - **Math performance**: Larger drop on GSM8K compared to hybrid, though this may recover with longer training - **Long-context behavior**: Degrades more smoothly than hybrid models beyond training length—no cliff at 32k, just gradual falloff ## Long-Context Performance (NIAH) The pure-KDA model shows interesting long-context characteristics: - 100% single-needle retrieval up to 65k (beyond training length!) - Multikey retrieval degrades starting at 4k but smoothly - No sharp "cliff" like hybrid models exhibit past 32k This behavior aligns with expectations for state-space-like architectures: fixed hidden state size creates inherent tension with growing context, but degradation is graceful. ## Usage ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch model_id = "arcee-ai/AFM-4.5B-Base-KDA-Only" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.bfloat16, device_map="auto" ) prompt = "The theory of relativity states that" input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(model.device) outputs = model.generate( input_ids, max_new_tokens=100, do_sample=True, temperature=0.7, top_p=0.95 ) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ## Training Details - **Method**: Knowledge distillation from AFM-4.5B-Base using [DistillKit](https://github.com/arcee-ai/DistillKit) - **Teacher**: AFM-4.5B-Base (full attention) - **Student Architecture**: All layers converted to KDA - **Training Length**: 32k sequence length ## Intended Use This model is intended for: - Research into linear attention mechanisms - Studying attention distillation techniques - Exploring pure state-space-like architectures for language modeling - Benchmarking KDA vs full attention tradeoffs ## Limitations - Lower math/reasoning performance compared to full attention - Not instruction-tuned - Research checkpoint—not optimized for production ## License AFM-4.5B is released under the Apache-2.0 license.