--- license: apache-2.0 datasets: - mindchain/wikitext2 - allenai/c4 language: - en metrics: - perplexity base_model: - TinyLlama/TinyLlama-1.1B-Chat-v1.0 pipeline_tag: text-generation library_name: transformers tags: - performer - linear-attention - favor+ - knowledge-distillation - research --- - **Model description**: TinyLlama 1.1B with K/32 softmax attention heads replaced by FAVOR+ linear attention, fine-tuned via knowledge distillation - **Intended use**: Research — evaluating quality/speed/approximation trade-offs of linear attention substitution - **How to load**: code snippet showing how to reconstruct MixedPerformerAttention and load the checkpoint - **Training details**: WikiText-103, 20k samples, SEQ_LEN=256, distillation loss, 4-phase curriculum - **Results table**: same as the README (ppl per phase) - **Limitations**: Phase 4 (32/32 heads) collapsed — not suitable for inference. Phase 2 is the recommended checkpoint.