--- base_model: LiquidAI/LFM2.5-1.2B-Instruct tags: - sdft - self-distillation - continual-learning - conversational language: - en license: apache-2.0 pipeline_tag: text-generation library_name: transformers --- # LFM2.5-1.2B-SDFT: Self-Distillation Fine-Tuned Model This model is a **Self-Distillation Fine-Tuned (SDFT)** version of [LiquidAI/LFM2.5-1.2B-Instruct](https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct), trained using the methodology from the paper ["Self-Distillation Enables Continual Learning"](https://arxiv.org/abs/2601.19897). ## Model Description - **Base Model:** LiquidAI/LFM2.5-1.2B-Instruct - **Training Method:** Self-Distillation Fine-Tuning (SDFT) - **Training Data:** ~5K samples from OpenAssistant dataset - **Training Hardware:** Single NVIDIA A100 GPU - **Parameters:** LoRA rank=8, alpha=16, targeting q_proj and v_proj ### What is SDFT? Self-Distillation Fine-Tuning (SDFT) is a continual learning technique that: - Uses the model's **in-context learning** ability to create a demonstration-aware teacher - Generates training data **on-policy** from the student model - Minimizes KL divergence between student and demonstration-conditioned teacher - Enables learning new tasks while **reducing catastrophic forgetting** Key advantages: - ✅ Learns from demonstrations without explicit reward functions - ✅ Maintains prior knowledge while acquiring new skills - ✅ On-policy learning improves generalization - ✅ Efficient training with EMA teacher updates ## Quick Start ### Installation ```bash pip install torch transformers peft accelerate bitsandbytes ``` ### Basic Usage ```python import torch from transformers import AutoTokenizer, AutoModelForCausalLM # Load tokenizer tokenizer = AutoTokenizer.from_pretrained("LiquidAI/LFM2.5-1.2B-Instruct") # Load model with quantization base_model = AutoModelForCausalLM.from_pretrained( "yasserrmd/lfm2.5-1.5b-sdft", torch_dtype=torch.float16, device_map="auto" ) model.eval() # Generate prompt = """<|im_start|>user Explain how photosynthesis works. <|im_end|> <|im_start|>assistant """ inputs = tokenizer(prompt, return_tensors="pt").to("cuda") # Use official LiquidAI parameters outputs = model.generate( **inputs, max_new_tokens=256, do_sample=True, temperature=0.1, top_k=50, top_p=0.1, repetition_penalty=1.05, pad_token_id=tokenizer.pad_token_id ) response = tokenizer.decode(outputs[0], skip_special_tokens=True) print(response) ``` ### With Demonstration (In-Context Learning) ```python prompt = """<|im_start|>user Explain how databases work. Here is an example response to guide you: Example: Databases store data in tables. You can query them to get information back. Now provide your own response following a similar approach: <|im_end|> <|im_start|>assistant """ inputs = tokenizer(prompt, return_tensors="pt").to("cuda") outputs = model.generate( **inputs, max_new_tokens=256, do_sample=True, temperature=0.1, top_k=50, top_p=0.1, repetition_penalty=1.05 ) response = tokenizer.decode(outputs[0], skip_special_tokens=True) print(response) ``` ## Training Details ### Dataset - **Source:** OpenAssistant conversations - **Size:** ~5,000 query-demonstration pairs - **Preprocessing:** - Filtered demonstrations: 20-2048 characters - Train/Val/Test split: 75%/10%/15% ### Training Configuration ```python # Model Architecture - Base: LiquidAI/LFM2.5-1.2B-Instruct - Quantization: 8-bit with bitsandbytes - LoRA: rank=8, alpha=16, dropout=0.05 - Target modules: q_proj, v_proj # Training Parameters - Learning rate: 5e-6 - Optimizer: AdamW (weight_decay=0.01) - Batch size: 1 (with gradient accumulation) - Gradient accumulation steps: 16 - Epochs: 3 - Max sequence length: 512 - Max generation length: 128 # SDFT-Specific - EMA alpha: 0.02 - Temperature: 1.0 - KL divergence: Analytic (full vocabulary) - On-policy generation: Yes ``` ### Prompt Format (Teacher vs Student) **Student Prompt (query only):** ``` <|im_start|>user {query} <|im_end|> <|im_start|>assistant ``` **Teacher Prompt (query + demonstration):** ``` <|im_start|>user {query} Here is an example response to guide you: <|im_start|>assistant {demonstration} <|im_end|> <|im_start|>user Now provide your own response following a similar approach and reasoning: <|im_end|> <|im_start|>assistant ``` ## Evaluation Results ### Tested on Multiple Dimensions: | Category | Description | Performance | |----------|-------------|-------------| | **ICL Adaptation** | Following demonstration style | ✅ Good | | **Task Improvement** | Learning from examples | ✅ Good | | **Retention** | No catastrophic forgetting | ✅ ~80% | | **Polarity Control** | Following demo viewpoint | ⚠️ Moderate | ### Key Findings: 1. ✅ **Maintains Knowledge:** No significant forgetting on general tasks 2. ✅ **Adapts to Demos:** Successfully follows demonstration styles 3. ✅ **Improved Over Training:** Epoch 3 shows stable, coherent outputs 4. ⚠️ **Model Size Limitation:** 1.2B parameters limits complex reasoning ### Comparison to Base Model: - **With Demonstrations:** SDFT shows better style matching and task following - **Without Demonstrations:** Maintains base model capabilities - **Response Quality:** More consistent and focused outputs ## Generation Parameters **⚠️ Important:** Use official LiquidAI parameters for best results: ```python generation_config = { "max_new_tokens": 256, "do_sample": True, "temperature": 0.1, # Official LiquidAI recommendation "top_k": 50, # Official LiquidAI recommendation "top_p": 0.1, # Official LiquidAI recommendation "repetition_penalty": 1.05 # Official LiquidAI recommendation } ``` These parameters are specifically tuned for LFM2.5 and provide: - Focused, factual responses - Minimal hallucinations - Consistent output quality ## Limitations ### Model Constraints: - **Size:** 1.2B parameters (smaller capacity than 7B+ models) - **Training Data:** 5K samples (vs paper's 20K+) - **Hardware:** Single A100 (vs paper's multi-GPU setup) - **Complexity:** Limited reasoning on very complex tasks ### Known Issues: - May require proper ChatML formatting for best results - Performance degrades on tasks requiring deep technical knowledge - Smaller model size limits polarity control effectiveness ### Appropriate Use Cases: - ✅ Conversational AI with example-guided responses - ✅ Task learning from demonstrations - ✅ Style-adaptive text generation - ✅ Educational/research purposes ### Not Recommended For: - ❌ Production systems requiring 100% reliability - ❌ Tasks requiring strong reasoning (use 7B+ models) - ❌ Safety-critical applications - ❌ Tasks outside training distribution without demonstrations ## Bias and Ethical Considerations - Inherits biases from base LFM2.5 model and OpenAssistant dataset - May generate inconsistent responses on controversial topics - Should not be used for medical, legal, or financial advice - Outputs should be reviewed by humans for critical applications ## Citation If you use this model, please cite: **SDFT Paper:** ```bibtex @article{shenfeld2026sdft, title={Self-Distillation Enables Continual Learning}, author={Shenfeld, Idan and Damani, Mehul and H{\"u}botter, Jonas and Agrawal, Pulkit}, journal={arXiv preprint arXiv:2601.19897}, year={2026} } ``` **Base Model:** ```bibtex @misc{lfm25, title={LFM2.5: Liquid Foundation Models}, author={LiquidAI}, year={2024}, url={https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct} } ``` ## Acknowledgments - **Paper:** ["Self-Distillation Enables Continual Learning"](https://arxiv.org/abs/2601.19897) by Shenfeld et al. - **Base Model:** [LiquidAI/LFM2.5-1.2B-Instruct](https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct) - **Dataset:** OpenAssistant conversations - **Framework:** HuggingFace Transformers, PEFT, bitsandbytes ## License This model is released under the Apache 2.0 license, following the base model's licensing. ## Model Card Authors [Your Name/Organization] ## Contact For questions or issues, please open an issue on the [model repository](https://huggingface.co/YOUR_USERNAME/lfm25-sdft). --- ## Additional Resources - 📄 [SDFT Paper](https://arxiv.org/abs/2601.19897) - 💻 [Training Code](https://github.com/YOUR_USERNAME/sdft-training) - 🤗 [Base Model](https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct) - 📊 [Evaluation Results](link-to-detailed-results) ## Version History - **v1.0** (2024-XX-XX): Initial release - Trained on 5K OpenAssistant samples - 3 epochs with gradient accumulation - LoRA rank 8, alpha 16