metadata
base_model: LiquidAI/LFM2.5-1.2B-Instruct
tags:
- sdft
- self-distillation
- continual-learning
- conversational
language:
- en
license: apache-2.0
pipeline_tag: text-generation
library_name: transformers
LFM2.5-1.2B-SDFT: Self-Distillation Fine-Tuned Model
This model is a Self-Distillation Fine-Tuned (SDFT) version of LiquidAI/LFM2.5-1.2B-Instruct, trained using the methodology from the paper "Self-Distillation Enables Continual Learning".
Model Description
- Base Model: LiquidAI/LFM2.5-1.2B-Instruct
- Training Method: Self-Distillation Fine-Tuning (SDFT)
- Training Data: ~5K samples from OpenAssistant dataset
- Training Hardware: Single NVIDIA A100 GPU
- Parameters: LoRA rank=8, alpha=16, targeting q_proj and v_proj
What is SDFT?
Self-Distillation Fine-Tuning (SDFT) is a continual learning technique that:
- Uses the model's in-context learning ability to create a demonstration-aware teacher
- Generates training data on-policy from the student model
- Minimizes KL divergence between student and demonstration-conditioned teacher
- Enables learning new tasks while reducing catastrophic forgetting
Key advantages:
- β Learns from demonstrations without explicit reward functions
- β Maintains prior knowledge while acquiring new skills
- β On-policy learning improves generalization
- β Efficient training with EMA teacher updates
Quick Start
Installation
pip install torch transformers peft accelerate bitsandbytes
Basic Usage
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("LiquidAI/LFM2.5-1.2B-Instruct")
# Load model with quantization
base_model = AutoModelForCausalLM.from_pretrained(
"yasserrmd/lfm2.5-1.5b-sdft",
torch_dtype=torch.float16,
device_map="auto"
)
model.eval()
# Generate
prompt = """<|im_start|>user
Explain how photosynthesis works.
<|im_end|>
<|im_start|>assistant
"""
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
# Use official LiquidAI parameters
outputs = model.generate(
**inputs,
max_new_tokens=256,
do_sample=True,
temperature=0.1,
top_k=50,
top_p=0.1,
repetition_penalty=1.05,
pad_token_id=tokenizer.pad_token_id
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
With Demonstration (In-Context Learning)
prompt = """<|im_start|>user
Explain how databases work.
Here is an example response to guide you:
Example: Databases store data in tables. You can query them to get information back.
Now provide your own response following a similar approach:
<|im_end|>
<|im_start|>assistant
"""
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(
**inputs,
max_new_tokens=256,
do_sample=True,
temperature=0.1,
top_k=50,
top_p=0.1,
repetition_penalty=1.05
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
Training Details
Dataset
- Source: OpenAssistant conversations
- Size: ~5,000 query-demonstration pairs
- Preprocessing:
- Filtered demonstrations: 20-2048 characters
- Train/Val/Test split: 75%/10%/15%
Training Configuration
# Model Architecture
- Base: LiquidAI/LFM2.5-1.2B-Instruct
- Quantization: 8-bit with bitsandbytes
- LoRA: rank=8, alpha=16, dropout=0.05
- Target modules: q_proj, v_proj
# Training Parameters
- Learning rate: 5e-6
- Optimizer: AdamW (weight_decay=0.01)
- Batch size: 1 (with gradient accumulation)
- Gradient accumulation steps: 16
- Epochs: 3
- Max sequence length: 512
- Max generation length: 128
# SDFT-Specific
- EMA alpha: 0.02
- Temperature: 1.0
- KL divergence: Analytic (full vocabulary)
- On-policy generation: Yes
Prompt Format (Teacher vs Student)
Student Prompt (query only):
<|im_start|>user
{query}
<|im_end|>
<|im_start|>assistant
Teacher Prompt (query + demonstration):
<|im_start|>user
{query}
Here is an example response to guide you:
<|im_start|>assistant
{demonstration}
<|im_end|>
<|im_start|>user
Now provide your own response following a similar approach and reasoning:
<|im_end|>
<|im_start|>assistant
Evaluation Results
Tested on Multiple Dimensions:
| Category | Description | Performance |
|---|---|---|
| ICL Adaptation | Following demonstration style | β Good |
| Task Improvement | Learning from examples | β Good |
| Retention | No catastrophic forgetting | β ~80% |
| Polarity Control | Following demo viewpoint | β οΈ Moderate |
Key Findings:
- β Maintains Knowledge: No significant forgetting on general tasks
- β Adapts to Demos: Successfully follows demonstration styles
- β Improved Over Training: Epoch 3 shows stable, coherent outputs
- β οΈ Model Size Limitation: 1.2B parameters limits complex reasoning
Comparison to Base Model:
- With Demonstrations: SDFT shows better style matching and task following
- Without Demonstrations: Maintains base model capabilities
- Response Quality: More consistent and focused outputs
Generation Parameters
β οΈ Important: Use official LiquidAI parameters for best results:
generation_config = {
"max_new_tokens": 256,
"do_sample": True,
"temperature": 0.1, # Official LiquidAI recommendation
"top_k": 50, # Official LiquidAI recommendation
"top_p": 0.1, # Official LiquidAI recommendation
"repetition_penalty": 1.05 # Official LiquidAI recommendation
}
These parameters are specifically tuned for LFM2.5 and provide:
- Focused, factual responses
- Minimal hallucinations
- Consistent output quality
Limitations
Model Constraints:
- Size: 1.2B parameters (smaller capacity than 7B+ models)
- Training Data: 5K samples (vs paper's 20K+)
- Hardware: Single A100 (vs paper's multi-GPU setup)
- Complexity: Limited reasoning on very complex tasks
Known Issues:
- May require proper ChatML formatting for best results
- Performance degrades on tasks requiring deep technical knowledge
- Smaller model size limits polarity control effectiveness
Appropriate Use Cases:
- β Conversational AI with example-guided responses
- β Task learning from demonstrations
- β Style-adaptive text generation
- β Educational/research purposes
Not Recommended For:
- β Production systems requiring 100% reliability
- β Tasks requiring strong reasoning (use 7B+ models)
- β Safety-critical applications
- β Tasks outside training distribution without demonstrations
Bias and Ethical Considerations
- Inherits biases from base LFM2.5 model and OpenAssistant dataset
- May generate inconsistent responses on controversial topics
- Should not be used for medical, legal, or financial advice
- Outputs should be reviewed by humans for critical applications
Citation
If you use this model, please cite:
SDFT Paper:
@article{shenfeld2026sdft,
title={Self-Distillation Enables Continual Learning},
author={Shenfeld, Idan and Damani, Mehul and H{\"u}botter, Jonas and Agrawal, Pulkit},
journal={arXiv preprint arXiv:2601.19897},
year={2026}
}
Base Model:
@misc{lfm25,
title={LFM2.5: Liquid Foundation Models},
author={LiquidAI},
year={2024},
url={https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct}
}
Acknowledgments
- Paper: "Self-Distillation Enables Continual Learning" by Shenfeld et al.
- Base Model: LiquidAI/LFM2.5-1.2B-Instruct
- Dataset: OpenAssistant conversations
- Framework: HuggingFace Transformers, PEFT, bitsandbytes
License
This model is released under the Apache 2.0 license, following the base model's licensing.
Model Card Authors
[Your Name/Organization]
Contact
For questions or issues, please open an issue on the model repository.
Additional Resources
- π SDFT Paper
- π» Training Code
- π€ Base Model
- π Evaluation Results
Version History
- v1.0 (2024-XX-XX): Initial release
- Trained on 5K OpenAssistant samples
- 3 epochs with gradient accumulation
- LoRA rank 8, alpha 16