---
base_model: LiquidAI/LFM2.5-1.2B-Instruct
tags:
- sdft
- self-distillation
- continual-learning
- conversational
language:
- en
license: apache-2.0
pipeline_tag: text-generation
library_name: transformers
---

# LFM2.5-1.2B-SDFT: Self-Distillation Fine-Tuned Model

This model is a **Self-Distillation Fine-Tuned (SDFT)** version of [LiquidAI/LFM2.5-1.2B-Instruct](https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct), trained using the methodology from the paper ["Self-Distillation Enables Continual Learning"](https://arxiv.org/abs/2601.19897).

## Model Description

- **Base Model:** LiquidAI/LFM2.5-1.2B-Instruct
- **Training Method:** Self-Distillation Fine-Tuning (SDFT)
- **Training Data:** ~5K samples from OpenAssistant dataset
- **Training Hardware:** Single NVIDIA A100 GPU
- **Parameters:** LoRA rank=8, alpha=16, targeting q_proj and v_proj

### What is SDFT?

Self-Distillation Fine-Tuning (SDFT) is a continual learning technique that:
- Uses the model's **in-context learning** ability to create a demonstration-aware teacher
- Generates training data **on-policy** from the student model
- Minimizes KL divergence between student and demonstration-conditioned teacher
- Enables learning new tasks while **reducing catastrophic forgetting**

Key advantages:
- ✅ Learns from demonstrations without explicit reward functions
- ✅ Maintains prior knowledge while acquiring new skills
- ✅ On-policy learning improves generalization
- ✅ Efficient training with EMA teacher updates

## Quick Start

### Installation

```bash
pip install torch transformers peft accelerate bitsandbytes
```

### Basic Usage

```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM


# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("LiquidAI/LFM2.5-1.2B-Instruct")

# Load model with quantization
base_model = AutoModelForCausalLM.from_pretrained(
    "yasserrmd/lfm2.5-1.5b-sdft",
    torch_dtype=torch.float16,
    device_map="auto"
)


model.eval()

# Generate
prompt = """<|im_start|>user
Explain how photosynthesis works.
<|im_end|>
<|im_start|>assistant
"""

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

# Use official LiquidAI parameters
outputs = model.generate(
    **inputs,
    max_new_tokens=256,
    do_sample=True,
    temperature=0.1,
    top_k=50,
    top_p=0.1,
    repetition_penalty=1.05,
    pad_token_id=tokenizer.pad_token_id
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
```

### With Demonstration (In-Context Learning)

```python
prompt = """<|im_start|>user
Explain how databases work.

Here is an example response to guide you:

Example: Databases store data in tables. You can query them to get information back.

Now provide your own response following a similar approach:
<|im_end|>
<|im_start|>assistant
"""

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(
    **inputs,
    max_new_tokens=256,
    do_sample=True,
    temperature=0.1,
    top_k=50,
    top_p=0.1,
    repetition_penalty=1.05
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
```

## Training Details

### Dataset
- **Source:** OpenAssistant conversations
- **Size:** ~5,000 query-demonstration pairs
- **Preprocessing:** 
  - Filtered demonstrations: 20-2048 characters
  - Train/Val/Test split: 75%/10%/15%

### Training Configuration

```python
# Model Architecture
- Base: LiquidAI/LFM2.5-1.2B-Instruct
- Quantization: 8-bit with bitsandbytes
- LoRA: rank=8, alpha=16, dropout=0.05
- Target modules: q_proj, v_proj

# Training Parameters
- Learning rate: 5e-6
- Optimizer: AdamW (weight_decay=0.01)
- Batch size: 1 (with gradient accumulation)
- Gradient accumulation steps: 16
- Epochs: 3
- Max sequence length: 512
- Max generation length: 128

# SDFT-Specific
- EMA alpha: 0.02
- Temperature: 1.0
- KL divergence: Analytic (full vocabulary)
- On-policy generation: Yes
```

### Prompt Format (Teacher vs Student)

**Student Prompt (query only):**
```
<|im_start|>user
{query}
<|im_end|>
<|im_start|>assistant
```

**Teacher Prompt (query + demonstration):**
```
<|im_start|>user
{query}

Here is an example response to guide you:

<|im_start|>assistant
{demonstration}
<|im_end|>

<|im_start|>user
Now provide your own response following a similar approach and reasoning:
<|im_end|>
<|im_start|>assistant
```

## Evaluation Results

### Tested on Multiple Dimensions:

| Category | Description | Performance |
|----------|-------------|-------------|
| **ICL Adaptation** | Following demonstration style | ✅ Good |
| **Task Improvement** | Learning from examples | ✅ Good |
| **Retention** | No catastrophic forgetting | ✅ ~80% |
| **Polarity Control** | Following demo viewpoint | ⚠️ Moderate |

### Key Findings:

1. ✅ **Maintains Knowledge:** No significant forgetting on general tasks
2. ✅ **Adapts to Demos:** Successfully follows demonstration styles
3. ✅ **Improved Over Training:** Epoch 3 shows stable, coherent outputs
4. ⚠️ **Model Size Limitation:** 1.2B parameters limits complex reasoning

### Comparison to Base Model:

- **With Demonstrations:** SDFT shows better style matching and task following
- **Without Demonstrations:** Maintains base model capabilities
- **Response Quality:** More consistent and focused outputs

## Generation Parameters

**⚠️ Important:** Use official LiquidAI parameters for best results:

```python
generation_config = {
    "max_new_tokens": 256,
    "do_sample": True,
    "temperature": 0.1,        # Official LiquidAI recommendation
    "top_k": 50,               # Official LiquidAI recommendation
    "top_p": 0.1,              # Official LiquidAI recommendation
    "repetition_penalty": 1.05 # Official LiquidAI recommendation
}
```

These parameters are specifically tuned for LFM2.5 and provide:
- Focused, factual responses
- Minimal hallucinations
- Consistent output quality

## Limitations

### Model Constraints:
- **Size:** 1.2B parameters (smaller capacity than 7B+ models)
- **Training Data:** 5K samples (vs paper's 20K+)
- **Hardware:** Single A100 (vs paper's multi-GPU setup)
- **Complexity:** Limited reasoning on very complex tasks

### Known Issues:
- May require proper ChatML formatting for best results
- Performance degrades on tasks requiring deep technical knowledge
- Smaller model size limits polarity control effectiveness

### Appropriate Use Cases:
- ✅ Conversational AI with example-guided responses
- ✅ Task learning from demonstrations
- ✅ Style-adaptive text generation
- ✅ Educational/research purposes

### Not Recommended For:
- ❌ Production systems requiring 100% reliability
- ❌ Tasks requiring strong reasoning (use 7B+ models)
- ❌ Safety-critical applications
- ❌ Tasks outside training distribution without demonstrations

## Bias and Ethical Considerations

- Inherits biases from base LFM2.5 model and OpenAssistant dataset
- May generate inconsistent responses on controversial topics
- Should not be used for medical, legal, or financial advice
- Outputs should be reviewed by humans for critical applications

## Citation

If you use this model, please cite:

**SDFT Paper:**
```bibtex
@article{shenfeld2026sdft,
  title={Self-Distillation Enables Continual Learning},
  author={Shenfeld, Idan and Damani, Mehul and H{\"u}botter, Jonas and Agrawal, Pulkit},
  journal={arXiv preprint arXiv:2601.19897},
  year={2026}
}
```

**Base Model:**
```bibtex
@misc{lfm25,
  title={LFM2.5: Liquid Foundation Models},
  author={LiquidAI},
  year={2024},
  url={https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct}
}
```

## Acknowledgments

- **Paper:** ["Self-Distillation Enables Continual Learning"](https://arxiv.org/abs/2601.19897) by Shenfeld et al.
- **Base Model:** [LiquidAI/LFM2.5-1.2B-Instruct](https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct)
- **Dataset:** OpenAssistant conversations
- **Framework:** HuggingFace Transformers, PEFT, bitsandbytes

## License

This model is released under the Apache 2.0 license, following the base model's licensing.

## Model Card Authors

[Your Name/Organization]

## Contact

For questions or issues, please open an issue on the [model repository](https://huggingface.co/YOUR_USERNAME/lfm25-sdft).

---

## Additional Resources

- 📄 [SDFT Paper](https://arxiv.org/abs/2601.19897)
- 💻 [Training Code](https://github.com/YOUR_USERNAME/sdft-training)
- 🤗 [Base Model](https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct)
- 📊 [Evaluation Results](link-to-detailed-results)

## Version History

- **v1.0** (2024-XX-XX): Initial release
  - Trained on 5K OpenAssistant samples
  - 3 epochs with gradient accumulation
  - LoRA rank 8, alpha 16