LFM2.5-1.2B-Distilled-SFT
A 1.2B hybrid model (SSM + attention) built in two stages: knowledge distillation from a 24B MoE hybrid teacher on STEM chain-of-thought data, then supervised fine-tuning on logical inference. The first proof-weighted distillation + SFT pipeline on a non-transformer architecture.
Liquid Foundation Models run at 239 tok/s on AMD CPU and fit under 1GB of RAM. This model adds structured STEM reasoning and formal logical inference to that efficiency substrate.
"Structure beats scale, collaboration beats hierarchy, observation beats theory." — Convergent Intelligence LLC: Research Division
Training Pipeline
Stage 1: Knowledge Distillation (STEM Reasoning Backbone)
LFM2.5-1.2B distilled from LFM2-24B-A2B — a 24B MoE hybrid (SSM + attention) with only 2B active parameters per token. Teacher and student share the LFM hybrid architecture, so the KL divergence transfers reasoning patterns between architecturally compatible models.
Data: 2,802 STEM CoT samples from 5 domains:
| Domain | Samples |
|---|---|
| Linear Algebra | 667 |
| Differential Equations | 636 |
| Electromagnetism | 580 |
| Mathematics | 576 |
| Classical Mechanics | 343 |
All from 0xZee. Focused subset — core mathematical reasoning domains that share the most structural overlap with logical inference.
Loss function:
- Proof-Weighted Cross-Entropy (55%) — 2.5x → 1.5x on derivation tokens
- Knowledge Distillation KL Divergence (45%) — T=2.0, scaled by T²
Training format:
Solve the following problem carefully and show a rigorous derivation.
Problem:
{question}
Proof:
{CoT}
Final Answer:
{response}
Stage 1 hyperparameters:
| Parameter | Value |
|---|---|
| Epochs | 1 |
| Effective batch size | 8 |
| Learning rate | 1.5e-5 → 1e-6 (cosine) |
| Temperature | 2.0 |
| Proof weight | 2.5 → 1.5 |
| Precision | bf16 |
Stage 2: Logical Inference SFT
Fine-tuned on KK04/LogicInference_OA — a reproduction of the LogicInference dataset (Santiago Ontañón, Google Research) formatted for instruction-following. IID split, LOGICINFERENCEe format (inference first, answer at end). 5,491 unique inference problems extended to ~54,607 instruction-response pairs.
Why logical inference on a hybrid architecture? SSM components excel at sequential state propagation — exactly what formal logical inference requires. Each premise updates a logical state, and the conclusion follows from the final state. The hybrid architecture's inductive bias naturally aligns with propositional logic chains. SFT activates this alignment explicitly.
Training format:
### Instruction:
{instruction}
### Response:
{response}
Stage 2 hyperparameters:
| Parameter | Value |
|---|---|
| Epochs | 1 |
| Effective batch size | 8 |
| Learning rate | 5e-6 (conservative to preserve backbone) |
| Gradient checkpointing | Enabled |
| Precision | bf16 |
Model Details
| Attribute | Value |
|---|---|
| Architecture | LFM2.5 (hybrid SSM + attention) |
| Parameters | 1.2B |
| Base model | liquid/LFM2.5-1.2B-Instruct |
| Teacher model | liquid/LFM2-24B-A2B |
| Stage 1 data | 2,802 STEM CoT samples (5 datasets) |
| Stage 2 data | KK04/LogicInference_OA (~54,607 pairs) |
| Inference | 239 tok/s AMD CPU, 82 tok/s mobile NPU, sub-1GB RAM |
| Context length | 1024 tokens (training) |
| License | Apache 2.0 |
| Developer | Reaperdoesntrun / Convergent Intelligence LLC: Research Division |
Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "reaperdoesntknow/LFM2.5-1.2B-Distilled-SFT"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
device_map="auto",
)
# Logical inference (Stage 2)
prompt = """### Instruction:
Consider the following premises: If p then q. If q then r. p is true. What can we infer?
### Response:
"""
# STEM derivation (Stage 1 still works)
prompt_stem = """Solve the following problem carefully and show a rigorous derivation.
Problem:
Solve the system of linear equations: 2x + y = 5, x - y = 1.
Proof:
"""
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
GGUF
Quantized versions at reaperdoesntknow/LFM2.5-1.2B-Distilled-SFT-GGUF.
Prompt Formats
STEM derivation (Stage 1):
Solve the following problem carefully and show a rigorous derivation.
Problem:
[Your problem]
Proof:
Logical inference (Stage 2):
### Instruction:
[Your question or logical inference problem]
### Response:
Intended Uses
Good for: On-device logical inference and STEM reasoning, mobile/edge/IoT deployment, formal reasoning tasks, educational tutoring, embedded inference pipelines, anywhere you need structured reasoning under 1GB.
Not for: Formal proof verification, safety-critical systems, complex multi-step proofs beyond model capacity, or long-context tasks beyond 1024 tokens.
Limitations
1.2B hybrid model. The SSM components give excellent inference speed but the model has hard capacity limits. Trained on 2,802 STEM samples (smaller than the 6,122 used for Qwen3 variants). Logical inference strongest on propositional logic patterns in the training data. Complex nested quantifiers may exceed capacity. Always verify.
Related Models
| Model | Description |
|---|---|
| LFM2.5-1.2B-Distilled | Stage 1 only — pure STEM backbone |
| LFM2.5-1.2B-Distilled-SFT-GGUF | This model quantized for edge deployment |
| Qwen3-1.7B-Coder-Distilled-SFT | Transformer variant, Coder teacher + logical inference |
| Qwen3-1.7B-Distilled-30B-A3B-SFT | Transformer variant, Instruct teacher + legal SFT |
Citation
@misc{colca2026lfmsft,
title={Hybrid SSM/Attention Distillation + Logical Inference: LFM2-24B to LFM2.5-1.2B},
year={2026},
publisher={HuggingFace},
url={https://huggingface.co/reaperdoesntknow/LFM2.5-1.2B-Distilled-SFT},
note={Convergent Intelligence LLC: Research Division}
}
References
Santiago Ontañón. "LogicInference: A Large-Scale Dataset for Logical Inference." ICLR 2023 Workshop on Mathematical and Empirical Understanding of Foundation Models. Paper | Code
From the Convergent Intelligence Portfolio
DistilQwen Collection — Our only BF16 series. Proof-weighted distillation from Qwen3-30B-A3B → 1.7B and 0.6B on H100. Three teacher variants (Instruct, Thinking, Coder), nine models, 2,788 combined downloads. The rest of the portfolio proves structure beats scale on CPU. This collection shows what happens when you give the methodology real hardware.
Top model: Qwen3-1.7B-Coder-Distilled-SFT — 508 downloads
Full methodology: Structure Over Scale (DOI: 10.57967/hf/8165)
Convergent Intelligence LLC: Research Division
Convergent Intelligence LLC: Research Division "Where classical analysis fails to see, we begin."
Convergent Intelligence Portfolio
Part of the Liquid Foundation Model Series by Convergent Intelligence LLC: Research Division
Top Models from Our Lab
Total Portfolio: 41 models | 2,781 total downloads
Last updated: 2026-03-28 12:47 UTC
- Downloads last month
- 544