LFM2.5-1.2B-Distilled-SFT

A 1.2B hybrid model (SSM + attention) built in two stages: knowledge distillation from a 24B MoE hybrid teacher on STEM chain-of-thought data, then supervised fine-tuning on logical inference. The first proof-weighted distillation + SFT pipeline on a non-transformer architecture.

Liquid Foundation Models run at 239 tok/s on AMD CPU and fit under 1GB of RAM. This model adds structured STEM reasoning and formal logical inference to that efficiency substrate.

"Structure beats scale, collaboration beats hierarchy, observation beats theory." — Convergent Intelligence LLC: Research Division

Training Pipeline

Stage 1: Knowledge Distillation (STEM Reasoning Backbone)

LFM2.5-1.2B distilled from LFM2-24B-A2B — a 24B MoE hybrid (SSM + attention) with only 2B active parameters per token. Teacher and student share the LFM hybrid architecture, so the KL divergence transfers reasoning patterns between architecturally compatible models.

Data: 2,802 STEM CoT samples from 5 domains:

Domain Samples
Linear Algebra 667
Differential Equations 636
Electromagnetism 580
Mathematics 576
Classical Mechanics 343

All from 0xZee. Focused subset — core mathematical reasoning domains that share the most structural overlap with logical inference.

Loss function:

  1. Proof-Weighted Cross-Entropy (55%) — 2.5x → 1.5x on derivation tokens
  2. Knowledge Distillation KL Divergence (45%) — T=2.0, scaled by T²

Training format:

Solve the following problem carefully and show a rigorous derivation.

Problem:
{question}

Proof:
{CoT}

Final Answer:
{response}

Stage 1 hyperparameters:

Parameter Value
Epochs 1
Effective batch size 8
Learning rate 1.5e-5 → 1e-6 (cosine)
Temperature 2.0
Proof weight 2.5 → 1.5
Precision bf16

Stage 2: Logical Inference SFT

Fine-tuned on KK04/LogicInference_OA — a reproduction of the LogicInference dataset (Santiago Ontañón, Google Research) formatted for instruction-following. IID split, LOGICINFERENCEe format (inference first, answer at end). 5,491 unique inference problems extended to ~54,607 instruction-response pairs.

Why logical inference on a hybrid architecture? SSM components excel at sequential state propagation — exactly what formal logical inference requires. Each premise updates a logical state, and the conclusion follows from the final state. The hybrid architecture's inductive bias naturally aligns with propositional logic chains. SFT activates this alignment explicitly.

Training format:

### Instruction:
{instruction}

### Response:
{response}

Stage 2 hyperparameters:

Parameter Value
Epochs 1
Effective batch size 8
Learning rate 5e-6 (conservative to preserve backbone)
Gradient checkpointing Enabled
Precision bf16

Model Details

Attribute Value
Architecture LFM2.5 (hybrid SSM + attention)
Parameters 1.2B
Base model liquid/LFM2.5-1.2B-Instruct
Teacher model liquid/LFM2-24B-A2B
Stage 1 data 2,802 STEM CoT samples (5 datasets)
Stage 2 data KK04/LogicInference_OA (~54,607 pairs)
Inference 239 tok/s AMD CPU, 82 tok/s mobile NPU, sub-1GB RAM
Context length 1024 tokens (training)
License Apache 2.0
Developer Reaperdoesntrun / Convergent Intelligence LLC: Research Division

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "reaperdoesntknow/LFM2.5-1.2B-Distilled-SFT"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
    device_map="auto",
)

# Logical inference (Stage 2)
prompt = """### Instruction:
Consider the following premises: If p then q. If q then r. p is true. What can we infer?

### Response:
"""

# STEM derivation (Stage 1 still works)
prompt_stem = """Solve the following problem carefully and show a rigorous derivation.

Problem:
Solve the system of linear equations: 2x + y = 5, x - y = 1.

Proof:
"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

GGUF

Quantized versions at reaperdoesntknow/LFM2.5-1.2B-Distilled-SFT-GGUF.

Prompt Formats

STEM derivation (Stage 1):

Solve the following problem carefully and show a rigorous derivation.

Problem:
[Your problem]

Proof:

Logical inference (Stage 2):

### Instruction:
[Your question or logical inference problem]

### Response:

Intended Uses

Good for: On-device logical inference and STEM reasoning, mobile/edge/IoT deployment, formal reasoning tasks, educational tutoring, embedded inference pipelines, anywhere you need structured reasoning under 1GB.

Not for: Formal proof verification, safety-critical systems, complex multi-step proofs beyond model capacity, or long-context tasks beyond 1024 tokens.

Limitations

1.2B hybrid model. The SSM components give excellent inference speed but the model has hard capacity limits. Trained on 2,802 STEM samples (smaller than the 6,122 used for Qwen3 variants). Logical inference strongest on propositional logic patterns in the training data. Complex nested quantifiers may exceed capacity. Always verify.

Related Models

Model Description
LFM2.5-1.2B-Distilled Stage 1 only — pure STEM backbone
LFM2.5-1.2B-Distilled-SFT-GGUF This model quantized for edge deployment
Qwen3-1.7B-Coder-Distilled-SFT Transformer variant, Coder teacher + logical inference
Qwen3-1.7B-Distilled-30B-A3B-SFT Transformer variant, Instruct teacher + legal SFT

Citation

@misc{colca2026lfmsft,
  title={Hybrid SSM/Attention Distillation + Logical Inference: LFM2-24B to LFM2.5-1.2B},
  year={2026},
  publisher={HuggingFace},
  url={https://huggingface.co/reaperdoesntknow/LFM2.5-1.2B-Distilled-SFT},
  note={Convergent Intelligence LLC: Research Division}
}

References

Santiago Ontañón. "LogicInference: A Large-Scale Dataset for Logical Inference." ICLR 2023 Workshop on Mathematical and Empirical Understanding of Foundation Models. Paper | Code



From the Convergent Intelligence Portfolio

DistilQwen Collection — Our only BF16 series. Proof-weighted distillation from Qwen3-30B-A3B → 1.7B and 0.6B on H100. Three teacher variants (Instruct, Thinking, Coder), nine models, 2,788 combined downloads. The rest of the portfolio proves structure beats scale on CPU. This collection shows what happens when you give the methodology real hardware.

Top model: Qwen3-1.7B-Coder-Distilled-SFT — 508 downloads

Full methodology: Structure Over Scale (DOI: 10.57967/hf/8165)

Convergent Intelligence LLC: Research Division

Convergent Intelligence LLC: Research Division "Where classical analysis fails to see, we begin."


Convergent Intelligence Portfolio

Part of the Liquid Foundation Model Series by Convergent Intelligence LLC: Research Division

Top Models from Our Lab

Total Portfolio: 41 models | 2,781 total downloads

Last updated: 2026-03-28 12:47 UTC

Downloads last month
544
Safetensors
Model size
1B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for reaperdoesntknow/LFM2.5-1.2B-Distilled-SFT

Quantizations
1 model

Datasets used to train reaperdoesntknow/LFM2.5-1.2B-Distilled-SFT