Qwen3-1.7B-Coder-Distilled-SFT
A 1.7B model built in two stages: knowledge distillation from a 30B Coder teacher to establish a structured reasoning backbone, then supervised fine-tuning on ~54,600 logical inference problems. The Coder teacher's decomposition patterns meet formal propositional logic.
The hypothesis: a model that learned STEM derivation from a Coder teacher (Stage 1) already has latent structure for sequential logic, state tracking, and compositional reasoning. Logical inference SFT (Stage 2) activates that structure explicitly — the model doesn't learn logic from scratch, it surfaces what the Coder teacher already gave it.
"Structure beats scale, collaboration beats hierarchy, observation beats theory." — Convergent Intelligence LLC: Research Division
Training Pipeline
Stage 1: Coder Teacher Knowledge Distillation (STEM Reasoning Backbone)
Qwen3-1.7B distilled from Qwen3-Coder-30B-A3B-Instruct — the coding-specialized variant of the 30B MoE architecture. Same STEM training data as the Instruct-teacher variants, but different teacher brain.
Why a Coder teacher? At distillation temperature T=2.0, the KL divergence transfers the teacher's full probability landscape — not just domain knowledge, but how the teacher organizes reasoning. The Coder variant organizes reasoning through precise sequential logic, explicit state tracking, and compositional decomposition. These are the same capabilities that make mathematical derivations rigorous and logical inference sound.
Data: 6,122 STEM chain-of-thought samples across 12 domains from 0xZee:
| Domain | Samples |
|---|---|
| Physics | 2,254 |
| Linear Algebra | 667 |
| Differential Equations | 636 |
| Electromagnetism | 580 |
| Mathematics | 576 |
| Engineering | 574 |
| Classical Mechanics | 343 |
| Theoretical Mechanics | 307 |
| Advanced Calculus | 268 |
| Modern Physics | 177 |
| Physiology | 114 |
| Molecular Biology | 71 |
Loss function:
- Proof-Weighted Cross-Entropy (55%) — 2.5x → 1.5x on derivation tokens
- Knowledge Distillation KL Divergence (45%) — T=2.0, scaled by T²
Stage 1 hyperparameters:
| Parameter | Value |
|---|---|
| Epochs | 1 |
| Training samples | 5,815 |
| Effective batch size | 8 |
| Learning rate | 1.5e-5 → 1e-6 (cosine) |
| Temperature | 2.0 |
| Proof weight | 2.5 → 1.5 |
| Precision | bf16 |
Training format:
Solve the following problem carefully and show a rigorous derivation.
Problem:
{question}
Proof:
{CoT}
Final Answer:
{response}
Stage 2: Logical Inference SFT
The distilled model was fine-tuned on KonstantinDob/logic_inference_dataset — ~54,607 instruction-response pairs covering propositional logic, logical entailment, and formal inference.
About the dataset: Reproduced from the LogicInference paper (Santiago Ontañón, Google Research). Uses the IID split only with LOGICINFERENCEe format — the model performs logical inference first, then gives the final answer at the end. 5,491 unique inference problems extended to ~54,607 instruction-response pairs. Three columns: INSTRUCTION, RESPONSE, SOURCE.
Why logical inference after Coder-distilled STEM? The Coder teacher gave the model structured decomposition patterns. The STEM data taught it to apply those patterns to derivations. Logical inference SFT takes the next step: formal propositional logic with explicit premises, inference rules, and conclusions. This is the most natural downstream task for a Coder-distilled reasoner — it's making the implicit structure explicit.
Training format:
### Instruction:
{instruction}
### Response:
{response}
Stage 2 hyperparameters:
| Parameter | Value |
|---|---|
| Epochs | 1 |
| Effective batch size | 8 |
| Learning rate | 5e-6 (lower than Stage 1 to preserve backbone) |
| Gradient checkpointing | Enabled |
| Precision | bf16 |
Model Details
| Attribute | Value |
|---|---|
| Architecture | Qwen3 (causal LM, RoPE, GQA) |
| Parameters | ~2B (1.7B advertised) |
| Base model | Qwen/Qwen3-1.7B |
| Teacher model | Qwen/Qwen3-Coder-30B-A3B-Instruct |
| Stage 1 data | 6,122 STEM CoT samples (12 datasets) |
| Stage 2 data | KonstantinDob/logic_inference_dataset (~54,607 pairs) |
| Context length | 1024 tokens (training) |
| License | Apache 2.0 |
| Developer | Reaperdoesntrun / Convergent Intelligence LLC: Research Division |
Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "reaperdoesntknow/Qwen3-1.7B-Coder-Distilled-SFT"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
device_map="auto",
)
# Logical inference (Stage 2 format)
prompt = """### Instruction:
Consider the following premises: For all x, if x is a cat then x is a mammal. Whiskers is a cat. What can we infer?
### Response:
"""
# STEM derivation (Stage 1 format still works)
prompt_stem = """Solve the following problem carefully and show a rigorous derivation.
Problem:
Prove that the composition of two injective functions is injective.
Proof:
"""
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
GGUF
Quantized versions at reaperdoesntknow/Qwen3-1.7B-Coder-Distilled-SFT-GGUF.
Prompt Formats
STEM derivation (Stage 1):
Solve the following problem carefully and show a rigorous derivation.
Problem:
[Your problem]
Proof:
Logical inference / instruction-following (Stage 2):
### Instruction:
[Your question or logical inference problem]
### Response:
Intended Uses
Good for: Logical inference, propositional logic, formal reasoning, STEM derivation, structured argumentation, educational tutoring, component in verification pipelines, edge deployment via GGUF.
Not for: General code generation (the Coder teacher influence is structural, not functional — use a dedicated code model), formal proof verification (use Lean/Coq), safety-critical analysis, or tasks requiring long context beyond 1024 tokens.
Limitations
1.7B model. Produces structured reasoning but can generate fluent incorrect logic. The Coder teacher gives structural decomposition, not code generation capability. Logical inference performance is strongest on propositional logic patterns represented in the training data. Complex multi-step inferences with many quantifiers may exceed the model's capacity. Always verify.
Related Models
| Model | Description |
|---|---|
| Qwen3-1.7B-Coder-Distilled | Stage 1 only — pure STEM backbone with Coder teacher |
| Qwen3-1.7B-Coder-Distilled-SFT-GGUF | This model quantized for edge deployment |
| Qwen3-1.7B-Distilled-30B-A3B-SFT | Instruct teacher + legal SFT variant |
| Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT | 0.6B Thinking teacher + legal SFT |
Citation
@misc{colca2026codersft,
title={Coder-Distilled Logical Inference: Cross-Domain Structure Transfer
from Code to Formal Reasoning},
author={Colca, Roy S.},
year={2026},
publisher={HuggingFace},
url={https://huggingface.co/reaperdoesntknow/Qwen3-1.7B-Coder-Distilled-SFT},
note={Convergent Intelligence LLC: Research Division}
}
References
Santiago Ontañón. "LogicInference: A Large-Scale Dataset for Logical Inference." ICLR 2023 Workshop on Mathematical and Empirical Understanding of Foundation Models. Paper | Code
Convergent Intelligence LLC: Research Division "Where classical analysis fails to see, we begin."
Convergent Intelligence Portfolio
Part of the Qwen3 Coder Series by Convergent Intelligence LLC: Research Division
Related Models
| Model | Downloads | Format |
|---|---|---|
| Qwen3-1.7B-Coder-Distilled-SFT-GGUF | 194 | GGUF |
Top Models from Our Lab
| Model | Downloads |
|---|---|
| Qwen3-1.7B-Thinking-Distil | 501 |
| LFM2.5-1.2B-Distilled-SFT | 342 |
| Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT-GGUF | 203 |
| Qwen3-1.7B-Distilled-30B-A3B-SFT-GGUF | 175 |
| SMOLM2Prover-GGUF | 150 |
Total Portfolio: 41 models | 2,781 total downloads
Last updated: 2026-03-28 12:48 UTC
- Downloads last month
- 302