Qwen3-1.7B-Coder-Distilled-SFT

A 1.7B model built in two stages: knowledge distillation from a 30B Coder teacher to establish a structured reasoning backbone, then supervised fine-tuning on ~54,600 logical inference problems. The Coder teacher's decomposition patterns meet formal propositional logic.

The hypothesis: a model that learned STEM derivation from a Coder teacher (Stage 1) already has latent structure for sequential logic, state tracking, and compositional reasoning. Logical inference SFT (Stage 2) activates that structure explicitly — the model doesn't learn logic from scratch, it surfaces what the Coder teacher already gave it.

"Structure beats scale, collaboration beats hierarchy, observation beats theory." — Convergent Intelligence LLC: Research Division

Training Pipeline

Stage 1: Coder Teacher Knowledge Distillation (STEM Reasoning Backbone)

Qwen3-1.7B distilled from Qwen3-Coder-30B-A3B-Instruct — the coding-specialized variant of the 30B MoE architecture. Same STEM training data as the Instruct-teacher variants, but different teacher brain.

Why a Coder teacher? At distillation temperature T=2.0, the KL divergence transfers the teacher's full probability landscape — not just domain knowledge, but how the teacher organizes reasoning. The Coder variant organizes reasoning through precise sequential logic, explicit state tracking, and compositional decomposition. These are the same capabilities that make mathematical derivations rigorous and logical inference sound.

Data: 6,122 STEM chain-of-thought samples across 12 domains from 0xZee:

Domain Samples
Physics 2,254
Linear Algebra 667
Differential Equations 636
Electromagnetism 580
Mathematics 576
Engineering 574
Classical Mechanics 343
Theoretical Mechanics 307
Advanced Calculus 268
Modern Physics 177
Physiology 114
Molecular Biology 71

Loss function:

  1. Proof-Weighted Cross-Entropy (55%) — 2.5x → 1.5x on derivation tokens
  2. Knowledge Distillation KL Divergence (45%) — T=2.0, scaled by T²

Stage 1 hyperparameters:

Parameter Value
Epochs 1
Training samples 5,815
Effective batch size 8
Learning rate 1.5e-5 → 1e-6 (cosine)
Temperature 2.0
Proof weight 2.5 → 1.5
Precision bf16

Training format:

Solve the following problem carefully and show a rigorous derivation.

Problem:
{question}

Proof:
{CoT}

Final Answer:
{response}

Stage 2: Logical Inference SFT

The distilled model was fine-tuned on KonstantinDob/logic_inference_dataset — ~54,607 instruction-response pairs covering propositional logic, logical entailment, and formal inference.

About the dataset: Reproduced from the LogicInference paper (Santiago Ontañón, Google Research). Uses the IID split only with LOGICINFERENCEe format — the model performs logical inference first, then gives the final answer at the end. 5,491 unique inference problems extended to ~54,607 instruction-response pairs. Three columns: INSTRUCTION, RESPONSE, SOURCE.

Why logical inference after Coder-distilled STEM? The Coder teacher gave the model structured decomposition patterns. The STEM data taught it to apply those patterns to derivations. Logical inference SFT takes the next step: formal propositional logic with explicit premises, inference rules, and conclusions. This is the most natural downstream task for a Coder-distilled reasoner — it's making the implicit structure explicit.

Training format:

### Instruction:
{instruction}

### Response:
{response}

Stage 2 hyperparameters:

Parameter Value
Epochs 1
Effective batch size 8
Learning rate 5e-6 (lower than Stage 1 to preserve backbone)
Gradient checkpointing Enabled
Precision bf16

Model Details

Attribute Value
Architecture Qwen3 (causal LM, RoPE, GQA)
Parameters ~2B (1.7B advertised)
Base model Qwen/Qwen3-1.7B
Teacher model Qwen/Qwen3-Coder-30B-A3B-Instruct
Stage 1 data 6,122 STEM CoT samples (12 datasets)
Stage 2 data KonstantinDob/logic_inference_dataset (~54,607 pairs)
Context length 1024 tokens (training)
License Apache 2.0
Developer Reaperdoesntrun / Convergent Intelligence LLC: Research Division

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "reaperdoesntknow/Qwen3-1.7B-Coder-Distilled-SFT"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
    device_map="auto",
)

# Logical inference (Stage 2 format)
prompt = """### Instruction:
Consider the following premises: For all x, if x is a cat then x is a mammal. Whiskers is a cat. What can we infer?

### Response:
"""

# STEM derivation (Stage 1 format still works)
prompt_stem = """Solve the following problem carefully and show a rigorous derivation.

Problem:
Prove that the composition of two injective functions is injective.

Proof:
"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

GGUF

Quantized versions at reaperdoesntknow/Qwen3-1.7B-Coder-Distilled-SFT-GGUF.

Prompt Formats

STEM derivation (Stage 1):

Solve the following problem carefully and show a rigorous derivation.

Problem:
[Your problem]

Proof:

Logical inference / instruction-following (Stage 2):

### Instruction:
[Your question or logical inference problem]

### Response:

Intended Uses

Good for: Logical inference, propositional logic, formal reasoning, STEM derivation, structured argumentation, educational tutoring, component in verification pipelines, edge deployment via GGUF.

Not for: General code generation (the Coder teacher influence is structural, not functional — use a dedicated code model), formal proof verification (use Lean/Coq), safety-critical analysis, or tasks requiring long context beyond 1024 tokens.

Limitations

1.7B model. Produces structured reasoning but can generate fluent incorrect logic. The Coder teacher gives structural decomposition, not code generation capability. Logical inference performance is strongest on propositional logic patterns represented in the training data. Complex multi-step inferences with many quantifiers may exceed the model's capacity. Always verify.

Related Models

Model Description
Qwen3-1.7B-Coder-Distilled Stage 1 only — pure STEM backbone with Coder teacher
Qwen3-1.7B-Coder-Distilled-SFT-GGUF This model quantized for edge deployment
Qwen3-1.7B-Distilled-30B-A3B-SFT Instruct teacher + legal SFT variant
Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT 0.6B Thinking teacher + legal SFT

Citation

@misc{colca2026codersft,
  title={Coder-Distilled Logical Inference: Cross-Domain Structure Transfer 
         from Code to Formal Reasoning},
  author={Colca, Roy S.},
  year={2026},
  publisher={HuggingFace},
  url={https://huggingface.co/reaperdoesntknow/Qwen3-1.7B-Coder-Distilled-SFT},
  note={Convergent Intelligence LLC: Research Division}
}

References

Santiago Ontañón. "LogicInference: A Large-Scale Dataset for Logical Inference." ICLR 2023 Workshop on Mathematical and Empirical Understanding of Foundation Models. Paper | Code


Convergent Intelligence LLC: Research Division "Where classical analysis fails to see, we begin."


Convergent Intelligence Portfolio

Part of the Qwen3 Coder Series by Convergent Intelligence LLC: Research Division

Related Models

Model Downloads Format
Qwen3-1.7B-Coder-Distilled-SFT-GGUF 194 GGUF

Top Models from Our Lab

Total Portfolio: 41 models | 2,781 total downloads

Last updated: 2026-03-28 12:48 UTC

Downloads last month
302
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for reaperdoesntknow/Qwen3-1.7B-Coder-Distilled-SFT

Finetuned
Qwen/Qwen3-1.7B
Finetuned
(532)
this model
Quantizations
1 model

Datasets used to train reaperdoesntknow/Qwen3-1.7B-Coder-Distilled-SFT

Collection including reaperdoesntknow/Qwen3-1.7B-Coder-Distilled-SFT