Instructions to use reaperdoesntknow/Qwen3-1.7B-Coder-Distilled-SFT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use reaperdoesntknow/Qwen3-1.7B-Coder-Distilled-SFT with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="reaperdoesntknow/Qwen3-1.7B-Coder-Distilled-SFT")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("reaperdoesntknow/Qwen3-1.7B-Coder-Distilled-SFT")
model = AutoModelForCausalLM.from_pretrained("reaperdoesntknow/Qwen3-1.7B-Coder-Distilled-SFT")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use reaperdoesntknow/Qwen3-1.7B-Coder-Distilled-SFT with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "reaperdoesntknow/Qwen3-1.7B-Coder-Distilled-SFT"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "reaperdoesntknow/Qwen3-1.7B-Coder-Distilled-SFT",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/reaperdoesntknow/Qwen3-1.7B-Coder-Distilled-SFT

SGLang

How to use reaperdoesntknow/Qwen3-1.7B-Coder-Distilled-SFT with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "reaperdoesntknow/Qwen3-1.7B-Coder-Distilled-SFT" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "reaperdoesntknow/Qwen3-1.7B-Coder-Distilled-SFT",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "reaperdoesntknow/Qwen3-1.7B-Coder-Distilled-SFT" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "reaperdoesntknow/Qwen3-1.7B-Coder-Distilled-SFT",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use reaperdoesntknow/Qwen3-1.7B-Coder-Distilled-SFT with Docker Model Runner:
```
docker model run hf.co/reaperdoesntknow/Qwen3-1.7B-Coder-Distilled-SFT
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

Qwen3-1.7B-Coder-Distilled-SFT

A 1.7B model built in two stages: knowledge distillation from a 30B Coder teacher to establish a structured reasoning backbone, then supervised fine-tuning on ~54,600 logical inference problems. The Coder teacher's decomposition patterns meet formal propositional logic.

The hypothesis: a model that learned STEM derivation from a Coder teacher (Stage 1) already has latent structure for sequential logic, state tracking, and compositional reasoning. Logical inference SFT (Stage 2) activates that structure explicitly — the model doesn't learn logic from scratch, it surfaces what the Coder teacher already gave it.

"Structure beats scale, collaboration beats hierarchy, observation beats theory." — Convergent Intelligence LLC: Research Division

Training Pipeline

Stage 1: Coder Teacher Knowledge Distillation (STEM Reasoning Backbone)

Qwen3-1.7B distilled from Qwen3-Coder-30B-A3B-Instruct — the coding-specialized variant of the 30B MoE architecture. Same STEM training data as the Instruct-teacher variants, but different teacher brain.

Why a Coder teacher? At distillation temperature T=2.0, the KL divergence transfers the teacher's full probability landscape — not just domain knowledge, but how the teacher organizes reasoning. The Coder variant organizes reasoning through precise sequential logic, explicit state tracking, and compositional decomposition. These are the same capabilities that make mathematical derivations rigorous and logical inference sound.

Data: 6,122 STEM chain-of-thought samples across 12 domains from 0xZee:

Domain	Samples
Physics	2,254
Linear Algebra	667
Differential Equations	636
Electromagnetism	580
Mathematics	576
Engineering	574
Classical Mechanics	343
Theoretical Mechanics	307
Advanced Calculus	268
Modern Physics	177
Physiology	114
Molecular Biology	71

Loss function:

Proof-Weighted Cross-Entropy (55%) — 2.5x → 1.5x on derivation tokens
Knowledge Distillation KL Divergence (45%) — T=2.0, scaled by T²

Stage 1 hyperparameters:

Parameter	Value
Epochs	1
Training samples	5,815
Effective batch size	8
Learning rate	1.5e-5 → 1e-6 (cosine)
Temperature	2.0
Proof weight	2.5 → 1.5
Precision	bf16

Training format:

Solve the following problem carefully and show a rigorous derivation.

Problem:
{question}

Proof:
{CoT}

Final Answer:
{response}

Stage 2: Logical Inference SFT

The distilled model was fine-tuned on KonstantinDob/logic_inference_dataset — ~54,607 instruction-response pairs covering propositional logic, logical entailment, and formal inference.

About the dataset: Reproduced from the LogicInference paper (Santiago Ontañón, Google Research). Uses the IID split only with LOGICINFERENCEe format — the model performs logical inference first, then gives the final answer at the end. 5,491 unique inference problems extended to ~54,607 instruction-response pairs. Three columns: INSTRUCTION, RESPONSE, SOURCE.

Why logical inference after Coder-distilled STEM? The Coder teacher gave the model structured decomposition patterns. The STEM data taught it to apply those patterns to derivations. Logical inference SFT takes the next step: formal propositional logic with explicit premises, inference rules, and conclusions. This is the most natural downstream task for a Coder-distilled reasoner — it's making the implicit structure explicit.

Training format:

### Instruction:
{instruction}

### Response:
{response}

Stage 2 hyperparameters:

Parameter	Value
Epochs	1
Effective batch size	8
Learning rate	5e-6 (lower than Stage 1 to preserve backbone)
Gradient checkpointing	Enabled
Precision	bf16

Model Details

Attribute	Value
Architecture	Qwen3 (causal LM, RoPE, GQA)
Parameters	~2B (1.7B advertised)
Base model	Qwen/Qwen3-1.7B
Teacher model	Qwen/Qwen3-Coder-30B-A3B-Instruct
Stage 1 data	6,122 STEM CoT samples (12 datasets)
Stage 2 data	KonstantinDob/logic_inference_dataset (~54,607 pairs)
Context length	1024 tokens (training)
License	Apache 2.0
Developer	Reaperdoesntrun / Convergent Intelligence LLC: Research Division

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "reaperdoesntknow/Qwen3-1.7B-Coder-Distilled-SFT"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
    device_map="auto",
)

# Logical inference (Stage 2 format)
prompt = """### Instruction:
Consider the following premises: For all x, if x is a cat then x is a mammal. Whiskers is a cat. What can we infer?

### Response:
"""

# STEM derivation (Stage 1 format still works)
prompt_stem = """Solve the following problem carefully and show a rigorous derivation.

Problem:
Prove that the composition of two injective functions is injective.

Proof:
"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

GGUF

Quantized versions at reaperdoesntknow/Qwen3-1.7B-Coder-Distilled-SFT-GGUF.

Prompt Formats

STEM derivation (Stage 1):

Solve the following problem carefully and show a rigorous derivation.

Problem:
[Your problem]

Proof:

Logical inference / instruction-following (Stage 2):

### Instruction:
[Your question or logical inference problem]

### Response:

Intended Uses

Good for: Logical inference, propositional logic, formal reasoning, STEM derivation, structured argumentation, educational tutoring, component in verification pipelines, edge deployment via GGUF.

Not for: General code generation (the Coder teacher influence is structural, not functional — use a dedicated code model), formal proof verification (use Lean/Coq), safety-critical analysis, or tasks requiring long context beyond 1024 tokens.

Limitations

1.7B model. Produces structured reasoning but can generate fluent incorrect logic. The Coder teacher gives structural decomposition, not code generation capability. Logical inference performance is strongest on propositional logic patterns represented in the training data. Complex multi-step inferences with many quantifiers may exceed the model's capacity. Always verify.

Mathematical Foundations: Discrepancy Calculus (DISC)

This model's training pipeline is grounded in Discrepancy Calculus — a measure-theoretic framework that treats singularities as primary structure rather than pathology. Full theory: "On the Formal Analysis of Discrepancy Calculus" (Colca, 2026; Convergent Intelligence LLC: Research Division).

The Core Operator:

$Df(x) = \lim_{\varepsilon \downarrow 0} \frac{1}{\varepsilon} \int_x^{x+\varepsilon} \frac{|f(t) - f(x)|}{|t - x|}\, dt$

For smooth $f$: $Df(x) = |f'(x)|$. For rough $f$: $D$ localizes irregularity to null sets while preserving integral structure.

The Mesh Fundamental Identity — every BV function decomposes as:

$f(b) - f(a) = \underbrace{\int_a^b f'(x)\,dx}_{\text{smooth (AC)}} + \underbrace{\sum_{x \in J_f} \Delta f(x)}_{\text{jumps}} + \underbrace{D^c f(I)}_{\text{Cantor drift}}$

Standard knowledge distillation captures only term 1. Topological Knowledge Distillation (TKD) preserves all three by treating the teacher's output distribution as a BV function and computing discrepancy energy, jump sets, and gap energy density before training begins.

Related Models

Model	Description
Qwen3-1.7B-Coder-Distilled	Stage 1 only — pure STEM backbone with Coder teacher
Qwen3-1.7B-Coder-Distilled-SFT-GGUF	This model quantized for edge deployment
Qwen3-1.7B-Distilled-30B-A3B-SFT	Instruct teacher + legal SFT variant
Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT	0.6B Thinking teacher + legal SFT

Citation

@misc{colca2026codersft,
  title={Coder-Distilled Logical Inference: Cross-Domain Structure Transfer 
         from Code to Formal Reasoning},
  author={Colca, Roy S.},
  year={2026},
  publisher={HuggingFace},
  url={https://huggingface.co/reaperdoesntknow/Qwen3-1.7B-Coder-Distilled-SFT},
  note={Convergent Intelligence LLC: Research Division}
}

References

Santiago Ontañón. "LogicInference: A Large-Scale Dataset for Logical Inference." ICLR 2023 Workshop on Mathematical and Empirical Understanding of Foundation Models. Paper | Code

Convergent Intelligence LLC: Research Division "Where classical analysis fails to see, we begin."

Convergent Intelligence Portfolio

Part of the Qwen3 Coder Series by Convergent Intelligence LLC: Research Division

Mathematical Foundations: Discrepancy Calculus (DISC)

The Core Operator:

$Df(x) = \lim_{\varepsilon \downarrow 0} \frac{1}{\varepsilon} \int_x^{x+\varepsilon} \frac{|f(t) - f(x)|}{|t - x|}\, dt$

For smooth $f$: $Df(x) = |f'(x)|$. For rough $f$: $D$ localizes irregularity to null sets while preserving integral structure.

The Mesh Fundamental Identity — every BV function decomposes as:

$f(b) - f(a) = \underbrace{\int_a^b f'(x)\,dx}_{\text{smooth (AC)}} + \underbrace{\sum_{x \in J_f} \Delta f(x)}_{\text{jumps}} + \underbrace{D^c f(I)}_{\text{Cantor drift}}$

Related Models

Model	Downloads	Format
Qwen3-1.7B-Coder-Distilled-SFT-GGUF	194	GGUF

Top Models from Our Lab

Model	Downloads
Qwen3-1.7B-Thinking-Distil	501
LFM2.5-1.2B-Distilled-SFT	342
Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT-GGUF	203
Qwen3-1.7B-Distilled-30B-A3B-SFT-GGUF	175
SMOLM2Prover-GGUF	150

Total Portfolio: 41 models | 2,781 total downloads

Last updated: 2026-03-28 12:48 UTC

DistilQwen Collection

This model is part of the DistilQwen proof-weighted distillation series. Collection: 9 models | 2,788 downloads

Teacher Variant Comparison

Teacher	Student Size	Strength	Models
Qwen3-30B-A3B (Instruct)	1.7B	Instruction following, structured output, legal reasoning	3 (833 DL)
Qwen3-30B-A3B (Thinking)	0.6B	Extended deliberation, higher-entropy distributions, proof derivation	3 (779 DL)
Qwen3-30B-A3B (Coder)	1.7B	Structured decomposition, STEM derivation, logical inference	2 (825 DL) ← this model

Methodology

The only BF16 collection in the portfolio. While the broader Convergent Intelligence catalog (43 models, 12,000+ downloads) was trained on CPU at FP32 for $24 total compute, the DistilQwen series was trained on H100 at BF16 with a 30B-parameter teacher. Same methodology, premium hardware. This is what happens when you give the pipeline real compute.

All models use proof-weighted knowledge distillation: 55% cross-entropy with decaying proof weights (2.5× → 1.5×), 45% KL divergence at T=2.0. The proof weight amplifies loss on reasoning-critical tokens, forcing the student to allocate capacity to structural understanding rather than surface-level pattern matching.

Full methodology: Structure Over Scale (DOI: 10.57967/hf/8165)