Ornstein-26B-A4B-it

A reasoning-focused fine-tune of Google Gemma 4 26B-A4B-it, trained on a small, high-quality dataset curated through a custom Drift Diffusion Modeling (DDM) pipeline. 1,229 carefully selected reasoning traces is all you need when the data quality is high enough.

Quantizations available:

GGUF: DJLougen/Ornstein-26B-A4B-it-GGUF

NVIDIA FP4: DJLougen/Ornstein-26B-A4B-it-NVFP4

Support This Work

I'm a PhD student in visual neuroscience at the University of Toronto who also happens to spend way too much time fine-tuning, merging, and quantizing open-weight models on rented H100s and a local DGX Spark. All training compute is self-funded - balancing GPU costs against a student budget. If my models have been useful to you, consider buying a PhD student a coffee. It goes a long way toward keeping these experiments running.

Support on Ko-fi

What Makes Ornstein Different

Most reasoning fine-tunes throw large volumes of synthetic data at a base model and hope for the best. Ornstein takes the opposite approach: every single training example passed through a multi-stage quality pipeline that measures whether a reasoning trace is actually reasoning or just generating tokens that look like reasoning.

The core insight is that language models frequently produce degenerate reasoning - long chains of text that superficially resemble deep thought (hedging, restating the problem, circling without progress) but carry little actual signal. The DDM pipeline detects and separates these from genuine premium reasoning traces, producing a training mix that teaches the model what good thinking actually looks like.

Why Gemma 4 26B-A4B

Gemma 4's A4B mixture-of-experts architecture activates only a subset of its 26B parameters per token, giving you strong reasoning capability at a fraction of the inference cost of a dense 26B model. Combined with DDM-curated data, you get a model that reasons well without burning through compute on every forward pass.

Training Data at a Glance

Drift Score Distribution: The DDM pipeline assigns each reasoning trace a drift score. Premium traces (blue) cluster low, degenerate traces (red) cluster high, with the fitted threshold at 1.463 cleanly separating the two pools.
Category Mix: The dataset is math-heavy (1,016 examples), with code (124), science (45), and logic (44) rounding it out.
Reasoning Depth: Premium traces average ~~1,263 words of thinking - substantially deeper than the degenerate traces (~~281 words), which tend to be shallow repetition that inflates token count without substance.
Difficulty x Pool: Breakdown across difficulty tiers. The degenerate pool skews toward hard problems where models are most likely to loop or stall.

DDM Curation Pipeline

The dataset was built using a hybrid curation approach that combines two complementary methods, each doing what it does best:

Classification (AUC): ROC/AUC analysis with Youden's J threshold performs the binary premium/degenerate separation. Each reasoning trace is decomposed into uniform segments and scored across multiple quality dimensions - whether the model is introducing new ideas, self-correcting, verifying intermediate results, or just padding. A cumulative drift score tracks how quality evolves across the trace, and traces that cross the deficit threshold get flagged as degenerate.

Weighting (Ornstein-Uhlenbeck process): Rather than treating curation as binary keep/discard, the pipeline models quality evolution as a mean-reverting stochastic process. This is the key insight: a trace that dips in quality but self-corrects should not be penalized the same as one that sustains a collapse. The OU process generates per-segment loss weights that tell the trainer how much each part of the trace should contribute to learning. Premium segments get high weight; degenerate segments get suppressed. The process parameters were fitted via maximum likelihood estimation over 1,152 hyperparameter configurations.

Pipeline performance:

AUC: 0.9705 - near-perfect separation between premium and degenerate traces
99.49% sensitivity - catches virtually all degenerate reasoning
~5% false positive rate - rarely misclassifies genuine reasoning
Only 11 hard negatives in the entire evaluation set

The final training mix is ratio-clamped to prevent degenerate patterns from destabilizing convergence: 799 premium traces + 430 selected degenerate traces = 1,229 total examples.

What the Training Data Carries

Each trace in the curated dataset includes per-segment loss weights, a curriculum score for training order, OU trajectory data, and truncation points for degenerate traces. This means the fine-tune isn't just trained on filtered data - it's trained with segment-level precision about which parts of each trace matter most.

Training Details

Parameter	Value
Base model	`unsloth/gemma-4-26B-A4B-it`
Architecture	Gemma 4 (MoE, 26B total, A4B active)
Method	LoRA (rank 32, alpha 32)
Dropout	0.05
Epochs	1
Learning rate	1e-4 (cosine schedule, 10% warmup)
Max sequence length	8192
Micro batch size	1
Gradient accumulation	4 steps
Weight decay	0.01
LoRA targets	q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Packing	Off
Training framework	Unsloth

Data Quality Metrics

Metric	Value
Total examples	1,229
Mean thinking depth	~1,667 words
Self-correction present	100% of traces
Verification present	100% of traces
Exploration present	100% of traces
Quality gate pass rate	100%

Usage

With Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "DJLougen/Ornstein-26B-A4B-it"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

messages = [{"role": "user", "content": "Your question here"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_new_tokens=8192)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

With Unsloth (Recommended for Inference)

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="DJLougen/Ornstein-26B-A4B-it",
    max_seq_length=8192,
    load_in_4bit=True,
)
FastLanguageModel.for_inference(model)

Reasoning Format

Ornstein uses extended thinking before providing final answers. The model produces multi-phase reasoning with self-correction and verification, then delivers a concise response.

Intended Use

Ornstein-26B-A4B-it is designed for tasks that benefit from structured, multi-step reasoning - math, logic, code analysis, scientific problems, and complex question answering. The DDM curation specifically optimizes for traces that demonstrate genuine cognitive progress rather than verbose restating.

Limitations

Single epoch on 1,229 examples means the model retains most of the base Gemma 4 behavior; the fine-tune primarily shapes reasoning style rather than injecting new knowledge
The DDM pipeline optimizes for English reasoning traces; performance on other languages reflects the base model
Extended thinking can still occasionally loop on adversarial or highly ambiguous prompts

License

Apache 2.0

Citation

If you use Ornstein-26B-A4B-it or the DDM curation methodology in your work:

@misc{ornstein26b,
  author = {DJLougen},
  title = {Ornstein-26B-A4B-it: DDM-Curated Reasoning Fine-Tune of Gemma 4 26B-A4B},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/DJLougen/Ornstein-26B-A4B-it}
}

Model tree for DJLougen/Ornstein-26B-A4B-it

Base model

google/gemma-4-26B-A4B-it

Finetuned

unsloth/gemma-4-26B-A4B-it

Adapter

(2)

this model

Adapters

3 models

Quantizations