Ornstein-122-A10B

A reasoning-focused fine-tune of Qwen 3.5 122B-A10B, trained on a small, high-quality dataset curated through a custom Drift Diffusion Modeling (DDM) pipeline. This is the Mixture-of-Experts variant of the Ornstein series — 122B total parameters with ~10B active per token.

GGUF quantizations available at DJLougen/Ornstein-122-A10B-gguf

Support This Work

I'm a PhD student in visual neuroscience at the University of Toronto who also happens to spend way too much time fine-tuning, merging, and quantizing open-weight models on rented H100s and a local DGX Spark. All training compute is self-funded — balancing GPU costs against a student budget. If my uploads have been useful to you, consider buying a PhD student a coffee. It goes a long way toward keeping these experiments running.

Support on Ko-fi

What Makes Ornstein Different

Most reasoning fine-tunes throw large volumes of synthetic data at a base model and hope for the best. Ornstein takes the opposite approach: every single training example passed through a multi-stage quality pipeline that measures whether a reasoning trace is actually reasoning or just generating tokens that look like reasoning.

The core insight is that language models frequently produce degenerate reasoning — long chains of text that superficially resemble deep thought (hedging, restating the problem, circling without progress) but carry little actual signal. The DDM pipeline detects and separates these from genuine premium reasoning traces, producing a training mix that teaches the model what good thinking actually looks like.

Why MoE?

The Qwen 3.5 122B-A10B architecture activates only ~10B parameters per token through its Mixture-of-Experts routing, giving you 122B-class reasoning quality at a fraction of the inference cost. Combined with DDM-curated training data, this produces a model that reasons deeply without burning through compute on every token.

DDM Curation Pipeline

Drift Diffusion Modeling works by decomposing each reasoning trace into uniform segments and tracking how "reasoning quality" evolves across the trace. Each segment is scored on multiple dimensions that capture whether the model is mimicking cognitive progress — things like introducing new ideas, self-correcting, verifying intermediate results, and exploring alternative approaches.

These per-segment scores are accumulated into a drift trajectory. Premium traces maintain healthy trajectories throughout. Degenerate traces accumulate deficit as the model loops, repeats itself, or pads without substance — and the drift score crosses a threshold fitted via ROC analysis.

Training Details

Parameter	Value
Base model	`unsloth/Qwen3.5-122B-A10B`
Architecture	Mixture-of-Experts (122B total, ~10B active)
Method	LoRA (rank 32, alpha 32)
Dropout	0.0
Epochs	1
Learning rate	1e-4 (cosine schedule, 10% warmup)
Max sequence length	8192
Micro batch size	1
Gradient accumulation	4 steps
Weight decay	0.01
LoRA targets	q_proj, k_proj, v_proj, o_proj
Packing	Off
Training framework	Unsloth

Usage

With Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "DJLougen/Ornstein-122-A10B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

messages = [{"role": "user", "content": "Your question here"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_new_tokens=8192)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

With Unsloth (Recommended for Inference)

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="DJLougen/Ornstein-122-A10B",
    max_seq_length=8192,
    load_in_4bit=True,
)
FastLanguageModel.for_inference(model)

Reasoning Format

Ornstein uses <think>...</think> blocks for extended reasoning:

<think>
Let me work through this step by step...
[multi-phase reasoning with self-correction and verification]
</think>

[Final answer]

Intended Use

Ornstein-122-A10B is designed for tasks that benefit from structured, multi-step reasoning — math, logic, code analysis, scientific problems, and complex question answering. The MoE architecture makes it practical to run 122B-class reasoning on hardware that couldn't handle a dense model of the same size.

Limitations

Single epoch training means the model retains most of the base Qwen 3.5 122B-A10B behavior; the fine-tune primarily shapes reasoning style rather than injecting new knowledge
The DDM pipeline optimizes for English reasoning traces; performance on other languages reflects the base model
Extended thinking can still occasionally loop on adversarial or highly ambiguous prompts
Full-precision weights require ~233 GB — see quantized variants for more accessible options

License

Apache 2.0

Citation

If you use Ornstein-122-A10B or the DDM curation methodology in your work:

@misc{ornstein122a10b,
  author = {DJLougen},
  title = {Ornstein-122-A10B: DDM-Curated Reasoning Fine-Tune of Qwen 3.5 122B-A10B},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/DJLougen/Ornstein-122-A10B}
}

Downloads last month: 87

Safetensors

Model size

125B params

Tensor type

BF16

F32

Model tree for DJLougen/Ornstein-122-A10B

Base model

Qwen/Qwen3.5-122B-A10B

Finetuned

unsloth/Qwen3.5-122B-A10B

Finetuned

(1)

this model

Quantizations

1 model