DJLougen commited on
Commit
0bd59d1
·
verified ·
1 Parent(s): 011a759

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +162 -153
README.md CHANGED
@@ -1,153 +1,162 @@
1
- ---
2
- language:
3
- - en
4
- license: apache-2.0
5
- library_name: transformers
6
- tags:
7
- - reasoning
8
- - qwen3.5
9
- - unsloth
10
- - safetensors
11
- - ddm
12
- - lora
13
- - sft
14
- base_model:
15
- - unsloth/Qwen3.5-27B
16
- pipeline_tag: image-text-to-text
17
- ---
18
-
19
- # Ornstein-27B
20
-
21
- A reasoning-focused fine-tune of [Qwen 3.5 27B](https://huggingface.co/unsloth/Qwen3.5-27B), trained on a small, high-quality dataset curated through a custom **Drift Diffusion Modeling (DDM)** pipeline. 1,229 carefully selected reasoning traces is all you need when the data quality is high enough.
22
-
23
- > **GGUF quantizations available at [DJLougen/Ornstein-27B-GGUF](https://huggingface.co/DJLougen/Ornstein-27B-GGUF)**
24
-
25
- ## What Makes Ornstein Different
26
-
27
- Most reasoning fine-tunes throw large volumes of synthetic data at a base model and hope for the best. Ornstein takes the opposite approach: every single training example passed through a multi-stage quality pipeline that measures whether a reasoning trace is actually *reasoning* or just generating tokens that look like reasoning.
28
-
29
- The core insight is that language models frequently produce **degenerate reasoning** - long chains of text that superficially resemble deep thought (hedging, restating the problem, circling without progress) but carry little actual signal. The DDM pipeline detects and separates these from genuine premium reasoning traces, producing a training mix that teaches the model what good thinking actually looks like.
30
-
31
- ### Training Data at a Glance
32
-
33
- ![Training Data Profile](ornstein_training_profile.png)
34
-
35
- - **Top left - Drift Score Distribution:** The DDM pipeline assigns each reasoning trace a drift score. Premium traces (blue) cluster low, degenerate traces (red) cluster high, with the fitted threshold at 1.463 cleanly separating the two pools.
36
- - **Top right - Category Mix:** The dataset is math-heavy (1,016 examples), with code (124), science (45), and logic (44) rounding it out.
37
- - **Bottom left - Reasoning Depth:** Premium traces average ~1,263 words of thinking - substantially deeper than the degenerate traces (~281 words), which tend to be shallow repetition that inflates token count without substance.
38
- - **Bottom right - Difficulty x Pool:** Breakdown across difficulty tiers. The degenerate pool skews toward hard problems where models are most likely to loop or stall.
39
-
40
- ## DDM Curation Pipeline
41
-
42
- Drift Diffusion Modeling works by decomposing each reasoning trace into uniform segments and tracking how "reasoning quality" evolves across the trace. Each segment is scored on multiple dimensions that capture whether the model is mimicking cognitive progress - things like introducing new ideas, self-correcting, verifying intermediate results, and exploring alternative approaches.
43
-
44
- These per-segment scores are accumulated into a drift trajectory. Premium traces maintain healthy trajectories throughout. Degenerate traces accumulate deficit as the model loops, repeats itself, or pads without substance - and the drift score crosses a threshold fitted via ROC analysis.
45
-
46
- **Pipeline performance on the validation set:**
47
- - **AUC: 0.9705** - near-perfect separation between premium and degenerate traces
48
- - **99.49% sensitivity** - catches virtually all degenerate reasoning
49
- - **~5% false positive rate** - rarely misclassifies genuine reasoning
50
- - **Only 11 hard negatives** in the entire evaluation set
51
-
52
- The final training mix is ratio-clamped to prevent degenerate patterns from destabilizing convergence, yielding **799 premium traces + 430 selected degenerate traces = 1,229 total examples**.
53
-
54
- ## Training Details
55
-
56
- | Parameter | Value |
57
- |---|---|
58
- | Base model | `unsloth/Qwen3.5-27B` |
59
- | Method | LoRA (rank 32, alpha 32) |
60
- | Dropout | 0.05 |
61
- | Epochs | 1 |
62
- | Learning rate | 1e-4 (cosine schedule, 10% warmup) |
63
- | Max sequence length | 8192 |
64
- | Micro batch size | 1 |
65
- | Gradient accumulation | 4 steps |
66
- | Weight decay | 0.01 |
67
- | LoRA targets | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
68
- | Packing | Off |
69
- | Training framework | Unsloth |
70
-
71
- ## Data Quality Metrics
72
-
73
- The curated training set has the following characteristics:
74
-
75
- | Metric | Value |
76
- |---|---|
77
- | Total examples | 1,229 |
78
- | Mean thinking depth | ~1,667 words |
79
- | Self-correction present | 100% of traces |
80
- | Verification present | 100% of traces |
81
- | Exploration present | 100% of traces |
82
- | Quality gate pass rate | 100% |
83
-
84
- ## Usage
85
-
86
- ### With Transformers
87
-
88
- ```python
89
- from transformers import AutoModelForCausalLM, AutoTokenizer
90
-
91
- model_id = "DJLougen/Ornstein-27B"
92
- tokenizer = AutoTokenizer.from_pretrained(model_id)
93
- model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
94
-
95
- messages = [{"role": "user", "content": "Your question here"}]
96
- inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
97
- outputs = model.generate(inputs, max_new_tokens=8192)
98
- print(tokenizer.decode(outputs[0], skip_special_tokens=True))
99
- ```
100
-
101
- ### With Unsloth (Recommended for Inference)
102
-
103
- ```python
104
- from unsloth import FastLanguageModel
105
-
106
- model, tokenizer = FastLanguageModel.from_pretrained(
107
- model_name="DJLougen/Ornstein-27B",
108
- max_seq_length=8192,
109
- load_in_4bit=True,
110
- )
111
- FastLanguageModel.for_inference(model)
112
- ```
113
-
114
- ## Reasoning Format
115
-
116
- Ornstein uses `<think>...</think>` blocks for extended reasoning:
117
-
118
- ```
119
- <think>
120
- Let me work through this step by step...
121
- [multi-phase reasoning with self-correction and verification]
122
- </think>
123
-
124
- [Final answer]
125
- ```
126
-
127
- ## Intended Use
128
-
129
- Ornstein-27B is designed for tasks that benefit from structured, multi-step reasoning - math, logic, code analysis, scientific problems, and complex question answering. The DDM curation specifically optimizes for traces that mimic cognitive progress rather than verbose restating.
130
-
131
- ## Limitations
132
-
133
- - Single epoch on 1,229 examples means the model retains most of the base Qwen 3.5 27B behavior; the fine-tune primarily shapes reasoning style rather than injecting new knowledge
134
- - The DDM pipeline optimizes for English reasoning traces; performance on other languages reflects the base model
135
- - Extended thinking can still occasionally loop on adversarial or highly ambiguous prompts
136
-
137
- ## License
138
-
139
- Apache 2.0
140
-
141
- ## Citation
142
-
143
- If you use Ornstein-27B or the DDM curation methodology in your work:
144
-
145
- ```bibtex
146
- @misc{ornstein27b,
147
- author = {DJLougen},
148
- title = {Ornstein-27B: DDM-Curated Reasoning Fine-Tune of Qwen 3.5 27B},
149
- year = {2026},
150
- publisher = {Hugging Face},
151
- url = {https://huggingface.co/DJLougen/Ornstein-27B}
152
- }
153
- ```
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: apache-2.0
5
+ library_name: transformers
6
+ tags:
7
+ - reasoning
8
+ - qwen3.5
9
+ - unsloth
10
+ - safetensors
11
+ - ddm
12
+ - lora
13
+ - sft
14
+ base_model:
15
+ - unsloth/Qwen3.5-27B
16
+ pipeline_tag: image-text-to-text
17
+ ---
18
+ # Ornstein-27B
19
+
20
+ A reasoning-focused fine-tune of [Qwen 3.5 27B](https://huggingface.co/unsloth/Qwen3.5-27B), trained on a small, high-quality dataset curated through a custom **Drift Diffusion Modeling (DDM)** pipeline. 1,229 carefully selected reasoning traces is all you need when the data quality is high enough.
21
+
22
+ > **GGUF quantizations available at [DJLougen/Ornstein-27B-GGUF](https://huggingface.co/DJLougen/Ornstein-27B-GGUF)**
23
+
24
+
25
+ ## Support This Work
26
+
27
+ I'm a PhD student in visual neuroscience at the University of Toronto who also happens to spend way too much time fine-tuning, merging, and quantizing open-weight models on rented H100s and a local DGX Spark. All training compute is self-funded — balancing GPU costs against a student budget. If my uploads have been useful to you, consider buying a PhD student a coffee. It goes a long way toward keeping these experiments running.
28
+
29
+ **[Support on Ko-fi](https://ko-fi.com/djlougen)**
30
+
31
+ ---
32
+
33
+
34
+ ## What Makes Ornstein Different
35
+
36
+ Most reasoning fine-tunes throw large volumes of synthetic data at a base model and hope for the best. Ornstein takes the opposite approach: every single training example passed through a multi-stage quality pipeline that measures whether a reasoning trace is actually *reasoning* or just generating tokens that look like reasoning.
37
+
38
+ The core insight is that language models frequently produce **degenerate reasoning** - long chains of text that superficially resemble deep thought (hedging, restating the problem, circling without progress) but carry little actual signal. The DDM pipeline detects and separates these from genuine premium reasoning traces, producing a training mix that teaches the model what good thinking actually looks like.
39
+
40
+ ### Training Data at a Glance
41
+
42
+ ![Training Data Profile](ornstein_training_profile.png)
43
+
44
+ - **Top left - Drift Score Distribution:** The DDM pipeline assigns each reasoning trace a drift score. Premium traces (blue) cluster low, degenerate traces (red) cluster high, with the fitted threshold at 1.463 cleanly separating the two pools.
45
+ - **Top right - Category Mix:** The dataset is math-heavy (1,016 examples), with code (124), science (45), and logic (44) rounding it out.
46
+ - **Bottom left - Reasoning Depth:** Premium traces average ~1,263 words of thinking - substantially deeper than the degenerate traces (~281 words), which tend to be shallow repetition that inflates token count without substance.
47
+ - **Bottom right - Difficulty x Pool:** Breakdown across difficulty tiers. The degenerate pool skews toward hard problems where models are most likely to loop or stall.
48
+
49
+ ## DDM Curation Pipeline
50
+
51
+ Drift Diffusion Modeling works by decomposing each reasoning trace into uniform segments and tracking how "reasoning quality" evolves across the trace. Each segment is scored on multiple dimensions that capture whether the model is mimicking cognitive progress - things like introducing new ideas, self-correcting, verifying intermediate results, and exploring alternative approaches.
52
+
53
+ These per-segment scores are accumulated into a drift trajectory. Premium traces maintain healthy trajectories throughout. Degenerate traces accumulate deficit as the model loops, repeats itself, or pads without substance - and the drift score crosses a threshold fitted via ROC analysis.
54
+
55
+ **Pipeline performance on the validation set:**
56
+ - **AUC: 0.9705** - near-perfect separation between premium and degenerate traces
57
+ - **99.49% sensitivity** - catches virtually all degenerate reasoning
58
+ - **~5% false positive rate** - rarely misclassifies genuine reasoning
59
+ - **Only 11 hard negatives** in the entire evaluation set
60
+
61
+ The final training mix is ratio-clamped to prevent degenerate patterns from destabilizing convergence, yielding **799 premium traces + 430 selected degenerate traces = 1,229 total examples**.
62
+
63
+ ## Training Details
64
+
65
+ | Parameter | Value |
66
+ |---|---|
67
+ | Base model | `unsloth/Qwen3.5-27B` |
68
+ | Method | LoRA (rank 32, alpha 32) |
69
+ | Dropout | 0.05 |
70
+ | Epochs | 1 |
71
+ | Learning rate | 1e-4 (cosine schedule, 10% warmup) |
72
+ | Max sequence length | 8192 |
73
+ | Micro batch size | 1 |
74
+ | Gradient accumulation | 4 steps |
75
+ | Weight decay | 0.01 |
76
+ | LoRA targets | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
77
+ | Packing | Off |
78
+ | Training framework | Unsloth |
79
+
80
+ ## Data Quality Metrics
81
+
82
+ The curated training set has the following characteristics:
83
+
84
+ | Metric | Value |
85
+ |---|---|
86
+ | Total examples | 1,229 |
87
+ | Mean thinking depth | ~1,667 words |
88
+ | Self-correction present | 100% of traces |
89
+ | Verification present | 100% of traces |
90
+ | Exploration present | 100% of traces |
91
+ | Quality gate pass rate | 100% |
92
+
93
+ ## Usage
94
+
95
+ ### With Transformers
96
+
97
+ ```python
98
+ from transformers import AutoModelForCausalLM, AutoTokenizer
99
+
100
+ model_id = "DJLougen/Ornstein-27B"
101
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
102
+ model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
103
+
104
+ messages = [{"role": "user", "content": "Your question here"}]
105
+ inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
106
+ outputs = model.generate(inputs, max_new_tokens=8192)
107
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
108
+ ```
109
+
110
+ ### With Unsloth (Recommended for Inference)
111
+
112
+ ```python
113
+ from unsloth import FastLanguageModel
114
+
115
+ model, tokenizer = FastLanguageModel.from_pretrained(
116
+ model_name="DJLougen/Ornstein-27B",
117
+ max_seq_length=8192,
118
+ load_in_4bit=True,
119
+ )
120
+ FastLanguageModel.for_inference(model)
121
+ ```
122
+
123
+ ## Reasoning Format
124
+
125
+ Ornstein uses `<think>...</think>` blocks for extended reasoning:
126
+
127
+ ```
128
+ <think>
129
+ Let me work through this step by step...
130
+ [multi-phase reasoning with self-correction and verification]
131
+ </think>
132
+
133
+ [Final answer]
134
+ ```
135
+
136
+ ## Intended Use
137
+
138
+ Ornstein-27B is designed for tasks that benefit from structured, multi-step reasoning - math, logic, code analysis, scientific problems, and complex question answering. The DDM curation specifically optimizes for traces that mimic cognitive progress rather than verbose restating.
139
+
140
+ ## Limitations
141
+
142
+ - Single epoch on 1,229 examples means the model retains most of the base Qwen 3.5 27B behavior; the fine-tune primarily shapes reasoning style rather than injecting new knowledge
143
+ - The DDM pipeline optimizes for English reasoning traces; performance on other languages reflects the base model
144
+ - Extended thinking can still occasionally loop on adversarial or highly ambiguous prompts
145
+
146
+ ## License
147
+
148
+ Apache 2.0
149
+
150
+ ## Citation
151
+
152
+ If you use Ornstein-27B or the DDM curation methodology in your work:
153
+
154
+ ```bibtex
155
+ @misc{ornstein27b,
156
+ author = {DJLougen},
157
+ title = {Ornstein-27B: DDM-Curated Reasoning Fine-Tune of Qwen 3.5 27B},
158
+ year = {2026},
159
+ publisher = {Hugging Face},
160
+ url = {https://huggingface.co/DJLougen/Ornstein-27B}
161
+ }
162
+ ```