File size: 15,854 Bytes
7f97afb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<style>
body {
    font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
    max-width: 800px;
    margin: 0 auto;
    padding: 20px 40px;
    line-height: 1.7;
    color: #1a1a2e;
    background: #fafafa;
}
h1 { font-size: 2em; margin-top: 1em; color: #0f0f23; }
h2 { font-size: 1.5em; margin-top: 1.5em; color: #16213e; border-bottom: 2px solid #e2e8f0; padding-bottom: 0.3em; }
h3 { font-size: 1.2em; color: #1a1a4e; }
code { background: #e8ecf1; padding: 2px 6px; border-radius: 3px; font-size: 0.9em; }
pre { background: #1e1e2e; color: #cdd6f4; padding: 16px; border-radius: 8px; overflow-x: auto; }
pre code { background: none; color: inherit; padding: 0; }
table { border-collapse: collapse; width: 100%; margin: 1em 0; }
th, td { border: 1px solid #d1d5db; padding: 8px 12px; text-align: left; }
th { background: #e8ecf1; font-weight: 600; }
tr:nth-child(even) { background: #f3f4f6; }
blockquote { border-left: 4px solid #6366f1; margin: 1em 0; padding: 0.5em 1em; background: #eef2ff; color: #312e81; }
a { color: #4f46e5; text-decoration: none; }
a:hover { text-decoration: underline; }
strong { color: #0f172a; }
hr { border: none; border-top: 2px solid #e2e8f0; margin: 2em 0; }
</style>
</head>
<body>
<h1>I Spent 34 Steps Building a Code Generator on My MacBook — Here's What Actually Worked</h1>
<p><strong>Florinel Chis</strong> · March 2026</p>
<hr />
<p>Most fine-tuning tutorials show you the happy path. This is the full path — including 6 training rounds that taught the model absolutely nothing, OOM crashes that killed my machine, and the realization that the real problem was never about the model.</p>
<p><strong>The end result:</strong> A Laravel PHP code generator that produces 26/26 valid PHP files with 20/20 Pest tests passing. Trained on 49 examples. Runs on an Apple M2 Pro with 16GB RAM. Total cloud GPU cost: $0.</p>
<p>Here's how I actually got there.</p>
<h2>The Hardware</h2>
<ul>
<li>Apple M2 Pro, 16GB unified memory</li>
<li>Qwen2.5-Coder-7B-Instruct, 4-bit quantized</li>
<li>MLX framework with LoRA</li>
<li>Target: Laravel 13.x PHP code generation</li>
</ul>
<p>The 16GB constraint shaped every architectural decision. You can't load two 7B models. You can't train with <code>max_seq_length=4096</code>. You close LM Studio before training or your machine crashes.</p>
<h2>Phase 1: Six Sprints of Nothing (The Silent Truncation Bug)</h2>
<p>I started with 90 training examples and grew to 261 across 6 sprints. <code>val_loss</code> kept dropping. By Sprint 6, it hit <strong>0.000</strong>. Perfect.</p>
<p>Except the generated code wasn't getting better. At all.</p>
<h3>The Root Cause</h3>
<p>The system prompt (guidelines for the model) had grown organically across sprints to <strong>2,380 tokens</strong>. My <code>max_seq_length</code> was <strong>1,500</strong>.</p>
<p>MLX truncates training examples silently at <code>max_seq_length</code>. Every single training example was cut off before the code completion even started. The model was being trained to predict its own system prompt — and it got really good at that (hence val_loss=0.000).</p>
<p><strong>Six sprints. Hundreds of examples. Zero code learning.</strong></p>
<h3>The Fix</h3>
<div class="codehilite"><pre><span></span><code><span class="c1"># BEFORE: 2380 tokens of verbose guidelines</span>
<span class="n">SYSTEM</span> <span class="o">=</span> <span class="s2">&quot;&quot;&quot;You are an expert Laravel developer. When writing models,</span>
<span class="s2">always use the HasFactory trait. The HasFactory trait enables...</span>
<span class="s2">[2380 tokens of examples and explanations]&quot;&quot;&quot;</span>

<span class="c1"># AFTER: 843 tokens, compressed</span>
<span class="n">SYSTEM</span> <span class="o">=</span> <span class="s2">&quot;&quot;&quot;Laravel 13.x code generator. Output ONLY PHP.</span>
<span class="s2">- model: use HasFactory, add relationships from spec</span>
<span class="s2">- controller: import Controller, destroy() returns noContent()</span>
<span class="s2">...&quot;&quot;&quot;</span>
</code></pre></div>

<p>And the verification I should have done from the start:</p>
<div class="codehilite"><pre><span></span><code><span class="c1"># Check that completions aren&#39;t truncated</span>
<span class="k">for</span> <span class="n">example</span> <span class="ow">in</span> <span class="n">dataset</span><span class="p">:</span>
    <span class="n">tokens</span> <span class="o">=</span> <span class="n">tokenizer</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="n">example</span><span class="p">[</span><span class="s2">&quot;text&quot;</span><span class="p">])</span>
    <span class="k">assert</span> <span class="nb">len</span><span class="p">(</span><span class="n">tokens</span><span class="p">)</span> <span class="o">&lt;</span> <span class="n">max_seq_length</span><span class="p">,</span> <span class="sa">f</span><span class="s2">&quot;Truncated at </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">tokens</span><span class="p">)</span><span class="si">}</span><span class="s2"> tokens&quot;</span>
</code></pre></div>

<p><strong>Lesson: <code>val_loss=0.000</code> means nothing is being learned, not that everything is perfect. Always verify your training data reaches the completions.</strong></p>
<h2>Phase 2: Targeted Bug Fixing (The 10-15 Example Rule)</h2>
<p>After fixing the truncation bug, real training started. val_loss: 0.080 (not 0.000!).</p>
<p>I discovered that <strong>every systematic bug can be fixed with 10-15 targeted examples</strong>:</p>
<table>
<thead>
<tr>
<th>Bug</th>
<th style="text-align: center;">Examples needed</th>
<th>Result</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>'optional'</code> validation rule (not a Laravel rule)</td>
<td style="text-align: center;">10</td>
<td>Fixed — generates <code>'nullable'</code></td>
</tr>
<tr>
<td><code>wasRecentlyCreated</code> in resources</td>
<td style="text-align: center;">5</td>
<td>Fixed — uses correct timestamps</td>
</tr>
<tr>
<td>Cross-resource missing imports</td>
<td style="text-align: center;">13</td>
<td>Fixed — 12 bugs → 0</td>
</tr>
<tr>
<td>Missing <code>HasFactory</code> trait</td>
<td style="text-align: center;">20 (fixed existing)</td>
<td>Fixed — 5 bugs → 0</td>
</tr>
</tbody>
</table>
<p>The model already knows PHP. You're nudging a trained distribution, not teaching from scratch. 10-15 diverse examples of the correct pattern is enough.</p>
<h3>The Eval Script Trap</h3>
<p>I built an automated bug checker. It flagged <code>StoreBookRequest $request</code> as "missing <code>Illuminate\Http\Request</code> import" because the regex <code>'Request $request'</code> matched as a substring.</p>
<p><strong>Test your eval script on correct code before trusting it.</strong></p>
<h3>Where I Hit the Wall</h3>
<p>After Sprint 9: 52/58 Pest tests passing. 6 failures remained. All were <strong>semantic hallucinations</strong>:</p>
<ul>
<li>Model invents a <code>user()</code> relationship that doesn't exist</li>
<li>Controller uses closure-based eager loading when array format is correct</li>
<li>Model generates <code>-&gt;withHttpStatus()</code> — a method that doesn't exist</li>
</ul>
<p>Adding more NL training examples didn't help. The model was filling prompt ambiguity with its pretraining priors. The problem wasn't the model — it was the input format.</p>
<h2>Phase 3: The Spec Pivot (The Real Breakthrough)</h2>
<p>Instead of natural language:</p>
<blockquote>
<p>"Create a Post model with author relationship, fillable title and body, soft deletes"</p>
</blockquote>
<p>I switched to structured JSON specs:</p>
<div class="codehilite"><pre><span></span><code><span class="p">{</span>
<span class="w">  </span><span class="nt">&quot;artifact&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;model&quot;</span><span class="p">,</span>
<span class="w">  </span><span class="nt">&quot;class&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;Post&quot;</span><span class="p">,</span>
<span class="w">  </span><span class="nt">&quot;table&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;posts&quot;</span><span class="p">,</span>
<span class="w">  </span><span class="nt">&quot;has_factory&quot;</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="p">,</span>
<span class="w">  </span><span class="nt">&quot;soft_deletes&quot;</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="p">,</span>
<span class="w">  </span><span class="nt">&quot;fillable&quot;</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s2">&quot;title&quot;</span><span class="p">,</span><span class="w"> </span><span class="s2">&quot;body&quot;</span><span class="p">,</span><span class="w"> </span><span class="s2">&quot;user_id&quot;</span><span class="p">],</span>
<span class="w">  </span><span class="nt">&quot;relationships&quot;</span><span class="p">:</span><span class="w"> </span><span class="p">[</span>
<span class="w">    </span><span class="p">{</span><span class="nt">&quot;type&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;BelongsTo&quot;</span><span class="p">,</span><span class="w"> </span><span class="nt">&quot;model&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;User&quot;</span><span class="p">,</span><span class="w"> </span><span class="nt">&quot;method&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;author&quot;</span><span class="p">,</span><span class="w"> </span><span class="nt">&quot;foreign_key&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;user_id&quot;</span><span class="p">}</span>
<span class="w">  </span><span class="p">]</span>
<span class="p">}</span>
</code></pre></div>

<h3>First test: 28 examples, 100 iterations</h3>
<p>Result: <strong>26/26 eval perfect. Zero semantic hallucinations.</strong> (Compare: 308 NL examples still had 5 hallucinations.)</p>
<p>The model can't invent a <code>user()</code> relationship if <code>relationships[]</code> explicitly lists only <code>author</code>. The spec removes the model's ability to hallucinate about <em>what</em> to generate. It only decides <em>how</em>.</p>
<h3>The Spec Compiler</h3>
<p>I built a compiler that validates specs before generation:</p>
<div class="codehilite"><pre><span></span><code>$<span class="w"> </span>python3<span class="w"> </span>spec_compiler.py<span class="w"> </span>bad_spec.json

SpecCompileError:<span class="w"> </span>rules<span class="o">[</span><span class="s1">&#39;venue_id&#39;</span><span class="o">]</span><span class="w"> </span>contains<span class="w"> </span>conditional<span class="w"> </span>token
<span class="s1">&#39;required_on_post&#39;</span>.<span class="w"> </span>Use<span class="w"> </span><span class="s1">&#39;conditional_rules&#39;</span><span class="w"> </span>dict<span class="w"> </span>instead.
</code></pre></div>

<p>Validation: &lt;1ms. Generation: ~30s per file. Catch errors early.</p>
<h3>Final Results: adapters_spec_v4</h3>
<table>
<thead>
<tr>
<th>Metric</th>
<th style="text-align: center;">NL Pipeline (308 ex)</th>
<th style="text-align: center;">Spec Pipeline (49 ex)</th>
</tr>
</thead>
<tbody>
<tr>
<td>PHP valid</td>
<td style="text-align: center;">26/26</td>
<td style="text-align: center;">26/26</td>
</tr>
<tr>
<td>Pest pass</td>
<td style="text-align: center;">52/58</td>
<td style="text-align: center;"><strong>20/20</strong></td>
</tr>
<tr>
<td>Manual fixes</td>
<td style="text-align: center;">5</td>
<td style="text-align: center;">4</td>
</tr>
<tr>
<td>Semantic hallucinations</td>
<td style="text-align: center;">5</td>
<td style="text-align: center;"><strong>0</strong></td>
</tr>
<tr>
<td>Training time</td>
<td style="text-align: center;">~30 min</td>
<td style="text-align: center;">~15 min</td>
</tr>
</tbody>
</table>
<h2>The Debugging Checklist</h2>
<p>Distilled from 34 steps of hitting walls:</p>
<p><strong>Before training:</strong>
1. Tokenize ALL examples. Check <code>max(total_tokens) &lt; max_seq_length</code>
2. Check <code>min(completion_tokens) &gt; 0</code>. If zero, system prompt is too long.
3. Close all GPU-using processes. Check memory with <code>vm_stat</code>.
4. Use <code>--num-layers 8</code> (not <code>--lora-layers 8</code>) on 16GB machines.</p>
<p><strong>After training:</strong>
5. If <code>val_loss = 0.000</code>: training is broken, not perfect.
6. Generate 3-5 test files and inspect manually before full benchmark.
7. Run <code>php -l</code> on all output (syntax check).</p>
<p><strong>When bugs persist:</strong>
8. Classify: is it a training data gap or a model capability limit?
9. If data gap: write 10-15 targeted examples with diverse contexts.
10. If capability limit: change the input format (structured specs).
11. If hallucinations persist after targeted training: the problem is <strong>ontological</strong> — the model's pretraining domain model diverges from yours. Give it an explicit ontology (structured spec), don't fight with more NL examples.</p>
<h2>What 7B Models Do Well vs Poorly</h2>
<p><strong>Does well:</strong>
- Individual class generation with clear patterns
- PHP syntax (very rare errors after basic fine-tuning)
- Following explicit rules in the system prompt
- CRUD operations with a single model</p>
<p><strong>Does poorly:</strong>
- Multi-file consistency (imports across files)
- Knowing what NOT to add (hallucinated relationships)
- Distinguishing Laravel API versions (mixes 9.x and 13.x patterns)
- Complex relationship traversal</p>
<p><strong>The key insight:</strong> 7B models don't reason about code. They pattern-match against pretraining. Every persistent bug is a missing pattern. The fix is always: add examples. If that's not enough: change the input format to remove the decision from the model entirely.</p>
<h2>Try It Yourself</h2>
<p>Everything is open source:</p>
<ul>
<li><strong>Spec-trained model</strong>: <a href="https://huggingface.co/fchis/Laravel-13x-Qwen2.5-Coder-7B-Instruct-LoRA-Spec">fchis/Laravel-13x-Qwen2.5-Coder-7B-Instruct-LoRA-Spec</a></li>
<li><strong>Training data</strong>: <a href="https://huggingface.co/datasets/fchis/laravel-buildspec-training">fchis/laravel-buildspec-training</a> (49 examples)</li>
<li><strong>Full pipeline</strong>: <a href="https://github.com/florinel-chis/laravel-ai-gen">github.com/florinel-chis/laravel-ai-gen</a></li>
</ul>
<div class="codehilite"><pre><span></span><code>pip<span class="w"> </span>install<span class="w"> </span>mlx-lm

<span class="c1"># Full pipeline: NL → specs → compile → PHP files</span>
python3<span class="w"> </span>pipeline_spec.py<span class="w"> </span><span class="s2">&quot;Create a REST API for managing blog posts with tags&quot;</span>

<span class="c1"># Or use a spec directly</span>
python3<span class="w"> </span>pipeline_spec.py<span class="w"> </span>--spec<span class="w"> </span>my_specs.json<span class="w"> </span>--output<span class="w"> </span>./generated
</code></pre></div>

<p>Runs entirely on Apple Silicon. M1/M2/M3/M4 with 16GB+ RAM.</p>
<hr />
<p><em>This post is an abbreviated version of: "From Hallucination to Ontology: 34 Steps Building a Domain-Specific Code Generator on Consumer Hardware" (Chis, 2026). The full paper with detailed results, bug taxonomy, and infrastructure lessons is available as a preprint.</em></p>
</body>
</html>