Spaces:
Running
Running
File size: 15,854 Bytes
7f97afb | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 | <!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<style>
body {
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
max-width: 800px;
margin: 0 auto;
padding: 20px 40px;
line-height: 1.7;
color: #1a1a2e;
background: #fafafa;
}
h1 { font-size: 2em; margin-top: 1em; color: #0f0f23; }
h2 { font-size: 1.5em; margin-top: 1.5em; color: #16213e; border-bottom: 2px solid #e2e8f0; padding-bottom: 0.3em; }
h3 { font-size: 1.2em; color: #1a1a4e; }
code { background: #e8ecf1; padding: 2px 6px; border-radius: 3px; font-size: 0.9em; }
pre { background: #1e1e2e; color: #cdd6f4; padding: 16px; border-radius: 8px; overflow-x: auto; }
pre code { background: none; color: inherit; padding: 0; }
table { border-collapse: collapse; width: 100%; margin: 1em 0; }
th, td { border: 1px solid #d1d5db; padding: 8px 12px; text-align: left; }
th { background: #e8ecf1; font-weight: 600; }
tr:nth-child(even) { background: #f3f4f6; }
blockquote { border-left: 4px solid #6366f1; margin: 1em 0; padding: 0.5em 1em; background: #eef2ff; color: #312e81; }
a { color: #4f46e5; text-decoration: none; }
a:hover { text-decoration: underline; }
strong { color: #0f172a; }
hr { border: none; border-top: 2px solid #e2e8f0; margin: 2em 0; }
</style>
</head>
<body>
<h1>I Spent 34 Steps Building a Code Generator on My MacBook — Here's What Actually Worked</h1>
<p><strong>Florinel Chis</strong> · March 2026</p>
<hr />
<p>Most fine-tuning tutorials show you the happy path. This is the full path — including 6 training rounds that taught the model absolutely nothing, OOM crashes that killed my machine, and the realization that the real problem was never about the model.</p>
<p><strong>The end result:</strong> A Laravel PHP code generator that produces 26/26 valid PHP files with 20/20 Pest tests passing. Trained on 49 examples. Runs on an Apple M2 Pro with 16GB RAM. Total cloud GPU cost: $0.</p>
<p>Here's how I actually got there.</p>
<h2>The Hardware</h2>
<ul>
<li>Apple M2 Pro, 16GB unified memory</li>
<li>Qwen2.5-Coder-7B-Instruct, 4-bit quantized</li>
<li>MLX framework with LoRA</li>
<li>Target: Laravel 13.x PHP code generation</li>
</ul>
<p>The 16GB constraint shaped every architectural decision. You can't load two 7B models. You can't train with <code>max_seq_length=4096</code>. You close LM Studio before training or your machine crashes.</p>
<h2>Phase 1: Six Sprints of Nothing (The Silent Truncation Bug)</h2>
<p>I started with 90 training examples and grew to 261 across 6 sprints. <code>val_loss</code> kept dropping. By Sprint 6, it hit <strong>0.000</strong>. Perfect.</p>
<p>Except the generated code wasn't getting better. At all.</p>
<h3>The Root Cause</h3>
<p>The system prompt (guidelines for the model) had grown organically across sprints to <strong>2,380 tokens</strong>. My <code>max_seq_length</code> was <strong>1,500</strong>.</p>
<p>MLX truncates training examples silently at <code>max_seq_length</code>. Every single training example was cut off before the code completion even started. The model was being trained to predict its own system prompt — and it got really good at that (hence val_loss=0.000).</p>
<p><strong>Six sprints. Hundreds of examples. Zero code learning.</strong></p>
<h3>The Fix</h3>
<div class="codehilite"><pre><span></span><code><span class="c1"># BEFORE: 2380 tokens of verbose guidelines</span>
<span class="n">SYSTEM</span> <span class="o">=</span> <span class="s2">"""You are an expert Laravel developer. When writing models,</span>
<span class="s2">always use the HasFactory trait. The HasFactory trait enables...</span>
<span class="s2">[2380 tokens of examples and explanations]"""</span>
<span class="c1"># AFTER: 843 tokens, compressed</span>
<span class="n">SYSTEM</span> <span class="o">=</span> <span class="s2">"""Laravel 13.x code generator. Output ONLY PHP.</span>
<span class="s2">- model: use HasFactory, add relationships from spec</span>
<span class="s2">- controller: import Controller, destroy() returns noContent()</span>
<span class="s2">..."""</span>
</code></pre></div>
<p>And the verification I should have done from the start:</p>
<div class="codehilite"><pre><span></span><code><span class="c1"># Check that completions aren't truncated</span>
<span class="k">for</span> <span class="n">example</span> <span class="ow">in</span> <span class="n">dataset</span><span class="p">:</span>
<span class="n">tokens</span> <span class="o">=</span> <span class="n">tokenizer</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="n">example</span><span class="p">[</span><span class="s2">"text"</span><span class="p">])</span>
<span class="k">assert</span> <span class="nb">len</span><span class="p">(</span><span class="n">tokens</span><span class="p">)</span> <span class="o"><</span> <span class="n">max_seq_length</span><span class="p">,</span> <span class="sa">f</span><span class="s2">"Truncated at </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">tokens</span><span class="p">)</span><span class="si">}</span><span class="s2"> tokens"</span>
</code></pre></div>
<p><strong>Lesson: <code>val_loss=0.000</code> means nothing is being learned, not that everything is perfect. Always verify your training data reaches the completions.</strong></p>
<h2>Phase 2: Targeted Bug Fixing (The 10-15 Example Rule)</h2>
<p>After fixing the truncation bug, real training started. val_loss: 0.080 (not 0.000!).</p>
<p>I discovered that <strong>every systematic bug can be fixed with 10-15 targeted examples</strong>:</p>
<table>
<thead>
<tr>
<th>Bug</th>
<th style="text-align: center;">Examples needed</th>
<th>Result</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>'optional'</code> validation rule (not a Laravel rule)</td>
<td style="text-align: center;">10</td>
<td>Fixed — generates <code>'nullable'</code></td>
</tr>
<tr>
<td><code>wasRecentlyCreated</code> in resources</td>
<td style="text-align: center;">5</td>
<td>Fixed — uses correct timestamps</td>
</tr>
<tr>
<td>Cross-resource missing imports</td>
<td style="text-align: center;">13</td>
<td>Fixed — 12 bugs → 0</td>
</tr>
<tr>
<td>Missing <code>HasFactory</code> trait</td>
<td style="text-align: center;">20 (fixed existing)</td>
<td>Fixed — 5 bugs → 0</td>
</tr>
</tbody>
</table>
<p>The model already knows PHP. You're nudging a trained distribution, not teaching from scratch. 10-15 diverse examples of the correct pattern is enough.</p>
<h3>The Eval Script Trap</h3>
<p>I built an automated bug checker. It flagged <code>StoreBookRequest $request</code> as "missing <code>Illuminate\Http\Request</code> import" because the regex <code>'Request $request'</code> matched as a substring.</p>
<p><strong>Test your eval script on correct code before trusting it.</strong></p>
<h3>Where I Hit the Wall</h3>
<p>After Sprint 9: 52/58 Pest tests passing. 6 failures remained. All were <strong>semantic hallucinations</strong>:</p>
<ul>
<li>Model invents a <code>user()</code> relationship that doesn't exist</li>
<li>Controller uses closure-based eager loading when array format is correct</li>
<li>Model generates <code>->withHttpStatus()</code> — a method that doesn't exist</li>
</ul>
<p>Adding more NL training examples didn't help. The model was filling prompt ambiguity with its pretraining priors. The problem wasn't the model — it was the input format.</p>
<h2>Phase 3: The Spec Pivot (The Real Breakthrough)</h2>
<p>Instead of natural language:</p>
<blockquote>
<p>"Create a Post model with author relationship, fillable title and body, soft deletes"</p>
</blockquote>
<p>I switched to structured JSON specs:</p>
<div class="codehilite"><pre><span></span><code><span class="p">{</span>
<span class="w"> </span><span class="nt">"artifact"</span><span class="p">:</span><span class="w"> </span><span class="s2">"model"</span><span class="p">,</span>
<span class="w"> </span><span class="nt">"class"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Post"</span><span class="p">,</span>
<span class="w"> </span><span class="nt">"table"</span><span class="p">:</span><span class="w"> </span><span class="s2">"posts"</span><span class="p">,</span>
<span class="w"> </span><span class="nt">"has_factory"</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="p">,</span>
<span class="w"> </span><span class="nt">"soft_deletes"</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="p">,</span>
<span class="w"> </span><span class="nt">"fillable"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s2">"title"</span><span class="p">,</span><span class="w"> </span><span class="s2">"body"</span><span class="p">,</span><span class="w"> </span><span class="s2">"user_id"</span><span class="p">],</span>
<span class="w"> </span><span class="nt">"relationships"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span>
<span class="w"> </span><span class="p">{</span><span class="nt">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"BelongsTo"</span><span class="p">,</span><span class="w"> </span><span class="nt">"model"</span><span class="p">:</span><span class="w"> </span><span class="s2">"User"</span><span class="p">,</span><span class="w"> </span><span class="nt">"method"</span><span class="p">:</span><span class="w"> </span><span class="s2">"author"</span><span class="p">,</span><span class="w"> </span><span class="nt">"foreign_key"</span><span class="p">:</span><span class="w"> </span><span class="s2">"user_id"</span><span class="p">}</span>
<span class="w"> </span><span class="p">]</span>
<span class="p">}</span>
</code></pre></div>
<h3>First test: 28 examples, 100 iterations</h3>
<p>Result: <strong>26/26 eval perfect. Zero semantic hallucinations.</strong> (Compare: 308 NL examples still had 5 hallucinations.)</p>
<p>The model can't invent a <code>user()</code> relationship if <code>relationships[]</code> explicitly lists only <code>author</code>. The spec removes the model's ability to hallucinate about <em>what</em> to generate. It only decides <em>how</em>.</p>
<h3>The Spec Compiler</h3>
<p>I built a compiler that validates specs before generation:</p>
<div class="codehilite"><pre><span></span><code>$<span class="w"> </span>python3<span class="w"> </span>spec_compiler.py<span class="w"> </span>bad_spec.json
SpecCompileError:<span class="w"> </span>rules<span class="o">[</span><span class="s1">'venue_id'</span><span class="o">]</span><span class="w"> </span>contains<span class="w"> </span>conditional<span class="w"> </span>token
<span class="s1">'required_on_post'</span>.<span class="w"> </span>Use<span class="w"> </span><span class="s1">'conditional_rules'</span><span class="w"> </span>dict<span class="w"> </span>instead.
</code></pre></div>
<p>Validation: <1ms. Generation: ~30s per file. Catch errors early.</p>
<h3>Final Results: adapters_spec_v4</h3>
<table>
<thead>
<tr>
<th>Metric</th>
<th style="text-align: center;">NL Pipeline (308 ex)</th>
<th style="text-align: center;">Spec Pipeline (49 ex)</th>
</tr>
</thead>
<tbody>
<tr>
<td>PHP valid</td>
<td style="text-align: center;">26/26</td>
<td style="text-align: center;">26/26</td>
</tr>
<tr>
<td>Pest pass</td>
<td style="text-align: center;">52/58</td>
<td style="text-align: center;"><strong>20/20</strong></td>
</tr>
<tr>
<td>Manual fixes</td>
<td style="text-align: center;">5</td>
<td style="text-align: center;">4</td>
</tr>
<tr>
<td>Semantic hallucinations</td>
<td style="text-align: center;">5</td>
<td style="text-align: center;"><strong>0</strong></td>
</tr>
<tr>
<td>Training time</td>
<td style="text-align: center;">~30 min</td>
<td style="text-align: center;">~15 min</td>
</tr>
</tbody>
</table>
<h2>The Debugging Checklist</h2>
<p>Distilled from 34 steps of hitting walls:</p>
<p><strong>Before training:</strong>
1. Tokenize ALL examples. Check <code>max(total_tokens) < max_seq_length</code>
2. Check <code>min(completion_tokens) > 0</code>. If zero, system prompt is too long.
3. Close all GPU-using processes. Check memory with <code>vm_stat</code>.
4. Use <code>--num-layers 8</code> (not <code>--lora-layers 8</code>) on 16GB machines.</p>
<p><strong>After training:</strong>
5. If <code>val_loss = 0.000</code>: training is broken, not perfect.
6. Generate 3-5 test files and inspect manually before full benchmark.
7. Run <code>php -l</code> on all output (syntax check).</p>
<p><strong>When bugs persist:</strong>
8. Classify: is it a training data gap or a model capability limit?
9. If data gap: write 10-15 targeted examples with diverse contexts.
10. If capability limit: change the input format (structured specs).
11. If hallucinations persist after targeted training: the problem is <strong>ontological</strong> — the model's pretraining domain model diverges from yours. Give it an explicit ontology (structured spec), don't fight with more NL examples.</p>
<h2>What 7B Models Do Well vs Poorly</h2>
<p><strong>Does well:</strong>
- Individual class generation with clear patterns
- PHP syntax (very rare errors after basic fine-tuning)
- Following explicit rules in the system prompt
- CRUD operations with a single model</p>
<p><strong>Does poorly:</strong>
- Multi-file consistency (imports across files)
- Knowing what NOT to add (hallucinated relationships)
- Distinguishing Laravel API versions (mixes 9.x and 13.x patterns)
- Complex relationship traversal</p>
<p><strong>The key insight:</strong> 7B models don't reason about code. They pattern-match against pretraining. Every persistent bug is a missing pattern. The fix is always: add examples. If that's not enough: change the input format to remove the decision from the model entirely.</p>
<h2>Try It Yourself</h2>
<p>Everything is open source:</p>
<ul>
<li><strong>Spec-trained model</strong>: <a href="https://huggingface.co/fchis/Laravel-13x-Qwen2.5-Coder-7B-Instruct-LoRA-Spec">fchis/Laravel-13x-Qwen2.5-Coder-7B-Instruct-LoRA-Spec</a></li>
<li><strong>Training data</strong>: <a href="https://huggingface.co/datasets/fchis/laravel-buildspec-training">fchis/laravel-buildspec-training</a> (49 examples)</li>
<li><strong>Full pipeline</strong>: <a href="https://github.com/florinel-chis/laravel-ai-gen">github.com/florinel-chis/laravel-ai-gen</a></li>
</ul>
<div class="codehilite"><pre><span></span><code>pip<span class="w"> </span>install<span class="w"> </span>mlx-lm
<span class="c1"># Full pipeline: NL → specs → compile → PHP files</span>
python3<span class="w"> </span>pipeline_spec.py<span class="w"> </span><span class="s2">"Create a REST API for managing blog posts with tags"</span>
<span class="c1"># Or use a spec directly</span>
python3<span class="w"> </span>pipeline_spec.py<span class="w"> </span>--spec<span class="w"> </span>my_specs.json<span class="w"> </span>--output<span class="w"> </span>./generated
</code></pre></div>
<p>Runs entirely on Apple Silicon. M1/M2/M3/M4 with 16GB+ RAM.</p>
<hr />
<p><em>This post is an abbreviated version of: "From Hallucination to Ontology: 34 Steps Building a Domain-Specific Code Generator on Consumer Hardware" (Chis, 2026). The full paper with detailed results, bug taxonomy, and infrastructure lessons is available as a preprint.</em></p>
</body>
</html> |