---
title: "I Spent 34 Steps Building a Code Generator on My MacBook — Here's What Actually Worked"
emoji: "🛠️"
colorFrom: green
colorTo: yellow
sdk: static
pinned: false
license: mit
tags:
- code-generation
- fine-tuning
- mlx
- lora
- laravel
- php
- apple-silicon
- experience-report
---

# I Spent 34 Steps Building a Code Generator on My MacBook — Here's What Actually Worked

**Florinel Chis** · March 2026

---

Most fine-tuning tutorials show you the happy path. This is the full path — including 6 training rounds that taught the model absolutely nothing, OOM crashes that killed my machine, and the realization that the real problem was never about the model.

**The end result:** A Laravel PHP code generator that produces 26/26 valid PHP files with 20/20 Pest tests passing. Trained on 49 examples. Runs on an Apple M2 Pro with 16GB RAM. Total cloud GPU cost: $0.

Here's how I actually got there.

## The Hardware

- Apple M2 Pro, 16GB unified memory
- Qwen2.5-Coder-7B-Instruct, 4-bit quantized
- MLX framework with LoRA
- Target: Laravel 13.x PHP code generation

The 16GB constraint shaped every architectural decision. You can't load two 7B models. You can't train with `max_seq_length=4096`. You close LM Studio before training or your machine crashes.

## Phase 1: Six Sprints of Nothing (The Silent Truncation Bug)

I started with 90 training examples and grew to 261 across 6 sprints. `val_loss` kept dropping. By Sprint 6, it hit **0.000**. Perfect.

Except the generated code wasn't getting better. At all.

### The Root Cause

The system prompt (guidelines for the model) had grown organically across sprints to **2,380 tokens**. My `max_seq_length` was **1,500**.

MLX truncates training examples silently at `max_seq_length`. Every single training example was cut off before the code completion even started. The model was being trained to predict its own system prompt — and it got really good at that (hence val_loss=0.000).

**Six sprints. Hundreds of examples. Zero code learning.**

### The Fix

```python
# BEFORE: 2380 tokens of verbose guidelines
SYSTEM = """You are an expert Laravel developer. When writing models,
always use the HasFactory trait. The HasFactory trait enables...
[2380 tokens of examples and explanations]"""

# AFTER: 843 tokens, compressed
SYSTEM = """Laravel 13.x code generator. Output ONLY PHP.
- model: use HasFactory, add relationships from spec
- controller: import Controller, destroy() returns noContent()
..."""
```

And the verification I should have done from the start:

```python
# Check that completions aren't truncated
for example in dataset:
    tokens = tokenizer.encode(example["text"])
    assert len(tokens) < max_seq_length, f"Truncated at {len(tokens)} tokens"
```

**Lesson: `val_loss=0.000` means nothing is being learned, not that everything is perfect. Always verify your training data reaches the completions.**

## Phase 2: Targeted Bug Fixing (The 10-15 Example Rule)

After fixing the truncation bug, real training started. val_loss: 0.080 (not 0.000!).

I discovered that **every systematic bug can be fixed with 10-15 targeted examples**:

| Bug | Examples needed | Result |
|-----|:-:|---|
| `'optional'` validation rule (not a Laravel rule) | 10 | Fixed — generates `'nullable'` |
| `wasRecentlyCreated` in resources | 5 | Fixed — uses correct timestamps |
| Cross-resource missing imports | 13 | Fixed — 12 bugs → 0 |
| Missing `HasFactory` trait | 20 (fixed existing) | Fixed — 5 bugs → 0 |

The model already knows PHP. You're nudging a trained distribution, not teaching from scratch. 10-15 diverse examples of the correct pattern is enough.

### The Eval Script Trap

I built an automated bug checker. It flagged `StoreBookRequest $request` as "missing `Illuminate\Http\Request` import" because the regex `'Request $request'` matched as a substring.

**Test your eval script on correct code before trusting it.**

### Where I Hit the Wall

After Sprint 9: 52/58 Pest tests passing. 6 failures remained. All were **semantic hallucinations**:

- Model invents a `user()` relationship that doesn't exist
- Controller uses closure-based eager loading when array format is correct
- Model generates `->withHttpStatus()` — a method that doesn't exist

Adding more NL training examples didn't help. The model was filling prompt ambiguity with its pretraining priors. The problem wasn't the model — it was the input format.

## Phase 3: The Spec Pivot (The Real Breakthrough)

Instead of natural language:

> "Create a Post model with author relationship, fillable title and body, soft deletes"

I switched to structured JSON specs:

```json
{
  "artifact": "model",
  "class": "Post",
  "table": "posts",
  "has_factory": true,
  "soft_deletes": true,
  "fillable": ["title", "body", "user_id"],
  "relationships": [
    {"type": "BelongsTo", "model": "User", "method": "author", "foreign_key": "user_id"}
  ]
}
```

### First test: 28 examples, 100 iterations

Result: **26/26 eval perfect. Zero semantic hallucinations.** (Compare: 308 NL examples still had 5 hallucinations.)

The model can't invent a `user()` relationship if `relationships[]` explicitly lists only `author`. The spec removes the model's ability to hallucinate about *what* to generate. It only decides *how*.

### The Spec Compiler

I built a compiler that validates specs before generation:

```
$ python3 spec_compiler.py bad_spec.json

SpecCompileError: rules['venue_id'] contains conditional token
'required_on_post'. Use 'conditional_rules' dict instead.
```

Validation: <1ms. Generation: ~30s per file. Catch errors early.

### Final Results: adapters_spec_v4

| Metric | NL Pipeline (308 ex) | Spec Pipeline (49 ex) |
|--------|:---:|:---:|
| PHP valid | 26/26 | 26/26 |
| Pest pass | 52/58 | **20/20** |
| Manual fixes | 5 | 4 |
| Semantic hallucinations | 5 | **0** |
| Training time | ~30 min | ~15 min |

## The Debugging Checklist

Distilled from 34 steps of hitting walls:

**Before training:**
1. Tokenize ALL examples. Check `max(total_tokens) < max_seq_length`
2. Check `min(completion_tokens) > 0`. If zero, system prompt is too long.
3. Close all GPU-using processes. Check memory with `vm_stat`.
4. Use `--num-layers 8` (not `--lora-layers 8`) on 16GB machines.

**After training:**
5. If `val_loss = 0.000`: training is broken, not perfect.
6. Generate 3-5 test files and inspect manually before full benchmark.
7. Run `php -l` on all output (syntax check).

**When bugs persist:**
8. Classify: is it a training data gap or a model capability limit?
9. If data gap: write 10-15 targeted examples with diverse contexts.
10. If capability limit: change the input format (structured specs).
11. If hallucinations persist after targeted training: the problem is **ontological** — the model's pretraining domain model diverges from yours. Give it an explicit ontology (structured spec), don't fight with more NL examples.

## What 7B Models Do Well vs Poorly

**Does well:**
- Individual class generation with clear patterns
- PHP syntax (very rare errors after basic fine-tuning)
- Following explicit rules in the system prompt
- CRUD operations with a single model

**Does poorly:**
- Multi-file consistency (imports across files)
- Knowing what NOT to add (hallucinated relationships)
- Distinguishing Laravel API versions (mixes 9.x and 13.x patterns)
- Complex relationship traversal

**The key insight:** 7B models don't reason about code. They pattern-match against pretraining. Every persistent bug is a missing pattern. The fix is always: add examples. If that's not enough: change the input format to remove the decision from the model entirely.

## Try It Yourself

Everything is open source:

- **Spec-trained model**: [fchis/Laravel-13x-Qwen2.5-Coder-7B-Instruct-LoRA-Spec](https://huggingface.co/fchis/Laravel-13x-Qwen2.5-Coder-7B-Instruct-LoRA-Spec)
- **Training data**: [fchis/laravel-buildspec-training](https://huggingface.co/datasets/fchis/laravel-buildspec-training) (49 examples)
- **Full pipeline**: [github.com/florinel-chis/laravel-ai-gen](https://github.com/florinel-chis/laravel-ai-gen)

```bash
pip install mlx-lm

# Full pipeline: NL → specs → compile → PHP files
python3 pipeline_spec.py "Create a REST API for managing blog posts with tags"

# Or use a spec directly
python3 pipeline_spec.py --spec my_specs.json --output ./generated
```

Runs entirely on Apple Silicon. M1/M2/M3/M4 with 16GB+ RAM.

---

*This post is an abbreviated version of: "From Hallucination to Ontology: 34 Steps Building a Domain-Specific Code Generator on Consumer Hardware" (Chis, 2026). The full paper with detailed results, bug taxonomy, and infrastructure lessons is available as a preprint.*