--- title: "I Spent 34 Steps Building a Code Generator on My MacBook โ€” Here's What Actually Worked" emoji: "๐Ÿ› ๏ธ" colorFrom: green colorTo: yellow sdk: static pinned: false license: mit tags: - code-generation - fine-tuning - mlx - lora - laravel - php - apple-silicon - experience-report --- # I Spent 34 Steps Building a Code Generator on My MacBook โ€” Here's What Actually Worked **Florinel Chis** ยท March 2026 --- Most fine-tuning tutorials show you the happy path. This is the full path โ€” including 6 training rounds that taught the model absolutely nothing, OOM crashes that killed my machine, and the realization that the real problem was never about the model. **The end result:** A Laravel PHP code generator that produces 26/26 valid PHP files with 20/20 Pest tests passing. Trained on 49 examples. Runs on an Apple M2 Pro with 16GB RAM. Total cloud GPU cost: $0. Here's how I actually got there. ## The Hardware - Apple M2 Pro, 16GB unified memory - Qwen2.5-Coder-7B-Instruct, 4-bit quantized - MLX framework with LoRA - Target: Laravel 13.x PHP code generation The 16GB constraint shaped every architectural decision. You can't load two 7B models. You can't train with `max_seq_length=4096`. You close LM Studio before training or your machine crashes. ## Phase 1: Six Sprints of Nothing (The Silent Truncation Bug) I started with 90 training examples and grew to 261 across 6 sprints. `val_loss` kept dropping. By Sprint 6, it hit **0.000**. Perfect. Except the generated code wasn't getting better. At all. ### The Root Cause The system prompt (guidelines for the model) had grown organically across sprints to **2,380 tokens**. My `max_seq_length` was **1,500**. MLX truncates training examples silently at `max_seq_length`. Every single training example was cut off before the code completion even started. The model was being trained to predict its own system prompt โ€” and it got really good at that (hence val_loss=0.000). **Six sprints. Hundreds of examples. Zero code learning.** ### The Fix ```python # BEFORE: 2380 tokens of verbose guidelines SYSTEM = """You are an expert Laravel developer. When writing models, always use the HasFactory trait. The HasFactory trait enables... [2380 tokens of examples and explanations]""" # AFTER: 843 tokens, compressed SYSTEM = """Laravel 13.x code generator. Output ONLY PHP. - model: use HasFactory, add relationships from spec - controller: import Controller, destroy() returns noContent() ...""" ``` And the verification I should have done from the start: ```python # Check that completions aren't truncated for example in dataset: tokens = tokenizer.encode(example["text"]) assert len(tokens) < max_seq_length, f"Truncated at {len(tokens)} tokens" ``` **Lesson: `val_loss=0.000` means nothing is being learned, not that everything is perfect. Always verify your training data reaches the completions.** ## Phase 2: Targeted Bug Fixing (The 10-15 Example Rule) After fixing the truncation bug, real training started. val_loss: 0.080 (not 0.000!). I discovered that **every systematic bug can be fixed with 10-15 targeted examples**: | Bug | Examples needed | Result | |-----|:-:|---| | `'optional'` validation rule (not a Laravel rule) | 10 | Fixed โ€” generates `'nullable'` | | `wasRecentlyCreated` in resources | 5 | Fixed โ€” uses correct timestamps | | Cross-resource missing imports | 13 | Fixed โ€” 12 bugs โ†’ 0 | | Missing `HasFactory` trait | 20 (fixed existing) | Fixed โ€” 5 bugs โ†’ 0 | The model already knows PHP. You're nudging a trained distribution, not teaching from scratch. 10-15 diverse examples of the correct pattern is enough. ### The Eval Script Trap I built an automated bug checker. It flagged `StoreBookRequest $request` as "missing `Illuminate\Http\Request` import" because the regex `'Request $request'` matched as a substring. **Test your eval script on correct code before trusting it.** ### Where I Hit the Wall After Sprint 9: 52/58 Pest tests passing. 6 failures remained. All were **semantic hallucinations**: - Model invents a `user()` relationship that doesn't exist - Controller uses closure-based eager loading when array format is correct - Model generates `->withHttpStatus()` โ€” a method that doesn't exist Adding more NL training examples didn't help. The model was filling prompt ambiguity with its pretraining priors. The problem wasn't the model โ€” it was the input format. ## Phase 3: The Spec Pivot (The Real Breakthrough) Instead of natural language: > "Create a Post model with author relationship, fillable title and body, soft deletes" I switched to structured JSON specs: ```json { "artifact": "model", "class": "Post", "table": "posts", "has_factory": true, "soft_deletes": true, "fillable": ["title", "body", "user_id"], "relationships": [ {"type": "BelongsTo", "model": "User", "method": "author", "foreign_key": "user_id"} ] } ``` ### First test: 28 examples, 100 iterations Result: **26/26 eval perfect. Zero semantic hallucinations.** (Compare: 308 NL examples still had 5 hallucinations.) The model can't invent a `user()` relationship if `relationships[]` explicitly lists only `author`. The spec removes the model's ability to hallucinate about *what* to generate. It only decides *how*. ### The Spec Compiler I built a compiler that validates specs before generation: ``` $ python3 spec_compiler.py bad_spec.json SpecCompileError: rules['venue_id'] contains conditional token 'required_on_post'. Use 'conditional_rules' dict instead. ``` Validation: <1ms. Generation: ~30s per file. Catch errors early. ### Final Results: adapters_spec_v4 | Metric | NL Pipeline (308 ex) | Spec Pipeline (49 ex) | |--------|:---:|:---:| | PHP valid | 26/26 | 26/26 | | Pest pass | 52/58 | **20/20** | | Manual fixes | 5 | 4 | | Semantic hallucinations | 5 | **0** | | Training time | ~30 min | ~15 min | ## The Debugging Checklist Distilled from 34 steps of hitting walls: **Before training:** 1. Tokenize ALL examples. Check `max(total_tokens) < max_seq_length` 2. Check `min(completion_tokens) > 0`. If zero, system prompt is too long. 3. Close all GPU-using processes. Check memory with `vm_stat`. 4. Use `--num-layers 8` (not `--lora-layers 8`) on 16GB machines. **After training:** 5. If `val_loss = 0.000`: training is broken, not perfect. 6. Generate 3-5 test files and inspect manually before full benchmark. 7. Run `php -l` on all output (syntax check). **When bugs persist:** 8. Classify: is it a training data gap or a model capability limit? 9. If data gap: write 10-15 targeted examples with diverse contexts. 10. If capability limit: change the input format (structured specs). 11. If hallucinations persist after targeted training: the problem is **ontological** โ€” the model's pretraining domain model diverges from yours. Give it an explicit ontology (structured spec), don't fight with more NL examples. ## What 7B Models Do Well vs Poorly **Does well:** - Individual class generation with clear patterns - PHP syntax (very rare errors after basic fine-tuning) - Following explicit rules in the system prompt - CRUD operations with a single model **Does poorly:** - Multi-file consistency (imports across files) - Knowing what NOT to add (hallucinated relationships) - Distinguishing Laravel API versions (mixes 9.x and 13.x patterns) - Complex relationship traversal **The key insight:** 7B models don't reason about code. They pattern-match against pretraining. Every persistent bug is a missing pattern. The fix is always: add examples. If that's not enough: change the input format to remove the decision from the model entirely. ## Try It Yourself Everything is open source: - **Spec-trained model**: [fchis/Laravel-13x-Qwen2.5-Coder-7B-Instruct-LoRA-Spec](https://huggingface.co/fchis/Laravel-13x-Qwen2.5-Coder-7B-Instruct-LoRA-Spec) - **Training data**: [fchis/laravel-buildspec-training](https://huggingface.co/datasets/fchis/laravel-buildspec-training) (49 examples) - **Full pipeline**: [github.com/florinel-chis/laravel-ai-gen](https://github.com/florinel-chis/laravel-ai-gen) ```bash pip install mlx-lm # Full pipeline: NL โ†’ specs โ†’ compile โ†’ PHP files python3 pipeline_spec.py "Create a REST API for managing blog posts with tags" # Or use a spec directly python3 pipeline_spec.py --spec my_specs.json --output ./generated ``` Runs entirely on Apple Silicon. M1/M2/M3/M4 with 16GB+ RAM. --- *This post is an abbreviated version of: "From Hallucination to Ontology: 34 Steps Building a Domain-Specific Code Generator on Consumer Hardware" (Chis, 2026). The full paper with detailed results, bug taxonomy, and infrastructure lessons is available as a preprint.*