Spaces:
Running
Running
| title: "I Spent 34 Steps Building a Code Generator on My MacBook — Here's What Actually Worked" | |
| emoji: "🛠️" | |
| colorFrom: green | |
| colorTo: yellow | |
| sdk: static | |
| pinned: false | |
| license: mit | |
| tags: | |
| - code-generation | |
| - fine-tuning | |
| - mlx | |
| - lora | |
| - laravel | |
| - php | |
| - apple-silicon | |
| - experience-report | |
| # I Spent 34 Steps Building a Code Generator on My MacBook — Here's What Actually Worked | |
| **Florinel Chis** · March 2026 | |
| --- | |
| Most fine-tuning tutorials show you the happy path. This is the full path — including 6 training rounds that taught the model absolutely nothing, OOM crashes that killed my machine, and the realization that the real problem was never about the model. | |
| **The end result:** A Laravel PHP code generator that produces 26/26 valid PHP files with 20/20 Pest tests passing. Trained on 49 examples. Runs on an Apple M2 Pro with 16GB RAM. Total cloud GPU cost: $0. | |
| Here's how I actually got there. | |
| ## The Hardware | |
| - Apple M2 Pro, 16GB unified memory | |
| - Qwen2.5-Coder-7B-Instruct, 4-bit quantized | |
| - MLX framework with LoRA | |
| - Target: Laravel 13.x PHP code generation | |
| The 16GB constraint shaped every architectural decision. You can't load two 7B models. You can't train with `max_seq_length=4096`. You close LM Studio before training or your machine crashes. | |
| ## Phase 1: Six Sprints of Nothing (The Silent Truncation Bug) | |
| I started with 90 training examples and grew to 261 across 6 sprints. `val_loss` kept dropping. By Sprint 6, it hit **0.000**. Perfect. | |
| Except the generated code wasn't getting better. At all. | |
| ### The Root Cause | |
| The system prompt (guidelines for the model) had grown organically across sprints to **2,380 tokens**. My `max_seq_length` was **1,500**. | |
| MLX truncates training examples silently at `max_seq_length`. Every single training example was cut off before the code completion even started. The model was being trained to predict its own system prompt — and it got really good at that (hence val_loss=0.000). | |
| **Six sprints. Hundreds of examples. Zero code learning.** | |
| ### The Fix | |
| ```python | |
| # BEFORE: 2380 tokens of verbose guidelines | |
| SYSTEM = """You are an expert Laravel developer. When writing models, | |
| always use the HasFactory trait. The HasFactory trait enables... | |
| [2380 tokens of examples and explanations]""" | |
| # AFTER: 843 tokens, compressed | |
| SYSTEM = """Laravel 13.x code generator. Output ONLY PHP. | |
| - model: use HasFactory, add relationships from spec | |
| - controller: import Controller, destroy() returns noContent() | |
| ...""" | |
| ``` | |
| And the verification I should have done from the start: | |
| ```python | |
| # Check that completions aren't truncated | |
| for example in dataset: | |
| tokens = tokenizer.encode(example["text"]) | |
| assert len(tokens) < max_seq_length, f"Truncated at {len(tokens)} tokens" | |
| ``` | |
| **Lesson: `val_loss=0.000` means nothing is being learned, not that everything is perfect. Always verify your training data reaches the completions.** | |
| ## Phase 2: Targeted Bug Fixing (The 10-15 Example Rule) | |
| After fixing the truncation bug, real training started. val_loss: 0.080 (not 0.000!). | |
| I discovered that **every systematic bug can be fixed with 10-15 targeted examples**: | |
| | Bug | Examples needed | Result | | |
| |-----|:-:|---| | |
| | `'optional'` validation rule (not a Laravel rule) | 10 | Fixed — generates `'nullable'` | | |
| | `wasRecentlyCreated` in resources | 5 | Fixed — uses correct timestamps | | |
| | Cross-resource missing imports | 13 | Fixed — 12 bugs → 0 | | |
| | Missing `HasFactory` trait | 20 (fixed existing) | Fixed — 5 bugs → 0 | | |
| The model already knows PHP. You're nudging a trained distribution, not teaching from scratch. 10-15 diverse examples of the correct pattern is enough. | |
| ### The Eval Script Trap | |
| I built an automated bug checker. It flagged `StoreBookRequest $request` as "missing `Illuminate\Http\Request` import" because the regex `'Request $request'` matched as a substring. | |
| **Test your eval script on correct code before trusting it.** | |
| ### Where I Hit the Wall | |
| After Sprint 9: 52/58 Pest tests passing. 6 failures remained. All were **semantic hallucinations**: | |
| - Model invents a `user()` relationship that doesn't exist | |
| - Controller uses closure-based eager loading when array format is correct | |
| - Model generates `->withHttpStatus()` — a method that doesn't exist | |
| Adding more NL training examples didn't help. The model was filling prompt ambiguity with its pretraining priors. The problem wasn't the model — it was the input format. | |
| ## Phase 3: The Spec Pivot (The Real Breakthrough) | |
| Instead of natural language: | |
| > "Create a Post model with author relationship, fillable title and body, soft deletes" | |
| I switched to structured JSON specs: | |
| ```json | |
| { | |
| "artifact": "model", | |
| "class": "Post", | |
| "table": "posts", | |
| "has_factory": true, | |
| "soft_deletes": true, | |
| "fillable": ["title", "body", "user_id"], | |
| "relationships": [ | |
| {"type": "BelongsTo", "model": "User", "method": "author", "foreign_key": "user_id"} | |
| ] | |
| } | |
| ``` | |
| ### First test: 28 examples, 100 iterations | |
| Result: **26/26 eval perfect. Zero semantic hallucinations.** (Compare: 308 NL examples still had 5 hallucinations.) | |
| The model can't invent a `user()` relationship if `relationships[]` explicitly lists only `author`. The spec removes the model's ability to hallucinate about *what* to generate. It only decides *how*. | |
| ### The Spec Compiler | |
| I built a compiler that validates specs before generation: | |
| ``` | |
| $ python3 spec_compiler.py bad_spec.json | |
| SpecCompileError: rules['venue_id'] contains conditional token | |
| 'required_on_post'. Use 'conditional_rules' dict instead. | |
| ``` | |
| Validation: <1ms. Generation: ~30s per file. Catch errors early. | |
| ### Final Results: adapters_spec_v4 | |
| | Metric | NL Pipeline (308 ex) | Spec Pipeline (49 ex) | | |
| |--------|:---:|:---:| | |
| | PHP valid | 26/26 | 26/26 | | |
| | Pest pass | 52/58 | **20/20** | | |
| | Manual fixes | 5 | 4 | | |
| | Semantic hallucinations | 5 | **0** | | |
| | Training time | ~30 min | ~15 min | | |
| ## The Debugging Checklist | |
| Distilled from 34 steps of hitting walls: | |
| **Before training:** | |
| 1. Tokenize ALL examples. Check `max(total_tokens) < max_seq_length` | |
| 2. Check `min(completion_tokens) > 0`. If zero, system prompt is too long. | |
| 3. Close all GPU-using processes. Check memory with `vm_stat`. | |
| 4. Use `--num-layers 8` (not `--lora-layers 8`) on 16GB machines. | |
| **After training:** | |
| 5. If `val_loss = 0.000`: training is broken, not perfect. | |
| 6. Generate 3-5 test files and inspect manually before full benchmark. | |
| 7. Run `php -l` on all output (syntax check). | |
| **When bugs persist:** | |
| 8. Classify: is it a training data gap or a model capability limit? | |
| 9. If data gap: write 10-15 targeted examples with diverse contexts. | |
| 10. If capability limit: change the input format (structured specs). | |
| 11. If hallucinations persist after targeted training: the problem is **ontological** — the model's pretraining domain model diverges from yours. Give it an explicit ontology (structured spec), don't fight with more NL examples. | |
| ## What 7B Models Do Well vs Poorly | |
| **Does well:** | |
| - Individual class generation with clear patterns | |
| - PHP syntax (very rare errors after basic fine-tuning) | |
| - Following explicit rules in the system prompt | |
| - CRUD operations with a single model | |
| **Does poorly:** | |
| - Multi-file consistency (imports across files) | |
| - Knowing what NOT to add (hallucinated relationships) | |
| - Distinguishing Laravel API versions (mixes 9.x and 13.x patterns) | |
| - Complex relationship traversal | |
| **The key insight:** 7B models don't reason about code. They pattern-match against pretraining. Every persistent bug is a missing pattern. The fix is always: add examples. If that's not enough: change the input format to remove the decision from the model entirely. | |
| ## Try It Yourself | |
| Everything is open source: | |
| - **Spec-trained model**: [fchis/Laravel-13x-Qwen2.5-Coder-7B-Instruct-LoRA-Spec](https://huggingface.co/fchis/Laravel-13x-Qwen2.5-Coder-7B-Instruct-LoRA-Spec) | |
| - **Training data**: [fchis/laravel-buildspec-training](https://huggingface.co/datasets/fchis/laravel-buildspec-training) (49 examples) | |
| - **Full pipeline**: [github.com/florinel-chis/laravel-ai-gen](https://github.com/florinel-chis/laravel-ai-gen) | |
| ```bash | |
| pip install mlx-lm | |
| # Full pipeline: NL → specs → compile → PHP files | |
| python3 pipeline_spec.py "Create a REST API for managing blog posts with tags" | |
| # Or use a spec directly | |
| python3 pipeline_spec.py --spec my_specs.json --output ./generated | |
| ``` | |
| Runs entirely on Apple Silicon. M1/M2/M3/M4 with 16GB+ RAM. | |
| --- | |
| *This post is an abbreviated version of: "From Hallucination to Ontology: 34 Steps Building a Domain-Specific Code Generator on Consumer Hardware" (Chis, 2026). The full paper with detailed results, bug taxonomy, and infrastructure lessons is available as a preprint.* | |