Spaces:

fchis
/

34-steps-code-generator

Running

App Files Files Community

fchis commited on Mar 28

Commit

22b85f7

verified ·

1 Parent(s): d30942d

Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +216 -4

README.md CHANGED Viewed

@@ -1,10 +1,222 @@
 ---
-title: 34 Steps Code Generator
-emoji: 👀
-colorFrom: gray
 colorTo: yellow
 sdk: static
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: "I Spent 34 Steps Building a Code Generator on My MacBook — Here's What Actually Worked"
+emoji: "🛠️"
+colorFrom: green
 colorTo: yellow
 sdk: static
 pinned: false
+license: mit
+tags:
+- code-generation
+- fine-tuning
+- mlx
+- lora
+- laravel
+- php
+- apple-silicon
+- experience-report
 ---
+# I Spent 34 Steps Building a Code Generator on My MacBook — Here's What Actually Worked
+**Florinel Chis** · March 2026
+---
+Most fine-tuning tutorials show you the happy path. This is the full path — including 6 training rounds that taught the model absolutely nothing, OOM crashes that killed my machine, and the realization that the real problem was never about the model.
+**The end result:** A Laravel PHP code generator that produces 26/26 valid PHP files with 20/20 Pest tests passing. Trained on 49 examples. Runs on an Apple M2 Pro with 16GB RAM. Total cloud GPU cost: $0.
+Here's how I actually got there.
+## The Hardware
+- Apple M2 Pro, 16GB unified memory
+- Qwen2.5-Coder-7B-Instruct, 4-bit quantized
+- MLX framework with LoRA
+- Target: Laravel 13.x PHP code generation
+The 16GB constraint shaped every architectural decision. You can't load two 7B models. You can't train with `max_seq_length=4096`. You close LM Studio before training or your machine crashes.
+## Phase 1: Six Sprints of Nothing (The Silent Truncation Bug)
+I started with 90 training examples and grew to 261 across 6 sprints. `val_loss` kept dropping. By Sprint 6, it hit **0.000**. Perfect.
+Except the generated code wasn't getting better. At all.
+### The Root Cause
+The system prompt (guidelines for the model) had grown organically across sprints to **2,380 tokens**. My `max_seq_length` was **1,500**.
+MLX truncates training examples silently at `max_seq_length`. Every single training example was cut off before the code completion even started. The model was being trained to predict its own system prompt — and it got really good at that (hence val_loss=0.000).
+**Six sprints. Hundreds of examples. Zero code learning.**
+### The Fix
+```python
+# BEFORE: 2380 tokens of verbose guidelines
+SYSTEM = """You are an expert Laravel developer. When writing models,
+always use the HasFactory trait. The HasFactory trait enables...
+[2380 tokens of examples and explanations]"""
+# AFTER: 843 tokens, compressed
+SYSTEM = """Laravel 13.x code generator. Output ONLY PHP.
+- model: use HasFactory, add relationships from spec
+- controller: import Controller, destroy() returns noContent()
+..."""
+```
+And the verification I should have done from the start:
+```python
+# Check that completions aren't truncated
+for example in dataset:
+    tokens = tokenizer.encode(example["text"])
+    assert len(tokens) < max_seq_length, f"Truncated at {len(tokens)} tokens"
+```
+**Lesson: `val_loss=0.000` means nothing is being learned, not that everything is perfect. Always verify your training data reaches the completions.**
+## Phase 2: Targeted Bug Fixing (The 10-15 Example Rule)
+After fixing the truncation bug, real training started. val_loss: 0.080 (not 0.000!).
+I discovered that **every systematic bug can be fixed with 10-15 targeted examples**:
+| Bug | Examples needed | Result |
+|-----|:-:|---|
+| `'optional'` validation rule (not a Laravel rule) | 10 | Fixed — generates `'nullable'` |
+| `wasRecentlyCreated` in resources | 5 | Fixed — uses correct timestamps |
+| Cross-resource missing imports | 13 | Fixed — 12 bugs → 0 |
+| Missing `HasFactory` trait | 20 (fixed existing) | Fixed — 5 bugs → 0 |
+The model already knows PHP. You're nudging a trained distribution, not teaching from scratch. 10-15 diverse examples of the correct pattern is enough.
+### The Eval Script Trap
+I built an automated bug checker. It flagged `StoreBookRequest $request` as "missing `Illuminate\Http\Request` import" because the regex `'Request $request'` matched as a substring.
+**Test your eval script on correct code before trusting it.**
+### Where I Hit the Wall
+After Sprint 9: 52/58 Pest tests passing. 6 failures remained. All were **semantic hallucinations**:
+- Model invents a `user()` relationship that doesn't exist
+- Controller uses closure-based eager loading when array format is correct
+- Model generates `->withHttpStatus()` — a method that doesn't exist
+Adding more NL training examples didn't help. The model was filling prompt ambiguity with its pretraining priors. The problem wasn't the model — it was the input format.
+## Phase 3: The Spec Pivot (The Real Breakthrough)
+Instead of natural language:
+> "Create a Post model with author relationship, fillable title and body, soft deletes"
+I switched to structured JSON specs:
+```json
+{
+  "artifact": "model",
+  "class": "Post",
+  "table": "posts",
+  "has_factory": true,
+  "soft_deletes": true,
+  "fillable": ["title", "body", "user_id"],
+  "relationships": [
+    {"type": "BelongsTo", "model": "User", "method": "author", "foreign_key": "user_id"}
+  ]
+}
+```
+### First test: 28 examples, 100 iterations
+Result: **26/26 eval perfect. Zero semantic hallucinations.** (Compare: 308 NL examples still had 5 hallucinations.)
+The model can't invent a `user()` relationship if `relationships[]` explicitly lists only `author`. The spec removes the model's ability to hallucinate about *what* to generate. It only decides *how*.
+### The Spec Compiler
+I built a compiler that validates specs before generation:
+```
+$ python3 spec_compiler.py bad_spec.json
+SpecCompileError: rules['venue_id'] contains conditional token
+'required_on_post'. Use 'conditional_rules' dict instead.
+```
+Validation: <1ms. Generation: ~30s per file. Catch errors early.
+### Final Results: adapters_spec_v4
+| Metric | NL Pipeline (308 ex) | Spec Pipeline (49 ex) |
+|--------|:---:|:---:|
+| PHP valid | 26/26 | 26/26 |
+| Pest pass | 52/58 | **20/20** |
+| Manual fixes | 5 | 4 |
+| Semantic hallucinations | 5 | **0** |
+| Training time | ~30 min | ~15 min |
+## The Debugging Checklist
+Distilled from 34 steps of hitting walls:
+**Before training:**
+1. Tokenize ALL examples. Check `max(total_tokens) < max_seq_length`
+2. Check `min(completion_tokens) > 0`. If zero, system prompt is too long.
+3. Close all GPU-using processes. Check memory with `vm_stat`.
+4. Use `--num-layers 8` (not `--lora-layers 8`) on 16GB machines.
+**After training:**
+5. If `val_loss = 0.000`: training is broken, not perfect.
+6. Generate 3-5 test files and inspect manually before full benchmark.
+7. Run `php -l` on all output (syntax check).
+**When bugs persist:**
+8. Classify: is it a training data gap or a model capability limit?
+9. If data gap: write 10-15 targeted examples with diverse contexts.
+10. If capability limit: change the input format (structured specs).
+11. If hallucinations persist after targeted training: the problem is **ontological** — the model's pretraining domain model diverges from yours. Give it an explicit ontology (structured spec), don't fight with more NL examples.
+## What 7B Models Do Well vs Poorly
+**Does well:**
+- Individual class generation with clear patterns
+- PHP syntax (very rare errors after basic fine-tuning)
+- Following explicit rules in the system prompt
+- CRUD operations with a single model
+**Does poorly:**
+- Multi-file consistency (imports across files)
+- Knowing what NOT to add (hallucinated relationships)
+- Distinguishing Laravel API versions (mixes 9.x and 13.x patterns)
+- Complex relationship traversal
+**The key insight:** 7B models don't reason about code. They pattern-match against pretraining. Every persistent bug is a missing pattern. The fix is always: add examples. If that's not enough: change the input format to remove the decision from the model entirely.
+## Try It Yourself
+Everything is open source:
+- **Spec-trained model**: [fchis/Laravel-13x-Qwen2.5-Coder-7B-Instruct-LoRA-Spec](https://huggingface.co/fchis/Laravel-13x-Qwen2.5-Coder-7B-Instruct-LoRA-Spec)
+- **Training data**: [fchis/laravel-buildspec-training](https://huggingface.co/datasets/fchis/laravel-buildspec-training) (49 examples)
+- **Full pipeline**: [github.com/florinel-chis/laravel-ai-gen](https://github.com/florinel-chis/laravel-ai-gen)
+```bash
+pip install mlx-lm
+# Full pipeline: NL → specs → compile → PHP files
+python3 pipeline_spec.py "Create a REST API for managing blog posts with tags"
+# Or use a spec directly
+python3 pipeline_spec.py --spec my_specs.json --output ./generated
+```
+Runs entirely on Apple Silicon. M1/M2/M3/M4 with 16GB+ RAM.
+---
+*This post is an abbreviated version of: "From Hallucination to Ontology: 34 Steps Building a Domain-Specific Code Generator on Consumer Hardware" (Chis, 2026). The full paper with detailed results, bug taxonomy, and infrastructure lessons is available as a preprint.*