Spaces:

fchis
/

ontological-gap-code-generation

Running

App Files Files Community

ontological-gap-code-generation / index.html

fchis

Upload index.html with huggingface_hub

5bc5503 verified about 1 month ago

raw

history blame contribute delete

12 kB

	<!DOCTYPE html>
	<html lang="en">
	<head>
	<meta charset="UTF-8">
	<meta name="viewport" content="width=device-width, initial-scale=1.0">
	<style>
	body {
	font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
	max-width: 800px;
	margin: 0 auto;
	padding: 20px 40px;
	line-height: 1.7;
	color: #1a1a2e;
	background: #fafafa;
	}
	h1 { font-size: 2em; margin-top: 1em; color: #0f0f23; }
	h2 { font-size: 1.5em; margin-top: 1.5em; color: #16213e; border-bottom: 2px solid #e2e8f0; padding-bottom: 0.3em; }
	h3 { font-size: 1.2em; color: #1a1a4e; }
	code { background: #e8ecf1; padding: 2px 6px; border-radius: 3px; font-size: 0.9em; }
	pre { background: #1e1e2e; color: #cdd6f4; padding: 16px; border-radius: 8px; overflow-x: auto; }
	pre code { background: none; color: inherit; padding: 0; }
	table { border-collapse: collapse; width: 100%; margin: 1em 0; }
	th, td { border: 1px solid #d1d5db; padding: 8px 12px; text-align: left; }
	th { background: #e8ecf1; font-weight: 600; }
	tr:nth-child(even) { background: #f3f4f6; }
	blockquote { border-left: 4px solid #6366f1; margin: 1em 0; padding: 0.5em 1em; background: #eef2ff; color: #312e81; }
	a { color: #4f46e5; text-decoration: none; }
	a:hover { text-decoration: underline; }
	strong { color: #0f172a; }
	hr { border: none; border-top: 2px solid #e2e8f0; margin: 2em 0; }
	</style>
	</head>
	<body>
	<h1>The Ontological Gap: Why Error Type Matters More Than Error Count in AI Code Generation</h1>
	<p><strong>Florinel Chis</strong> · March 2026</p>
	<hr />
	<p>Code generation evaluation obsesses over <strong>how often</strong> models fail — pass@k, syntax validity, test pass rates. But there's a dimension nobody measures: <strong>how</strong> they fail.</p>
	<p>We found that shifting from natural language prompts to structured JSON specifications didn't reduce our error count (5 → 4). But it <strong>fundamentally changed the error type</strong> — from semantic hallucinations that require runtime debugging to mechanical gaps caught by a compiler in under 1 millisecond.</p>
	<p>That shift matters more than the count.</p>
	<h2>The Setup</h2>
	<p>We fine-tuned <a href="https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct">Qwen2.5-Coder-7B-Instruct</a> (4-bit quantized) with LoRA on an Apple M2 Pro with 16GB RAM to generate Laravel 13.x PHP code. Two pipelines, same model, same 3-app benchmark (26 PHP files):</p>
	<table>
	<thead>
	<tr>
	<th></th>
	<th>Natural Language → PHP</th>
	<th>BuildSpec JSON → PHP</th>
	</tr>
	</thead>
	<tbody>
	<tr>
	<td>Training examples</td>
	<td>308</td>
	<td><strong>49</strong></td>
	</tr>
	<tr>
	<td>PHP syntax valid</td>
	<td>26/26</td>
	<td>26/26</td>
	</tr>
	<tr>
	<td>Manual fixes needed</td>
	<td>5</td>
	<td>4</td>
	</tr>
	<tr>
	<td><strong>Error type</strong></td>
	<td><strong>Semantic hallucination</strong></td>
	<td><strong>Mechanical/spec gap</strong></td>
	</tr>
	</tbody>
	</table>
	<p>The numbers look similar. The errors are completely different.</p>
	<h2>What Went Wrong with Natural Language</h2>
	<p>With NL prompts like <em>"Create a Post model with an author relationship and soft deletes"</em>, the model produced 5 bugs:</p>
	<ol>
	<li><strong>Invented a closure-based eager loading pattern</strong> in EventController that doesn't exist in the codebase</li>
	<li><strong>Dropped a BelongsTo relationship</strong> on Book model despite being explicitly asked for it</li>
	<li><strong>Used <code>->load(['user'])</code> on a model with no user relationship</strong> — hallucinated a relationship from pretraining</li>
	<li><strong>Generated <code>->withHttpStatus()</code></strong> — a method that doesn't exist in Laravel</li>
	<li><strong>Missing JsonResource import</strong> in SubscriberResource</li>
	</ol>
	<p>Every one of these is a <strong>semantic hallucination</strong>: the model generated something that doesn't match the developer's intent, and the only way to catch it is to run the code and debug the failure.</p>
	<h2>What Went Wrong with BuildSpec</h2>
	<p>With structured specs like this:</p>
	<div class="codehilite"><pre><span></span><code><span class="p">{</span>
	<span class="w"> </span><span class="nt">"artifact"</span><span class="p">:</span><span class="w"> </span><span class="s2">"model"</span><span class="p">,</span>
	<span class="w"> </span><span class="nt">"class"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Book"</span><span class="p">,</span>
	<span class="w"> </span><span class="nt">"table"</span><span class="p">:</span><span class="w"> </span><span class="s2">"books"</span><span class="p">,</span>
	<span class="w"> </span><span class="nt">"has_factory"</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="p">,</span>
	<span class="w"> </span><span class="nt">"fillable"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s2">"title"</span><span class="p">,</span><span class="w"> </span><span class="s2">"isbn"</span><span class="p">,</span><span class="w"> </span><span class="s2">"year"</span><span class="p">,</span><span class="w"> </span><span class="s2">"author_id"</span><span class="p">],</span>
	<span class="w"> </span><span class="nt">"relationships"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span>
	<span class="w"> </span><span class="p">{</span><span class="nt">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"BelongsTo"</span><span class="p">,</span><span class="w"> </span><span class="nt">"model"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Author"</span><span class="p">,</span><span class="w"> </span><span class="nt">"method"</span><span class="p">:</span><span class="w"> </span><span class="s2">"author"</span><span class="p">}</span>
	<span class="w"> </span><span class="p">]</span>
	<span class="p">}</span>
	</code></pre></div>

	<p>The model produced 4 bugs:</p>
	<ol>
	<li><strong>Used string-based <code>unique:books,isbn,...</code></strong> instead of <code>Rule::unique()->ignore()</code> — wrong PHP pattern for a correct concept</li>
	<li><strong>Excluded <code>author_id</code> from <code>Book::create()</code></strong> unnecessarily — wrong code pattern</li>
	<li><strong>Migration spec said <code>published_year</code>, model spec said <code>year</code></strong> — our spec was inconsistent</li>
	<li><strong>Factory included <code>max_attendees</code></strong> but migration didn't have that column — our test harness was wrong</li>
	</ol>
	<p><strong>Zero semantic hallucinations.</strong> The model never invented a relationship, never used a nonexistent method, never hallucinated a pattern. Bugs 1-2 are wrong <em>code patterns</em> for correct <em>concepts</em>. Bugs 3-4 are our own spec inconsistencies.</p>
	<h2>Why? The Ontological Gap</h2>
	<p>Here's our hypothesis:</p>
	<blockquote>
	<p><strong>Semantic hallucinations are caused by ontological misalignment.</strong> The model has its own implicit "ontology" — a domain model learned from pretraining on millions of PHP files. When prompted with natural language, gaps in the prompt are filled from this implicit ontology. Where it diverges from the developer's intent, hallucinations occur.</p>
	</blockquote>
	<p>The model's pretraining ontology says:
	- "Models usually have a <code>user()</code> relationship"
	- "Validation includes <code>'optional'</code>" (not a real Laravel rule)
	- "Controllers use closure-based eager loading"</p>
	<p>The developer's ontology says:
	- "Posts belong to Authors, not Users"
	- "Validation uses <code>'nullable'</code>"
	- "Simple array-based eager loading suffices"</p>
	<p><strong>BuildSpec closes this gap.</strong> When you write <code>"relationships": [{"type": "BelongsTo", "model": "Author"}]</code>, there's no room for the model to substitute its own prior about what relationships a Post "should" have.</p>
	<h2>The Spec Compiler: An Ontological Reasoner</h2>
	<p>Before any code is generated, the spec compiler validates every spec in <1ms:</p>
	<div class="codehilite"><pre><span></span><code>$<span class="w"> </span>python3<span class="w"> </span>spec_compiler.py<span class="w"> </span>event_request.json

	SpecCompileError:<span class="w"> </span>rules<span class="o">[</span><span class="s1">'venue_id'</span><span class="o">]</span><span class="w"> </span>contains<span class="w"> </span>conditional<span class="w"> </span>token
	<span class="s1">'required_on_post'</span>.<span class="w"> </span>Use<span class="w"> </span><span class="s1">'conditional_rules'</span><span class="w"> </span>dict<span class="w"> </span>instead.
	Example:<span class="w"> </span><span class="o">{</span><span class="s2">"conditional_rules"</span>:<span class="w"> </span><span class="o">{</span><span class="s2">"venue_id"</span>:<span class="w"> </span><span class="o">{</span><span class="s2">"POST"</span>:<span class="w"> </span><span class="o">[</span><span class="s2">"required"</span><span class="o">]</span>,
	<span class="s2">"PUT"</span>:<span class="w"> </span><span class="o">[</span><span class="s2">"sometimes"</span><span class="o">]}}}</span>
	</code></pre></div>

	<p>The compiler catches ontological violations — wrong field names, missing required properties, invalid constraint expressions — before the model ever sees the spec. Generation takes ~30 seconds per file. Validation takes <1ms. <strong>Validate aggressively, generate only validated specs.</strong></p>
	<h2>Data Efficiency: 49 vs 308 Examples</h2>
	<p>The spec pipeline needed <strong>6x fewer training examples</strong> for equivalent results. Why?</p>
	<p>With natural language, the model must learn two things:
	1. <strong>What to generate</strong> (the domain ontology — which entities, relationships, rules)
	2. <strong>How to generate it</strong> (the code mapping — PHP syntax, Laravel patterns)</p>
	<p>With BuildSpec, the ontology is <em>given</em>. The model only learns the mapping. Half the learning problem is eliminated by making the input explicit.</p>
	<h2>The Takeaway</h2>
	<p>If you're building domain-specific code generation:</p>
	<ol>
	<li><strong>Measure error type, not just error count.</strong> 4 mechanical bugs are better than 4 semantic hallucinations.</li>
	<li><strong>Make the developer's ontology explicit.</strong> Structured specs remove the model's ability to hallucinate about <em>what</em> to generate.</li>
	<li><strong>Validate inputs, not just outputs.</strong> A spec compiler catches errors 30,000x faster than generating code and running tests.</li>
	<li><strong>You need fewer examples than you think.</strong> Structured input is more data-efficient because the model isn't learning domain concepts — just code patterns.</li>
	</ol>
	<h2>Try It</h2>
	<ul>
	<li><strong>Model</strong>: <a href="https://huggingface.co/fchis/Laravel-13x-Qwen2.5-Coder-7B-Instruct-LoRA-Spec">fchis/Laravel-13x-Qwen2.5-Coder-7B-Instruct-LoRA-Spec</a></li>
	<li><strong>Dataset</strong>: <a href="https://huggingface.co/datasets/fchis/laravel-buildspec-training">fchis/laravel-buildspec-training</a></li>
	<li><strong>Code</strong>: <a href="https://github.com/florinel-chis/laravel-ai-gen">github.com/florinel-chis/laravel-ai-gen</a></li>
	</ul>
	<div class="codehilite"><pre><span></span><code>pip<span class="w"> </span>install<span class="w"> </span>mlx-lm
	python3<span class="w"> </span>pipeline_spec.py<span class="w"> </span><span class="s2">"Create a REST API for managing blog posts with tags"</span>
	</code></pre></div>

	<p>Runs entirely on Apple Silicon. No cloud GPU needed.</p>
	<hr />
	<p><em>This post summarizes findings from: "The Ontological Gap: Why Error Type Matters More Than Error Count in AI Code Generation" (Chis, 2026).</em></p>
	</body>
	</html>