exploitintel's picture
Upload blog.md with huggingface_hub
780995b verified

The Overeager Intern Problem: Fine-Tuning Google's New AI Into a Vulnerability Specialist

A brilliant, brand-new model that won't stop talking — and the genuinely interesting work of teaching it to give one precise answer instead. Real methods, real numbers, and the part where we nearly published a lie.


The problem: a flood of bugs, and nobody to file them

Every year the world logs tens of thousands of new software vulnerabilities. Each one gets a CVE — a public record stating that one specific thing is broken in one specific product — and the count climbs relentlessly. In 2024 it climbed straight into a wall: the U.S. National Vulnerability Database, the clearinghouse that enriches those raw records with structured metadata, fell dramatically behind, leaving a mountain of CVEs with a description and little else attached.

One of the most valuable pieces of that "little else" is the CWE — the weakness category, the answer to what kind of mistake was this? (Full definition in a moment.) Without it, a vulnerability is an isolated anecdote. With it, you can do the work that actually matters: notice a vendor shipping the same class of bug over and over, prioritize the weaknesses that genuinely threaten you, measure whether your defenses are improving, and tie a brand-new CVE to everything you already know about that failure mode.

But assigning a CWE is slow, manual, and inconsistent. Different analysts reach for different categories; many CVEs never receive a dependable mapping at all; and the boundaries between categories are subtle enough to start arguments among experts. Meanwhile the backlog only grows.

This is exactly our problem at exploit-intel.com. We take in the entire CVE firehose and turn it into usable intelligence — and that depends on every incoming vulnerability being tagged with its weakness category consistently, immediately, and without waiting in anyone else's queue. At this volume, by hand, that's hopeless. We needed a classifier of our own: something that reads a raw CVE description and reliably names the CWE(s), at scale, on our own infrastructure.

The obvious move in 2026 is to point a capable language model at the job. So we did — and immediately met an old friend.


Everyone has met this intern

You know the type. Razor-sharp, eager, genuinely knows the material — and constitutionally incapable of answering a yes-or-no question without a TED talk. Ask where the printer is and you get a history of toner.

Google's newest open model, Gemma 4, is that intern. Released June 3rd, 2026, free for anyone to use, and clever enough to run on a good laptop. We handed it a short report describing a software security bug and asked one narrow thing — which category of flaw is this? It replied with a warm, fluent, three-paragraph essay, complete with a remediation plan we didn't ask for.

It wasn't wrong. It was just useless for our actual problem, which is sorting ten thousand of these reports a night. We wanted a label. We got literature.

This is the story of how we fixed that: how a single overnight training run on one graphics card turned the rambling generalist into a specialist that answers in eleven tokens, and — this is the surprising part — answers more accurately than it did while monologuing. Everything below is measured on 10,514 examples the model had never seen. Where it loses, I'll show you exactly where.


The job: a triage nurse for software

To make the rest land, picture a hospital triage nurse. Patients describe their problems in their own words — "chest hurts, feeling dizzy." The nurse's job is to convert that messy, human description into a standard code so the right specialist is paged and the hospital can track patterns. The patient's story is unique; the diagnosis comes from a fixed list.

Vulnerability triage is the same shape:

  • A CVE (Common Vulnerabilities and Exposures) is one specific, publicly disclosed bug in one specific product — "version 2.3 of this login page can be tricked into dumping its database." Each is written by a different human, in different prose.
  • A CWE (Common Weakness Enumeration) is the standardized category of the underlying flaw — SQL injection, cleartext transmission, use-after-free. We're working within MITRE's View-1003, a curated set of about 117 weakness classes. And it's multi-label: one bug can legitimately map to several CWEs at once.

So the task is: read the CVE description, output the CWE ID(s) — CWE-79, or CWE-89, CWE-352 when more than one applies. Do it well at scale and you can finally see your own patterns, route fixes to the right teams, and stop drowning in a backlog.

One rule we never broke: description only. No CVE ID in the prompt, no metadata, no hints about how anyone else classified it — just the prose. A nurse working from the patient's words alone. That constraint is the whole point; anything else is cheating.

Three things make this genuinely hard, and they'll matter later:

  1. It's brutally long-tailed. A few CWEs (XSS, SQL injection, buffer overflow) appear constantly; most of the ~117 are rare, with a few dozen examples or fewer. A model that learns only the head looks great on average and is useless precisely on the unusual cases you most need help with.
  2. Naming vs. inferring. Some descriptions say the flaw outright ("a cross-site scripting issue…"); many just describe a behavior and leave the category to be deduced. We split every result into easy (weakness named in the text) and hard (must be inferred) so a single average can't hide which one the model is actually good at.
  3. The categories are devious siblings. Is a heap out-of-bounds write CWE-787 or CWE-123? Is an authorization bug CWE-863 or CWE-285? These are distinctions that start arguments among human experts.

How you actually teach an old model a new trick

Our raw material, Gemma 4 12B, already knows an enormous amount — "12B" means 12 billion internal parameters, the tunable dials that encode everything it learned in pre-training. The problem isn't knowledge. It's discipline and house style for this one job.

The fix is fine-tuning — continued training on task-specific examples — and we used a particularly elegant flavor of it called QLoRA. Here's why it's clever, because this is the genuinely interesting part. Retraining all 12 billion dials would be slow and need far more memory than one card has. So instead you freeze the entire original model and bolt on a small set of new dials — a LoRA adapter — and train only those. In our case the adapter is 65 million parameters, just 0.55% of the model. On top of that, the frozen model is loaded in 4-bit precision (each dial squeezed from 16 bits down to 4) to fit in memory — the "Q" in QLoRA. The result: a 12-billion-parameter model meaningfully reshaped by a postage-stamp-sized, dirt-cheap patch.

Concretely: 3 epochs (three full passes over the data), context length capped at 512 tokens, on a single NVIDIA RTX 5090 (32 GB). It ran for about 7 hours and peaked at 17.3 GB of memory — half the card. Training loss fell smoothly from ~5.5 to ~0.70.

The teaching set was 50,074 real CVE→CWE pairs, and its construction matters: the common classes were capped at 1,500 examples each so they couldn't dominate, and the training split was deliberately weighted toward the inference-hard cases. We were explicitly trying to force the model to learn the tail, not coast on the head.

Why so deliberate? Because our first attempt taught us a lesson in the most pointed way possible. An earlier, smaller model trained for a single epoch learned the popular flaws beautifully and the rare ones not at all. Its score on the rare-class metric (more on that metric in a second) was 0.067 out of 1.0 — a number that means "this model has, for all practical purposes, given up on the entire long tail." It could shout "XSS!" with total confidence and had essentially nothing to say about the other hundred categories. Fixing that specific failure was the entire mission of version two.


Did it work? The clean experiment

To prove the training did something real — and that we weren't flattering ourselves — we ran the fairest test available. Same Gemma 4 model, same hardware, same prompt, same everything, changing exactly one variable: whether our adapter was attached. Untrained model versus trained model, head to head.

First, the three ways we graded, because the distinctions carry the whole story:

  • Exact-match — did it produce the complete, correct set of CWEs, nothing missing and nothing extra? The harshest grade.
  • Micro-F1 — partial credit, pooled across every individual prediction. Forgiving of "got 2 of 3 right."
  • Macro-F1 — grade each of the ~117 classes separately, then average, so a rare class counts exactly as much as a common one. This is the one that exposes a dead tail. You can post a respectable exact-match by mastering the ten common flaws while macro-F1 quietly craters because you ignored the other hundred.

The results on all 10,514 held-out examples:

metric our fine-tune stock Gemma 4
exact-match 0.697 0.440
micro-F1 0.732 0.507
macro-F1 (the rare-class score) 0.500 0.129
easy (weakness named) 0.808 0.675
hard (must be inferred) 0.611 0.257
distinct CWEs predicted across the set 131 295

Three things jump out of that table.

The long tail is the whole ballgame, and the training won it. Macro-F1 went from 0.129 to 0.500 — nearly 4×. The bottom row is the mechanism: unsure of itself, stock Gemma sprays guesses across 295 distinct categories, while ours concentrates on 131 — strikingly close to the ~128 that actually occur in the ground truth. The generalist hedges across the entire rulebook; the specialist learned where the real boundaries are. (And for scale: our failed first attempt scored that infamous 0.067 here. From 0.067 to 0.500 is the difference between a toy and a tool.)

The base model recognizes; ours reasons. Look at the easy/hard split — this is my favorite number in the whole project. On easy cases, where the description names the flaw, even untrained Gemma does fine (0.675) because that's mostly reading comprehension. On hard cases, where the category must be inferred from described behavior, untrained Gemma collapses to 0.257, while our fine-tune holds at 0.611 — a 2.4× gap. That divergence is the entire value of fine-tuning expressed in one figure: we didn't teach it new facts, we taught it to diagnose rather than pattern-match.

It's a clean win, not a measurement artifact. Because we held quantization, runtime, and prompt fixed, none of the gap can be blamed on tooling. It's the adapter, full stop. (At full uncompressed precision our model edges a little higher still — 0.714 / 0.756 / 0.538 — but more on compression below.)

One honest caveat in the other direction: the jump from our v1 to v2 moved several levers at once — bigger model, more epochs, a tail-weighted dataset — so we can't attribute the gain to any single one. The trained-vs-untrained comparison above, though, is airtight.


Where the difference actually lives

Averages conceal the texture. Here are real test cases (with the flaw translated into English):

A heap out-of-bounds write inside Google's own TensorFlow.

Truth: CWE-787 (out-of-bounds write). Stock Gemma: CWE-123 (write-what-where — a plausible cousin). Ours: CWE-787.

The generalist grabbed a neighboring category; the specialist learned the consensus convention.

An authorization flaw in OpenStack Barbican, a cloud secrets vault, where an admin could add data to another project's container.

Truth: CWE-863 (incorrect authorization). Stock Gemma: CWE-285 (improper authorization — the near-synonym sibling). Ours: CWE-863.

These two are close enough that humans litigate them. The fine-tune learned which side of the line this dataset draws.

A use-after-free in PJSIP, a widely embedded calling library.

Truth: CWE-416 (use-after-free). Stock Gemma: CWE-415, CWE-416 (it tacks on double-free for good measure). Ours: CWE-416.

There's the over-prediction tic again — when in doubt, list two and hope. It feels safe and it wrecks precision. That habit, scaled up, is the 295-vs-131 spread made concrete: a surprising fraction of being right is the discipline to stop adding guesses.

And the efficiency angle, which is easy to undersell. On an easy case where both models reach CWE-319, stock Gemma with its reasoning mode on burns about 231 tokens (tokens are the sub-word chunks a model reads and writes) thinking out loud before committing. Ours spends 11. Same answer, one-twentieth the work. Across ten thousand reports a night, that's the line between a tool you run and a tool you abandon.

Where neither model shines, and I won't pretend otherwise: richly multi-label CVEs. When the consensus answer is four categories deep — say CWE-20, CWE-22, CWE-610, CWE-668both models tend to confidently return just CWE-22 and drop the rest. Predicting long weakness chains from a paragraph is genuinely unsolved here. Fine-tuning clearly helped on single- and double-label cases; it did not crack the gnarliest ones. That's the honest top of the v3 to-do list.


Shrinking it to fit on a laptop — and what that costs

The full-precision model is ~24 GB, which is fine on a server and irritating on a laptop. So we quantized it — compressed the numbers to fewer bits per dial — and, instead of hand-waving about quality, measured the damage on the same test set:

metric full precision (24 GB) Q8 (12 GB) Q4 (7 GB)
exact-match 0.714 0.697 0.682
micro-F1 0.756 0.732 0.718
macro-F1 0.538 0.500 0.429

Q8 (8 bits per dial) is effectively free. The small gap from full precision is mostly the runtime, not the compression — you'd be hard-pressed to notice.

Q4 (4 bits per dial) is where it gets interesting, and instructive. Exact-match and micro-F1 slip only ~1.5 points — barely anything. But macro-F1 drops about 7 points (0.500 → 0.429), and the model starts smearing guesses across rare classes again. Read that carefully: the metric that survives compression is the common-case one; the rare-CWE tail — the exact thing we worked hardest to win — is what 4-bit quantization erodes first. It's the MP3 effect: compress a pop track and it's fine, compress a sparse piano recording and the quiet detail is the first to go.

The generalizable lesson, worth more than this one project: when you quantize a specialist, check whether you broke its specialty. Headline accuracy will reassure you while the thing you actually built quietly degrades. Use Q8 if the tail matters (for a security tool, it does); reach for Q4 only when memory is tight and you mostly see common weaknesses.


Running it yourself

It's all public under Apache-2.0: the full model for server-side use, and quantized GGUF files for Ollama and llama.cpp — the standard tools for running models locally. The fastest path is a one-line pull in Ollama.

But here's the punchline that closes the loop. Gemma 4 is a reasoning model, and the runtime turns that "think out loud" mode on by default — so even our terse specialist will start monologuing again unless you switch it off (/set nothink, or "think": false over the API). One toggle is, quite literally, the difference between an essay and an answer:

You: An app sends user credentials over plain, unencrypted HTTP.
It:  CWE-319

The "this model is two days old" tax

Building on an architecture the world received 48 hours earlier is its own adventure, and these are the kind of details that quietly eat an afternoon — so, in the spirit of saving the next person:

  • The tooling hadn't caught up. llama.cpp — the open-source engine most of this runs on — was a few days too old to know Gemma 4 existed. The current build supports it, but recent versions also refactored the conversion logic into a separate package, so the main converter script looks like it supports nothing at all until you find where the Gemma code actually moved.
  • Gemma 4 quietly changed its special tokens. Models use reserved tokens to mark whose turn it is to speak. Gemma 2 and 3 used <start_of_turn> / <end_of_turn>; Gemma 4 switched schemes and wraps the answer in a "reasoning channel." A stock chat template looks correct and silently misformats every prompt — the kind of bug that yields a confident, fluent, completely wrong model. We only caught it by dumping the raw token IDs and matching them against what the model was actually trained on.
  • Asking for "the Q8" handed us a 640 MB file. Benchmarking stock Gemma as our baseline, the convenient one-line download grabbed a half-gigabyte multi-token-prediction head — a small auxiliary component that happened to share the Q8_0 tag — instead of the 12 GB model. It loaded, it ran, and it produced nothing, with total confidence. That "baseline" would have been a polished, published lie if we hadn't questioned why the download was suspiciously small. Always confirm your baseline is the model you think it is.
  • Reasoning-on-by-default, again — a feature for an open-ended assistant, pure overhead for a single-label classifier, and the last 10× of latency we left on the table until we found the switch.

None of these were hard once spotted. Every one of them was invisible until we tripped over it — and every one would have quietly corrupted a number we were about to publish.


What we actually learned

  • Fine-tuning buys judgment, not facts. Stock Gemma already knew every one of these categories. What it lacked was the discipline to answer with only the right one and the calibration to choose between look-alikes. Three epochs of QLoRA — 0.55% of the model, one night, one card — bought both.
  • Macro-F1 is the honest metric for long-tail problems. It's where v1 failed (0.067), where v2 won (0.500), and where quantization quietly bills you. Report only accuracy and you'd miss all three stories. Pick the metric that can embarrass you.
  • Verify the unglamorous things. Our scariest near-miss wasn't a subtle modeling error; it was a 640 MB file wearing a 12 GB model's name tag. Half of doing this honestly is making sure the thing under test is the thing you think it is — and that one flattering average isn't hiding a dead tail.

The result is a small, fast, refreshingly tight-lipped tool: feed it a messy CVE description, get back a clean set of CWE IDs, in eleven tokens, more reliably than the far chattier model it was carved from. It's not a replacement for a human analyst — it still fumbles the deepest multi-label chains, and it's a triage aid, not an oracle. But for us at exploit-intel.com it closes the exact gap we started with: every CVE that lands in the pipeline gets its weakness category the moment it arrives — on our own hardware, in eleven tokens, with no backlog and nobody else's queue to wait on. And for anyone else with more bug reports to map than hours to map them, it's a real extra pair of hands. A tireless triage nurse who, at long last, has learned to stop explaining.

Grab it: full model · quantized for laptops · the dataset we trained on