Instructions to use exploitintel/cve-cwe-gemma4-12b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use exploitintel/cve-cwe-gemma4-12b with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="exploitintel/cve-cwe-gemma4-12b")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("exploitintel/cve-cwe-gemma4-12b")
model = AutoModelForImageTextToText.from_pretrained("exploitintel/cve-cwe-gemma4-12b")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use exploitintel/cve-cwe-gemma4-12b with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "exploitintel/cve-cwe-gemma4-12b"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "exploitintel/cve-cwe-gemma4-12b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/exploitintel/cve-cwe-gemma4-12b

SGLang

How to use exploitintel/cve-cwe-gemma4-12b with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "exploitintel/cve-cwe-gemma4-12b" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "exploitintel/cve-cwe-gemma4-12b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "exploitintel/cve-cwe-gemma4-12b" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "exploitintel/cve-cwe-gemma4-12b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use exploitintel/cve-cwe-gemma4-12b with Docker Model Runner:
```
docker model run hf.co/exploitintel/cve-cwe-gemma4-12b
```

exploitintel commited on 1 day ago

Commit

9904a9d

verified ·

1 Parent(s): 83f96e7

Upload blog.md with huggingface_hub

Browse files

Files changed (1) hide show

blog.md +192 -0

blog.md ADDED Viewed

	@@ -0,0 +1,192 @@

+# From Essays to `CWE-319`: Fine-Tuning Gemma 4 12B to Classify Vulnerabilities
+*How a weekend QLoRA run turned Google's brand-new 12B model into a precise CVE→CWE classifier that beats both its own base model and our previous attempt — and what the numbers actually say.*
+---
+## The one-token answer problem
+Ask a stock large language model to classify a vulnerability and it will be *delighted* to help. Here is what `gemma-4-12b-it` — Google's freshly released open model — does when you hand it a CVE description and ask which weakness class it belongs to:
+> **You:** *The update handler transmits user credentials over an unencrypted HTTP channel.*
+>
+> **Gemma 4:** *Great question! This vulnerability maps to **CWE-319: Cleartext Transmission of Sensitive Information**. Here's why: when credentials travel over plain HTTP, an attacker positioned on the network path can intercept them… (continues for three more paragraphs, with a CVSS primer and a bulleted remediation plan).*
+It's not wrong. It's just not *useful* — not if you're triaging 10,000 CVEs a night and you need exactly one thing: the CWE ID, comma-separated, nothing else. You wanted `CWE-319`. You got an essay.
+This post is about closing that gap. We took the new Gemma 4 12B, fine-tuned it with QLoRA on a curated CVE→CWE dataset, and turned a chatty generalist into a terse specialist that answers in **eleven tokens** and gets the answer right far more often than the model it started from. Along the way we'll look at where fine-tuning genuinely helps (the rare, must-infer cases), where it doesn't (richly multi-label vulnerabilities are hard for *everyone*), and what quantization costs you when you ship it to a laptop.
+All the numbers below are real, measured on a held-out test set of 10,514 examples. Where the model loses, we'll say so.
+---
+## The task: CVE in, CWE out
+A quick framing for readers who don't live in vulnerability databases.
+- A **CVE** (Common Vulnerabilities and Exposures) is a *specific* publicly disclosed vulnerability — "the thing that's broken in version 2.3 of this product." Each one comes with a free-text description.
+- A **CWE** (Common Weakness Enumeration) is the *category* of underlying flaw — "SQL injection," "use-after-free," "cleartext transmission." MITRE's [View-1003](https://cwe.mitre.org/data/definitions/1003.html) narrows the universe to ~117 weakness classes commonly used for mapping CVEs.
+The job: read a CVE description and emit the CWE ID(s) it maps to — `CWE-79`, or `CWE-89, CWE-352` when several apply. It's **multi-label classification over ~117 classes**, and it's harder than it sounds for three reasons:
+1. **It's long-tailed.** A handful of CWEs (XSS, SQL injection, buffer overflow) dominate; hundreds of examples each. Most CWEs are rare — a few dozen examples, sometimes fewer. A model that only learns the head looks decent on average and is useless on the tail.
+2. **Half the work is inference, not keyword-matching.** Some descriptions literally name the weakness ("a cross-site scripting flaw…"). Many don't — they describe a behavior and leave you to infer the class. We split our evaluation into **easy** (weakness named in the text) and **hard** (must be inferred) precisely so one number can't flatter the other.
+3. **The boundaries are subtle.** Is a heap out-of-bounds *write* CWE-787 or CWE-123? Is an authorization flaw CWE-863 or CWE-285? These are sibling classes a generalist routinely confuses.
+One rule we held throughout: **description-only**. No CVE ID in the prompt, no metadata, no hints. The model has to reason from the prose, because that's the entire point.
+---
+## The ingredients
+**The model — Gemma 4 12B.** Released by Google on June 3, 2026, Gemma 4 is an encoder-free multimodal model (text, image, audio flow into a single decoder) under an Apache-2.0 license. We use it as a pure text model — the vision and audio paths are irrelevant to reading CVE prose, and we strip them out entirely when we ship. Being two days old, it brought some tooling adventures (more on that later), but the underlying model is excellent.
+**The data — `exploitintel/cve-cwe-consensus` (v3).** A chat-formatted dataset of CVE descriptions paired with consensus CWE labels: **50,074 train / 11,052 validation / 10,514 test**. The v3 refresh does two things that matter: it **caps the head at 1,500 examples per class** (so the common CWEs can't drown everything else), and it **weights the training split toward inference-hard cases** (so the model is forced to learn to infer, not just pattern-match). The assistant turn is the bare answer — `CWE-319` — which is exactly the terseness we want to teach.
+**The hardware.** A single NVIDIA RTX 5090 (32 GB, Blackwell). One GPU, one overnight run.
+---
+## The training run
+We fine-tuned with **QLoRA** via [Unsloth](https://github.com/unslothai/unsloth): the base model loaded in 4-bit (bitsandbytes nf4), a LoRA adapter of rank 16 trained on top, **3 epochs**, context length 512, full-sequence supervised fine-tuning. Only 0.55% of the parameters (65M of 12B) are trainable — QLoRA's whole appeal is that you nudge a giant model with a tiny, cheap delta.
+The run took **~7.1 hours** and peaked at **17.3 GB of VRAM** — comfortably inside the 5090's 32 GB, with headroom we used to push the batch size. Training loss fell from ~5.5 to ~0.70, smooth and stable the whole way.
+Why three epochs? Because our *first* attempt — call it v1, a smaller 4B-class Gemma variant trained for a single epoch — had a very specific, very instructive failure. It learned the common CWEs and **completely missed the tail**:
+| v1 (4B-class, 1 epoch) | score |
+|---|---|
+| exact-match | 0.29 |
+| micro-F1 | 0.32 |
+| **macro-F1** | **0.067** |
+That macro-F1 of **0.067** is the tell. Micro-F1 weights every prediction equally, so it's dominated by the common classes; macro-F1 averages the per-class F1 *unweighted*, so every rare CWE counts as much as XSS. A macro-F1 near zero means the long tail was essentially unlearned. v1 was a model that could shout "XSS!" confidently and had nothing to say about the other hundred-odd weaknesses. The whole point of v2 — bigger model, more epochs, a tail-weighted dataset — was to fix that.
+---
+## Results: did it work?
+The cleanest way to measure what fine-tuning bought is a controlled ablation: take the **same base model, the same Q8_0 quantization, the same Ollama runtime, the same system prompt** — and change only whether our QLoRA adapter is applied. Here's that head-to-head on the held-out test set (10,514 examples, description-only, direct-answer mode):
+| metric | **our fine-tune** | stock gemma-4-12b |
+|---|---|---|
+| exact-match | **0.697** | 0.440 |
+| micro-F1 | **0.732** | 0.507 |
+| **macro-F1** | **0.500** | 0.129 |
+| easy exact (weakness named) | **0.808** | 0.675 |
+| hard exact (must infer) | **0.611** | 0.257 |
+| distinct CWEs predicted across the set | 131 | 295 |
+*Same model, same quant, same pipeline — the only difference is our fine-tuning.*
+Three things jump out.
+**The rare tail is the whole ballgame — and fine-tuning wins it.** Macro-F1, which weights every CWE class equally, goes from stock's **0.129 to 0.500** — nearly 4×. The bottom row explains why: stock gemma scatters its guesses across **295 distinct CWE classes**, a long smear of spurious labels; our model concentrates on **131**. The generalist hedges across the whole taxonomy; the specialist learned where the real boundaries are. (For reference, our *first* attempt — a 4B-class model, one epoch — managed a macro-F1 of just **0.067**. That tail-collapse was the failure we set out to fix, and 0.500 is how it went.)
+**The base model keyword-matches; ours infers.** Look at the easy/hard split. On **easy** cases — the weakness is named in the text — stock does respectably at **0.675**, because it can pattern-match. On **hard** cases, where the class must be inferred from described behavior, stock collapses to **0.257**, while our fine-tune holds at **0.611** — a 2.4× gap. That single divergence *is* the value of fine-tuning: it taught the model to reason about weaknesses it isn't simply handed.
+**It's a clean win, not a pipeline artifact.** Because the comparison fixes the quantization and runtime, none of the gap comes from tooling. It's the adapter. (For completeness, our full-precision bf16 model scores a hair higher than its Q8 form — **0.714 / 0.756 / 0.538** exact/micro/macro — and since Q8_0 is near-lossless, the ablation above is just as fair at full precision.)
+One honesty note in the other direction: the jump from our v1 (4B, one epoch) to v2 (12B, three epochs, tail-weighted data) moves several variables at once, so we can't credit any single one. The stock-gemma ablation is the controlled comparison — and it's decisive.
+---
+## Where the fine-tune actually wins
+Averages hide the story. Here are real test cases (truth label from the dataset, both models in direct-answer mode):
+**Heap out-of-bounds write in TensorFlow's Grappler.**
+> truth: `CWE-787` · stock gemma: `CWE-123` · **ours: `CWE-787`** ✓
+CWE-123 ("write-what-where") is *plausibly* related, but the dataset's consensus label is the more specific CWE-787 (out-of-bounds write). The generalist reached for a neighbor; the specialist learned the convention.
+**Authorization flaw in OpenStack Barbican** (an admin could add secrets to another project's container).
+> truth: `CWE-863` · stock gemma: `CWE-285` · **ours: `CWE-863`** ✓
+CWE-285 ("improper authorization") and CWE-863 ("incorrect authorization") are siblings a human reviewer might debate. The fine-tune learned which side of the line this dataset draws.
+**Use-after-free in PJSIP.**
+> truth: `CWE-416` · stock gemma: `CWE-415, CWE-416` · **ours: `CWE-416`** ✓
+Here the failure mode is *over-prediction*: stock gemma tacks on CWE-415 (double-free) for good measure, tanking precision. Our model says exactly what's there. This pattern — the base model hedging by emitting extra sibling CWEs — showed up again and again, and it's a big part of why its exact-match suffers.
+How often? Across the test set, the base model's predictions touched **295 distinct CWE classes**; ours touched **131** — much closer to the ~128 that actually appear in the ground truth. That gap is the over-prediction habit made visible: stock gemma reaches for any plausible-sounding weakness it half-remembers, while the fine-tune has learned the dataset's working vocabulary and stays inside it. A permissions misconfiguration that's purely `CWE-276`? Stock adds `CWE-732` for symmetry; ours doesn't. An authorization bug labeled `CWE-863`? Stock offers the near-synonym `CWE-285`; ours commits to the right one. A surprising amount of precision turns out to be knowing when *not* to add another guess.
+And the efficiency angle, which is easy to overlook: on an easy case where both models get `CWE-319`, stock gemma with reasoning enabled spends **231 tokens** thinking its way there. Ours spends **11**. At scale, that's a ~20× difference in latency and cost for the same answer.
+**Where neither model shines:** richly multi-label CVEs. When the consensus label is `CWE-20, CWE-22, CWE-610, CWE-668`, *both* models tend to confidently return just `CWE-22` and miss the rest. Predicting four-deep weakness chains from a paragraph is genuinely hard, and it's the most honest "future work" item we have. Fine-tuning narrowed the gap on single- and double-label cases; it did not solve the long multi-label sets.
+---
+## Shipping it: quantization and the laptop question
+A 12B model in bf16 is ~24 GB — fine for a server, awkward for a laptop. So we converted the merged model to GGUF and quantized it. The question every quantization raises is: *what did we lose?* We measured it, through the same Ollama pipeline:
+| metric | bf16 (full) | Q8_0 (12 GB) | Q4_K_M (7 GB) |
+|---|---|---|---|
+| exact-match | 0.714 | 0.697 | 0.682 |
+| micro-F1 | 0.756 | 0.732 | 0.718 |
+| macro-F1 | 0.538 | 0.500 | 0.429 |
+**Q8_0 is effectively lossless.** The small gap from bf16 is mostly the runtime pipeline, not the quantization — call it the cost of moving from `transformers` to a GGUF runtime. If you want the full-quality model that fits in 12 GB, this is it.
+**Q4_K_M is where the tradeoff bites — and it bites the tail.** Exact-match and micro-F1 slip only ~1.5 points, which looks fine. But macro-F1 drops ~7 points (0.500 → 0.429), and the model starts predicting a wider spread of spurious rare classes. In other words, the metric that's robust to quantization is the *common-case* metric; the rare-CWE tail — the thing we worked hardest to win — is exactly what 4-bit quantization erodes first. The lesson generalizes: **when you quantize a long-tail classifier, watch macro-F1, not accuracy.**
+Practical guidance: use **Q8_0** if rare/long-tail CWEs matter to you (and for a security tool, they probably do). Reach for **Q4_K_M** only when you're tight on memory and mostly seeing common weaknesses.
+---
+## Running it yourself
+Everything is published under Apache-2.0:
+- **`exploitintel/cve-cwe-gemma4-12b`** — the merged bf16 model for `transformers` / vLLM / TGI.
+- **`exploitintel/cve-cwe-gemma4-12b-GGUF`** — Q8_0 and Q4_K_M for Ollama and llama.cpp.
+The fastest path, straight from the Hub:
+```bash
+ollama run hf.co/exploitintel/cve-cwe-gemma4-12b-GGUF:Q8_0
+```
+But here's the gotcha that bit us, and it ties right back to where we started. Gemma 4 is a *reasoning* model, and Ollama runs it with **thinking ON by default** — so out of the box, even our terse specialist will think for a few hundred tokens before answering, and without our Modelfile it won't even have the analyst system prompt. The quick-pull gives you a chatty generalist again. To get the eleven-token classifier, disable thinking and use the published Modelfile:
+```bash
+ollama run cve-cwe-gemma4
+>>> /set nothink
+>>> The update handler transmits credentials over cleartext HTTP.
+CWE-319
+```
+Or via the API, pass `"think": false` and read the clean `response` field. It's a one-line change, and it's the difference between an essay and a label.
+---
+## The two-day-old-model tax
+Running an architecture that shipped 48 hours earlier is its own small adventure, and the war stories are worth telling because they'll bite the next person too.
+**The conversion toolchain hadn't caught up.** Our local llama.cpp predated Gemma 4 and simply didn't know the architecture. The current build does — but recent llama.cpp also *refactored* conversion into a package, so the main converter script looks like it supports nothing until you realize the Gemma logic now lives in a submodule. When a model is days old, update the toolchain first and check the actual model registry, not the top-level script.
+**Gemma 4 quietly changed its turn tokens.** Where Gemma 2 and 3 used `<start_of_turn>` / `<end_of_turn>`, Gemma 4 uses a different scheme and wraps the answer in a reasoning channel. A stock chat template *looks* right and silently misformats the prompt — exactly the kind of bug that yields a confident, fluent, completely wrong model. We caught it only by dumping raw token IDs and matching them to what the model was trained on.
+**Pulling "the Q8" got us a 640 MB file.** When we went to benchmark stock Gemma as a baseline, the convenient one-line pull grabbed a 0.5 GB *multi-token-prediction head* that happened to be tagged `Q8_0` — not the 12 GB model. It loaded, it ran, and it produced nothing. That "baseline" would have been silently, embarrassingly wrong if we hadn't questioned why the download was so small. Always confirm your baseline is actually the model you think it is.
+**Reasoning is on by default.** Gemma 4 thinks before it answers, and the runtime enables that automatically. For an open-ended assistant that's a feature; for a single-label classifier it's pure overhead — the last 10× of latency we left on the table until we switched it off.
+None of these were hard once found. All of them were invisible until we hit them — and every one would have quietly corrupted a number we were about to publish.
+---
+## What we learned
+A few things worth carrying forward:
+- **Fine-tuning buys precision, not just knowledge.** Stock Gemma 4 already *knows* what CWE-319 is. What it lacked was the discipline to answer with exactly that and nothing more, and the calibration to pick the right sibling class. Three epochs of QLoRA on 50K examples bought both — for about seven hours on one GPU.
+- **Macro-F1 is the honest metric for long-tail problems.** It's where v1 failed (0.067), where v2 won (0.538), and where quantization quietly costs you. If we'd only reported accuracy, we'd have missed all three stories.
+- **Measure the thing you're actually claiming.** Every number here ran through one scorer, and the baseline was the *same model with the adapter removed* — not a vibes comparison against a chat transcript. Half the work of an honest benchmark is making sure the model under test is the model you think it is (see: the 640 MB "baseline"), and that one flattering average isn't hiding a dead tail (see: macro-F1).
+The result is a small, fast, honest tool: hand it a CVE description, get back a CWE ID, in eleven tokens, more accurately than the 12B generalist it was carved from. Not a replacement for a human analyst — it still over- and under-predicts on the hardest multi-label cases, and it's a triage aid, not an oracle — but a genuine multiplier for anyone who has more CVEs to map than hours to map them.
+*Models: [`exploitintel/cve-cwe-gemma4-12b`](https://huggingface.co/exploitintel/cve-cwe-gemma4-12b) · [`exploitintel/cve-cwe-gemma4-12b-GGUF`](https://huggingface.co/exploitintel/cve-cwe-gemma4-12b-GGUF) · Dataset: [`exploitintel/cve-cwe-consensus`](https://huggingface.co/datasets/exploitintel/cve-cwe-consensus)*