- scar-gnn-defect-detector
- Output example:
```
- GNN Vulnerability Context
Suspicion score : 91.4% (SUSPICIOUS)
Sinks : memcpy — copies n bytes from src to dest — no overlap or bounds check
Input channels : function_argument
Guard status : NO icmp in slice — sink appears UNGUARDED
Harness target : fuzz n relative to dest buffer size; n=0, n=SIZE_MAX, n=dest_size+1
Slice : 31 nodes, 1 sink(s)
- → ranked list of functions by vulnerability score
- → feed top-N to LLM for semantic triage
scar-gnn-defect-detector
A collection of lightweight graph neural networks that classify C functions as vulnerable or safe using their LLVM IR graphs. Designed as a zero-cost pre-filter for LLM-based vulnerability triage pipelines.
Recommended model
model_slice_pdg.pt (§12 PDG slice GNN) and model_slice_pdg_v8.pt (§28 RankNet) — co-best.
Both rank 11 of 13 known-vulnerable scarnet functions in the top 13 of 19 (84.6% P@13).
§28 is preferred when ranking quality matters more than calibration; §12 has better score spread.
python scan_ir.py fn.ll --all-functions --threshold 0.5
python scan_ir.py fn.ll --context # include PDG slice vulnerability context
Models
| File | Section | Architecture | Devign |
|---|---|---|---|
model.pt |
§4d | DefectGNN block-level (60ep h=128, 2 rel) | 55.52% |
model_instr.pt |
§7 | InstrGNN instruction-level baseline (30ep h=64) | 56.53% |
model_instr_v2.pt |
§13 | Perfograph + call categories | 58.75% |
model_instr_v3.pt |
§14 | VSDG memory ordering edges | 57.47% |
model_instr_v4.pt |
§15 | Register name embedding | 57.47% |
model_instr_v5.pt |
§16 | Static analysis flags | 57.15% |
model_instr_v6.pt |
§17 | Taint propagation | 58.00% |
model_slice.pt |
§11 | DFG backward slice | 55.60% |
model_slice_pdg.pt |
§12 | PDG slice — recommended | 56.48% |
model_slice_pdg_v2.pt |
§22 | PDG + taint flags | — |
model_slice_pdg_v3.pt |
§23 | PDG sink-node readout + residual/LN | 55.40% |
model_slice_pdg_v4.pt |
§24 | PDG + intrinsic-aware sinks (retrain) | 55.00% |
model_slice_pdg_v5.pt |
§25 | PDG slice trained on PrimeVul | 55.56% |
model_slice_pdg_v6.pt |
§26 | PDG slice trained on Joern PrimeVul (VOCAB_SIZE=16) — dropped, see note | — |
model_slice_pdg_v7.pt |
§27 | PDG slice, Juliet pretrain + Devign BCE fine-tune; multi-feature x(N,3) | 56.12% |
model_slice_pdg_v8.pt |
§28 | PDG slice, Juliet pretrain + Devign RankNet fine-tune; pairwise ranking loss | TBD |
model_juliet_pretrain.pt |
§29 | Juliet-only — no Devign fine-tune; pure structural prior (no training on commit history) | — |
model_slice_pdg_v9.pt |
§30 | Juliet positives + clean real-C negatives (zlib/musl/lua/lz4/cjson/libuv); no Devign | — |
model_slice_pdg_v10.pt |
§31 | Juliet positives + domain-matched clean negatives (libcurl replaces lua); no Devign | — |
model_slice_pdg_v11.pt |
§32 | Juliet pretrain (with atoi sinks) + Devign RankNet FT; correct eval preprocessor | — |
model_bigvul_cls.pt and model_bigvul_combined.pt (§21) are trained on BigVul only
and have no Devign score. See scarnet table below.
Real-world validation: scarnet
Applied to johwes/scarnet (19 functions, 13 known-vulnerable). Evaluated all
checkpoints at top-13-of-19:
| Checkpoint | Section | Devign | Hits | P@13 | R@13 |
|---|---|---|---|---|---|
| model.pt | §4d block-level DefectGNN | 55.52% | 9/13 | 69.2% | 69.2% |
| model_instr.pt | §7 instr baseline | 56.53% | 9/13 | 69.2% | 69.2% |
| model_instr_v2.pt | §13 Perfograph + call categories | 58.75% | 10/13 | 76.9% | 76.9% |
| model_instr_v3.pt | §14 VSDG memory ordering edges | 57.47% | 10/13 | 76.9% | 76.9% |
| model_instr_v4.pt | §15 register name embedding | 57.47% | 8/13 | 61.5% | 61.5% |
| model_instr_v5.pt | §16 static analysis flags | 57.15% | 8/13 | 61.5% | 61.5% |
| model_instr_v6.pt | §17 taint propagation | 58.00% | 10/13 | 76.9% | 76.9% |
| model_slice.pt | §11 DFG slice | 55.60% | 10/13 | 76.9% | 76.9% |
| model_slice_pdg.pt | §12 PDG slice | 56.48% | 11/13 | 84.6% | 84.6% |
| model_slice_pdg_v2.pt | §22 PDG + taint flags | — | 9/13 | 69.2% | 69.2% |
| model_slice_pdg_v3.pt | §23 PDG sink-node readout | 55.40% | 9/13 | 69.2% | 69.2% |
| model_slice_pdg_v4.pt | §24 PDG + intrinsic-aware sinks | 55.00% | 10/13 | 76.9% | 76.9% |
| model_slice_pdg_v5.pt | §25 PDG slice (PrimeVul training) | 55.56% | 9/13 | 69.2% | 69.2% |
| model_slice_pdg_v7.pt | §27 Juliet pretrain + Devign BCE FT | 56.12% | 10/13§ | 76.9% | 76.9% |
| model_slice_pdg_v8.pt | §28 Juliet pretrain + Devign RankNet | 44.52% | 11/13 | 84.6% | 84.6% |
| model_juliet_pretrain.pt | §29 Juliet-only, no Devign FT | — | 9/13† | 69.2% | 69.2% |
| model_slice_pdg_v9.pt | §30 Juliet + clean real-C neg (lua domain shift) | — | 9/13‡ | 69.2% | 69.2% |
| model_slice_pdg_v10.pt | §31 Juliet + libcurl neg (libcurl too similar to Juliet bad) | — | 8/13‡ | 61.5% | 61.5% |
| model_slice_pdg_v11.pt | §32 Juliet pretrain (atoi sinks) + Devign RankNet FT | — | 11/13§§ | 84.6% | 84.6% |
| model_bigvul_cls.pt | §21 BigVul classifier | — | 9/13 | 69.2% | 69.2% |
| model_bigvul_combined.pt | §21 BigVul+Devign combined | — | 9/13 | 69.2% | 69.2% |
| ENSEMBLE (max) | all models | — | 9/13 | 69.2% | 69.2% |
| ENSEMBLE (mean) | all models | — | 9/13 | 69.2% | 69.2% |
§All §27–§31 results were initially collected with a preprocessor bug: ir_to_graph_slice_pdg (x shape N×1) was used instead of ir_to_graph_slice_pdg_v7 (x shape N×3), silently zeroing guard_class and is_external_input in all eval runs. Additionally atoi/strtol family was missing from DANGEROUS_SINKS. Results above reflect corrected evaluation. §27 BCE regressed from reported 11/13 to 10/13 once guard features are active; §28 RankNet held at 11/13.
†§29 saturates: scores 87–100% on all 19 functions including benign ones — no discrimination possible. Confirmed with corrected preprocessor. Devign fine-tune provides necessary calibration; Juliet-only model has no reference for what real production C looks like.
‡§30 and §31 both failed but for opposite reasons. §30 (lua): negatives too structurally distant, model suppresses server-C as "interpreter-like = clean" (9/13). §31 (libcurl): negatives too structurally similar to Juliet bad, model saturates server-C as "network-like = vulnerable" (8/13). Clean-negatives approach exhausted — no curated open-source corpus sits in the correct structural domain with confirmed-clean labels.
§§§32 reaches 11/13 with scar_atoi at 71.9% (rank 6) — nearly identical to §28's 71.1%, confirming atoi coverage was not the limiting factor. The two misses (handle_stats rank 14, scar_alloc_copy rank 15) are unchanged vs §28. Score spread compresses to 18 points (60.8–79.1%) vs §28's wider distribution — correct rankings, reduced calibration margin. §28 is preferred for operational use. §32 confirms the ceiling is structural, not addressable through sink coverage or training data variants.
§12 is the uniquely best checkpoint. Every other model and the full ensemble top out at 10/13. The ensemble scoring 9/13 — worse than the best individual model — indicates correlated errors: models trained on the same Devign distribution share the same failure modes, so ensembling amplifies rather than cancels noise.
Two irreducible misses (2/13): no model catches all 13. The two remaining misses are semantic bugs with no distinct structural IR signature — wrong comparison operand, or an off-by-one in a constant. These are in the LLM triage domain, not the GNN domain.
False positives (common): main, dispatch, handle_get, session_new — structurally
complex functions that score high but are safe. Dismissible in a one-sentence LLM triage step.
Model Description
model.pt — DefectGNN (block-level)
- Architecture: two
RGCNConvlayers (2 relation types: CFG, DFG), AttentionalAggregation readout, two-layer MLP classifier - Input: LLVM IR compiled with
clang -O0 -fno-inline -S -emit-llvm; each basic block becomes a node with 45 semantic features (opcode distribution, branch density, memory op ratio, call density, phi count, block size) - Output: probability ∈ [0, 1] that the function is vulnerable
- Parameters: ~293 KB (
hidden=128)
Instruction-level GNNs (§7–§17)
- Architecture: Embedding lookup (vocab=110 opcodes) → two
RGCNConvlayers (3 relation types: CFG, DFG, context) → AttentionalAggregation → binary classifier - Input: same IR; each instruction becomes a node (opcode ID)
- Parameters: ~256 KB (
hidden=64)
PDG slice GNNs (§11–§23)
- Architecture: same embedding + RGCN backbone; input is a PDG backward slice from dangerous sinks (DFG + control dependence), not the full function graph
- §12 (recommended): AttentionalAggregation global readout
- §23: Sink-node readout (scatter-max over identified sink embeddings) + residual connections + LayerNorm — correct architecture for node-level supervision; 55.40% on Devign, 9/13 scarnet under graph-level noisy labels
Training
| Dataset | Split | Functions |
|---|---|---|
| Devign (FFmpeg, QEMU, Linux, LibreSSL) | train | ~10,125 |
| Devign | validation | ~1,254 |
| Devign | test | ~1,251 |
Training: Adam lr=1e-3, StepLR decay (γ=0.5, step=10), 30 epochs, hidden=64.
Experiment log summary (§1–§23)
| Section | Change | Devign | Scarnet |
|---|---|---|---|
| §4d | Block-level, 45 features, 2 relations | 55.52% | 9/13 |
| §7 | Instruction-level baseline (opcode only) | 56.53% | 9/13 |
| §11 | DFG backward slice from sinks | 55.60% | 10/13 |
| §12 | PDG slice (DFG + control dependence) | 56.48% | 11/13 |
| §13 | Perfograph encoding + call categories | 58.75%† | 10/13 |
| §14 | VSDG memory ordering edges | 57.47% | 10/13 |
| §15 | Register name embedding | 57.47% | 8/13 |
| §16 | Static analysis flags (cppcheck) | 57.15% | 8/13 |
| §17 | Taint propagation edges | 58.00% | 10/13 |
| §21 | BigVul binary classifier | — | 9/13 |
| §22 | PDG + taint flags combined | — | 9/13 |
| §23 | Sink-node readout + CD cap + residual/LN | 55.40% | 9/13 |
| §24 | PDG + intrinsic-aware sinks (retrain) | 55.00% | 10/13 |
| §25 | PDG slice trained on PrimeVul | 55.56% | 9/13 |
| §26 | Joern PrimeVul — dropped (IKOS pipeline conflict) | — | — |
| §27 | Juliet pretrain (99%+) → Devign BCE FT; multi-feature x(N,3) | 56.12% | 10/13§ |
| §28 | Juliet pretrain → Devign RankNet FT; pairwise ranking loss | 44.52% | 11/13 |
| §29 | Juliet-only, no Devign FT — saturates (87–100% on everything) | — | 9/13† |
| §30 | Juliet pos + clean real-C neg (zlib/musl/lua/lz4/cjson/libuv); no Devign | — | 9/13‡ |
| §31 | Juliet pos + domain-matched neg (libcurl); libcurl too similar to Juliet bad | — | 8/13‡ |
| §32 | Juliet pretrain (atoi sinks) → Devign RankNet FT; scar_atoi 71.9% (≈§28); score compression | — | 11/13§§ |
†High cross-run variance (~54–59%) at the ~1,250-sample split scale.
Full experiment notes: docs/ir-embed.md
Context enrichment for harness generation
slice_context.py converts the PDG slice graph into a structured vulnerability
specification for injection into LLM harness generation prompts:
from preprocess_slice_pdg import ir_to_graph_slice_pdg
from slice_context import summarize_slice, format_for_llm
g = ir_to_graph_slice_pdg(open("function.ll").read())
ctx = format_for_llm(summarize_slice(g, fn_name="process_packet"), score=0.91)
# → inject ctx into your LLM harness generation prompt
Output example: ```
GNN Vulnerability Context Suspicion score : 91.4% (SUSPICIOUS) Sinks : memcpy — copies n bytes from src to dest — no overlap or bounds check Input channels : function_argument Guard status : NO icmp in slice — sink appears UNGUARDED Harness target : fuzz n relative to dest buffer size; n=0, n=SIZE_MAX, n=dest_size+1 Slice : 31 nodes, 1 sink(s)
See [`docs/oss-fuzz-gen-integration.md`](https://github.com/johwes/llvm-ir-vuln-gnn/blob/main/docs/oss-fuzz-gen-integration.md)
for the full integration guide with oss-fuzz-gen.
## Intended Use
This model is a **zero-cost ranker**, not a hard gate. Recommended pipeline:
clang -O0 -fno-inline -S -emit-llvm src/*.c -I include/ -o fn.ll python scan_ir.py fn.ll --all-functions --threshold 0.5
→ ranked list of functions by vulnerability score
→ feed top-N to LLM for semantic triage
Use it to decide *which functions to show an LLM*, not to make final vulnerability
decisions.
## Limitations
- **Topology-only:** node features are opcode categories; identifier names, string
literals, and type tokens are discarded. Semantic bugs (wrong comparison operator,
wrong format string, off-by-one in a constant) produce identical IR topology to
correct code and are undetectable.
- **GNN ceiling ~55–58% on Devign:** 30 experiments across block-level, instruction-level,
slice variants, Perfograph encoding, VSDG edges, taint propagation, sink-node readout,
intrinsic-aware sinks, PrimeVul training, Juliet pretraining, RankNet loss, and clean-negative
corpus substitution all converge in this range.
§28 (RankNet) trades classification accuracy for ranking quality — Devign acc may read ~50%
while scarnet ranking improves; evaluate with eval_all_models.py --scarnet. The ceiling is
Devign's commit-level label noise (~10–20%), not the architecture or dataset quality.
CodeBERT on source text reaches 63.43% partly because it reads identifier names.
See `docs/ir-embed.md` for the full analysis.
- **§26 (Joern/PrimeVul) dropped:** Joern achieved ~95% compilation coverage vs ~34%
for clang, fixing the PrimeVul attrition problem. Dropped because Joern requires a
Java runtime and a separate analysis pass that conflicts with the SCAR pipeline
(IKOS static analysis already uses LLVM bitcode; all experiments stay within that
toolchain).
- **Real-world ranking is more reliable than Devign accuracy:** §12 scores 56.48% on
Devign but 84.6% recall on scarnet. The ranking signal is real; the Devign binary
accuracy is noise-floored.
- **Devign distribution:** trained on C from four large open-source projects; may not
generalise well to embedded, kernel, or heavily macro-expanded code.
## Repository & Reproducibility
**[johwes/llvm-ir-vuln-gnn](https://github.com/johwes/llvm-ir-vuln-gnn)**
Key files:
- `train_gnn/train_slice_pdg.py` — §12 training (recommended)
- `train_gnn/train_slice_pdg_v5.py` — §25 PrimeVul training
- `train_gnn/train_slice_pdg_v7.py` — §27 Juliet pretrain → Devign BCE fine-tune
- `train_gnn/train_slice_pdg_v8.py` — §28 Juliet pretrain → Devign RankNet fine-tune
- `train_gnn/preprocess_juliet.py` — Juliet Test Suite preprocessor (§27/§28)
- `train_gnn/preprocess_primevul.py` — PrimeVul dataset preprocessor
- `train_gnn/preprocess_slice_pdg.py` — PDG slice extractor
- `train_gnn/preprocess_slice_pdg_v3.py` — v3 extractor (sink_mask + CD cap)
- `train_gnn/scan_ir.py` — inference CLI (`--context` flag for LLM context)
- `train_gnn/slice_context.py` — PDG slice → LLM prompt context
- `docs/ir-embed.md` — full experiment log §1–§28 (§26 Joern dropped)
- `docs/applications.md` — market gap analysis
- `docs/oss-fuzz-gen-integration.md` — oss-fuzz-gen integration guide
## Citation
@misc{scar-ir-gnn-2026, title = {SCAR GNN Defect Detector}, author = {johnnywesterlund}, year = {2026}, url = {https://huggingface.co/johnnywesterlund/scar-gnn-defect-detector}, source = {https://github.com/johwes/llvm-ir-vuln-gnn} }