LCSAJdump Hybrid ML Gadget Ranker
Model Overview
This repository contains the pre-trained Machine Learning model used by the LCSAJdump ROP/JOP/COP gadget finder.
The model is a LightGBM LambdaRank engine designed to score and sort Return-Oriented Programming (ROP) gadgets extracted from binary executables. It was trained to differentiate between useless instruction sequences and highly valuable, exploit-ready gadgets by combining structural static analysis with Deep Semantic Features (extracted via angr symbolic execution).
How it works
Traditional gadget finders (like ROPgadget or ropper) rely on syntactic heuristics (e.g., "does it end with ret?" or "does it pop rdi?"). This approach often yields hundreds of false positives, especially in obfuscated binaries or complex architectures like ARM64 and RISC-V.
This LambdaRank model receives a set of 29 features for each gadget, including:
- Structural Features: Extracted by LCSAJdump's RainbowBFS algorithm (e.g., instruction count, presence of internal calls, clobbered registers).
- Semantic Features: Extracted by running the gadget through the
angrsymbolic execution engine. The model mathematically knows if a gadget actually performs a stack pivot (sm_stack_pivot_size), controls argument registers (sm_controls_arg_reg), or performs memory writes (sm_writes_memory).
By learning from a ground truth of real-world CTF exploit scripts, the model learns to prioritize gadgets that are genuinely useful for building exploit chains, achieving an NDCG@5 of 0.97+.
Architectures Supported
The model is fully architecture-aware and currently supports:
- x86_64
- x86_32
- ARM64 (AArch64)
- RISC-V (64-bit)
Usage in LCSAJdump
This model is deeply integrated into the LCSAJdump CLI tool.
You do not need to download or run this model manually. When you install LCSAJdump, the .pkl file is bundled in the lcsajdump/ml/models/ directory.
Simply run the tool against a binary:
python3 -m lcsajdump.cli /path/to/binary
If the model is present, LCSAJdump will automatically activate the ML re-ranking engine and output:
[+] ML re-ranking active (gadget_model.pkl)
(To disable the ML engine and fall back to pure static heuristics, use the --algo flag).
Training Data & Performance
The model was trained on a custom dataset (gadget_dataset.csv, 7847 samples) built by automatically extracting and labeling gadgets used in published exploit scripts from major CTF competitions (e.g., DEF CON, LACTF, DiceCTF, ROP Emporium).
Performance (K-Fold Cross Validation):
- NDCG@1: 0.9333 (The #1 suggested gadget is the absolute best choice in 98% of cases)
- NDCG@3: 0.9333
- NDCG@5: 0.9349
- NDCG@10: 0.8592
(Compared to pure static heuristics which score ~0.81 on NDCG@10).
Feature Importances (SHAP)
The most impactful features learned by the model are:
is_ret_terminated(Clean execution flow is paramount)heuristic_score(Base syntactic score)frame_size_bytes(Stack damage minimization)sm_stack_pivot_size(Semantic stack control viaangr)stack_slots
Author
Created by Chris1sFlaggin for the LCSAJdump project.