mba-simplifier / README.md
qox's picture
Upload README.md with huggingface_hub
4c0c6c9 verified
|
Raw
History Blame Contribute Delete
4.22 kB
---
language:
- en
tags:
- code
- symbolic-math
- deobfuscation
- mixed-boolean-arithmetic
license: mit
pipeline_tag: other
---
# MBA Simplifier β€” Pipeline v5
Simplifies **Mixed Boolean-Arithmetic (MBA)** expressions used in code obfuscation.
Combines BFS lookup over a pre-computed expression table, symbolic rewriting (40+ rules),
and optional neural-guided search. Every output is oracle-verified correct.
## Quick Start
```bash
git clone https://huggingface.co/<your-username>/mba-simplifier
cd mba-simplifier
pip install -r requirements.txt
python inference.py "(x & y) + (x | y)"
```
Expected output:
```
BFS: 18,189 entries (182ms)
Input: (x & y) + (x | y)
Simplified: x + y
Reduction: 57% (bfs)
Latency: 186 ms (cold), ~2 ms (warm)
```
## Python API
```python
from inference import simplify
result = simplify("(x & y) + (x | y)")
print(result["simplified"]) # "x + y"
print(result["reduction"]) # 0.571
print(result["strategy"]) # "bfs"
print(result["latency_ms"]) # ~2 ms warm
# 32-bit mode (symbolic only β€” BFS table is 8-bit)
result = simplify("(x & y) + (x | y)", bits=32)
# β†’ "x + y" (strategy: symbolic)
# With neural guidance
result = simplify("complex_expr", model_path="mba_classifier_v2.pt")
```
### Return fields
| Field | Type | Description |
|---|---|---|
| `simplified` | str | Simplified infix expression |
| `original` | str | Input expression |
| `reduction` | float | Tree-size reduction (0.0–1.0) |
| `strategy` | str | `bfs` / `symbolic` / `neural` / `none` |
| `latency_ms` | float | Wall-clock time in milliseconds |
## Input / Output Examples
| Input | Output | Strategy |
|---|---|---|
| `(x & y) + (x \| y)` | `x + y` | bfs |
| `(x \| y) - (x & y)` | `x ^ y` | bfs |
| `(x + y) - 2 * (x & y)` | `x ^ y` | bfs |
| `~(~x)` | `x` | bfs |
| `-(~x) - 1` | `x` | symbolic |
| `~(~x & ~y)` | `x \| y` | symbolic |
| `x & (x \| y)` | `x` | symbolic |
| `x ^ x` | `0` | bfs |
| `x * 1 + y * 0` | `x` | bfs |
| `(x & ~y) \| (x & y)` | `x` | bfs |
## Architecture
**Three-layer pipeline** (first hit wins):
1. **BFS lookup** β€” pre-computed table of 18,189 minimal 8-bit expressions
(loaded from `bfs_table_6.json`, ~180ms first call, ~2ms warm).
Coverage: all 2-variable expressions up to AST size 6.
2. **Symbolic rewriting** β€” 40+ algebraic rules covering MBA identities,
De Morgan, absorption, two's complement, XOR-NOT, cancellation,
constant folding. Runs to fixpoint (up to 30 passes). Works for 8-bit
and 32-bit.
3. **Neural-guided beam search** (optional, requires `mba_classifier_v2.pt`) β€”
Transformer classifier predicts rewrite rules to try. Symbolic executor
applies them; oracle verifies each candidate. Requires `torch`.
**Oracle verification:** every candidate result is verified by random sampling
(512 points). The pipeline never returns a wrong answer β€” it falls back to the
original if verification fails.
## Benchmark Results
| Method | Correct% | Reduced% | Size reduction | ms/expr |
|---|---|---|---|---|
| **Pipeline v5 (ours)** | **100%** | **100%** | **77%** | **6.5ms** |
| MBA-Blast (USENIX '21) | 100% | 100% | 77% | 15.8ms |
| SSPAM (Quarkslab) | 100% | 0% | 0% | 298.6ms |
| LLaMA-3.3-70B | 93% | 12% | 16% | 34.4ms |
*Dataset: MBA-Blast paper dataset, 62 expressions, 8-bit 2-var.*
## Files
| File | Description |
|---|---|
| `inference.py` | Standalone inference β€” the only file you need |
| `bfs_table_6.json` | Pre-computed BFS table, 18K entries (1.7MB) |
| `mba_classifier_v2.pt` | Neural model, 2.2M params (8.6MB, optional) |
## Limitations
- **BFS table:** 8-bit only, variables `x` and `y` only. Expressions with 3+
variables or 32-bit semantics fall through to symbolic rewriting.
- **Variables:** up to 2 variables for BFS; symbolic handles more.
- **Bit width:** 8 and 32 only.
- **Cold start:** BFS table loads in ~180ms on first call; subsequent calls
are fast (~2ms).
- **Neural model trained on 8-bit** β€” may not generalize perfectly to 32-bit.
## Requirements
```
torch>=1.13.0 # required for neural layer (optional otherwise)
numpy>=1.21.0 # required for oracle exhaustive check (8-bit)
```
Python 3.10+ required (structural pattern matching).