| --- |
| language: |
| - en |
| tags: |
| - code |
| - symbolic-math |
| - deobfuscation |
| - mixed-boolean-arithmetic |
| license: mit |
| pipeline_tag: other |
| --- |
| |
| # MBA Simplifier β Pipeline v5 |
|
|
| Simplifies **Mixed Boolean-Arithmetic (MBA)** expressions used in code obfuscation. |
| Combines BFS lookup over a pre-computed expression table, symbolic rewriting (40+ rules), |
| and optional neural-guided search. Every output is oracle-verified correct. |
|
|
| ## Quick Start |
|
|
| ```bash |
| git clone https://huggingface.co/<your-username>/mba-simplifier |
| cd mba-simplifier |
| pip install -r requirements.txt |
| python inference.py "(x & y) + (x | y)" |
| ``` |
|
|
| Expected output: |
| ``` |
| BFS: 18,189 entries (182ms) |
| Input: (x & y) + (x | y) |
| Simplified: x + y |
| Reduction: 57% (bfs) |
| Latency: 186 ms (cold), ~2 ms (warm) |
| ``` |
|
|
| ## Python API |
|
|
| ```python |
| from inference import simplify |
| |
| result = simplify("(x & y) + (x | y)") |
| print(result["simplified"]) # "x + y" |
| print(result["reduction"]) # 0.571 |
| print(result["strategy"]) # "bfs" |
| print(result["latency_ms"]) # ~2 ms warm |
| |
| # 32-bit mode (symbolic only β BFS table is 8-bit) |
| result = simplify("(x & y) + (x | y)", bits=32) |
| # β "x + y" (strategy: symbolic) |
| |
| # With neural guidance |
| result = simplify("complex_expr", model_path="mba_classifier_v2.pt") |
| ``` |
|
|
| ### Return fields |
|
|
| | Field | Type | Description | |
| |---|---|---| |
| | `simplified` | str | Simplified infix expression | |
| | `original` | str | Input expression | |
| | `reduction` | float | Tree-size reduction (0.0β1.0) | |
| | `strategy` | str | `bfs` / `symbolic` / `neural` / `none` | |
| | `latency_ms` | float | Wall-clock time in milliseconds | |
|
|
| ## Input / Output Examples |
|
|
| | Input | Output | Strategy | |
| |---|---|---| |
| | `(x & y) + (x \| y)` | `x + y` | bfs | |
| | `(x \| y) - (x & y)` | `x ^ y` | bfs | |
| | `(x + y) - 2 * (x & y)` | `x ^ y` | bfs | |
| | `~(~x)` | `x` | bfs | |
| | `-(~x) - 1` | `x` | symbolic | |
| | `~(~x & ~y)` | `x \| y` | symbolic | |
| | `x & (x \| y)` | `x` | symbolic | |
| | `x ^ x` | `0` | bfs | |
| | `x * 1 + y * 0` | `x` | bfs | |
| | `(x & ~y) \| (x & y)` | `x` | bfs | |
|
|
| ## Architecture |
|
|
| **Three-layer pipeline** (first hit wins): |
|
|
| 1. **BFS lookup** β pre-computed table of 18,189 minimal 8-bit expressions |
| (loaded from `bfs_table_6.json`, ~180ms first call, ~2ms warm). |
| Coverage: all 2-variable expressions up to AST size 6. |
|
|
| 2. **Symbolic rewriting** β 40+ algebraic rules covering MBA identities, |
| De Morgan, absorption, two's complement, XOR-NOT, cancellation, |
| constant folding. Runs to fixpoint (up to 30 passes). Works for 8-bit |
| and 32-bit. |
|
|
| 3. **Neural-guided beam search** (optional, requires `mba_classifier_v2.pt`) β |
| Transformer classifier predicts rewrite rules to try. Symbolic executor |
| applies them; oracle verifies each candidate. Requires `torch`. |
|
|
| **Oracle verification:** every candidate result is verified by random sampling |
| (512 points). The pipeline never returns a wrong answer β it falls back to the |
| original if verification fails. |
|
|
| ## Benchmark Results |
|
|
| | Method | Correct% | Reduced% | Size reduction | ms/expr | |
| |---|---|---|---|---| |
| | **Pipeline v5 (ours)** | **100%** | **100%** | **77%** | **6.5ms** | |
| | MBA-Blast (USENIX '21) | 100% | 100% | 77% | 15.8ms | |
| | SSPAM (Quarkslab) | 100% | 0% | 0% | 298.6ms | |
| | LLaMA-3.3-70B | 93% | 12% | 16% | 34.4ms | |
|
|
| *Dataset: MBA-Blast paper dataset, 62 expressions, 8-bit 2-var.* |
|
|
| ## Files |
|
|
| | File | Description | |
| |---|---| |
| | `inference.py` | Standalone inference β the only file you need | |
| | `bfs_table_6.json` | Pre-computed BFS table, 18K entries (1.7MB) | |
| | `mba_classifier_v2.pt` | Neural model, 2.2M params (8.6MB, optional) | |
|
|
| ## Limitations |
|
|
| - **BFS table:** 8-bit only, variables `x` and `y` only. Expressions with 3+ |
| variables or 32-bit semantics fall through to symbolic rewriting. |
| - **Variables:** up to 2 variables for BFS; symbolic handles more. |
| - **Bit width:** 8 and 32 only. |
| - **Cold start:** BFS table loads in ~180ms on first call; subsequent calls |
| are fast (~2ms). |
| - **Neural model trained on 8-bit** β may not generalize perfectly to 32-bit. |
|
|
| ## Requirements |
|
|
| ``` |
| torch>=1.13.0 # required for neural layer (optional otherwise) |
| numpy>=1.21.0 # required for oracle exhaustive check (8-bit) |
| ``` |
|
|
| Python 3.10+ required (structural pattern matching). |
|
|