--- language: - en tags: - code - symbolic-math - deobfuscation - mixed-boolean-arithmetic license: mit pipeline_tag: other --- # MBA Simplifier — Pipeline v5 Simplifies **Mixed Boolean-Arithmetic (MBA)** expressions used in code obfuscation. Combines BFS lookup over a pre-computed expression table, symbolic rewriting (40+ rules), and optional neural-guided search. Every output is oracle-verified correct. ## Quick Start ```bash git clone https://huggingface.co//mba-simplifier cd mba-simplifier pip install -r requirements.txt python inference.py "(x & y) + (x | y)" ``` Expected output: ``` BFS: 18,189 entries (182ms) Input: (x & y) + (x | y) Simplified: x + y Reduction: 57% (bfs) Latency: 186 ms (cold), ~2 ms (warm) ``` ## Python API ```python from inference import simplify result = simplify("(x & y) + (x | y)") print(result["simplified"]) # "x + y" print(result["reduction"]) # 0.571 print(result["strategy"]) # "bfs" print(result["latency_ms"]) # ~2 ms warm # 32-bit mode (symbolic only — BFS table is 8-bit) result = simplify("(x & y) + (x | y)", bits=32) # → "x + y" (strategy: symbolic) # With neural guidance result = simplify("complex_expr", model_path="mba_classifier_v2.pt") ``` ### Return fields | Field | Type | Description | |---|---|---| | `simplified` | str | Simplified infix expression | | `original` | str | Input expression | | `reduction` | float | Tree-size reduction (0.0–1.0) | | `strategy` | str | `bfs` / `symbolic` / `neural` / `none` | | `latency_ms` | float | Wall-clock time in milliseconds | ## Input / Output Examples | Input | Output | Strategy | |---|---|---| | `(x & y) + (x \| y)` | `x + y` | bfs | | `(x \| y) - (x & y)` | `x ^ y` | bfs | | `(x + y) - 2 * (x & y)` | `x ^ y` | bfs | | `~(~x)` | `x` | bfs | | `-(~x) - 1` | `x` | symbolic | | `~(~x & ~y)` | `x \| y` | symbolic | | `x & (x \| y)` | `x` | symbolic | | `x ^ x` | `0` | bfs | | `x * 1 + y * 0` | `x` | bfs | | `(x & ~y) \| (x & y)` | `x` | bfs | ## Architecture **Three-layer pipeline** (first hit wins): 1. **BFS lookup** — pre-computed table of 18,189 minimal 8-bit expressions (loaded from `bfs_table_6.json`, ~180ms first call, ~2ms warm). Coverage: all 2-variable expressions up to AST size 6. 2. **Symbolic rewriting** — 40+ algebraic rules covering MBA identities, De Morgan, absorption, two's complement, XOR-NOT, cancellation, constant folding. Runs to fixpoint (up to 30 passes). Works for 8-bit and 32-bit. 3. **Neural-guided beam search** (optional, requires `mba_classifier_v2.pt`) — Transformer classifier predicts rewrite rules to try. Symbolic executor applies them; oracle verifies each candidate. Requires `torch`. **Oracle verification:** every candidate result is verified by random sampling (512 points). The pipeline never returns a wrong answer — it falls back to the original if verification fails. ## Benchmark Results | Method | Correct% | Reduced% | Size reduction | ms/expr | |---|---|---|---|---| | **Pipeline v5 (ours)** | **100%** | **100%** | **77%** | **6.5ms** | | MBA-Blast (USENIX '21) | 100% | 100% | 77% | 15.8ms | | SSPAM (Quarkslab) | 100% | 0% | 0% | 298.6ms | | LLaMA-3.3-70B | 93% | 12% | 16% | 34.4ms | *Dataset: MBA-Blast paper dataset, 62 expressions, 8-bit 2-var.* ## Files | File | Description | |---|---| | `inference.py` | Standalone inference — the only file you need | | `bfs_table_6.json` | Pre-computed BFS table, 18K entries (1.7MB) | | `mba_classifier_v2.pt` | Neural model, 2.2M params (8.6MB, optional) | ## Limitations - **BFS table:** 8-bit only, variables `x` and `y` only. Expressions with 3+ variables or 32-bit semantics fall through to symbolic rewriting. - **Variables:** up to 2 variables for BFS; symbolic handles more. - **Bit width:** 8 and 32 only. - **Cold start:** BFS table loads in ~180ms on first call; subsequent calls are fast (~2ms). - **Neural model trained on 8-bit** — may not generalize perfectly to 32-bit. ## Requirements ``` torch>=1.13.0 # required for neural layer (optional otherwise) numpy>=1.21.0 # required for oracle exhaustive check (8-bit) ``` Python 3.10+ required (structural pattern matching).