Integer Overflow in Proxy Buffer Allocation Leading to Heap Buffer Overflow
Target
- Project: h5py/h5py
- URL: https://github.com/h5py/h5py
- Component:
_proxy.templ.pyx-create_buffer()function - CWE: CWE-190 (Integer Overflow or Wraparound), CWE-122 (Heap-based Buffer Overflow)
Severity: HIGH
Summary
The create_buffer() function in h5py/h5py/_proxy.templ.pyx computes the buffer size as size * npoints (where both are size_t) without checking for integer overflow. When a crafted HDF5 file specifies a dataset with a very large type size or number of points, the multiplication can wrap around, causing malloc() to allocate a much smaller buffer than expected. Subsequent H5Dread() or H5Tconvert() operations then write past the end of this undersized buffer, resulting in a heap buffer overflow.
Vulnerable Code
File: h5py/h5py/_proxy.templ.pyx (lines 294-308)
cdef void* create_buffer(size_t ipt_size, size_t opt_size, size_t nl) except NULL:
cdef size_t final_size
cdef void* buf
if ipt_size >= opt_size:
final_size = ipt_size*nl # <-- INTEGER OVERFLOW: no bounds check
else:
final_size = opt_size*nl # <-- INTEGER OVERFLOW: no bounds check
buf = malloc(final_size)
if buf == NULL:
raise MemoryError("Failed to allocate conversion buffer")
return buf
Callers (all in the same file):
Line 53 (attribute read/write proxy):
conv_buf = create_buffer(asize, msize, npoints)
Line 60:
back_buf = create_buffer(msize, asize, npoints)
Line 137 (dataset read/write proxy):
conv_buf = create_buffer(H5Tget_size(dstype), H5Tget_size(mtype), npoints)
Line 146:
back_buf = create_buffer(H5Tget_size(dstype), H5Tget_size(mtype), npoints)
Line 210 (vlen string read/write):
conv_buf = create_buffer(H5Tget_size(dstype), H5Tget_size(h5_vlen_string), npoints)
The overflow scenario:
The values ipt_size / opt_size come from H5Tget_size() which returns the size of the HDF5 datatype as stored in the file. The value nl (npoints) comes from H5Sget_select_npoints() which returns the number of selected elements in the dataspace. Both are controlled by the contents of the HDF5 file.
On a 64-bit system, if ipt_size = 0x100000001 (4GB+1, achievable with a compound type containing many members) and nl = 0x100000000 (4G points), then:
final_size = 0x100000001 * 0x100000000 = 0x100000000100000000
This overflows size_t (64-bit unsigned) to a small value, and malloc() returns a small buffer. The subsequent H5Dread() writes the full expected amount of data, overflowing the heap.
Even on 64-bit, more practical cases exist with compound types. For example, a compound type of size 65536 bytes with 281474976710656 (2^48) points overflows: 65536 * 281474976710656 = 2^64, wrapping to 0, which fails the NULL check... but values slightly above that wrap to small positive numbers.
On 32-bit systems this is trivially exploitable:
- Type size: 65536 (compound type)
- npoints: 65537
- Product: 65536 * 65537 = 4,295,032,832 which wraps to 65536 on 32-bit
size_t
Exploitation
- Craft an HDF5 file with:
- A compound datatype with many fields to inflate
H5Tget_size() - A dataspace with dimensions chosen so
npointstimestype_sizeoverflowssize_t
- A compound datatype with many fields to inflate
- Open the file with h5py and read the dataset
- The proxy code path is triggered when the dataset type requires conversion (compound types, vlen types, references)
create_buffer()allocates a small buffer due to the overflowH5Dread()writes past the end of the buffer, corrupting the heap
Proof of Concept (conceptual):
import h5py
import numpy as np
# On a 32-bit system or with carefully chosen values on 64-bit:
# Create a file with a large compound type + many elements
# such that type_size * npoints overflows size_t
# The proxy buffer is used when compound type conversion is needed,
# so the file's compound type must differ from the memory type
# (e.g., different field ordering or extra fields)
with h5py.File('overflow.h5', 'w') as f:
# Create compound dtype with large total size
fields = [(f'field_{i}', 'f8') for i in range(8192)] # 65536 bytes
dt = np.dtype(fields)
# On 32-bit: 65536 * 65537 wraps to 65536
# Create dataset with 65537 elements
f.create_dataset('data', shape=(65537,), dtype=dt)
# Reading back with a different compound dtype triggers proxy buffering
with h5py.File('overflow.h5', 'r') as f:
# Read triggers create_buffer with overflowing size
data = f['data'][:] # Heap overflow occurs here
Impact
- Heap buffer overflow: Can corrupt heap metadata, potentially leading to arbitrary code execution
- Denial of Service: Crash via heap corruption or segfault
- Triggered by simply reading a crafted HDF5 file with h5py
- The compound type conversion proxy path is commonly used when reading HDF5 files created by different software versions or with different field orderings
Remediation
Add overflow checking to create_buffer():
cdef void* create_buffer(size_t ipt_size, size_t opt_size, size_t nl) except NULL:
cdef size_t final_size
cdef size_t elem_size
cdef void* buf
elem_size = ipt_size if ipt_size >= opt_size else opt_size
# Check for overflow before multiplication
if nl != 0 and elem_size > (<size_t>-1) / nl:
raise OverflowError("Buffer size calculation would overflow")
final_size = elem_size * nl
buf = malloc(final_size)
if buf == NULL:
raise MemoryError("Failed to allocate conversion buffer")
return buf
References
- HDF5 compound types: https://docs.hdfgroup.org/hdf5/develop/group___h5_t.html
- Similar CVEs in HDF5 processing: CVE-2021-46243, CVE-2021-46244