YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Transformers model package prediction flip via tokenizer.json vocab binding while safetensors weights remain unchanged
Summary
A crafted HuggingFace Transformers model package can silently invert model inference outcomes
by mutating the tokenizer.json model.vocab token-to-ID binding table, while leaving
model.safetensors and config.json completely unchanged.
This is not a .safetensors parser bug. It is a SafeTensors-based Transformers model
package vulnerability where tokenizer.json controls which embedding rows in model.safetensors
are addressed for each input token. No modification to model weights is required to flip predictions.
Affected Product
- Package: HuggingFace
transformers - Format: SafeTensors-based Transformers model package
(
model.safetensors+tokenizer.json+config.json) - Relevant file:
tokenizer.jsonβmodel.vocabfield - Runtime path:
AutoTokenizer.from_pretrained()βAutoModelForSequenceClassification.from_pretrained()βmodel.forward()
Vulnerability Details
tokenizer.json contains the model.vocab dictionary, which maps each token string to an
integer ID. These IDs are consumed at inference time by the tokenizers Rust backend
(token_to_id()) and used to construct input_ids. The input_ids are then passed to
model.forward(), where they index into the word-embedding matrix stored in model.safetensors.
There is no cross-validation between the tokenizer.json vocab binding and the
model.safetensors weight matrix at load time. A model package distributor can swap any two
token ID assignments in tokenizer.json, causing those tokens to address different embedding
rows without altering the weight file.
Mechanism (step by step)
- In the clean package:
"good"β ID 2204,"bad"β ID 2919 - In the mutant package:
"good"β ID 2919,"bad"β ID 2204 model.safetensorshash: identical in both packagesconfig.jsonhash: identical in both packages- Raw text
"the movie is good"is tokenized differently:- clean:
input_ids = [101, 1996, 3185, 2003, **2204**, 102]β prediction: POSITIVE - mutant:
input_ids = [101, 1996, 3185, 2003, **2919**, 102]β prediction: NEGATIVE
- clean:
- The
model.forward()invocation is identical in both cases; only the index into the embedding matrix differs.
Proof of Concept
This repository contains a minimal PoC using a tiny deterministic
BertForSequenceClassification model (hidden_size=16, 1 layer, vocab_size=30522) with
crafted weights to isolate the mechanism. The affected runtime path is the standard
Transformers package path: AutoTokenizer.from_pretrained(),
AutoModelForSequenceClassification.from_pretrained(), and model.forward().
Reproduce
pip install -r requirements.txt
python reproduce_transformers_tokenizer_vocab_binding_flip.py
Expected terminal output:
TRANSFORMERS_TOKENIZER_VOCAB_BINDING_FLIP_CONFIRMED
Inspect (hash matrix + vocab analysis)
python inspect_transformers_tokenizer_vocab_binding_hash_matrix.py
Expected terminal output:
TRANSFORMERS_TOKENIZER_VOCAB_BINDING_HASH_MATRIX_PASS
Runtime Evidence
| Text | clean input_ids | mutant input_ids | clean pred | mutant pred | flipped |
|---|---|---|---|---|---|
| "the movie is good" | [101,1996,3185,2003,2204,102] | [101,1996,3185,2003,2919,102] | POSITIVE | NEGATIVE | β |
| "this film is bad" | [101,2023,2143,2003,2919,102] | [101,2023,2143,2003,2204,102] | NEGATIVE | POSITIVE | β |
| "an absolutely good film" | [101,2019,7078,2204,2143,102] | [101,2019,7078,2919,2143,102] | POSITIVE | NEGATIVE | β |
Flip count: 3/3. Logit delta: 3.9965.
Hash Matrix
| File | clean SHA3-256 (16 hex) | mutant SHA3-256 (16 hex) | Status |
|---|---|---|---|
model.safetensors |
ae27a5f4ca265c51 |
ae27a5f4ca265c51 |
IDENTICAL |
config.json |
0340a71886d99eac |
0340a71886d99eac |
IDENTICAL |
tokenizer.json |
2bd5fd376f6e2aa0 |
e68711129c55c39b |
DIFFERS |
tokenizer.json is the only modified file.
Route Framing
tokenizer.json is model package state. It is loaded by the Transformers runtime at
inference time to convert raw text into token IDs. The binding it establishes between token
strings and embedding rows in model.safetensors determines model output. Distributing a
package with a mutated tokenizer.json is distributing a semantically modified model β even
though the weight file bytes are unchanged.
Distinctness
This finding is distinct from all prior submissions:
| Dimension | This finding | Prior ONNX/Joblib/TFLite submissions |
|---|---|---|
| Format | SafeTensors + tokenizer.json |
.onnx, .joblib, .tflite |
| Mutated component | tokenizer.json model.vocab |
ONNX op attribute, sklearn vocab, TFLite metadata |
| Loader | HuggingFace tokenizers Rust backend + safetensors Rust loader |
onnxruntime / scikit-learn / Task Library |
| Inference API | AutoTokenizer + AutoModelForSequenceClassification + model.forward() |
different runtimes |
Non-Claims
- This is not a
.safetensorsbinary format parser vulnerability. - This is not arbitrary code execution.
- This is not a scanner bypass (primary claim).
- Model weights are unchanged.
tokenizer.jsonis part of the model package state; the claim is not that no model state changed. - This PoC uses a tiny deterministic
BertForSequenceClassificationto isolate the mechanism. The issue applies to any Transformers model package that usestokenizer.jsonfor vocabulary binding.
Recommendation
Transformers model loading should validate that the tokenizer.json vocabulary mapping is
consistent with the intended model configuration. At minimum, downstream consumers should
treat tokenizer.json with the same integrity requirements as model.safetensors when
verifying model package authenticity.
References
- HuggingFace
tokenizerslibrary: https://github.com/huggingface/tokenizers - HuggingFace
transformerslibrary: https://github.com/huggingface/transformers - SafeTensors format: https://github.com/huggingface/safetensors