YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
---
license: apache-2.0
tags:
- tokenizer
- parity
- ferrotorch
- real-artifact
---
# `ferrotorch/tokenizer-parity-v1`
HuggingFace tokenizer parity fixtures pinned for the ferrotorch
real-artifact harness (Phase G.2, #1168).
## Provenance
* Upstream tokenizers: see the `upstream_repo` column below.
* Generator script:
[`scripts/pin_pretrained_tokenizer_fixtures.py`](https://github.com/dollspace-gay/ferrotorch/blob/main/scripts/pin_pretrained_tokenizer_fixtures.py).
* SHA-256 of `bundle.tar` (pinned in
`ferrotorch-hub/src/registry.rs`): `8d949235bb5cfaaea8916dcce001d17fd4b4383c2d5e033272397cf9545d1ef6`.
## Families
| family | upstream | kind | vocab | chat tpl |
|----------|---------------------------------------|-------------------------|--------|----------|
| `llama3` | `meta-llama/Meta-Llama-3-8B-Instruct` | BPE (Llama 3, tiktoken-derived) | 128256 | yes |
| clip | openai/clip-vit-large-patch14 | BPE (CLIP, lowercased) | 49408 | no |
| bert | bert-base-uncased | WordPiece (BERT, lowercased) | 30522 | no |
| gpt2 | gpt2 | BPE (GPT-2) | 50257 | no |
| smollm | HuggingFaceTB/SmolLM-135M-Instruct | BPE (SmolLM, GPT-2 family) | 49152 | yes |
## Layout
Each `<family>/` subfolder ships:
* `tokenizer.json` β upstream tokenizer config (fast
tokenizers format).
* `tokenizer_config.json` β full config with chat template
(when upstream ships one).
* `special_tokens_map.json` β special-token mapping (when
upstream ships one).
* Additional family-specific files (`vocab.json`, `merges.txt`,
`vocab.txt`) when upstream ships them β these let rust
tooling that bypasses `tokenizer.json` still round-trip.
* `strings.json` β the 20-element fixed test
corpus (same list for every
family).
* `token_ids.json` β `{ encode_with_special[20],
encode_no_special[20] }`
β Python reference encodings.
* `decoded.json` β `{ decode_with_special_keep[20],
decode_with_special_skip[20],
decode_no_special[20] }`
β Python reference decodes.
* `chat_template.json` β for families with a chat
template: rendered system+user
+assistant conversation with
and without
`add_generation_prompt`.
* `meta.json` β versions and provenance.
## How the rust side consumes this
The rust dump example
[`ferrotorch-tokenize/examples/tokenizer_parity_dump.rs`](https://github.com/dollspace-gay/ferrotorch/blob/main/ferrotorch-tokenize/examples/tokenizer_parity_dump.rs)
loads `tokenizer.json` (and optionally `tokenizer_config.json`)
from the local family folder, then re-runs encode/decode/chat
template against the corpus and writes the rust-side outputs
next to the references. The python harness
[`scripts/verify_tokenizer_inference.py`](https://github.com/dollspace-gay/ferrotorch/blob/main/scripts/verify_tokenizer_inference.py)
compares every output with **exact integer / string equality**
β there is no tolerance, divergence on any string surfaces a
real bug.
## Upstream licenses
Each upstream tokenizer carries its own license. See:
* meta-llama/Meta-Llama-3-8B : Meta Llama 3 Community License
* openai/clip-vit-large-patch14 : MIT
* bert-base-uncased : Apache 2.0
* gpt2 : MIT
* HuggingFaceTB/SmolLM-135M : Apache 2.0
Only the tokenizer config and vocabulary metadata are mirrored;
none of these files contains model weights.
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support