YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

    ---
    license: apache-2.0
    tags:
      - tokenizer
      - parity
      - ferrotorch
      - real-artifact
    ---

    # `ferrotorch/tokenizer-parity-v1`

    HuggingFace tokenizer parity fixtures pinned for the ferrotorch
    real-artifact harness (Phase G.2, #1168).

    ## Provenance

    * Upstream tokenizers: see the `upstream_repo` column below.
    * Generator script:
      [`scripts/pin_pretrained_tokenizer_fixtures.py`](https://github.com/dollspace-gay/ferrotorch/blob/main/scripts/pin_pretrained_tokenizer_fixtures.py).
    * SHA-256 of `bundle.tar` (pinned in
      `ferrotorch-hub/src/registry.rs`): `8d949235bb5cfaaea8916dcce001d17fd4b4383c2d5e033272397cf9545d1ef6`.

    ## Families

    | family   | upstream                              | kind                    | vocab  | chat tpl |
    |----------|---------------------------------------|-------------------------|--------|----------|
    | `llama3` | `meta-llama/Meta-Llama-3-8B-Instruct` | BPE (Llama 3, tiktoken-derived) | 128256 | yes |

| clip | openai/clip-vit-large-patch14 | BPE (CLIP, lowercased) | 49408 | no | | bert | bert-base-uncased | WordPiece (BERT, lowercased) | 30522 | no | | gpt2 | gpt2 | BPE (GPT-2) | 50257 | no | | smollm | HuggingFaceTB/SmolLM-135M-Instruct | BPE (SmolLM, GPT-2 family) | 49152 | yes |

    ## Layout

    Each `<family>/` subfolder ships:

    * `tokenizer.json`            β€” upstream tokenizer config (fast
                                    tokenizers format).
    * `tokenizer_config.json`     β€” full config with chat template
                                    (when upstream ships one).
    * `special_tokens_map.json`   β€” special-token mapping (when
                                    upstream ships one).
    * Additional family-specific files (`vocab.json`, `merges.txt`,
      `vocab.txt`) when upstream ships them β€” these let rust
      tooling that bypasses `tokenizer.json` still round-trip.
    * `strings.json`              β€” the 20-element fixed test
                                    corpus (same list for every
                                    family).
    * `token_ids.json`            β€” `{ encode_with_special[20],
                                        encode_no_special[20] }`
                                    β€” Python reference encodings.
    * `decoded.json`              β€” `{ decode_with_special_keep[20],
                                        decode_with_special_skip[20],
                                        decode_no_special[20] }`
                                    β€” Python reference decodes.
    * `chat_template.json`        β€” for families with a chat
                                    template: rendered system+user
                                    +assistant conversation with
                                    and without
                                    `add_generation_prompt`.
    * `meta.json`                 β€” versions and provenance.

    ## How the rust side consumes this

    The rust dump example
    [`ferrotorch-tokenize/examples/tokenizer_parity_dump.rs`](https://github.com/dollspace-gay/ferrotorch/blob/main/ferrotorch-tokenize/examples/tokenizer_parity_dump.rs)
    loads `tokenizer.json` (and optionally `tokenizer_config.json`)
    from the local family folder, then re-runs encode/decode/chat
    template against the corpus and writes the rust-side outputs
    next to the references. The python harness
    [`scripts/verify_tokenizer_inference.py`](https://github.com/dollspace-gay/ferrotorch/blob/main/scripts/verify_tokenizer_inference.py)
    compares every output with **exact integer / string equality**
    β€” there is no tolerance, divergence on any string surfaces a
    real bug.

    ## Upstream licenses

    Each upstream tokenizer carries its own license. See:

    * meta-llama/Meta-Llama-3-8B  : Meta Llama 3 Community License
    * openai/clip-vit-large-patch14 : MIT
    * bert-base-uncased             : Apache 2.0
    * gpt2                          : MIT
    * HuggingFaceTB/SmolLM-135M     : Apache 2.0

    Only the tokenizer config and vocabulary metadata are mirrored;
    none of these files contains model weights.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support