J-Raposo
/

code-search-net-tokenizer

Model card Files Files and versions

J-Raposo commited on Oct 30, 2025

Commit

ffec9f2

·

verified ·

1 Parent(s): 9c6c107

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -38,7 +38,7 @@ This tokenizer is a byte-level BPE tokenizer (GPT-2 style) retrained on the Code
 ## Tokenizer details / configuration
 - **Tokenizer type:** Byte-level BPE (GPT-2–style / `tokenizers` fast API).
-- **Vocabulary size:** 50,257 (GPT-2 default) — **replace with the actual vocab size if different**.
 - **Special tokens:** standard GPT-2 tokens (e.g., ``) or custom tokens if you added any. Ensure `tokenizer_config.json` in the repo lists them.
 - **Normalization:** Byte-level normalization (works with arbitrary byte sequences / UTF-8).
 - **Files included:** `tokenizer.json` (preferred `tokenizers` fast format) or `vocab.json` + `merges.txt` (legacy), and `tokenizer_config.json`.

 ## Tokenizer details / configuration
 - **Tokenizer type:** Byte-level BPE (GPT-2–style / `tokenizers` fast API).
+- **Vocabulary size:** 50,257 (GPT-2 default)
 - **Special tokens:** standard GPT-2 tokens (e.g., ``) or custom tokens if you added any. Ensure `tokenizer_config.json` in the repo lists them.
 - **Normalization:** Byte-level normalization (works with arbitrary byte sequences / UTF-8).
 - **Files included:** `tokenizer.json` (preferred `tokenizers` fast format) or `vocab.json` + `merges.txt` (legacy), and `tokenizer_config.json`.