Update README.md
Browse files
README.md
CHANGED
|
@@ -38,7 +38,7 @@ This tokenizer is a byte-level BPE tokenizer (GPT-2 style) retrained on the Code
|
|
| 38 |
|
| 39 |
## Tokenizer details / configuration
|
| 40 |
- **Tokenizer type:** Byte-level BPE (GPT-2–style / `tokenizers` fast API).
|
| 41 |
-
- **Vocabulary size:** 50,257 (GPT-2 default)
|
| 42 |
- **Special tokens:** standard GPT-2 tokens (e.g., ``) or custom tokens if you added any. Ensure `tokenizer_config.json` in the repo lists them.
|
| 43 |
- **Normalization:** Byte-level normalization (works with arbitrary byte sequences / UTF-8).
|
| 44 |
- **Files included:** `tokenizer.json` (preferred `tokenizers` fast format) or `vocab.json` + `merges.txt` (legacy), and `tokenizer_config.json`.
|
|
|
|
| 38 |
|
| 39 |
## Tokenizer details / configuration
|
| 40 |
- **Tokenizer type:** Byte-level BPE (GPT-2–style / `tokenizers` fast API).
|
| 41 |
+
- **Vocabulary size:** 50,257 (GPT-2 default)
|
| 42 |
- **Special tokens:** standard GPT-2 tokens (e.g., ``) or custom tokens if you added any. Ensure `tokenizer_config.json` in the repo lists them.
|
| 43 |
- **Normalization:** Byte-level normalization (works with arbitrary byte sequences / UTF-8).
|
| 44 |
- **Files included:** `tokenizer.json` (preferred `tokenizers` fast format) or `vocab.json` + `merges.txt` (legacy), and `tokenizer_config.json`.
|