File size: 7,833 Bytes
6db98e5 2507447 6db98e5 2507447 6db98e5 2507447 6db98e5 2507447 6db98e5 2507447 6db98e5 2507447 6db98e5 2507447 6db98e5 2507447 6db98e5 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 | ---
license: apache-2.0
language:
- en
- code
tags:
- tokenizer
- bpe
- code
- fim
- tiktoken
library_name: transformers
pipeline_tag: text-generation
---
# ezellm-lite-tokenizer
A **24,600-vocab** byte-level BPE tokenizer trained on a **142 GB code-heavy corpus**, with a Qwen2.5-Coder-style special-token layout (FIM + repo metadata + reserved slots). Designed for small-to-mid scale code language models where embedding-table size matters.
This is the **v2** of the tokenizer.
## Quick start
```python
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("TerminatorPower/ezellm-lite-tokenizer")
ids = tok.encode("def hello(name):\n print(f'Hello, {name}!')\n")
print(len(ids), tok.decode(ids))
```
## Vocabulary layout
Total vocab: **24,600**. The first 24,576 IDs are learned BPE merges; the top 24 IDs are reserved for control tokens.
| ID range | Tokens | Purpose |
|---------------|-----------------------------------------------------------|-------------------------------|
| 0 β 24,575 | Learned BPE pieces | Text/code |
| 24,576 | `<\|endoftext\|>` | EOS + PAD |
| 24,577 β 24,580 | `<\|fim_prefix\|>`, `<\|fim_middle\|>`, `<\|fim_suffix\|>`, `<\|fim_pad\|>` | Fill-in-the-Middle training |
| 24,581 β 24,583 | `<\|file_sep\|>`, `<\|repo_name\|>`, `<\|filename\|>` | Repo-level packing |
| 24,584 β 24,599 | `<\|reserved_0\|>` β¦ `<\|reserved_15\|>` | Reserved for downstream use |
Only `<|endoftext|>` is registered as `eos_token` / `pad_token` in `special_tokens_map`. The FIM and repo markers are added tokens but are **not** flagged "special" β `tok.decode(ids, skip_special_tokens=True)` will only strip `<|endoftext|>`. Register the others explicitly if you want them stripped on decode.
## Training data
- **Size:** ~142 GB of text and source code
- **Mix (heavily code-leaning):** Python, JavaScript/TypeScript, Java, C/C++, web (HTML/CSS), Markdown, math/scientific Python, technical prose
- **Algorithm:** Byte-level BPE (tiktoken-compatible; `tiktoken.bpe` and `tiktoken.json` are bundled alongside the standard `tokenizer.json`)
The tokenizer files in this repo can be loaded both via π€ `transformers` (`tokenizer.json`) and via `tiktoken` directly (`tiktoken.bpe`).
## Benchmarks
Compared against **size-matched, code-trained tokenizers** in the 24Kβ50K vocab range. Measured on a held-out multi-domain corpus (~618 KB across 10 categories: C, C++, Java, JavaScript, Markdown, math/Python, prose, general Python, HTML/CSS, raw outputs).
### Aggregate compression
Higher chars/token = better compression = shorter context for the same input.
| Tokenizer | Vocab | chars/token | tokens / 1k chars |
|----------------------|-------:|------------:|------------------:|
| **StarCoder2** | 49,152 | **3.238** | 308.9 |
| **ezellm-lite** | 24,600 | **3.081** | 324.6 |
| CodeGen-mono | 50,295 | 3.017 | 331.5 |
| DeepSeek-Coder | 32,022 | 2.836 | 352.6 |
| GPT-2 | 50,257 | 2.487 | 402.1 |
**ezellm-lite is the second-best tokenizer in this group** β within ~5% of StarCoder2 despite having **half the vocabulary**, ahead of CodeGen-mono and DeepSeek-Coder which both have 30β100% more vocab slots, and ~24% more compressive than GPT-2.
### Compression by category β characters per token
| Category | ezellm-lite (24.6K) | StarCoder2 (49K) | CodeGen-mono (50K) | DeepSeek-Coder (32K) | GPT-2 (50K) |
|----------------|--------------------:|-----------------:|-------------------:|---------------------:|------------:|
| c | 2.839 | 2.900 | 2.630 | 2.534 | 2.470 |
| cpp | 3.157 | 3.303 | 2.914 | 2.793 | 2.289 |
| java | 3.996 | 4.517 | 3.606 | 3.605 | 2.329 |
| javascript | 3.142 | 3.423 | 2.988 | 2.898 | 2.357 |
| markdown_docs | 3.117 | 3.272 | 3.125 | 2.906 | 2.986 |
| math_python | 2.630 | 2.695 | 2.482 | 2.387 | 2.136 |
| prose | 3.661 | 3.731 | 4.356 | 3.855 | 4.356 |
| python_general | 3.680 | 3.747 | 3.249 | 3.169 | 2.586 |
| web_html_css | 2.673 | 2.897 | 2.831 | 2.543 | 2.289 |
**Reading the table.** ezellm-lite either wins or comes within a few percent of the leader on every code category, beating CodeGen-mono and DeepSeek-Coder consistently on Python, JavaScript, Java, and C++. StarCoder2 edges it out on most categories (expected β 2Γ the vocab, trained on The Stack v2). The one place ezellm-lite clearly trails is **prose**, where the prose-heavy GPT-2 vocabulary still wins β a deliberate trade for a code-focused tokenizer.
### Efficiency per vocabulary slot
A 24K vocab tokenizer is at a structural disadvantage on raw chars/token. A fairer cross-vocab metric is **chars/token Γ logβ(vocab)** β roughly "input bits carried per token."
| Tokenizer | Vocab | chars/tok | chars/tok Γ logβ(V) |
|------------------|-------:|----------:|--------------------:|
| StarCoder2 | 49,152 | 3.238 | 50.46 |
| CodeGen-mono | 50,295 | 3.017 | 47.12 |
| **ezellm-lite** | 24,600 | 3.081 | **44.94** |
| DeepSeek-Coder | 32,022 | 2.836 | 42.45 |
| GPT-2 | 50,257 | 2.487 | 38.84 |
By this measure ezellm-lite sits **above CodeGen-mono on raw compression while using less than half the embedding parameters**, and clearly above DeepSeek-Coder and GPT-2.
At `d_model=1024`, the embedding/output table sizes are: ezellm-lite **25M**, DeepSeek-Coder **33M**, StarCoder2 / CodeGen-mono / GPT-2 **~50M**. For a small code LM (β€2B params) those tens of millions of softmax parameters are real budget.
### Encoding speed
All five tokenizers are sub-millisecond on the 60β90 KB sample shards in this benchmark; encoding speed is not the bottleneck for any of them.
## Files
| File | Purpose |
|-----------------------|-----------------------------------------------|
| `tokenizer.json` | π€ Tokenizers / `transformers`-loadable model |
| `tokenizer_config.json` | Special-token metadata for `transformers` |
| `tiktoken.bpe` | tiktoken-format merge table |
| `tiktoken.json` | tiktoken metadata (pattern, special tokens) |
## Intended use
- Pretraining / fine-tuning small-to-mid code LMs (β€2B params) where vocabulary size dominates parameter count
- FIM-style training out of the box (FIM specials are pre-allocated)
- Repo-aware packing using `<|repo_name|>`, `<|filename|>`, `<|file_sep|>`
## Limitations
- Not optimized for non-English natural language; the corpus is English + code.
- Compression on dense punctuation languages (C, HTML/CSS) is noticeably tighter than for Python/prose β budget context length accordingly.
- The 16 reserved slots are *unused* β they will never appear in trained text and need to be registered explicitly if you want to repurpose them (e.g. chat-template tags).
## License
Apache-2.0.
|