TerminatorPower's picture
Replace OpenAI baselines with size-matched code tokenizers (StarCoder2, CodeGen, DeepSeek-Coder, GPT-2)
2507447 verified
---
license: apache-2.0
language:
- en
- code
tags:
- tokenizer
- bpe
- code
- fim
- tiktoken
library_name: transformers
pipeline_tag: text-generation
---
# ezellm-lite-tokenizer
A **24,600-vocab** byte-level BPE tokenizer trained on a **142 GB code-heavy corpus**, with a Qwen2.5-Coder-style special-token layout (FIM + repo metadata + reserved slots). Designed for small-to-mid scale code language models where embedding-table size matters.
This is the **v2** of the tokenizer.
## Quick start
```python
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("TerminatorPower/ezellm-lite-tokenizer")
ids = tok.encode("def hello(name):\n print(f'Hello, {name}!')\n")
print(len(ids), tok.decode(ids))
```
## Vocabulary layout
Total vocab: **24,600**. The first 24,576 IDs are learned BPE merges; the top 24 IDs are reserved for control tokens.
| ID range | Tokens | Purpose |
|---------------|-----------------------------------------------------------|-------------------------------|
| 0 – 24,575 | Learned BPE pieces | Text/code |
| 24,576 | `<\|endoftext\|>` | EOS + PAD |
| 24,577 – 24,580 | `<\|fim_prefix\|>`, `<\|fim_middle\|>`, `<\|fim_suffix\|>`, `<\|fim_pad\|>` | Fill-in-the-Middle training |
| 24,581 – 24,583 | `<\|file_sep\|>`, `<\|repo_name\|>`, `<\|filename\|>` | Repo-level packing |
| 24,584 – 24,599 | `<\|reserved_0\|>` … `<\|reserved_15\|>` | Reserved for downstream use |
Only `<|endoftext|>` is registered as `eos_token` / `pad_token` in `special_tokens_map`. The FIM and repo markers are added tokens but are **not** flagged "special" β€” `tok.decode(ids, skip_special_tokens=True)` will only strip `<|endoftext|>`. Register the others explicitly if you want them stripped on decode.
## Training data
- **Size:** ~142 GB of text and source code
- **Mix (heavily code-leaning):** Python, JavaScript/TypeScript, Java, C/C++, web (HTML/CSS), Markdown, math/scientific Python, technical prose
- **Algorithm:** Byte-level BPE (tiktoken-compatible; `tiktoken.bpe` and `tiktoken.json` are bundled alongside the standard `tokenizer.json`)
The tokenizer files in this repo can be loaded both via πŸ€— `transformers` (`tokenizer.json`) and via `tiktoken` directly (`tiktoken.bpe`).
## Benchmarks
Compared against **size-matched, code-trained tokenizers** in the 24K–50K vocab range. Measured on a held-out multi-domain corpus (~618 KB across 10 categories: C, C++, Java, JavaScript, Markdown, math/Python, prose, general Python, HTML/CSS, raw outputs).
### Aggregate compression
Higher chars/token = better compression = shorter context for the same input.
| Tokenizer | Vocab | chars/token | tokens / 1k chars |
|----------------------|-------:|------------:|------------------:|
| **StarCoder2** | 49,152 | **3.238** | 308.9 |
| **ezellm-lite** | 24,600 | **3.081** | 324.6 |
| CodeGen-mono | 50,295 | 3.017 | 331.5 |
| DeepSeek-Coder | 32,022 | 2.836 | 352.6 |
| GPT-2 | 50,257 | 2.487 | 402.1 |
**ezellm-lite is the second-best tokenizer in this group** β€” within ~5% of StarCoder2 despite having **half the vocabulary**, ahead of CodeGen-mono and DeepSeek-Coder which both have 30–100% more vocab slots, and ~24% more compressive than GPT-2.
### Compression by category β€” characters per token
| Category | ezellm-lite (24.6K) | StarCoder2 (49K) | CodeGen-mono (50K) | DeepSeek-Coder (32K) | GPT-2 (50K) |
|----------------|--------------------:|-----------------:|-------------------:|---------------------:|------------:|
| c | 2.839 | 2.900 | 2.630 | 2.534 | 2.470 |
| cpp | 3.157 | 3.303 | 2.914 | 2.793 | 2.289 |
| java | 3.996 | 4.517 | 3.606 | 3.605 | 2.329 |
| javascript | 3.142 | 3.423 | 2.988 | 2.898 | 2.357 |
| markdown_docs | 3.117 | 3.272 | 3.125 | 2.906 | 2.986 |
| math_python | 2.630 | 2.695 | 2.482 | 2.387 | 2.136 |
| prose | 3.661 | 3.731 | 4.356 | 3.855 | 4.356 |
| python_general | 3.680 | 3.747 | 3.249 | 3.169 | 2.586 |
| web_html_css | 2.673 | 2.897 | 2.831 | 2.543 | 2.289 |
**Reading the table.** ezellm-lite either wins or comes within a few percent of the leader on every code category, beating CodeGen-mono and DeepSeek-Coder consistently on Python, JavaScript, Java, and C++. StarCoder2 edges it out on most categories (expected β€” 2Γ— the vocab, trained on The Stack v2). The one place ezellm-lite clearly trails is **prose**, where the prose-heavy GPT-2 vocabulary still wins β€” a deliberate trade for a code-focused tokenizer.
### Efficiency per vocabulary slot
A 24K vocab tokenizer is at a structural disadvantage on raw chars/token. A fairer cross-vocab metric is **chars/token Γ— logβ‚‚(vocab)** β€” roughly "input bits carried per token."
| Tokenizer | Vocab | chars/tok | chars/tok Γ— logβ‚‚(V) |
|------------------|-------:|----------:|--------------------:|
| StarCoder2 | 49,152 | 3.238 | 50.46 |
| CodeGen-mono | 50,295 | 3.017 | 47.12 |
| **ezellm-lite** | 24,600 | 3.081 | **44.94** |
| DeepSeek-Coder | 32,022 | 2.836 | 42.45 |
| GPT-2 | 50,257 | 2.487 | 38.84 |
By this measure ezellm-lite sits **above CodeGen-mono on raw compression while using less than half the embedding parameters**, and clearly above DeepSeek-Coder and GPT-2.
At `d_model=1024`, the embedding/output table sizes are: ezellm-lite **25M**, DeepSeek-Coder **33M**, StarCoder2 / CodeGen-mono / GPT-2 **~50M**. For a small code LM (≀2B params) those tens of millions of softmax parameters are real budget.
### Encoding speed
All five tokenizers are sub-millisecond on the 60–90 KB sample shards in this benchmark; encoding speed is not the bottleneck for any of them.
## Files
| File | Purpose |
|-----------------------|-----------------------------------------------|
| `tokenizer.json` | πŸ€— Tokenizers / `transformers`-loadable model |
| `tokenizer_config.json` | Special-token metadata for `transformers` |
| `tiktoken.bpe` | tiktoken-format merge table |
| `tiktoken.json` | tiktoken metadata (pattern, special tokens) |
## Intended use
- Pretraining / fine-tuning small-to-mid code LMs (≀2B params) where vocabulary size dominates parameter count
- FIM-style training out of the box (FIM specials are pre-allocated)
- Repo-aware packing using `<|repo_name|>`, `<|filename|>`, `<|file_sep|>`
## Limitations
- Not optimized for non-English natural language; the corpus is English + code.
- Compression on dense punctuation languages (C, HTML/CSS) is noticeably tighter than for Python/prose β€” budget context length accordingly.
- The 16 reserved slots are *unused* β€” they will never appear in trained text and need to be registered explicitly if you want to repurpose them (e.g. chat-template tags).
## License
Apache-2.0.