Replace OpenAI baselines with size-matched code tokenizers (StarCoder2, CodeGen, DeepSeek-Coder, GPT-2)
2507447 verified | license: apache-2.0 | |
| language: | |
| - en | |
| - code | |
| tags: | |
| - tokenizer | |
| - bpe | |
| - code | |
| - fim | |
| - tiktoken | |
| library_name: transformers | |
| pipeline_tag: text-generation | |
| # ezellm-lite-tokenizer | |
| A **24,600-vocab** byte-level BPE tokenizer trained on a **142 GB code-heavy corpus**, with a Qwen2.5-Coder-style special-token layout (FIM + repo metadata + reserved slots). Designed for small-to-mid scale code language models where embedding-table size matters. | |
| This is the **v2** of the tokenizer. | |
| ## Quick start | |
| ```python | |
| from transformers import AutoTokenizer | |
| tok = AutoTokenizer.from_pretrained("TerminatorPower/ezellm-lite-tokenizer") | |
| ids = tok.encode("def hello(name):\n print(f'Hello, {name}!')\n") | |
| print(len(ids), tok.decode(ids)) | |
| ``` | |
| ## Vocabulary layout | |
| Total vocab: **24,600**. The first 24,576 IDs are learned BPE merges; the top 24 IDs are reserved for control tokens. | |
| | ID range | Tokens | Purpose | | |
| |---------------|-----------------------------------------------------------|-------------------------------| | |
| | 0 β 24,575 | Learned BPE pieces | Text/code | | |
| | 24,576 | `<\|endoftext\|>` | EOS + PAD | | |
| | 24,577 β 24,580 | `<\|fim_prefix\|>`, `<\|fim_middle\|>`, `<\|fim_suffix\|>`, `<\|fim_pad\|>` | Fill-in-the-Middle training | | |
| | 24,581 β 24,583 | `<\|file_sep\|>`, `<\|repo_name\|>`, `<\|filename\|>` | Repo-level packing | | |
| | 24,584 β 24,599 | `<\|reserved_0\|>` β¦ `<\|reserved_15\|>` | Reserved for downstream use | | |
| Only `<|endoftext|>` is registered as `eos_token` / `pad_token` in `special_tokens_map`. The FIM and repo markers are added tokens but are **not** flagged "special" β `tok.decode(ids, skip_special_tokens=True)` will only strip `<|endoftext|>`. Register the others explicitly if you want them stripped on decode. | |
| ## Training data | |
| - **Size:** ~142 GB of text and source code | |
| - **Mix (heavily code-leaning):** Python, JavaScript/TypeScript, Java, C/C++, web (HTML/CSS), Markdown, math/scientific Python, technical prose | |
| - **Algorithm:** Byte-level BPE (tiktoken-compatible; `tiktoken.bpe` and `tiktoken.json` are bundled alongside the standard `tokenizer.json`) | |
| The tokenizer files in this repo can be loaded both via π€ `transformers` (`tokenizer.json`) and via `tiktoken` directly (`tiktoken.bpe`). | |
| ## Benchmarks | |
| Compared against **size-matched, code-trained tokenizers** in the 24Kβ50K vocab range. Measured on a held-out multi-domain corpus (~618 KB across 10 categories: C, C++, Java, JavaScript, Markdown, math/Python, prose, general Python, HTML/CSS, raw outputs). | |
| ### Aggregate compression | |
| Higher chars/token = better compression = shorter context for the same input. | |
| | Tokenizer | Vocab | chars/token | tokens / 1k chars | | |
| |----------------------|-------:|------------:|------------------:| | |
| | **StarCoder2** | 49,152 | **3.238** | 308.9 | | |
| | **ezellm-lite** | 24,600 | **3.081** | 324.6 | | |
| | CodeGen-mono | 50,295 | 3.017 | 331.5 | | |
| | DeepSeek-Coder | 32,022 | 2.836 | 352.6 | | |
| | GPT-2 | 50,257 | 2.487 | 402.1 | | |
| **ezellm-lite is the second-best tokenizer in this group** β within ~5% of StarCoder2 despite having **half the vocabulary**, ahead of CodeGen-mono and DeepSeek-Coder which both have 30β100% more vocab slots, and ~24% more compressive than GPT-2. | |
| ### Compression by category β characters per token | |
| | Category | ezellm-lite (24.6K) | StarCoder2 (49K) | CodeGen-mono (50K) | DeepSeek-Coder (32K) | GPT-2 (50K) | | |
| |----------------|--------------------:|-----------------:|-------------------:|---------------------:|------------:| | |
| | c | 2.839 | 2.900 | 2.630 | 2.534 | 2.470 | | |
| | cpp | 3.157 | 3.303 | 2.914 | 2.793 | 2.289 | | |
| | java | 3.996 | 4.517 | 3.606 | 3.605 | 2.329 | | |
| | javascript | 3.142 | 3.423 | 2.988 | 2.898 | 2.357 | | |
| | markdown_docs | 3.117 | 3.272 | 3.125 | 2.906 | 2.986 | | |
| | math_python | 2.630 | 2.695 | 2.482 | 2.387 | 2.136 | | |
| | prose | 3.661 | 3.731 | 4.356 | 3.855 | 4.356 | | |
| | python_general | 3.680 | 3.747 | 3.249 | 3.169 | 2.586 | | |
| | web_html_css | 2.673 | 2.897 | 2.831 | 2.543 | 2.289 | | |
| **Reading the table.** ezellm-lite either wins or comes within a few percent of the leader on every code category, beating CodeGen-mono and DeepSeek-Coder consistently on Python, JavaScript, Java, and C++. StarCoder2 edges it out on most categories (expected β 2Γ the vocab, trained on The Stack v2). The one place ezellm-lite clearly trails is **prose**, where the prose-heavy GPT-2 vocabulary still wins β a deliberate trade for a code-focused tokenizer. | |
| ### Efficiency per vocabulary slot | |
| A 24K vocab tokenizer is at a structural disadvantage on raw chars/token. A fairer cross-vocab metric is **chars/token Γ logβ(vocab)** β roughly "input bits carried per token." | |
| | Tokenizer | Vocab | chars/tok | chars/tok Γ logβ(V) | | |
| |------------------|-------:|----------:|--------------------:| | |
| | StarCoder2 | 49,152 | 3.238 | 50.46 | | |
| | CodeGen-mono | 50,295 | 3.017 | 47.12 | | |
| | **ezellm-lite** | 24,600 | 3.081 | **44.94** | | |
| | DeepSeek-Coder | 32,022 | 2.836 | 42.45 | | |
| | GPT-2 | 50,257 | 2.487 | 38.84 | | |
| By this measure ezellm-lite sits **above CodeGen-mono on raw compression while using less than half the embedding parameters**, and clearly above DeepSeek-Coder and GPT-2. | |
| At `d_model=1024`, the embedding/output table sizes are: ezellm-lite **25M**, DeepSeek-Coder **33M**, StarCoder2 / CodeGen-mono / GPT-2 **~50M**. For a small code LM (β€2B params) those tens of millions of softmax parameters are real budget. | |
| ### Encoding speed | |
| All five tokenizers are sub-millisecond on the 60β90 KB sample shards in this benchmark; encoding speed is not the bottleneck for any of them. | |
| ## Files | |
| | File | Purpose | | |
| |-----------------------|-----------------------------------------------| | |
| | `tokenizer.json` | π€ Tokenizers / `transformers`-loadable model | | |
| | `tokenizer_config.json` | Special-token metadata for `transformers` | | |
| | `tiktoken.bpe` | tiktoken-format merge table | | |
| | `tiktoken.json` | tiktoken metadata (pattern, special tokens) | | |
| ## Intended use | |
| - Pretraining / fine-tuning small-to-mid code LMs (β€2B params) where vocabulary size dominates parameter count | |
| - FIM-style training out of the box (FIM specials are pre-allocated) | |
| - Repo-aware packing using `<|repo_name|>`, `<|filename|>`, `<|file_sep|>` | |
| ## Limitations | |
| - Not optimized for non-English natural language; the corpus is English + code. | |
| - Compression on dense punctuation languages (C, HTML/CSS) is noticeably tighter than for Python/prose β budget context length accordingly. | |
| - The 16 reserved slots are *unused* β they will never appear in trained text and need to be registered explicitly if you want to repurpose them (e.g. chat-template tags). | |
| ## License | |
| Apache-2.0. | |