--- license: apache-2.0 language: - en - code tags: - tokenizer - bpe - code - fim - tiktoken library_name: transformers pipeline_tag: text-generation --- # ezellm-lite-tokenizer A **24,600-vocab** byte-level BPE tokenizer trained on a **142 GB code-heavy corpus**, with a Qwen2.5-Coder-style special-token layout (FIM + repo metadata + reserved slots). Designed for small-to-mid scale code language models where embedding-table size matters. This is the **v2** of the tokenizer. ## Quick start ```python from transformers import AutoTokenizer tok = AutoTokenizer.from_pretrained("TerminatorPower/ezellm-lite-tokenizer") ids = tok.encode("def hello(name):\n print(f'Hello, {name}!')\n") print(len(ids), tok.decode(ids)) ``` ## Vocabulary layout Total vocab: **24,600**. The first 24,576 IDs are learned BPE merges; the top 24 IDs are reserved for control tokens. | ID range | Tokens | Purpose | |---------------|-----------------------------------------------------------|-------------------------------| | 0 – 24,575 | Learned BPE pieces | Text/code | | 24,576 | `<\|endoftext\|>` | EOS + PAD | | 24,577 – 24,580 | `<\|fim_prefix\|>`, `<\|fim_middle\|>`, `<\|fim_suffix\|>`, `<\|fim_pad\|>` | Fill-in-the-Middle training | | 24,581 – 24,583 | `<\|file_sep\|>`, `<\|repo_name\|>`, `<\|filename\|>` | Repo-level packing | | 24,584 – 24,599 | `<\|reserved_0\|>` … `<\|reserved_15\|>` | Reserved for downstream use | Only `<|endoftext|>` is registered as `eos_token` / `pad_token` in `special_tokens_map`. The FIM and repo markers are added tokens but are **not** flagged "special" — `tok.decode(ids, skip_special_tokens=True)` will only strip `<|endoftext|>`. Register the others explicitly if you want them stripped on decode. ## Training data - **Size:** ~142 GB of text and source code - **Mix (heavily code-leaning):** Python, JavaScript/TypeScript, Java, C/C++, web (HTML/CSS), Markdown, math/scientific Python, technical prose - **Algorithm:** Byte-level BPE (tiktoken-compatible; `tiktoken.bpe` and `tiktoken.json` are bundled alongside the standard `tokenizer.json`) The tokenizer files in this repo can be loaded both via 🤗 `transformers` (`tokenizer.json`) and via `tiktoken` directly (`tiktoken.bpe`). ## Benchmarks Compared against **size-matched, code-trained tokenizers** in the 24K–50K vocab range. Measured on a held-out multi-domain corpus (~618 KB across 10 categories: C, C++, Java, JavaScript, Markdown, math/Python, prose, general Python, HTML/CSS, raw outputs). ### Aggregate compression Higher chars/token = better compression = shorter context for the same input. | Tokenizer | Vocab | chars/token | tokens / 1k chars | |----------------------|-------:|------------:|------------------:| | **StarCoder2** | 49,152 | **3.238** | 308.9 | | **ezellm-lite** | 24,600 | **3.081** | 324.6 | | CodeGen-mono | 50,295 | 3.017 | 331.5 | | DeepSeek-Coder | 32,022 | 2.836 | 352.6 | | GPT-2 | 50,257 | 2.487 | 402.1 | **ezellm-lite is the second-best tokenizer in this group** — within ~5% of StarCoder2 despite having **half the vocabulary**, ahead of CodeGen-mono and DeepSeek-Coder which both have 30–100% more vocab slots, and ~24% more compressive than GPT-2. ### Compression by category — characters per token | Category | ezellm-lite (24.6K) | StarCoder2 (49K) | CodeGen-mono (50K) | DeepSeek-Coder (32K) | GPT-2 (50K) | |----------------|--------------------:|-----------------:|-------------------:|---------------------:|------------:| | c | 2.839 | 2.900 | 2.630 | 2.534 | 2.470 | | cpp | 3.157 | 3.303 | 2.914 | 2.793 | 2.289 | | java | 3.996 | 4.517 | 3.606 | 3.605 | 2.329 | | javascript | 3.142 | 3.423 | 2.988 | 2.898 | 2.357 | | markdown_docs | 3.117 | 3.272 | 3.125 | 2.906 | 2.986 | | math_python | 2.630 | 2.695 | 2.482 | 2.387 | 2.136 | | prose | 3.661 | 3.731 | 4.356 | 3.855 | 4.356 | | python_general | 3.680 | 3.747 | 3.249 | 3.169 | 2.586 | | web_html_css | 2.673 | 2.897 | 2.831 | 2.543 | 2.289 | **Reading the table.** ezellm-lite either wins or comes within a few percent of the leader on every code category, beating CodeGen-mono and DeepSeek-Coder consistently on Python, JavaScript, Java, and C++. StarCoder2 edges it out on most categories (expected — 2× the vocab, trained on The Stack v2). The one place ezellm-lite clearly trails is **prose**, where the prose-heavy GPT-2 vocabulary still wins — a deliberate trade for a code-focused tokenizer. ### Efficiency per vocabulary slot A 24K vocab tokenizer is at a structural disadvantage on raw chars/token. A fairer cross-vocab metric is **chars/token × log₂(vocab)** — roughly "input bits carried per token." | Tokenizer | Vocab | chars/tok | chars/tok × log₂(V) | |------------------|-------:|----------:|--------------------:| | StarCoder2 | 49,152 | 3.238 | 50.46 | | CodeGen-mono | 50,295 | 3.017 | 47.12 | | **ezellm-lite** | 24,600 | 3.081 | **44.94** | | DeepSeek-Coder | 32,022 | 2.836 | 42.45 | | GPT-2 | 50,257 | 2.487 | 38.84 | By this measure ezellm-lite sits **above CodeGen-mono on raw compression while using less than half the embedding parameters**, and clearly above DeepSeek-Coder and GPT-2. At `d_model=1024`, the embedding/output table sizes are: ezellm-lite **25M**, DeepSeek-Coder **33M**, StarCoder2 / CodeGen-mono / GPT-2 **~50M**. For a small code LM (≤2B params) those tens of millions of softmax parameters are real budget. ### Encoding speed All five tokenizers are sub-millisecond on the 60–90 KB sample shards in this benchmark; encoding speed is not the bottleneck for any of them. ## Files | File | Purpose | |-----------------------|-----------------------------------------------| | `tokenizer.json` | 🤗 Tokenizers / `transformers`-loadable model | | `tokenizer_config.json` | Special-token metadata for `transformers` | | `tiktoken.bpe` | tiktoken-format merge table | | `tiktoken.json` | tiktoken metadata (pattern, special tokens) | ## Intended use - Pretraining / fine-tuning small-to-mid code LMs (≤2B params) where vocabulary size dominates parameter count - FIM-style training out of the box (FIM specials are pre-allocated) - Repo-aware packing using `<|repo_name|>`, `<|filename|>`, `<|file_sep|>` ## Limitations - Not optimized for non-English natural language; the corpus is English + code. - Compression on dense punctuation languages (C, HTML/CSS) is noticeably tighter than for Python/prose — budget context length accordingly. - The 16 reserved slots are *unused* — they will never appear in trained text and need to be registered explicitly if you want to repurpose them (e.g. chat-template tags). ## License Apache-2.0.