---
license: apache-2.0
language:
- en
- code
tags:
- tokenizer
- bpe
- code
- fim
- tiktoken
library_name: transformers
pipeline_tag: text-generation
---

# ezellm-lite-tokenizer

A **24,600-vocab** byte-level BPE tokenizer trained on a **142 GB code-heavy corpus**, with a Qwen2.5-Coder-style special-token layout (FIM + repo metadata + reserved slots). Designed for small-to-mid scale code language models where embedding-table size matters.

This is the **v2** of the tokenizer.

## Quick start

```python
from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("TerminatorPower/ezellm-lite-tokenizer")

ids = tok.encode("def hello(name):\n    print(f'Hello, {name}!')\n")
print(len(ids), tok.decode(ids))
```

## Vocabulary layout

Total vocab: **24,600**. The first 24,576 IDs are learned BPE merges; the top 24 IDs are reserved for control tokens.

| ID range      | Tokens                                                    | Purpose                       |
|---------------|-----------------------------------------------------------|-------------------------------|
| 0 – 24,575    | Learned BPE pieces                                        | Text/code                     |
| 24,576        | `<\|endoftext\|>`                                         | EOS + PAD                     |
| 24,577 – 24,580 | `<\|fim_prefix\|>`, `<\|fim_middle\|>`, `<\|fim_suffix\|>`, `<\|fim_pad\|>` | Fill-in-the-Middle training |
| 24,581 – 24,583 | `<\|file_sep\|>`, `<\|repo_name\|>`, `<\|filename\|>`     | Repo-level packing            |
| 24,584 – 24,599 | `<\|reserved_0\|>` … `<\|reserved_15\|>`                 | Reserved for downstream use   |

Only `<|endoftext|>` is registered as `eos_token` / `pad_token` in `special_tokens_map`. The FIM and repo markers are added tokens but are **not** flagged "special" — `tok.decode(ids, skip_special_tokens=True)` will only strip `<|endoftext|>`. Register the others explicitly if you want them stripped on decode.

## Training data

- **Size:** ~142 GB of text and source code
- **Mix (heavily code-leaning):** Python, JavaScript/TypeScript, Java, C/C++, web (HTML/CSS), Markdown, math/scientific Python, technical prose
- **Algorithm:** Byte-level BPE (tiktoken-compatible; `tiktoken.bpe` and `tiktoken.json` are bundled alongside the standard `tokenizer.json`)

The tokenizer files in this repo can be loaded both via 🤗 `transformers` (`tokenizer.json`) and via `tiktoken` directly (`tiktoken.bpe`).

## Benchmarks

Compared against **size-matched, code-trained tokenizers** in the 24K–50K vocab range. Measured on a held-out multi-domain corpus (~618 KB across 10 categories: C, C++, Java, JavaScript, Markdown, math/Python, prose, general Python, HTML/CSS, raw outputs).

### Aggregate compression

Higher chars/token = better compression = shorter context for the same input.

| Tokenizer            |  Vocab | chars/token | tokens / 1k chars |
|----------------------|-------:|------------:|------------------:|
| **StarCoder2**       | 49,152 |   **3.238** |             308.9 |
| **ezellm-lite**      | 24,600 |   **3.081** |             324.6 |
| CodeGen-mono         | 50,295 |       3.017 |             331.5 |
| DeepSeek-Coder       | 32,022 |       2.836 |             352.6 |
| GPT-2                | 50,257 |       2.487 |             402.1 |

**ezellm-lite is the second-best tokenizer in this group** — within ~5% of StarCoder2 despite having **half the vocabulary**, ahead of CodeGen-mono and DeepSeek-Coder which both have 30–100% more vocab slots, and ~24% more compressive than GPT-2.

### Compression by category — characters per token

| Category       | ezellm-lite (24.6K) | StarCoder2 (49K) | CodeGen-mono (50K) | DeepSeek-Coder (32K) | GPT-2 (50K) |
|----------------|--------------------:|-----------------:|-------------------:|---------------------:|------------:|
| c              |               2.839 |            2.900 |              2.630 |                2.534 |       2.470 |
| cpp            |               3.157 |            3.303 |              2.914 |                2.793 |       2.289 |
| java           |               3.996 |            4.517 |              3.606 |                3.605 |       2.329 |
| javascript     |               3.142 |            3.423 |              2.988 |                2.898 |       2.357 |
| markdown_docs  |               3.117 |            3.272 |              3.125 |                2.906 |       2.986 |
| math_python    |               2.630 |            2.695 |              2.482 |                2.387 |       2.136 |
| prose          |               3.661 |            3.731 |              4.356 |                3.855 |       4.356 |
| python_general |               3.680 |            3.747 |              3.249 |                3.169 |       2.586 |
| web_html_css   |               2.673 |            2.897 |              2.831 |                2.543 |       2.289 |

**Reading the table.** ezellm-lite either wins or comes within a few percent of the leader on every code category, beating CodeGen-mono and DeepSeek-Coder consistently on Python, JavaScript, Java, and C++. StarCoder2 edges it out on most categories (expected — 2× the vocab, trained on The Stack v2). The one place ezellm-lite clearly trails is **prose**, where the prose-heavy GPT-2 vocabulary still wins — a deliberate trade for a code-focused tokenizer.

### Efficiency per vocabulary slot

A 24K vocab tokenizer is at a structural disadvantage on raw chars/token. A fairer cross-vocab metric is **chars/token × log₂(vocab)** — roughly "input bits carried per token."

| Tokenizer        |  Vocab | chars/tok | chars/tok × log₂(V) |
|------------------|-------:|----------:|--------------------:|
| StarCoder2       | 49,152 |     3.238 |               50.46 |
| CodeGen-mono     | 50,295 |     3.017 |               47.12 |
| **ezellm-lite**  | 24,600 |     3.081 |           **44.94** |
| DeepSeek-Coder   | 32,022 |     2.836 |               42.45 |
| GPT-2            | 50,257 |     2.487 |               38.84 |

By this measure ezellm-lite sits **above CodeGen-mono on raw compression while using less than half the embedding parameters**, and clearly above DeepSeek-Coder and GPT-2.

At `d_model=1024`, the embedding/output table sizes are: ezellm-lite **25M**, DeepSeek-Coder **33M**, StarCoder2 / CodeGen-mono / GPT-2 **~50M**. For a small code LM (≤2B params) those tens of millions of softmax parameters are real budget.

### Encoding speed

All five tokenizers are sub-millisecond on the 60–90 KB sample shards in this benchmark; encoding speed is not the bottleneck for any of them.

## Files

| File                  | Purpose                                       |
|-----------------------|-----------------------------------------------|
| `tokenizer.json`      | 🤗 Tokenizers / `transformers`-loadable model |
| `tokenizer_config.json` | Special-token metadata for `transformers`   |
| `tiktoken.bpe`        | tiktoken-format merge table                   |
| `tiktoken.json`       | tiktoken metadata (pattern, special tokens)   |

## Intended use

- Pretraining / fine-tuning small-to-mid code LMs (≤2B params) where vocabulary size dominates parameter count
- FIM-style training out of the box (FIM specials are pre-allocated)
- Repo-aware packing using `<|repo_name|>`, `<|filename|>`, `<|file_sep|>`

## Limitations

- Not optimized for non-English natural language; the corpus is English + code.
- Compression on dense punctuation languages (C, HTML/CSS) is noticeably tighter than for Python/prose — budget context length accordingly.
- The 16 reserved slots are *unused* — they will never appear in trained text and need to be registered explicitly if you want to repurpose them (e.g. chat-template tags).

## License

Apache-2.0.