File size: 7,833 Bytes
6db98e5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2507447
6db98e5
2507447
 
 
 
 
 
 
 
 
 
 
 
 
6db98e5
2507447
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6db98e5
2507447
6db98e5
2507447
 
 
 
 
 
 
6db98e5
2507447
6db98e5
2507447
6db98e5
 
 
2507447
6db98e5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
---
license: apache-2.0
language:
- en
- code
tags:
- tokenizer
- bpe
- code
- fim
- tiktoken
library_name: transformers
pipeline_tag: text-generation
---

# ezellm-lite-tokenizer

A **24,600-vocab** byte-level BPE tokenizer trained on a **142 GB code-heavy corpus**, with a Qwen2.5-Coder-style special-token layout (FIM + repo metadata + reserved slots). Designed for small-to-mid scale code language models where embedding-table size matters.

This is the **v2** of the tokenizer.

## Quick start

```python
from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("TerminatorPower/ezellm-lite-tokenizer")

ids = tok.encode("def hello(name):\n    print(f'Hello, {name}!')\n")
print(len(ids), tok.decode(ids))
```

## Vocabulary layout

Total vocab: **24,600**. The first 24,576 IDs are learned BPE merges; the top 24 IDs are reserved for control tokens.

| ID range      | Tokens                                                    | Purpose                       |
|---------------|-----------------------------------------------------------|-------------------------------|
| 0 – 24,575    | Learned BPE pieces                                        | Text/code                     |
| 24,576        | `<\|endoftext\|>`                                         | EOS + PAD                     |
| 24,577 – 24,580 | `<\|fim_prefix\|>`, `<\|fim_middle\|>`, `<\|fim_suffix\|>`, `<\|fim_pad\|>` | Fill-in-the-Middle training |
| 24,581 – 24,583 | `<\|file_sep\|>`, `<\|repo_name\|>`, `<\|filename\|>`     | Repo-level packing            |
| 24,584 – 24,599 | `<\|reserved_0\|>` … `<\|reserved_15\|>`                 | Reserved for downstream use   |

Only `<|endoftext|>` is registered as `eos_token` / `pad_token` in `special_tokens_map`. The FIM and repo markers are added tokens but are **not** flagged "special" β€” `tok.decode(ids, skip_special_tokens=True)` will only strip `<|endoftext|>`. Register the others explicitly if you want them stripped on decode.

## Training data

- **Size:** ~142 GB of text and source code
- **Mix (heavily code-leaning):** Python, JavaScript/TypeScript, Java, C/C++, web (HTML/CSS), Markdown, math/scientific Python, technical prose
- **Algorithm:** Byte-level BPE (tiktoken-compatible; `tiktoken.bpe` and `tiktoken.json` are bundled alongside the standard `tokenizer.json`)

The tokenizer files in this repo can be loaded both via πŸ€— `transformers` (`tokenizer.json`) and via `tiktoken` directly (`tiktoken.bpe`).

## Benchmarks

Compared against **size-matched, code-trained tokenizers** in the 24K–50K vocab range. Measured on a held-out multi-domain corpus (~618 KB across 10 categories: C, C++, Java, JavaScript, Markdown, math/Python, prose, general Python, HTML/CSS, raw outputs).

### Aggregate compression

Higher chars/token = better compression = shorter context for the same input.

| Tokenizer            |  Vocab | chars/token | tokens / 1k chars |
|----------------------|-------:|------------:|------------------:|
| **StarCoder2**       | 49,152 |   **3.238** |             308.9 |
| **ezellm-lite**      | 24,600 |   **3.081** |             324.6 |
| CodeGen-mono         | 50,295 |       3.017 |             331.5 |
| DeepSeek-Coder       | 32,022 |       2.836 |             352.6 |
| GPT-2                | 50,257 |       2.487 |             402.1 |

**ezellm-lite is the second-best tokenizer in this group** β€” within ~5% of StarCoder2 despite having **half the vocabulary**, ahead of CodeGen-mono and DeepSeek-Coder which both have 30–100% more vocab slots, and ~24% more compressive than GPT-2.

### Compression by category β€” characters per token

| Category       | ezellm-lite (24.6K) | StarCoder2 (49K) | CodeGen-mono (50K) | DeepSeek-Coder (32K) | GPT-2 (50K) |
|----------------|--------------------:|-----------------:|-------------------:|---------------------:|------------:|
| c              |               2.839 |            2.900 |              2.630 |                2.534 |       2.470 |
| cpp            |               3.157 |            3.303 |              2.914 |                2.793 |       2.289 |
| java           |               3.996 |            4.517 |              3.606 |                3.605 |       2.329 |
| javascript     |               3.142 |            3.423 |              2.988 |                2.898 |       2.357 |
| markdown_docs  |               3.117 |            3.272 |              3.125 |                2.906 |       2.986 |
| math_python    |               2.630 |            2.695 |              2.482 |                2.387 |       2.136 |
| prose          |               3.661 |            3.731 |              4.356 |                3.855 |       4.356 |
| python_general |               3.680 |            3.747 |              3.249 |                3.169 |       2.586 |
| web_html_css   |               2.673 |            2.897 |              2.831 |                2.543 |       2.289 |

**Reading the table.** ezellm-lite either wins or comes within a few percent of the leader on every code category, beating CodeGen-mono and DeepSeek-Coder consistently on Python, JavaScript, Java, and C++. StarCoder2 edges it out on most categories (expected β€” 2Γ— the vocab, trained on The Stack v2). The one place ezellm-lite clearly trails is **prose**, where the prose-heavy GPT-2 vocabulary still wins β€” a deliberate trade for a code-focused tokenizer.

### Efficiency per vocabulary slot

A 24K vocab tokenizer is at a structural disadvantage on raw chars/token. A fairer cross-vocab metric is **chars/token Γ— logβ‚‚(vocab)** β€” roughly "input bits carried per token."

| Tokenizer        |  Vocab | chars/tok | chars/tok Γ— logβ‚‚(V) |
|------------------|-------:|----------:|--------------------:|
| StarCoder2       | 49,152 |     3.238 |               50.46 |
| CodeGen-mono     | 50,295 |     3.017 |               47.12 |
| **ezellm-lite**  | 24,600 |     3.081 |           **44.94** |
| DeepSeek-Coder   | 32,022 |     2.836 |               42.45 |
| GPT-2            | 50,257 |     2.487 |               38.84 |

By this measure ezellm-lite sits **above CodeGen-mono on raw compression while using less than half the embedding parameters**, and clearly above DeepSeek-Coder and GPT-2.

At `d_model=1024`, the embedding/output table sizes are: ezellm-lite **25M**, DeepSeek-Coder **33M**, StarCoder2 / CodeGen-mono / GPT-2 **~50M**. For a small code LM (≀2B params) those tens of millions of softmax parameters are real budget.

### Encoding speed

All five tokenizers are sub-millisecond on the 60–90 KB sample shards in this benchmark; encoding speed is not the bottleneck for any of them.

## Files

| File                  | Purpose                                       |
|-----------------------|-----------------------------------------------|
| `tokenizer.json`      | πŸ€— Tokenizers / `transformers`-loadable model |
| `tokenizer_config.json` | Special-token metadata for `transformers`   |
| `tiktoken.bpe`        | tiktoken-format merge table                   |
| `tiktoken.json`       | tiktoken metadata (pattern, special tokens)   |

## Intended use

- Pretraining / fine-tuning small-to-mid code LMs (≀2B params) where vocabulary size dominates parameter count
- FIM-style training out of the box (FIM specials are pre-allocated)
- Repo-aware packing using `<|repo_name|>`, `<|filename|>`, `<|file_sep|>`

## Limitations

- Not optimized for non-English natural language; the corpus is English + code.
- Compression on dense punctuation languages (C, HTML/CSS) is noticeably tighter than for Python/prose β€” budget context length accordingly.
- The 16 reserved slots are *unused* β€” they will never appear in trained text and need to be registered explicitly if you want to repurpose them (e.g. chat-template tags).

## License

Apache-2.0.