Instructions to use TerminatorPower/ezellm-lite-tokenizer with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use TerminatorPower/ezellm-lite-tokenizer with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="TerminatorPower/ezellm-lite-tokenizer")

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("TerminatorPower/ezellm-lite-tokenizer", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use TerminatorPower/ezellm-lite-tokenizer with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "TerminatorPower/ezellm-lite-tokenizer"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "TerminatorPower/ezellm-lite-tokenizer",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/TerminatorPower/ezellm-lite-tokenizer

SGLang

How to use TerminatorPower/ezellm-lite-tokenizer with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "TerminatorPower/ezellm-lite-tokenizer" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "TerminatorPower/ezellm-lite-tokenizer",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "TerminatorPower/ezellm-lite-tokenizer" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "TerminatorPower/ezellm-lite-tokenizer",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use TerminatorPower/ezellm-lite-tokenizer with Docker Model Runner:
```
docker model run hf.co/TerminatorPower/ezellm-lite-tokenizer
```

ezellm-lite-tokenizer / README.md

TerminatorPower

Replace OpenAI baselines with size-matched code tokenizers (StarCoder2, CodeGen, DeepSeek-Coder, GPT-2)

2507447 verified 9 days ago

preview code

raw

history blame contribute delete

7.83 kB

	---
	license: apache-2.0
	language:
	- en
	- code
	tags:
	- tokenizer
	- bpe
	- code
	- fim
	- tiktoken
	library_name: transformers
	pipeline_tag: text-generation
	---

	# ezellm-lite-tokenizer

	A 24,600-vocab byte-level BPE tokenizer trained on a 142 GB code-heavy corpus, with a Qwen2.5-Coder-style special-token layout (FIM + repo metadata + reserved slots). Designed for small-to-mid scale code language models where embedding-table size matters.

	This is the v2 of the tokenizer.

	## Quick start

	```python
	from transformers import AutoTokenizer

	tok = AutoTokenizer.from_pretrained("TerminatorPower/ezellm-lite-tokenizer")

	ids = tok.encode("def hello(name):\n print(f'Hello, {name}!')\n")
	print(len(ids), tok.decode(ids))
	```

	## Vocabulary layout

	Total vocab: 24,600. The first 24,576 IDs are learned BPE merges; the top 24 IDs are reserved for control tokens.

	\| ID range \| Tokens \| Purpose \|
	\|---------------\|-----------------------------------------------------------\|-------------------------------\|
	\| 0 – 24,575 \| Learned BPE pieces \| Text/code \|
	\| 24,576 \| `<\\|endoftext\\|>` \| EOS + PAD \|
	\| 24,577 – 24,580 \| `<\\|fim_prefix\\|>`, `<\\|fim_middle\\|>`, `<\\|fim_suffix\\|>`, `<\\|fim_pad\\|>` \| Fill-in-the-Middle training \|
	\| 24,581 – 24,583 \| `<\\|file_sep\\|>`, `<\\|repo_name\\|>`, `<\\|filename\\|>` \| Repo-level packing \|
	\| 24,584 – 24,599 \| `<\\|reserved_0\\|>` … `<\\|reserved_15\\|>` \| Reserved for downstream use \|

	Only `<\|endoftext\|>` is registered as `eos_token` / `pad_token` in `special_tokens_map`. The FIM and repo markers are added tokens but are not flagged "special" — `tok.decode(ids, skip_special_tokens=True)` will only strip `<\|endoftext\|>`. Register the others explicitly if you want them stripped on decode.

	## Training data

	- Size: ~142 GB of text and source code
	- Mix (heavily code-leaning): Python, JavaScript/TypeScript, Java, C/C++, web (HTML/CSS), Markdown, math/scientific Python, technical prose
	- Algorithm: Byte-level BPE (tiktoken-compatible; `tiktoken.bpe` and `tiktoken.json` are bundled alongside the standard `tokenizer.json`)

	The tokenizer files in this repo can be loaded both via 🤗 `transformers` (`tokenizer.json`) and via `tiktoken` directly (`tiktoken.bpe`).

	## Benchmarks

	Compared against size-matched, code-trained tokenizers in the 24K–50K vocab range. Measured on a held-out multi-domain corpus (~618 KB across 10 categories: C, C++, Java, JavaScript, Markdown, math/Python, prose, general Python, HTML/CSS, raw outputs).

	### Aggregate compression

	Higher chars/token = better compression = shorter context for the same input.

	\| Tokenizer \| Vocab \| chars/token \| tokens / 1k chars \|
	\|----------------------\|-------:\|------------:\|------------------:\|
	\| StarCoder2 \| 49,152 \| 3.238 \| 308.9 \|
	\| ezellm-lite \| 24,600 \| 3.081 \| 324.6 \|
	\| CodeGen-mono \| 50,295 \| 3.017 \| 331.5 \|
	\| DeepSeek-Coder \| 32,022 \| 2.836 \| 352.6 \|
	\| GPT-2 \| 50,257 \| 2.487 \| 402.1 \|

	ezellm-lite is the second-best tokenizer in this group — within ~5% of StarCoder2 despite having half the vocabulary, ahead of CodeGen-mono and DeepSeek-Coder which both have 30–100% more vocab slots, and ~24% more compressive than GPT-2.

	### Compression by category — characters per token

	\| Category \| ezellm-lite (24.6K) \| StarCoder2 (49K) \| CodeGen-mono (50K) \| DeepSeek-Coder (32K) \| GPT-2 (50K) \|
	\|----------------\|--------------------:\|-----------------:\|-------------------:\|---------------------:\|------------:\|
	\| c \| 2.839 \| 2.900 \| 2.630 \| 2.534 \| 2.470 \|
	\| cpp \| 3.157 \| 3.303 \| 2.914 \| 2.793 \| 2.289 \|
	\| java \| 3.996 \| 4.517 \| 3.606 \| 3.605 \| 2.329 \|
	\| javascript \| 3.142 \| 3.423 \| 2.988 \| 2.898 \| 2.357 \|
	\| markdown_docs \| 3.117 \| 3.272 \| 3.125 \| 2.906 \| 2.986 \|
	\| math_python \| 2.630 \| 2.695 \| 2.482 \| 2.387 \| 2.136 \|
	\| prose \| 3.661 \| 3.731 \| 4.356 \| 3.855 \| 4.356 \|
	\| python_general \| 3.680 \| 3.747 \| 3.249 \| 3.169 \| 2.586 \|
	\| web_html_css \| 2.673 \| 2.897 \| 2.831 \| 2.543 \| 2.289 \|

	Reading the table. ezellm-lite either wins or comes within a few percent of the leader on every code category, beating CodeGen-mono and DeepSeek-Coder consistently on Python, JavaScript, Java, and C++. StarCoder2 edges it out on most categories (expected — 2× the vocab, trained on The Stack v2). The one place ezellm-lite clearly trails is prose, where the prose-heavy GPT-2 vocabulary still wins — a deliberate trade for a code-focused tokenizer.

	### Efficiency per vocabulary slot

	A 24K vocab tokenizer is at a structural disadvantage on raw chars/token. A fairer cross-vocab metric is chars/token × log₂(vocab) — roughly "input bits carried per token."

	\| Tokenizer \| Vocab \| chars/tok \| chars/tok × log₂(V) \|
	\|------------------\|-------:\|----------:\|--------------------:\|
	\| StarCoder2 \| 49,152 \| 3.238 \| 50.46 \|
	\| CodeGen-mono \| 50,295 \| 3.017 \| 47.12 \|
	\| ezellm-lite \| 24,600 \| 3.081 \| 44.94 \|
	\| DeepSeek-Coder \| 32,022 \| 2.836 \| 42.45 \|
	\| GPT-2 \| 50,257 \| 2.487 \| 38.84 \|

	By this measure ezellm-lite sits above CodeGen-mono on raw compression while using less than half the embedding parameters, and clearly above DeepSeek-Coder and GPT-2.

	At `d_model=1024`, the embedding/output table sizes are: ezellm-lite 25M, DeepSeek-Coder 33M, StarCoder2 / CodeGen-mono / GPT-2 ~50M. For a small code LM (≤2B params) those tens of millions of softmax parameters are real budget.

	### Encoding speed

	All five tokenizers are sub-millisecond on the 60–90 KB sample shards in this benchmark; encoding speed is not the bottleneck for any of them.

	## Files

	\| File \| Purpose \|
	\|-----------------------\|-----------------------------------------------\|
	\| `tokenizer.json` \| 🤗 Tokenizers / `transformers`-loadable model \|
	\| `tokenizer_config.json` \| Special-token metadata for `transformers` \|
	\| `tiktoken.bpe` \| tiktoken-format merge table \|
	\| `tiktoken.json` \| tiktoken metadata (pattern, special tokens) \|

	## Intended use

	- Pretraining / fine-tuning small-to-mid code LMs (≤2B params) where vocabulary size dominates parameter count
	- FIM-style training out of the box (FIM specials are pre-allocated)
	- Repo-aware packing using `<\|repo_name\|>`, `<\|filename\|>`, `<\|file_sep\|>`

	## Limitations

	- Not optimized for non-English natural language; the corpus is English + code.
	- Compression on dense punctuation languages (C, HTML/CSS) is noticeably tighter than for Python/prose — budget context length accordingly.
	- The 16 reserved slots are unused — they will never appear in trained text and need to be registered explicitly if you want to repurpose them (e.g. chat-template tags).

	## License

	Apache-2.0.

	---
	license: apache-2.0
	language:
	- en
	- code
	tags:
	- tokenizer
	- bpe
	- code
	- fim
	- tiktoken
	library_name: transformers
	pipeline_tag: text-generation
	---

	# ezellm-lite-tokenizer

	A 24,600-vocab byte-level BPE tokenizer trained on a 142 GB code-heavy corpus, with a Qwen2.5-Coder-style special-token layout (FIM + repo metadata + reserved slots). Designed for small-to-mid scale code language models where embedding-table size matters.

	This is the v2 of the tokenizer.

	## Quick start

	```python
	from transformers import AutoTokenizer

	tok = AutoTokenizer.from_pretrained("TerminatorPower/ezellm-lite-tokenizer")

	ids = tok.encode("def hello(name):\n print(f'Hello, {name}!')\n")
	print(len(ids), tok.decode(ids))
	```

	## Vocabulary layout

	Total vocab: 24,600. The first 24,576 IDs are learned BPE merges; the top 24 IDs are reserved for control tokens.

	\| ID range \| Tokens \| Purpose \|
	\|---------------\|-----------------------------------------------------------\|-------------------------------\|
	\| 0 – 24,575 \| Learned BPE pieces \| Text/code \|
	\| 24,576 \| `<\\|endoftext\\|>` \| EOS + PAD \|
	\| 24,577 – 24,580 \| `<\\|fim_prefix\\|>`, `<\\|fim_middle\\|>`, `<\\|fim_suffix\\|>`, `<\\|fim_pad\\|>` \| Fill-in-the-Middle training \|
	\| 24,581 – 24,583 \| `<\\|file_sep\\|>`, `<\\|repo_name\\|>`, `<\\|filename\\|>` \| Repo-level packing \|
	\| 24,584 – 24,599 \| `<\\|reserved_0\\|>` … `<\\|reserved_15\\|>` \| Reserved for downstream use \|

	Only `<\|endoftext\|>` is registered as `eos_token` / `pad_token` in `special_tokens_map`. The FIM and repo markers are added tokens but are not flagged "special" — `tok.decode(ids, skip_special_tokens=True)` will only strip `<\|endoftext\|>`. Register the others explicitly if you want them stripped on decode.

	## Training data

	- Size: ~142 GB of text and source code
	- Mix (heavily code-leaning): Python, JavaScript/TypeScript, Java, C/C++, web (HTML/CSS), Markdown, math/scientific Python, technical prose
	- Algorithm: Byte-level BPE (tiktoken-compatible; `tiktoken.bpe` and `tiktoken.json` are bundled alongside the standard `tokenizer.json`)

	The tokenizer files in this repo can be loaded both via 🤗 `transformers` (`tokenizer.json`) and via `tiktoken` directly (`tiktoken.bpe`).

	## Benchmarks

	Compared against size-matched, code-trained tokenizers in the 24K–50K vocab range. Measured on a held-out multi-domain corpus (~618 KB across 10 categories: C, C++, Java, JavaScript, Markdown, math/Python, prose, general Python, HTML/CSS, raw outputs).

	### Aggregate compression

	Higher chars/token = better compression = shorter context for the same input.

	\| Tokenizer \| Vocab \| chars/token \| tokens / 1k chars \|
	\|----------------------\|-------:\|------------:\|------------------:\|
	\| StarCoder2 \| 49,152 \| 3.238 \| 308.9 \|
	\| ezellm-lite \| 24,600 \| 3.081 \| 324.6 \|
	\| CodeGen-mono \| 50,295 \| 3.017 \| 331.5 \|
	\| DeepSeek-Coder \| 32,022 \| 2.836 \| 352.6 \|
	\| GPT-2 \| 50,257 \| 2.487 \| 402.1 \|

	ezellm-lite is the second-best tokenizer in this group — within ~5% of StarCoder2 despite having half the vocabulary, ahead of CodeGen-mono and DeepSeek-Coder which both have 30–100% more vocab slots, and ~24% more compressive than GPT-2.

	### Compression by category — characters per token

	\| Category \| ezellm-lite (24.6K) \| StarCoder2 (49K) \| CodeGen-mono (50K) \| DeepSeek-Coder (32K) \| GPT-2 (50K) \|
	\|----------------\|--------------------:\|-----------------:\|-------------------:\|---------------------:\|------------:\|
	\| c \| 2.839 \| 2.900 \| 2.630 \| 2.534 \| 2.470 \|
	\| cpp \| 3.157 \| 3.303 \| 2.914 \| 2.793 \| 2.289 \|
	\| java \| 3.996 \| 4.517 \| 3.606 \| 3.605 \| 2.329 \|
	\| javascript \| 3.142 \| 3.423 \| 2.988 \| 2.898 \| 2.357 \|
	\| markdown_docs \| 3.117 \| 3.272 \| 3.125 \| 2.906 \| 2.986 \|
	\| math_python \| 2.630 \| 2.695 \| 2.482 \| 2.387 \| 2.136 \|
	\| prose \| 3.661 \| 3.731 \| 4.356 \| 3.855 \| 4.356 \|
	\| python_general \| 3.680 \| 3.747 \| 3.249 \| 3.169 \| 2.586 \|
	\| web_html_css \| 2.673 \| 2.897 \| 2.831 \| 2.543 \| 2.289 \|

	Reading the table. ezellm-lite either wins or comes within a few percent of the leader on every code category, beating CodeGen-mono and DeepSeek-Coder consistently on Python, JavaScript, Java, and C++. StarCoder2 edges it out on most categories (expected — 2× the vocab, trained on The Stack v2). The one place ezellm-lite clearly trails is prose, where the prose-heavy GPT-2 vocabulary still wins — a deliberate trade for a code-focused tokenizer.

	### Efficiency per vocabulary slot

	A 24K vocab tokenizer is at a structural disadvantage on raw chars/token. A fairer cross-vocab metric is chars/token × log₂(vocab) — roughly "input bits carried per token."

	\| Tokenizer \| Vocab \| chars/tok \| chars/tok × log₂(V) \|
	\|------------------\|-------:\|----------:\|--------------------:\|
	\| StarCoder2 \| 49,152 \| 3.238 \| 50.46 \|
	\| CodeGen-mono \| 50,295 \| 3.017 \| 47.12 \|
	\| ezellm-lite \| 24,600 \| 3.081 \| 44.94 \|
	\| DeepSeek-Coder \| 32,022 \| 2.836 \| 42.45 \|
	\| GPT-2 \| 50,257 \| 2.487 \| 38.84 \|

	By this measure ezellm-lite sits above CodeGen-mono on raw compression while using less than half the embedding parameters, and clearly above DeepSeek-Coder and GPT-2.

	At `d_model=1024`, the embedding/output table sizes are: ezellm-lite 25M, DeepSeek-Coder 33M, StarCoder2 / CodeGen-mono / GPT-2 ~50M. For a small code LM (≤2B params) those tens of millions of softmax parameters are real budget.

	### Encoding speed

	All five tokenizers are sub-millisecond on the 60–90 KB sample shards in this benchmark; encoding speed is not the bottleneck for any of them.

	## Files

	\| File \| Purpose \|
	\|-----------------------\|-----------------------------------------------\|
	\| `tokenizer.json` \| 🤗 Tokenizers / `transformers`-loadable model \|
	\| `tokenizer_config.json` \| Special-token metadata for `transformers` \|
	\| `tiktoken.bpe` \| tiktoken-format merge table \|
	\| `tiktoken.json` \| tiktoken metadata (pattern, special tokens) \|

	## Intended use

	- Pretraining / fine-tuning small-to-mid code LMs (≤2B params) where vocabulary size dominates parameter count
	- FIM-style training out of the box (FIM specials are pre-allocated)
	- Repo-aware packing using `<\|repo_name\|>`, `<\|filename\|>`, `<\|file_sep\|>`

	## Limitations

	- Not optimized for non-English natural language; the corpus is English + code.
	- Compression on dense punctuation languages (C, HTML/CSS) is noticeably tighter than for Python/prose — budget context length accordingly.
	- The 16 reserved slots are unused — they will never appear in trained text and need to be registered explicitly if you want to repurpose them (e.g. chat-template tags).

	## License

	Apache-2.0.