Upload folder using huggingface_hub

bfe5f8e verified 2 days ago

3.91 kB

	---
	language:
	- en
	- de
	- fr
	- es
	- pt
	- it
	- nl
	- pl
	- ro
	- cs
	- sv
	- da
	- "no"
	- fi
	- hu
	- hr
	- bg
	- tr
	- ca
	- ru
	- uk
	- sr
	- zh
	- ja
	- ko
	- ar
	- fa
	- he
	- hi
	- bn
	- th
	- vi
	- ka
	- hy
	- el
	- yi
	- ur
	- ta
	- te
	- gu
	- pa
	- ml
	- kn
	- am
	- si
	- my
	- km
	- mr
	- ne
	- or
	- bo
	- dv
	- eu
	- gl
	- gd
	- et
	- sk
	- lt
	- sl
	- lv
	- af
	- sq
	- sw
	- is
	- tl
	- cy
	- ga
	- br
	- la
	- mk
	- id
	license: apache-2.0
	library_name: tokenizers
	tags:
	- tokenizer
	- bpe
	- multilingual
	- quartz
	- aenea
	- flores
	pipeline_tag: text-generation
	---

	# QT_V.2 64K — Multilingual BPE Tokenizer

	The most equitable 64K tokenizer available. 71 natural languages across 26 script families, with half the vocabulary of Llama 3, Tekken, and Qwen 2.5 — yet fewer total tokens on both FLORES-200 (204 languages) and our 66-test field benchmark.

	Part of the QT_V.2 tokenizer family by [Quartz Data Infrastructure](https://quartz.host), the open data layer behind [AENEA](https://aenea.app).

	## FLORES-200 Results (204 Languages · 1,012 Parallel Sentences)

	\| Metric \| QT 64K \| QT 96K \| QT Code 114K \| Llama 3 (128K) \| Tekken (131K) \| Qwen 2.5 (152K) \|
	\|---\|---\|---\|---\|---\|---\|---\|
	\| Total tokens \| 13,592,357 \| 12,961,617 \| 13,007,924 \| 16,764,198 \| 14,421,539 \| 15,425,680 \|
	\| Equity ratio \| 41.0× \| 31.6× \| 43.3× \| 118.6× \| 127.9× \| 77.7× \|
	\| Mean fertility \| 4.18 \| 3.94 \| 4.03 \| 5.72 \| 5.34 \| 4.91 \|

	The equity ratio measures the gap between best-served and worst-served language (lower is fairer). QT 64K at 41.0× is 2.9× more equitable than Llama 3 (118.6×) and 3.1× more equitable than Tekken (127.9×) — at half the vocabulary.

	### Where QT 64K Dominates (FLORES-200 tok/word)

	\| Language \| QT 64K \| Llama 3 \| Tekken \| Qwen 2.5 \|
	\|---\|---\|---\|---\|---\|
	\| Tibetan \| 42.5 \| 149.8 \| 168.4 \| 98.0 \|
	\| Odia \| 4.16 \| 16.90 \| 18.30 \| 13.65 \|
	\| Khmer \| 17.1 \| 40.9 \| 70.5 \| 30.7 \|
	\| Georgian \| 3.83 \| 15.47 \| 3.93 \| 8.33 \|
	\| Sinhala \| 3.84 \| 11.37 \| 16.60 \| 9.17 \|
	\| Amharic \| 3.90 \| 11.95 \| 11.98 \| 6.45 \|

	## Field Benchmark (66 Tests)

	\| Metric \| Value \|
	\|---\|---\|
	\| Total tokens \| 3,593 \|
	\| vs Llama 3 (128K) \| 36.3% fewer tokens \|
	\| vs Tekken (131K) \| 17.3% fewer tokens \|
	\| vs Qwen 2.5 (152K) \| 30.7% fewer tokens \|

	## When to Use This Variant

	QT_V.2 64K is ideal when you need the smallest possible embedding table — for parameter-constrained small models, edge deployment, or when every MB of VRAM matters.

	Also available: [QT_V.2 96K](https://huggingface.co/QuartzOpen/QT_V.2_96K) (best all-round) · [QT_V.2 Code 114K](https://huggingface.co/QuartzOpen/QT_V.2_Code_114K) (multilingual coding)

	## Usage

	```python
	from tokenizers import Tokenizer
	tok = Tokenizer.from_file("tokenizer.json")
	encoded = tok.encode("The quick brown fox jumps over the lazy dog")
	print(encoded.tokens)
	```

	## Specifications

	\| Spec \| Value \|
	\|---\|---\|
	\| Vocabulary \| 64,000 \|
	\| Languages \| 71 natural + 14 code \|
	\| Script families \| 26 \|
	\| Pretokenizer \| Llama 3 regex \|
	\| Arithmetic \| Single-digit splitting \|
	\| Max token length \| 15 chars \|
	\| Avg token length \| 5.7 chars \|

	## Training

	Byte-level BPE with Llama 3 regex pretokenizer. Corpus: 58.5% Wikipedia (71 languages via wiki_ultra_clean v7.3), 18.0% code (14 languages), 23.5% Stack Exchange (49 sites via se_ultra_clean v1).

	## Files

	`tokenizer.json` · `vocab.json` · `merges.txt` · `training_report.json`

	## Contact

	Open-source: quartzopensource@gmail.com
	Commercial licensing & enterprise: commercial@aeneaglobal.com

	## License

	Apache 2.0 — Copyright 2025-2026 AENEA Global Ltd

	```bibtex
	@misc{qt_v2_2026,
	title={QT_V.2: A Multilingual BPE Tokenizer Family},
	author={AENEA Global Ltd},
	year={2026},
	url={https://quartz.host},
	}
	```