Upload folder using huggingface_hub

bbf1368 verified 3 days ago

4.7 kB

	---
	language:
	- en
	- de
	- fr
	- es
	- pt
	- it
	- nl
	- pl
	- ro
	- cs
	- sv
	- da
	- "no"
	- fi
	- hu
	- hr
	- bg
	- tr
	- ca
	- ru
	- uk
	- sr
	- zh
	- ja
	- ko
	- ar
	- fa
	- he
	- hi
	- bn
	- th
	- vi
	- ka
	- hy
	- el
	- yi
	- ur
	- ta
	- te
	- gu
	- pa
	- ml
	- kn
	- am
	- si
	- my
	- km
	- mr
	- ne
	- or
	- bo
	- dv
	- eu
	- gl
	- gd
	- et
	- sk
	- lt
	- sl
	- lv
	- af
	- sq
	- sw
	- is
	- tl
	- cy
	- ga
	- br
	- la
	- mk
	- id
	license: apache-2.0
	library_name: tokenizers
	tags:
	- tokenizer
	- bpe
	- multilingual
	- quartz
	- aenea
	- flores
	pipeline_tag: text-generation
	---

	# QT_V.2 96K — Best All-Round Multilingual Tokenizer

	Fewest total tokens on FLORES-200 of any tokenizer tested. 96,000 vocabulary covering 71 languages and 26 script families. The most equitable tokenizer in the field — 4× fairer than Llama 3, 4× fairer than Tekken — while using 25–37% less vocabulary than all competitors.

	Part of the QT_V.2 tokenizer family by [Quartz Data Infrastructure](https://quartz.host), the open data layer behind [AENEA](https://aenea.app).

	## FLORES-200 Results (204 Languages · 1,012 Parallel Sentences)

	\| Metric \| QT 96K \| QT Code 114K \| QT 64K \| Llama 3 (128K) \| Tekken (131K) \| Qwen 2.5 (152K) \|
	\|---\|---\|---\|---\|---\|---\|---\|
	\| Total tokens \| 12,961,617 ✓ \| 13,007,924 \| 13,592,357 \| 16,764,198 \| 14,421,539 \| 15,425,680 \|
	\| Equity ratio \| 31.6× ✓ \| 43.3× \| 41.0× \| 118.6× \| 127.9× \| 77.7× \|
	\| Mean fertility \| 3.94 ✓ \| 4.03 \| 4.18 \| 5.72 \| 5.34 \| 4.91 \|
	\| Worst language \| lao (43.0) \| lao (58.0) \| lao (58.0) \| bod (149.8) \| bod (168.4) \| bod (98.0) \|

	QT 96K wins on total tokens, equity, and mean fertility. The 31.6× equity ratio means the worst-served language costs 31.6× more tokens than the best-served — compared to 118.6× for Llama 3 and 127.9× for Tekken. Llama 3's worst language (Tibetan at 149.8 tok/word) is 3.6× more expensive than QT 96K's Tibetan (41.1 tok/word).

	### Script Family Averages (FLORES-200 tok/word)

	\| Script Family \| QT 96K \| Llama 3 \| Tekken \| Qwen 2.5 \|
	\|---\|---\|---\|---\|---\|
	\| Latin (37 langs) \| 2.20 \| 2.39 \| 2.20 \| 2.41 \|
	\| Cyrillic (5) \| 2.23 \| 2.59 \| 2.27 \| 2.99 \|
	\| CJK (4) \| 17.17 \| 19.75 \| 21.36 \| 17.26 \|
	\| Indic Other (9) \| 4.21 \| 12.42 \| 6.77 \| 10.37 \|
	\| SE Asian (4) \| 20.70 \| 31.08 \| 38.22 \| 24.04 \|
	\| Unique Scripts (6) \| 9.35 \| 32.96 \| 32.05 \| 21.39 \|

	QT 96K is 3× more efficient than Llama 3 on Indic languages, and 3.4× more efficient on unique scripts (Georgian, Armenian, Tibetan, Amharic, Hebrew, Greek).

	## Field Benchmark (66 Tests)

	\| Metric \| Value \|
	\|---\|---\|
	\| Total tokens \| 3,339 \|
	\| vs Llama 3 (128K) \| 40.8% fewer tokens \|
	\| vs Tekken (131K) \| 23.2% fewer tokens \|
	\| vs Qwen 2.5 (152K) \| 35.6% fewer tokens \|

	Wins 6 of 9 benchmark categories: V1 Expansion, V2 New Scripts, V2 Gap-closers, V2 Latin Wikis, Celtic/Brythonic, and Natural Languages (within 1% of Tekken).

	## When to Use This Variant

	QT_V.2 96K is the recommended general-purpose tokenizer. Best balance between vocab efficiency and token compression across all language families. Recommended for production multilingual models serving diverse user populations.

	Also available: [QT_V.2 64K](https://huggingface.co/QuartzOpen/QT_V.2_64K) (smallest embedding) · [QT_V.2 Code 114K](https://huggingface.co/QuartzOpen/QT_V.2_Code_114K) (multilingual coding)

	## Usage

	```python
	from tokenizers import Tokenizer
	tok = Tokenizer.from_file("tokenizer.json")
	encoded = tok.encode("The quick brown fox jumps over the lazy dog")
	print(encoded.tokens)
	```

	## Specifications

	\| Spec \| Value \|
	\|---\|---\|
	\| Vocabulary \| 96,000 \|
	\| Languages \| 71 natural + 14 code \|
	\| Script families \| 26 \|
	\| Pretokenizer \| Llama 3 regex \|
	\| Arithmetic \| Single-digit splitting \|
	\| Max token length \| 15 chars \|
	\| Avg token length \| 6.1 chars \|
	\| Compression \| 3.17 chars/token \|

	## Training

	Byte-level BPE with Llama 3 regex pretokenizer. Corpus: 57.1% Wikipedia (71 languages via wiki_ultra_clean v7.3), 21.0% code (14 languages, boosted +25%), 21.9% Stack Exchange (49 sites). Top-10 European languages boosted +10%, Hindi/Bengali +15%.

	## Files

	`tokenizer.json` · `vocab.json` · `merges.txt` · `training_report.json`

	## Contact

	Open-source: quartzopensource@gmail.com
	Commercial licensing & enterprise: commercial@aeneaglobal.com

	## License

	Apache 2.0 — Copyright 2025-2026 AENEA Global Ltd

	```bibtex
	@misc{qt_v2_2026,
	title={QT_V.2: A Multilingual BPE Tokenizer Family},
	author={AENEA Global Ltd},
	year={2026},
	url={https://quartz.host},
	}
	```