Upload folder using huggingface_hub

7098035 verified 2 days ago

5.22 kB

	---
	language:
	- en
	- de
	- fr
	- es
	- pt
	- it
	- nl
	- pl
	- ro
	- cs
	- sv
	- da
	- "no"
	- fi
	- hu
	- hr
	- bg
	- tr
	- ca
	- ru
	- uk
	- sr
	- zh
	- ja
	- ko
	- ar
	- fa
	- he
	- hi
	- bn
	- th
	- vi
	- ka
	- hy
	- el
	- yi
	- ur
	- ta
	- te
	- gu
	- pa
	- ml
	- kn
	- am
	- si
	- my
	- km
	- mr
	- ne
	- or
	- bo
	- dv
	- eu
	- gl
	- gd
	- et
	- sk
	- lt
	- sl
	- lv
	- af
	- sq
	- sw
	- is
	- tl
	- cy
	- ga
	- br
	- la
	- mk
	- id
	- code
	license: apache-2.0
	library_name: tokenizers
	tags:
	- tokenizer
	- bpe
	- multilingual
	- code
	- quartz
	- aenea
	- coding
	- python
	- flores
	pipeline_tag: text-generation
	---

	# QT_V.2 Code 114K — Multilingual Coding Tokenizer

	Lowest total tokens on our 66-test field benchmark of any tokenizer at any vocab size. 114,688 vocabulary optimised for multilingual coding models. Trained with doubled code weight (37% of corpus) including 450K high-quality Python functions from CodeSearchNet. Beats Llama 3, Tekken, and Qwen 2.5 on total tokens while using 10–37% less vocabulary. Validated on FLORES-200 across 204 languages.

	Part of the QT_V.2 tokenizer family by [Quartz Data Infrastructure](https://quartz.host), the open data layer behind [AENEA](https://aenea.app).

	## FLORES-200 Results (204 Languages · 1,012 Parallel Sentences)

	\| Metric \| QT Code 114K \| QT 96K \| QT 64K \| Llama 3 (128K) \| Tekken (131K) \| Qwen 2.5 (152K) \|
	\|---\|---\|---\|---\|---\|---\|---\|
	\| Total tokens \| 13,007,924 \| 12,961,617 \| 13,592,357 \| 16,764,198 \| 14,421,539 \| 15,425,680 \|
	\| Equity ratio \| 43.3× \| 31.6× \| 41.0× \| 118.6× \| 127.9× \| 77.7× \|
	\| Mean fertility \| 4.03 \| 3.94 \| 4.18 \| 5.72 \| 5.34 \| 4.91 \|

	QT Code 114K uses 22.4% fewer tokens than Llama 3 and 9.8% fewer than Tekken across all 204 FLORES languages — with 10–37% less vocabulary.

	### Key FLORES Languages (tok/word)

	\| Language \| QT Code \| Llama 3 \| Tekken \| Qwen 2.5 \|
	\|---\|---\|---\|---\|---\|
	\| Japanese \| 32.1 \| 38.9 \| 41.3 \| 35.8 \|
	\| Tibetan \| 46.5 \| 149.8 \| 168.4 \| 98.0 \|
	\| Sinhala \| 3.58 \| 11.37 \| 16.60 \| 9.17 \|
	\| Amharic \| 3.40 \| 11.95 \| 11.98 \| 6.45 \|
	\| Georgian \| 3.46 \| 15.47 \| 3.93 \| 8.33 \|
	\| Odia \| 4.10 \| 16.90 \| 18.30 \| 13.65 \|

	## Field Benchmark (66 Tests)

	\| Metric \| Value \|
	\|---\|---\|
	\| Total tokens \| 3,314 (lowest of any tokenizer) \|
	\| vs Llama 3 (128K) \| 41.2% fewer tokens \|
	\| vs Tekken (131K) \| 23.8% fewer tokens \|
	\| vs Qwen 2.5 (152K) \| 36.1% fewer tokens \|

	### Code Performance

	\| Language \| QT Code \| QT 96K \| QT 64K \| Llama 3 \| Tekken \| Qwen 2.5 \|
	\|---\|---\|---\|---\|---\|---\|---\|
	\| Python \| 110 \| 115 \| 125 \| 97 \| 112 \| 105 \|
	\| JavaScript \| 67 \| 71 \| 71 \| 65 \| 69 \| 64 \|
	\| Rust \| 111 \| 113 \| 117 \| 108 \| 111 \| 107 \|

	Python compression improved from 125 (64K) to 115 (96K) to 110 (Code 114K) — closing the gap versus Llama 3's 97 from 28.9% to 13.4%.

	### Category Totals (lower is better)

	\| Category \| QT Code \| Llama 3 \| Tekken \| Qwen 2.5 \|
	\|---\|---\|---\|---\|---\|
	\| Natural Languages (20) \| 1,033 \| 1,599 \| 1,038 \| 1,535 \|
	\| V1 Expansion (14) \| 662 \| 1,758 \| 1,092 \| 1,509 \|
	\| V2 New Scripts (3) \| 188 \| 692 \| 740 \| 523 \|
	\| Celtic / Brythonic (8) \| 312 \| 391 \| 341 \| 384 \|
	\| Code (3) \| 288 \| 270 \| 292 \| 276 \|
	\| TOTAL (66 tests) \| 3,314 \| 5,639 \| 4,347 \| 5,183 \|

	## When to Use This Variant

	QT_V.2 Code 114K is designed for multilingual coding assistants and code generation models. It wins Natural Languages outright (1,033 — beating Tekken's 1,038) while offering competitive code compression. Ideal for models that must serve both code and diverse natural language users.

	Also available: [QT_V.2 64K](https://huggingface.co/QuartzOpen/QT_V.2_64K) (smallest embedding) · [QT_V.2 96K](https://huggingface.co/QuartzOpen/QT_V.2_96K) (best all-round)

	## Usage

	```python
	from tokenizers import Tokenizer
	tok = Tokenizer.from_file("tokenizer.json")
	encoded = tok.encode("def fibonacci(n):\n if n <= 1:\n return n\n return fibonacci(n-1) + fibonacci(n-2)")
	print(encoded.tokens)
	```

	## Specifications

	\| Spec \| Value \|
	\|---\|---\|
	\| Vocabulary \| 114,688 \|
	\| Languages \| 71 natural + 15 code (incl. CodeSearchNet) \|
	\| Script families \| 26 \|
	\| Pretokenizer \| Llama 3 regex \|
	\| Arithmetic \| Single-digit splitting \|
	\| Max token length \| 15 chars \|
	\| Avg token length \| 6.24 chars \|
	\| Compression \| 3.60 chars/token \|

	## Training

	Byte-level BPE with Llama 3 regex pretokenizer. Code-heavy corpus:

	\| Category \| Share \| Sources \|
	\|---\|---\|---\|
	\| Wikipedia \| 37.3% \| 71 languages (wiki_ultra_clean v7.3) \|
	\| Code \| 37.4% \| 14 languages + CodeSearchNet Python (450K functions) \|
	\| Stack Exchange \| 25.3% \| 49 sites (se_ultra_clean v1) \|

	## Files

	`tokenizer.json` · `vocab.json` · `merges.txt` · `training_report.json`

	## Contact

	Open-source: quartzopensource@gmail.com
	Commercial licensing & enterprise: commercial@aeneaglobal.com

	## License

	Apache 2.0 — Copyright 2025-2026 AENEA Global Ltd

	```bibtex
	@misc{qt_v2_2026,
	title={QT_V.2: A Multilingual BPE Tokenizer Family},
	author={AENEA Global Ltd},
	year={2026},
	url={https://quartz.host},
	}
	```