| | --- |
| | language: |
| | - en |
| | - de |
| | - fr |
| | - es |
| | - pt |
| | - it |
| | - nl |
| | - pl |
| | - ro |
| | - cs |
| | - sv |
| | - da |
| | - "no" |
| | - fi |
| | - hu |
| | - hr |
| | - bg |
| | - tr |
| | - ca |
| | - ru |
| | - uk |
| | - sr |
| | - zh |
| | - ja |
| | - ko |
| | - ar |
| | - fa |
| | - he |
| | - hi |
| | - bn |
| | - th |
| | - vi |
| | - ka |
| | - hy |
| | - el |
| | - yi |
| | - ur |
| | - ta |
| | - te |
| | - gu |
| | - pa |
| | - ml |
| | - kn |
| | - am |
| | - si |
| | - my |
| | - km |
| | - mr |
| | - ne |
| | - or |
| | - bo |
| | - dv |
| | - eu |
| | - gl |
| | - gd |
| | - et |
| | - sk |
| | - lt |
| | - sl |
| | - lv |
| | - af |
| | - sq |
| | - sw |
| | - is |
| | - tl |
| | - cy |
| | - ga |
| | - br |
| | - la |
| | - mk |
| | - id |
| | - code |
| | license: apache-2.0 |
| | library_name: tokenizers |
| | tags: |
| | - tokenizer |
| | - bpe |
| | - multilingual |
| | - code |
| | - quartz |
| | - aenea |
| | - coding |
| | - python |
| | - flores |
| | pipeline_tag: text-generation |
| | --- |
| | |
| | # QT_V.2 Code 114K — Multilingual Coding Tokenizer |
| | |
| | **Lowest total tokens on our 66-test field benchmark of any tokenizer at any vocab size.** 114,688 vocabulary optimised for multilingual coding models. Trained with doubled code weight (37% of corpus) including 450K high-quality Python functions from CodeSearchNet. Beats Llama 3, Tekken, and Qwen 2.5 on total tokens while using 10–37% less vocabulary. Validated on FLORES-200 across 204 languages. |
| | |
| | Part of the **QT_V.2 tokenizer family** by [Quartz Data Infrastructure](https://quartz.host), the open data layer behind [AENEA](https://aenea.app). |
| | |
| | ## FLORES-200 Results (204 Languages · 1,012 Parallel Sentences) |
| | |
| | | Metric | QT Code 114K | QT 96K | QT 64K | Llama 3 (128K) | Tekken (131K) | Qwen 2.5 (152K) | |
| | |---|---|---|---|---|---|---| |
| | | **Total tokens** | 13,007,924 | **12,961,617** | 13,592,357 | 16,764,198 | 14,421,539 | 15,425,680 | |
| | | **Equity ratio** | 43.3× | **31.6×** | 41.0× | 118.6× | 127.9× | 77.7× | |
| | | Mean fertility | 4.03 | **3.94** | 4.18 | 5.72 | 5.34 | 4.91 | |
| | |
| | QT Code 114K uses **22.4% fewer tokens than Llama 3** and **9.8% fewer than Tekken** across all 204 FLORES languages — with 10–37% less vocabulary. |
| | |
| | ### Key FLORES Languages (tok/word) |
| | |
| | | Language | QT Code | Llama 3 | Tekken | Qwen 2.5 | |
| | |---|---|---|---|---| |
| | | Japanese | **32.1** | 38.9 | 41.3 | 35.8 | |
| | | Tibetan | **46.5** | 149.8 | 168.4 | 98.0 | |
| | | Sinhala | **3.58** | 11.37 | 16.60 | 9.17 | |
| | | Amharic | **3.40** | 11.95 | 11.98 | 6.45 | |
| | | Georgian | **3.46** | 15.47 | 3.93 | 8.33 | |
| | | Odia | **4.10** | 16.90 | 18.30 | 13.65 | |
| | |
| | ## Field Benchmark (66 Tests) |
| | |
| | | Metric | Value | |
| | |---|---| |
| | | **Total tokens** | **3,314** (lowest of any tokenizer) | |
| | | vs Llama 3 (128K) | 41.2% fewer tokens | |
| | | vs Tekken (131K) | 23.8% fewer tokens | |
| | | vs Qwen 2.5 (152K) | 36.1% fewer tokens | |
| | |
| | ### Code Performance |
| | |
| | | Language | QT Code | QT 96K | QT 64K | Llama 3 | Tekken | Qwen 2.5 | |
| | |---|---|---|---|---|---|---| |
| | | Python | **110** | 115 | 125 | 97 | 112 | 105 | |
| | | JavaScript | **67** | 71 | 71 | 65 | 69 | 64 | |
| | | Rust | **111** | 113 | 117 | 108 | 111 | 107 | |
| | |
| | Python compression improved from 125 (64K) to 115 (96K) to **110** (Code 114K) — closing the gap versus Llama 3's 97 from 28.9% to 13.4%. |
| | |
| | ### Category Totals (lower is better) |
| | |
| | | Category | QT Code | Llama 3 | Tekken | Qwen 2.5 | |
| | |---|---|---|---|---| |
| | | Natural Languages (20) | **1,033** | 1,599 | 1,038 | 1,535 | |
| | | V1 Expansion (14) | **662** | 1,758 | 1,092 | 1,509 | |
| | | V2 New Scripts (3) | **188** | 692 | 740 | 523 | |
| | | Celtic / Brythonic (8) | **312** | 391 | 341 | 384 | |
| | | Code (3) | 288 | **270** | 292 | 276 | |
| | | **TOTAL (66 tests)** | **3,314** | 5,639 | 4,347 | 5,183 | |
| | |
| | ## When to Use This Variant |
| | |
| | **QT_V.2 Code 114K** is designed for multilingual coding assistants and code generation models. It wins Natural Languages outright (1,033 — beating Tekken's 1,038) while offering competitive code compression. Ideal for models that must serve both code and diverse natural language users. |
| | |
| | Also available: [QT_V.2 64K](https://huggingface.co/QuartzOpen/QT_V.2_64K) (smallest embedding) · [QT_V.2 96K](https://huggingface.co/QuartzOpen/QT_V.2_96K) (best all-round) |
| | |
| | ## Usage |
| | |
| | ```python |
| | from tokenizers import Tokenizer |
| | tok = Tokenizer.from_file("tokenizer.json") |
| | encoded = tok.encode("def fibonacci(n):\n if n <= 1:\n return n\n return fibonacci(n-1) + fibonacci(n-2)") |
| | print(encoded.tokens) |
| | ``` |
| | |
| | ## Specifications |
| | |
| | | Spec | Value | |
| | |---|---| |
| | | Vocabulary | 114,688 | |
| | | Languages | 71 natural + 15 code (incl. CodeSearchNet) | |
| | | Script families | 26 | |
| | | Pretokenizer | Llama 3 regex | |
| | | Arithmetic | Single-digit splitting | |
| | | Max token length | 15 chars | |
| | | Avg token length | 6.24 chars | |
| | | Compression | 3.60 chars/token | |
| | |
| | ## Training |
| | |
| | Byte-level BPE with Llama 3 regex pretokenizer. Code-heavy corpus: |
| | |
| | | Category | Share | Sources | |
| | |---|---|---| |
| | | Wikipedia | 37.3% | 71 languages (wiki_ultra_clean v7.3) | |
| | | Code | 37.4% | 14 languages + CodeSearchNet Python (450K functions) | |
| | | Stack Exchange | 25.3% | 49 sites (se_ultra_clean v1) | |
| | |
| | ## Files |
| | |
| | `tokenizer.json` · `vocab.json` · `merges.txt` · `training_report.json` |
| | |
| | ## Contact |
| | |
| | Open-source: quartzopensource@gmail.com |
| | Commercial licensing & enterprise: commercial@aeneaglobal.com |
| | |
| | ## License |
| | |
| | Apache 2.0 — Copyright 2025-2026 AENEA Global Ltd |
| | |
| | ```bibtex |
| | @misc{qt_v2_2026, |
| | title={QT_V.2: A Multilingual BPE Tokenizer Family}, |
| | author={AENEA Global Ltd}, |
| | year={2026}, |
| | url={https://quartz.host}, |
| | } |
| | ``` |
| | |