Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,64 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language:
|
| 3 |
+
- it
|
| 4 |
+
- en
|
| 5 |
+
tags:
|
| 6 |
+
- tokenizer
|
| 7 |
+
- bpe
|
| 8 |
+
- stem
|
| 9 |
+
- physics
|
| 10 |
+
- quark
|
| 11 |
+
license: mit
|
| 12 |
+
pipeline_tag: text-generation
|
| 13 |
+
---
|
| 14 |
+
|
| 15 |
+
# Quark2Tokenizer
|
| 16 |
+
|
| 17 |
+
Quark2Tokenizer is a highly optimized, bilingual (English/Italian) Byte-Level Byte-Pair Encoding (BPE) tokenizer engineered specifically for the **Quark** Small Language Model (SLM) series. It is designed to maximize context window efficiency across technical domains, including specialized STEM literature, theoretical physics, source code, and structured multi-turn reasoning paths.
|
| 18 |
+
|
| 19 |
+
## Architectural Design & Specifications
|
| 20 |
+
|
| 21 |
+
- **Model Type:** ByteLevelBPE (Rust-backed core for highly parallelized inference and training)
|
| 22 |
+
- **Vocabulary Size:** 65,536 (Fixed entry constraints)
|
| 23 |
+
- **Primary Languages:** Italian (IT), English (EN)
|
| 24 |
+
- **Domain Optimization:** Advanced Mathematics (LaTeX), Deep Theoretical Physics, and Structural Source Code (Python, Shell)
|
| 25 |
+
|
| 26 |
+
### Token Allocation & Special Tokens
|
| 27 |
+
The first 10 token slots (`0-9`) are structurally reserved as non-fragmentable atomic identifiers. This architecture prevents standard ByteLevel segmentation from breaking conversational control markers or degrading the model's available context window during dense logical inference:
|
| 28 |
+
|
| 29 |
+
| Token | ID | Functional Scope |
|
| 30 |
+
| :--- | :--- | :--- |
|
| 31 |
+
| `<\|system\|>` | `4` | System prompt configuration boundary |
|
| 32 |
+
| `<\|user\|>` | `5` | User prompt interface marker |
|
| 33 |
+
| `<\|assistant\|>` | `6` | Assistant response generation boundary |
|
| 34 |
+
| `<\|end\|>` | `7` | Complete sequence/turn termination delimiter |
|
| 35 |
+
| `<\|thinking\|>` | `8` | **Chain-of-Thought (CoT) sequence initialization** |
|
| 36 |
+
| `<\|/thinking\|>` | `9` | **Chain-of-Thought (CoT) sequence termination** |
|
| 37 |
+
|
| 38 |
+
---
|
| 39 |
+
|
| 40 |
+
## Tokenization Efficiency & Compression Metrics
|
| 41 |
+
|
| 42 |
+
The tokenizer was trained on a meticulously balanced 5-billion-token streaming matrix comprising Wikipedia (EN/IT), FineWeb-2 (IT), CulturaX, OpenWebMath, The Stack-Dedup, and high-quality Claude CoT distillation data.
|
| 43 |
+
|
| 44 |
+
Evaluated across varied modalities, the vocabulary yields a highly competitive **Characters-per-Token (char/tok)** compression ratio:
|
| 45 |
+
|
| 46 |
+
* **Natural Language Prosa (Bilingual EN/IT):** **5.31 char/tok** *Consolidates morphological roots and spaces seamlessly (e.g., merging `路modello` or `路architecture` into single vocab entries), significantly reducing context consumption during prompt parsing.*
|
| 47 |
+
* **Source Code (Python):** **2.72 char/tok** *Maintains exact syntactic integrity of control structures, indentation spaces, and logical keywords without granular sub-word fragmentation.*
|
| 48 |
+
* **Mathematical Notation (LaTeX & Tensor Calculus):** **2.19 char/tok** *Successfully captures highly frequent structural clusters (e.g., immediate grouping of subscript operators and syntax macros like `_{\`), avoiding the typical cascade of single-byte token explosions found in generic tokenizers.*
|
| 49 |
+
|
| 50 |
+
---
|
| 51 |
+
|
| 52 |
+
## Quickstart
|
| 53 |
+
|
| 54 |
+
```python
|
| 55 |
+
from transformers import AutoTokenizer
|
| 56 |
+
|
| 57 |
+
# Initialize the Quark2Tokenizer
|
| 58 |
+
tokenizer = AutoTokenizer.from_pretrained("ThingAI/QuarkTokenizer-v2")
|
| 59 |
+
|
| 60 |
+
sequence = "<|thinking|>\nExecuting field equations...\n<|/thinking|>G_{\mu\nu} = \\frac{8\\pi G}{c^4} T_{\mu\nu}"
|
| 61 |
+
tokens = tokenizer.encode(sequence)
|
| 62 |
+
|
| 63 |
+
print(f"Token IDs: {tokens}")
|
| 64 |
+
print(f"Decoded Sequence: {tokenizer.decode(tokens)}")
|