ThingsAI commited on
Commit
372588f
verified
1 Parent(s): 4f051f1

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +64 -0
README.md ADDED
@@ -0,0 +1,64 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - it
4
+ - en
5
+ tags:
6
+ - tokenizer
7
+ - bpe
8
+ - stem
9
+ - physics
10
+ - quark
11
+ license: mit
12
+ pipeline_tag: text-generation
13
+ ---
14
+
15
+ # Quark2Tokenizer
16
+
17
+ Quark2Tokenizer is a highly optimized, bilingual (English/Italian) Byte-Level Byte-Pair Encoding (BPE) tokenizer engineered specifically for the **Quark** Small Language Model (SLM) series. It is designed to maximize context window efficiency across technical domains, including specialized STEM literature, theoretical physics, source code, and structured multi-turn reasoning paths.
18
+
19
+ ## Architectural Design & Specifications
20
+
21
+ - **Model Type:** ByteLevelBPE (Rust-backed core for highly parallelized inference and training)
22
+ - **Vocabulary Size:** 65,536 (Fixed entry constraints)
23
+ - **Primary Languages:** Italian (IT), English (EN)
24
+ - **Domain Optimization:** Advanced Mathematics (LaTeX), Deep Theoretical Physics, and Structural Source Code (Python, Shell)
25
+
26
+ ### Token Allocation & Special Tokens
27
+ The first 10 token slots (`0-9`) are structurally reserved as non-fragmentable atomic identifiers. This architecture prevents standard ByteLevel segmentation from breaking conversational control markers or degrading the model's available context window during dense logical inference:
28
+
29
+ | Token | ID | Functional Scope |
30
+ | :--- | :--- | :--- |
31
+ | `<\|system\|>` | `4` | System prompt configuration boundary |
32
+ | `<\|user\|>` | `5` | User prompt interface marker |
33
+ | `<\|assistant\|>` | `6` | Assistant response generation boundary |
34
+ | `<\|end\|>` | `7` | Complete sequence/turn termination delimiter |
35
+ | `<\|thinking\|>` | `8` | **Chain-of-Thought (CoT) sequence initialization** |
36
+ | `<\|/thinking\|>` | `9` | **Chain-of-Thought (CoT) sequence termination** |
37
+
38
+ ---
39
+
40
+ ## Tokenization Efficiency & Compression Metrics
41
+
42
+ The tokenizer was trained on a meticulously balanced 5-billion-token streaming matrix comprising Wikipedia (EN/IT), FineWeb-2 (IT), CulturaX, OpenWebMath, The Stack-Dedup, and high-quality Claude CoT distillation data.
43
+
44
+ Evaluated across varied modalities, the vocabulary yields a highly competitive **Characters-per-Token (char/tok)** compression ratio:
45
+
46
+ * **Natural Language Prosa (Bilingual EN/IT):** **5.31 char/tok** *Consolidates morphological roots and spaces seamlessly (e.g., merging `路modello` or `路architecture` into single vocab entries), significantly reducing context consumption during prompt parsing.*
47
+ * **Source Code (Python):** **2.72 char/tok** *Maintains exact syntactic integrity of control structures, indentation spaces, and logical keywords without granular sub-word fragmentation.*
48
+ * **Mathematical Notation (LaTeX & Tensor Calculus):** **2.19 char/tok** *Successfully captures highly frequent structural clusters (e.g., immediate grouping of subscript operators and syntax macros like `_{\`), avoiding the typical cascade of single-byte token explosions found in generic tokenizers.*
49
+
50
+ ---
51
+
52
+ ## Quickstart
53
+
54
+ ```python
55
+ from transformers import AutoTokenizer
56
+
57
+ # Initialize the Quark2Tokenizer
58
+ tokenizer = AutoTokenizer.from_pretrained("ThingAI/QuarkTokenizer-v2")
59
+
60
+ sequence = "<|thinking|>\nExecuting field equations...\n<|/thinking|>G_{\mu\nu} = \\frac{8\\pi G}{c^4} T_{\mu\nu}"
61
+ tokens = tokenizer.encode(sequence)
62
+
63
+ print(f"Token IDs: {tokens}")
64
+ print(f"Decoded Sequence: {tokenizer.decode(tokens)}")