assemsabry commited on
Commit
65480a1
·
verified ·
1 Parent(s): 99b583f

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +72 -3
README.md CHANGED
@@ -1,3 +1,72 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Traum Tokenizer
2
+
3
+ Traum Tokenizer is a high-performance, specialized tokenizer designed for next-generation Large Language Models (LLMs) and specifically optimized for the Flash - SLM project. Developed after extensive research into existing tokenizers like GPT-2 and BERT, Traum Tokenizer addresses the critical need for a balanced approach between compression efficiency, training speed, and linguistic understanding.
4
+
5
+ ## Overview
6
+
7
+ A tokenizer's efficiency is paramount to a model's performance. Traum Tokenizer utilizes a Byte-Level BPE (Byte-Pair Encoding) algorithm, which ensures that no unknown or encoding error tokens are produced, making it robust across diverse text types.
8
+
9
+ ### Key Features
10
+
11
+ - Massive Training Scale: Trained on a diverse dataset of 20 billion tokens.
12
+ - Expanded Vocabulary: Features a vocabulary size larger than GPT-2 by over 15,000 tokens, allowing for better representation of complex and modern terminology.
13
+ - Precision Engineering: Optimized for reasoning, mathematical symbols, and structural code.
14
+ - Optimized for Efficiency: Designed to maximize training throughput and inference quality for Small Language Models (SLMs).
15
+
16
+ ## Performance Benchmarks
17
+
18
+ Traum Tokenizer has been benchmarked against GPT-2 and LLaMA tokenizers across multiple domains. The performance metrics focus on the compression ratio (Characters per Token), where higher values indicate more efficient tokenization.
19
+
20
+ | Benchmark Category | Traum Tokenizer | GPT-2 Tokenizer | LLaMA Tokenizer |
21
+ | :--- | :--- | :--- | :--- |
22
+ | English Text | 2.80 | 2.80 | 2.33 |
23
+ | Mathematical Logic | 1.00 | 1.00 | 0.83 |
24
+ | Code Syntax | 2.57 | 2.57 | 2.57 |
25
+ | Chain-of-Thought (CoT) | 7.00 | 3.50 | 3.11 |
26
+
27
+ ### Benchmark Analysis
28
+
29
+ - English: Traum outperforms the LLaMA tokenizer and establishes a performance profile comparable to the industry-standard GPT-2.
30
+ - Mathematics: Traum shows superior tokenization efficiency compared to both GPT-2 and LLaMA, capturing mathematical structures with high precision.
31
+ - Code: Performance is consistent and equal with current state-of-the-art tokenizers.
32
+ - Reasoning (CoT): The current version exhibits extremely high compression in reasoning tasks (7.00 chars/token). While highly efficient, future iterations (Traum v2) will focus on fine-tuning this compression to further enhance linguistic nuances in dense reasoning chains.
33
+
34
+ ### Visual Comparison
35
+
36
+ The chart below visualizes the comparative efficiency of Traum Tokenizer across different test sets.
37
+
38
+ ![Tokenizer Comparison](./Traum_Chart.png)
39
+
40
+ ## Future Development
41
+
42
+ Traum Tokenizer is the foundational component for a series of upcoming open-source AI models designed for high-efficiency reasoning. These models will be released on the official account. Based on community interest and feedback, the tokenizer architecture may be fully open-sourced for broad use in the future.
43
+
44
+ ## Usage
45
+
46
+ Load the tokenizer via the Hugging Face Transformers library:
47
+
48
+ ```python
49
+ from transformers import AutoTokenizer
50
+
51
+ tokenizer = AutoTokenizer.from_pretrained("assemsabry/traum-tokenizer")
52
+
53
+ # Example usage
54
+ text = "The quick brown fox jumps over the lazy dog."
55
+ tokens = tokenizer.encode(text)
56
+ print(f"Encoded tokens: {tokens}")
57
+ print(f"Decoded text: {tokenizer.decode(tokens)}")
58
+ ```
59
+
60
+ ## Repository Structure
61
+
62
+ - `tokenizer.json`: Core BPE tokenizer configuration and vocabulary.
63
+ - `tokenizer_config.json`: Metadata and configuration for the Transformers/Tokenizers library.
64
+ - `Traum_Chart.png`: Benchmark visualization.
65
+ - `README.md`: System documentation and benchmarks.
66
+
67
+ ## Developer
68
+
69
+ **Assem Sabry** is an Egyptian AI Engineer & Researcher and the founder of Token AI (founded in 2025).
70
+
71
+ - Website: https://assem.cloud/
72
+ - LinkedIn: https://www.linkedin.com/in/assem7/