Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -1,3 +1,72 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Traum Tokenizer
|
| 2 |
+
|
| 3 |
+
Traum Tokenizer is a high-performance, specialized tokenizer designed for next-generation Large Language Models (LLMs) and specifically optimized for the Flash - SLM project. Developed after extensive research into existing tokenizers like GPT-2 and BERT, Traum Tokenizer addresses the critical need for a balanced approach between compression efficiency, training speed, and linguistic understanding.
|
| 4 |
+
|
| 5 |
+
## Overview
|
| 6 |
+
|
| 7 |
+
A tokenizer's efficiency is paramount to a model's performance. Traum Tokenizer utilizes a Byte-Level BPE (Byte-Pair Encoding) algorithm, which ensures that no unknown or encoding error tokens are produced, making it robust across diverse text types.
|
| 8 |
+
|
| 9 |
+
### Key Features
|
| 10 |
+
|
| 11 |
+
- Massive Training Scale: Trained on a diverse dataset of 20 billion tokens.
|
| 12 |
+
- Expanded Vocabulary: Features a vocabulary size larger than GPT-2 by over 15,000 tokens, allowing for better representation of complex and modern terminology.
|
| 13 |
+
- Precision Engineering: Optimized for reasoning, mathematical symbols, and structural code.
|
| 14 |
+
- Optimized for Efficiency: Designed to maximize training throughput and inference quality for Small Language Models (SLMs).
|
| 15 |
+
|
| 16 |
+
## Performance Benchmarks
|
| 17 |
+
|
| 18 |
+
Traum Tokenizer has been benchmarked against GPT-2 and LLaMA tokenizers across multiple domains. The performance metrics focus on the compression ratio (Characters per Token), where higher values indicate more efficient tokenization.
|
| 19 |
+
|
| 20 |
+
| Benchmark Category | Traum Tokenizer | GPT-2 Tokenizer | LLaMA Tokenizer |
|
| 21 |
+
| :--- | :--- | :--- | :--- |
|
| 22 |
+
| English Text | 2.80 | 2.80 | 2.33 |
|
| 23 |
+
| Mathematical Logic | 1.00 | 1.00 | 0.83 |
|
| 24 |
+
| Code Syntax | 2.57 | 2.57 | 2.57 |
|
| 25 |
+
| Chain-of-Thought (CoT) | 7.00 | 3.50 | 3.11 |
|
| 26 |
+
|
| 27 |
+
### Benchmark Analysis
|
| 28 |
+
|
| 29 |
+
- English: Traum outperforms the LLaMA tokenizer and establishes a performance profile comparable to the industry-standard GPT-2.
|
| 30 |
+
- Mathematics: Traum shows superior tokenization efficiency compared to both GPT-2 and LLaMA, capturing mathematical structures with high precision.
|
| 31 |
+
- Code: Performance is consistent and equal with current state-of-the-art tokenizers.
|
| 32 |
+
- Reasoning (CoT): The current version exhibits extremely high compression in reasoning tasks (7.00 chars/token). While highly efficient, future iterations (Traum v2) will focus on fine-tuning this compression to further enhance linguistic nuances in dense reasoning chains.
|
| 33 |
+
|
| 34 |
+
### Visual Comparison
|
| 35 |
+
|
| 36 |
+
The chart below visualizes the comparative efficiency of Traum Tokenizer across different test sets.
|
| 37 |
+
|
| 38 |
+

|
| 39 |
+
|
| 40 |
+
## Future Development
|
| 41 |
+
|
| 42 |
+
Traum Tokenizer is the foundational component for a series of upcoming open-source AI models designed for high-efficiency reasoning. These models will be released on the official account. Based on community interest and feedback, the tokenizer architecture may be fully open-sourced for broad use in the future.
|
| 43 |
+
|
| 44 |
+
## Usage
|
| 45 |
+
|
| 46 |
+
Load the tokenizer via the Hugging Face Transformers library:
|
| 47 |
+
|
| 48 |
+
```python
|
| 49 |
+
from transformers import AutoTokenizer
|
| 50 |
+
|
| 51 |
+
tokenizer = AutoTokenizer.from_pretrained("assemsabry/traum-tokenizer")
|
| 52 |
+
|
| 53 |
+
# Example usage
|
| 54 |
+
text = "The quick brown fox jumps over the lazy dog."
|
| 55 |
+
tokens = tokenizer.encode(text)
|
| 56 |
+
print(f"Encoded tokens: {tokens}")
|
| 57 |
+
print(f"Decoded text: {tokenizer.decode(tokens)}")
|
| 58 |
+
```
|
| 59 |
+
|
| 60 |
+
## Repository Structure
|
| 61 |
+
|
| 62 |
+
- `tokenizer.json`: Core BPE tokenizer configuration and vocabulary.
|
| 63 |
+
- `tokenizer_config.json`: Metadata and configuration for the Transformers/Tokenizers library.
|
| 64 |
+
- `Traum_Chart.png`: Benchmark visualization.
|
| 65 |
+
- `README.md`: System documentation and benchmarks.
|
| 66 |
+
|
| 67 |
+
## Developer
|
| 68 |
+
|
| 69 |
+
**Assem Sabry** is an Egyptian AI Engineer & Researcher and the founder of Token AI (founded in 2025).
|
| 70 |
+
|
| 71 |
+
- Website: https://assem.cloud/
|
| 72 |
+
- LinkedIn: https://www.linkedin.com/in/assem7/
|