| --- |
| license: mit |
| language: |
| - en |
| pipeline_tag: token-classification |
| tags: |
| - tokenizer |
| --- |
| # Traum Tokenizer |
|
|
| Traum Tokenizer is a high-performance, specialized tokenizer designed for next-generation Large Language Models (LLMs) and specifically optimized for the Flash - SLM project. Developed after extensive research into existing tokenizers like GPT-2 and BERT, Traum Tokenizer addresses the critical need for a balanced approach between compression efficiency, training speed, and linguistic understanding. |
|
|
| ## Overview |
|
|
| A tokenizer's efficiency is paramount to a model's performance. Traum Tokenizer utilizes a Byte-Level BPE (Byte-Pair Encoding) algorithm, which ensures that no unknown or encoding error tokens are produced, making it robust across diverse text types. |
|
|
| ### Key Features |
|
|
| - Massive Training Scale: Trained on a diverse dataset of 20 billion tokens. |
| - Expanded Vocabulary: Features a vocabulary size larger than GPT-2 by over 15,000 tokens, allowing for better representation of complex and modern terminology. |
| - Precision Engineering: Optimized for reasoning, mathematical symbols, and structural code. |
| - Optimized for Efficiency: Designed to maximize training throughput and inference quality for Small Language Models (SLMs). |
|
|
| ## Performance Benchmarks |
|
|
| Traum Tokenizer has been benchmarked against GPT-2 and LLaMA tokenizers across multiple domains. The performance metrics focus on the compression ratio (Characters per Token), where higher values indicate more efficient tokenization. |
|
|
| | Benchmark Category | Traum Tokenizer | GPT-2 Tokenizer | LLaMA Tokenizer | |
| | :--- | :--- | :--- | :--- | |
| | English Text | 2.80 | 2.80 | 2.33 | |
| | Mathematical Logic | 1.00 | 1.00 | 0.83 | |
| | Code Syntax | 2.57 | 2.57 | 2.57 | |
| | Chain-of-Thought (CoT) | 7.00 | 3.50 | 3.11 | |
|
|
| ### Benchmark Analysis |
|
|
| - English: Traum outperforms the LLaMA tokenizer and establishes a performance profile comparable to the industry-standard GPT-2. |
| - Mathematics: Traum shows superior tokenization efficiency compared to both GPT-2 and LLaMA, capturing mathematical structures with high precision. |
| - Code: Performance is consistent and equal with current state-of-the-art tokenizers. |
| - Reasoning (CoT): The current version exhibits extremely high compression in reasoning tasks (7.00 chars/token). While highly efficient, future iterations (Traum v2) will focus on fine-tuning this compression to further enhance linguistic nuances in dense reasoning chains. |
|
|
| ### Visual Comparison |
|
|
| The chart below visualizes the comparative efficiency of Traum Tokenizer across different test sets. |
|
|
|  |
|
|
| ## Future Development |
|
|
| Traum Tokenizer is the foundational component for a series of upcoming open-source AI models designed for high-efficiency reasoning. These models will be released on the official account. Based on community interest and feedback, the tokenizer architecture may be fully open-sourced for broad use in the future. |
|
|
| ## Usage |
|
|
| Load the tokenizer via the Hugging Face Transformers library: |
|
|
| ```python |
| from transformers import AutoTokenizer |
| |
| tokenizer = AutoTokenizer.from_pretrained("assemsabry/traum-tokenizer") |
| |
| # Example usage |
| text = "The quick brown fox jumps over the lazy dog." |
| tokens = tokenizer.encode(text) |
| print(f"Encoded tokens: {tokens}") |
| print(f"Decoded text: {tokenizer.decode(tokens)}") |
| ``` |
|
|
| ## Repository Structure |
|
|
| - `tokenizer.json`: Core BPE tokenizer configuration and vocabulary. |
| - `tokenizer_config.json`: Metadata and configuration for the Transformers/Tokenizers library. |
| - `traum_chart.png`: Benchmark visualization. |
| - `README.md`: System documentation and benchmarks. |
|
|
| ## Developer |
|
|
| **Assem Sabry** is an Egyptian AI Engineer & Researcher and the founder of Token AI (founded in 2025). |
|
|
| - Website: https://assem.cloud/ |
| - LinkedIn: https://www.linkedin.com/in/assem7/ |