Update README.md

6129809 verified 2 days ago

3.83 kB

	---
	license: mit
	language:
	- en
	pipeline_tag: token-classification
	tags:
	- tokenizer
	---
	# Traum Tokenizer

	Traum Tokenizer is a high-performance, specialized tokenizer designed for next-generation Large Language Models (LLMs) and specifically optimized for the Flash - SLM project. Developed after extensive research into existing tokenizers like GPT-2 and BERT, Traum Tokenizer addresses the critical need for a balanced approach between compression efficiency, training speed, and linguistic understanding.

	## Overview

	A tokenizer's efficiency is paramount to a model's performance. Traum Tokenizer utilizes a Byte-Level BPE (Byte-Pair Encoding) algorithm, which ensures that no unknown or encoding error tokens are produced, making it robust across diverse text types.

	### Key Features

	- Massive Training Scale: Trained on a diverse dataset of 20 billion tokens.
	- Expanded Vocabulary: Features a vocabulary size larger than GPT-2 by over 15,000 tokens, allowing for better representation of complex and modern terminology.
	- Precision Engineering: Optimized for reasoning, mathematical symbols, and structural code.
	- Optimized for Efficiency: Designed to maximize training throughput and inference quality for Small Language Models (SLMs).

	## Performance Benchmarks

	Traum Tokenizer has been benchmarked against GPT-2 and LLaMA tokenizers across multiple domains. The performance metrics focus on the compression ratio (Characters per Token), where higher values indicate more efficient tokenization.

	\| Benchmark Category \| Traum Tokenizer \| GPT-2 Tokenizer \| LLaMA Tokenizer \|
	\| :--- \| :--- \| :--- \| :--- \|
	\| English Text \| 2.80 \| 2.80 \| 2.33 \|
	\| Mathematical Logic \| 1.00 \| 1.00 \| 0.83 \|
	\| Code Syntax \| 2.57 \| 2.57 \| 2.57 \|
	\| Chain-of-Thought (CoT) \| 7.00 \| 3.50 \| 3.11 \|

	### Benchmark Analysis

	- English: Traum outperforms the LLaMA tokenizer and establishes a performance profile comparable to the industry-standard GPT-2.
	- Mathematics: Traum shows superior tokenization efficiency compared to both GPT-2 and LLaMA, capturing mathematical structures with high precision.
	- Code: Performance is consistent and equal with current state-of-the-art tokenizers.
	- Reasoning (CoT): The current version exhibits extremely high compression in reasoning tasks (7.00 chars/token). While highly efficient, future iterations (Traum v2) will focus on fine-tuning this compression to further enhance linguistic nuances in dense reasoning chains.

	### Visual Comparison

	The chart below visualizes the comparative efficiency of Traum Tokenizer across different test sets.

	![Tokenizer Comparison](./traum_chart.png)

	## Future Development

	Traum Tokenizer is the foundational component for a series of upcoming open-source AI models designed for high-efficiency reasoning. These models will be released on the official account. Based on community interest and feedback, the tokenizer architecture may be fully open-sourced for broad use in the future.

	## Usage

	Load the tokenizer via the Hugging Face Transformers library:

	```python
	from transformers import AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained("assemsabry/traum-tokenizer")

	# Example usage
	text = "The quick brown fox jumps over the lazy dog."
	tokens = tokenizer.encode(text)
	print(f"Encoded tokens: {tokens}")
	print(f"Decoded text: {tokenizer.decode(tokens)}")
	```

	## Repository Structure

	- `tokenizer.json`: Core BPE tokenizer configuration and vocabulary.
	- `tokenizer_config.json`: Metadata and configuration for the Transformers/Tokenizers library.
	- `traum_chart.png`: Benchmark visualization.
	- `README.md`: System documentation and benchmarks.

	## Developer

	Assem Sabry is an Egyptian AI Engineer & Researcher and the founder of Token AI (founded in 2025).

	- Website: https://assem.cloud/
	- LinkedIn: https://www.linkedin.com/in/assem7/