rxmha125
/

Txa1-4B-Tokenizer

+---
+language:
+- en
+- code
+license: apache-2.0
+library_name: transformers
+tags:
+- tokenizer
+- txa-1
+- axtrio-ai
+- mistral-based
+- chatml
+- moe-optimized
+pipeline_tag: text-generation
+---
+<div align="center">
+  <img src="https://img.rxcodexai.com/img/huggingface%20assets/logo/rx-codex-logo.png" width="40%" alt="Axtrio AI Logo" />
+</div>
+<h3 align="center">
+  <b>
+    <span>━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━</span>
+    <br/>
+    Txa 1 Tokenizer: The Foundation of Axtrio AI
+    <br/>
+    <span>━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━</span>
+    <br/>
+  </b>
+</h3>
+<br/>
+<div align="center" style="line-height: 1;">
+  |
+  <a href="https://huggingface.co/AxtrioAI" target="_blank">🤗 Axtrio AI</a>
+  &nbsp;|
+  <a href="https://rxcodexai.com" target="_blank">🌐 Website</a>
+  &nbsp;|
+  <a href="mailto:contact@rxcodexai.com" target="_blank">📧 Contact</a>
+  &nbsp;|
+  <br/>
+</div>
+<br/>
+## ⚡ Overview
+The **Txa 1 Tokenizer** is a highly efficient, production-ready tokenizer engineered for the **Txa 1 (4B MoE)** model family. Built upon the battle-tested **Mistral v1** foundation, it has been fine-tuned to balance high compression rates with extreme processing speed on H100/H200 hardware.
+This tokenizer natively supports **ChatML** formatting, making it instantly compatible with modern inference engines like vLLM, Ollama, and LM Studio.
+**Developed by Rx, Founder & CEO of Axtrio AI.**
+---
+## 📊 Benchmark Arena
+We pitted the Txa 1 Tokenizer against industry heavyweights in our **Tokenizer Arena**.
+### 1. Speed Analysis (Throughput)
+*Higher is better. Measures raw tokenization speed on H100 hardware.*
+![Speed Comparison](https://img.rxcodexai.com/img/huggingface%20assets/Tokenizer%20Benchmark/benchmark_speed_16_9.png)
+### 2. Compression Efficiency
+*Lower is better. Measures how many tokens are needed to represent complex Code & English.*
+![Compression Comparison](https://img.rxcodexai.com/img/huggingface%20assets/Tokenizer%20Benchmark/benchmark_compression_16_9.png)
+### 3. Vocabulary Architecture
+*Comparison of dictionary sizes. Txa 1 stays lean (32k) to maximize VRAM efficiency for the 4B MoE architecture.*
+![Vocab Comparison](https://img.rxcodexai.com/img/huggingface%20assets/Tokenizer%20Benchmark/benchmark_vocab_16_9.png)
+---
+## 🔧 Technical Specifications
+| Feature | Specification |
+| :--- | :--- |
+| **Base Architecture** | Byte-Pair Encoding (Mistral v1 Foundation) |
+| **Vocabulary Size** | **32,003 Tokens** (Efficient & Lean) |
+| **Added Special Tokens** | `<|im_start|>`, `<|im_end|>`, `<|eot|>` |
+| **Optimization** | Code & Logic Compression |
+| **Compatibility** | Fully Compatible with `LlamaTokenizerFast` |
+## 💻 Usage
+### Quick Start
+```python
+from transformers import AutoTokenizer
+# Load the tokenizer
+tokenizer = AutoTokenizer.from_pretrained("AxtrioAI/Txa1-4B-Tokenizer")
+# Test ChatML Format
+chat = [
+    {"role": "user", "content": "Hello Txa, can you help me debug python?"},
+    {"role": "assistant", "content": "Certainly! Please paste your code below."}
+]
+# Apply template
+formatted_prompt = tokenizer.apply_chat_template(chat, tokenize=False)
+print(formatted_prompt)