rxmha125
/

Rx_Codex_Tokenizer

+---
+language:
+- en
+license: apache-2.0
+library_name: transformers
+tags:
+- tokenizer
+- rx-codex
+- medical-ai
+- code-tokenizer
+- chat-ai
+pipeline_tag: text-generation
+---
+<div align="center">
+  <img src="./rx_codex_logo.png" width="60%" alt="Rx-Codex-AI" />
+</div>
+<h3 align="center">
+  <b>
+    <span>━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━</span>
+    <br/>
+    Rx Codex Tokenizer: Professional Tokenizer for Modern AI
+    <br/>
+    <span>━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━</span>
+    <br/>
+  </b>
+</h3>
+<br/>
+<div align="center" style="line-height: 1;">
+  |
+  <a href="https://huggingface.co/rxmha125" target="_blank">🤗 HuggingFace</a>
+  &nbsp;|
+  <a href="https://rxcodex.ai" target="_blank">🌐 Website</a>
+  &nbsp;|
+  <a href="mailto:contact@rxcodex.ai" target="_blank">📧 Contact</a>
+  &nbsp;|
+  <br/>
+</div>
+<br/>
+## Overview
+Rx Codex Tokenizer is a state-of-the-art BPE tokenizer designed for modern AI applications. With 128K vocabulary optimized for English, code, and medical text, it outperforms established tokenizers in comprehensive benchmarks.
+**Developed by Rx Founder & CEO of Rx Codex AI**
+## Benchmark Results
+### Tokenizer Battle Royale - Final Scores
+| Tokenizer | Final Score | Speed | Compression | Special Tokens | Chat Support |
+|-----------|-------------|-------|-------------|----------------|--------------|
+| 🥇 **Rx Codex** | **84.51/100** | 24.84/25 | 35.0/35 | 16.67/20 | 15/15 |
+| 🥈 GPT-2 | 67.89/100 | 24.89/25 | 35.0/35 | 0.0/20 | 15/15 |
+| 🥉 DeepSeek | 67.77/100 | 24.77/25 | 35.0/35 | 0.0/20 | 15/15 |
+### Final Scores Comparison
+![Final Scores](./final_scores_comparison.png)
+### Performance Breakdown
+![Performance Breakdown](./performance_breakdown.png)
+### Speed Analysis
+![Speed Comparison](./speed_comparison.png)
+### Compression Efficiency
+![Compression Comparison](./compression_comparison.png)
+### Multi-dimensional Analysis
+![Capabilities Radar](./capabilities_radar.png)
+### Token Count Efficiency
+![Token Count Comparison](./token_count_comparison.png)
+## Key Features
+- **128K Vocabulary** - Optimal balance of coverage and efficiency
+- **Byte-Level BPE** - No UNK tokens, handles any text
+- **Medical Text Optimized** - Perfect for healthcare AI applications
+- **Code-Aware** - Excellent programming language support
+- **Chat-Ready Tokens** - Built-in support for conversation formats
+## Technical Specifications
+- **Vocabulary Size**: 128,256 tokens
+- **Special Tokens**: 9 custom tokens
+- **Model Type**: BPE with byte fallback
+- **Training Data**: OpenOrca 5GB English dataset
+- **Average Speed**: 0.63ms per tokenization
+- **Compression Ratio**: 4.18 characters per token
+## Use Cases
+- **Chat AI Systems** - Built-in chat token support
+- **Medical AI** - Optimized for healthcare terminology
+- **Code Generation** - Excellent programming language handling
+- **Academic Research** - Efficient with complex text
+## License
+Apache 2.0
+## Author
+**Rx Founder & CEO**
+Rx Codex AI