BPE-DNA-Tokenizer / README.md
abi96062's picture
Update README.md
d7610b8 verified
---
title: BPE DNA Tokenizer
emoji: 🧬
colorFrom: green
colorTo: blue
sdk: gradio
sdk_version: 4.44.0
app_file: app.py
pinned: false
license: mit
---
# 🧬 BPE DNA Tokenizer
An interactive demo of a Byte Pair Encoding (BPE) tokenizer trained on the *E. coli* K-12 genome.
## 🎯 Key Results
- **Vocabulary Size**: 5,000 tokens
- **Compression Ratio**: 5.208x (62.8% above requirement)
- **Dataset**: *E. coli* K-12 genome (4.6M base pairs)
- **Lossless**: 100% perfect reconstruction
## ✨ Features
- 🧬 **DNA-Optimized**: Specifically designed for genomic sequences
- πŸš€ **High Compression**: Achieves 5.2x compression
- πŸ”¬ **Biological Discovery**: Automatically finds codons, TATA boxes, and more
- βœ… **Lossless**: Perfect encode-decode reconstruction
## πŸ”¬ Discovered Patterns
The tokenizer learned biologically meaningful patterns without supervision:
- **Start Codon**: ATG
- **Stop Codons**: TAA, TAG
- **TATA Box**: TATAA
- **Shine-Dalgarno**: AGGAGG
- **CpG Islands**: GCGC
## πŸš€ Try It Out
1. Enter any DNA sequence (A, C, G, T, N)
2. Click "Tokenize Sequence"
3. See the compression statistics and token breakdown
## πŸ“Š Model Details
- **Training Data**: 4,641,652 base pairs
- **Compressed Size**: 891,316 tokens
- **Training Time**: 88 minutes
- **Longest Token**: 26 bases
## πŸ”— Links
- [GitHub Repository](https://github.com/abi2024/bpe-dna-tokenizer)
- [Full Documentation](https://github.com/abi2024/bpe-dna-tokenizer#readme)
---
**Built for genomics and machine learning** πŸ§¬πŸ€–