Spaces:
Sleeping
Sleeping
| title: BPE DNA Tokenizer | |
| emoji: 𧬠| |
| colorFrom: green | |
| colorTo: blue | |
| sdk: gradio | |
| sdk_version: 4.44.0 | |
| app_file: app.py | |
| pinned: false | |
| license: mit | |
| # 𧬠BPE DNA Tokenizer | |
| An interactive demo of a Byte Pair Encoding (BPE) tokenizer trained on the *E. coli* K-12 genome. | |
| ## π― Key Results | |
| - **Vocabulary Size**: 5,000 tokens | |
| - **Compression Ratio**: 5.208x (62.8% above requirement) | |
| - **Dataset**: *E. coli* K-12 genome (4.6M base pairs) | |
| - **Lossless**: 100% perfect reconstruction | |
| ## β¨ Features | |
| - 𧬠**DNA-Optimized**: Specifically designed for genomic sequences | |
| - π **High Compression**: Achieves 5.2x compression | |
| - π¬ **Biological Discovery**: Automatically finds codons, TATA boxes, and more | |
| - β **Lossless**: Perfect encode-decode reconstruction | |
| ## π¬ Discovered Patterns | |
| The tokenizer learned biologically meaningful patterns without supervision: | |
| - **Start Codon**: ATG | |
| - **Stop Codons**: TAA, TAG | |
| - **TATA Box**: TATAA | |
| - **Shine-Dalgarno**: AGGAGG | |
| - **CpG Islands**: GCGC | |
| ## π Try It Out | |
| 1. Enter any DNA sequence (A, C, G, T, N) | |
| 2. Click "Tokenize Sequence" | |
| 3. See the compression statistics and token breakdown | |
| ## π Model Details | |
| - **Training Data**: 4,641,652 base pairs | |
| - **Compressed Size**: 891,316 tokens | |
| - **Training Time**: 88 minutes | |
| - **Longest Token**: 26 bases | |
| ## π Links | |
| - [GitHub Repository](https://github.com/abi2024/bpe-dna-tokenizer) | |
| - [Full Documentation](https://github.com/abi2024/bpe-dna-tokenizer#readme) | |
| --- | |
| **Built for genomics and machine learning** π§¬π€ |