Spaces:
Sleeping
Sleeping
A newer version of the Gradio SDK is available:
6.1.0
metadata
title: BPE DNA Tokenizer
emoji: π§¬
colorFrom: green
colorTo: blue
sdk: gradio
sdk_version: 4.44.0
app_file: app.py
pinned: false
license: mit
𧬠BPE DNA Tokenizer
An interactive demo of a Byte Pair Encoding (BPE) tokenizer trained on the E. coli K-12 genome.
π― Key Results
- Vocabulary Size: 5,000 tokens
- Compression Ratio: 5.208x (62.8% above requirement)
- Dataset: E. coli K-12 genome (4.6M base pairs)
- Lossless: 100% perfect reconstruction
β¨ Features
- 𧬠DNA-Optimized: Specifically designed for genomic sequences
- π High Compression: Achieves 5.2x compression
- π¬ Biological Discovery: Automatically finds codons, TATA boxes, and more
- β Lossless: Perfect encode-decode reconstruction
π¬ Discovered Patterns
The tokenizer learned biologically meaningful patterns without supervision:
- Start Codon: ATG
- Stop Codons: TAA, TAG
- TATA Box: TATAA
- Shine-Dalgarno: AGGAGG
- CpG Islands: GCGC
π Try It Out
- Enter any DNA sequence (A, C, G, T, N)
- Click "Tokenize Sequence"
- See the compression statistics and token breakdown
π Model Details
- Training Data: 4,641,652 base pairs
- Compressed Size: 891,316 tokens
- Training Time: 88 minutes
- Longest Token: 26 bases
π Links
Built for genomics and machine learning π§¬π€