# Tokenization This document covers 6-mer vocabulary construction, the circular windowing mechanic, the heteroplasmy channel, and vocabulary statistics. --- ## Why K-mer Tokenization? DNA sequence models face a fundamental tokenization choice: character-level (4 tokens), k-mer (4^k tokens), or BPE (learned, corpus-dependent vocabulary). **Character-level** keeps the vocabulary tiny but forces the model to learn multi-base patterns entirely through self-attention. Long-range dependencies in a 16,569-character sequence strain even a 512-token context window. **BPE** learns which substrings are statistically frequent in the training corpus. This is excellent for natural language (where word-level units have meaning) but problematic for DNA: the learned vocabulary is non-reproducible across projects, depends on corpus composition, and discards the biological fact that every possible 6-mer is equally valid. **K-mer tokenization** gives every possible 6-mer a stable, deterministic token ID. The vocabulary is 4^6 = 4,096 tokens plus 6 special tokens = 4,102 total. Any project using `KmerVocabulary.build(k=6)` on any machine will produce the same mapping. This is the right choice for a pre-trained model that needs to generalize across datasets without re-tokenizing. --- ## Vocabulary Construction ```python from mtdna_fm.tokenizer.vocabulary import KmerVocabulary vocab = KmerVocabulary.build(k=6) len(vocab) # 4102 ``` ### Special tokens The first 6 token IDs are reserved for special tokens: | ID | Token | Purpose | |---|---|---| | 0 | `[PAD]` | Padding to fixed sequence length | | 1 | `[CLS]` | Classification token prepended to every sequence window | | 2 | `[MASK]` | Masked position during MLM pre-training | | 3 | `[UNK]` | Any k-mer containing N (ambiguous base) | | 4 | `[SEP]` | Separator (for future paired-sequence tasks) | | 5 | `[HET]` | Heteroplasmic position marker (reserved for future use) | ### K-mer enumeration K-mers are enumerated in lexicographic order over the alphabet ACGT. The index of a 6-mer is its position in sorted(all_4096_kmers), plus 6 (to leave room for special tokens). This ordering is deterministic and reproducible. ```python vocab.encode("AAAAAA") # 6 (first k-mer after special tokens) vocab.encode("TTTTTT") # 4101 (last k-mer) vocab.decode(6) # "AAAAAA" ``` N-containing k-mers (e.g., "ACGTAN") map to `[UNK]` (ID 3). This handles sequencing gaps without crashing. ### Save and load The vocabulary follows HuggingFace `PretrainedConfig` conventions so it can be stored alongside model weights: ```python vocab.save_pretrained("models/vocabulary/") # writes: models/vocabulary/vocab_config.json loaded = KmerVocabulary.from_pretrained("models/vocabulary/") assert len(loaded) == 4102 ``` --- ## Sequence Tokenization ```python from mtdna_fm.tokenizer.tokenize import tokenize_sequence tokens = tokenize_sequence( seq="ATCG...", # 16,569-bp mtDNA genome vocabulary=vocab, k=6, stride=1, max_seq_len=512, circular=True, het_levels=None, # optional: np.ndarray of float in [0, 1] ) # tokens: dict with keys input_ids, attention_mask, position_ids, het_values ``` ### Output fields | Field | Shape | Description | |---|---|---| | `input_ids` | `(seq_len,)` | K-mer token IDs | | `attention_mask` | `(seq_len,)` | 1 for real tokens, 0 for padding | | `position_ids` | `(seq_len,)` | Absolute genomic coordinates (0-indexed) | | `het_values` | `(seq_len,)` | Heteroplasmy levels, 0.0 if not provided | ### Circular windowing The full genome (16,569 bp) is too long for a 512-token context window. Instead, `MtDNADataset` tiles the genome with overlapping windows: - Window size: 512 tokens - Stride: 256 tokens (50% overlap) - Windows per genome: ceil(16569 / 256) ≈ 65 Each window receives a `[CLS]` token prepended, so the actual context is 513 tokens. The `position_ids` in each window are **absolute genomic coordinates** (not window-relative), so the circular positional encoding maps each token to the correct angular position on the genome. ### Junction handling With `circular=True`, tokenization wraps around the genome junction at position 16568/0. Before k-merizing, the last k-1 = 5 bases are appended to the front of the sequence: ``` seq_circular = seq[-5:] + seq # 16,574 bp ``` This ensures that the k-mers at positions 16564-16568 (which overlap the junction) are computed correctly. Without this step, those positions would yield partial k-mers that don't appear in the vocabulary. The `position_ids` for the wrapped junction tokens are assigned positions 16564-16568, not positions that exceed `genome_length`. The circular PE handles the topology. --- ## Heteroplasmy Channel Heteroplasmy is the presence of two or more mitochondrial DNA variants within a single cell (e.g., 80% wild-type copies, 20% mutant copies). Standard sequence models expect one definitive base at each position; the heteroplasmy channel extends the model to handle continuous mixtures. ### Input format `het_levels` is an optional `np.ndarray` of shape `(genome_length,)` with float values in `[0, 1]`. Each value is the fraction of mtDNA copies carrying an alternate allele at that position. - `0.0`: all copies are wild-type at this position - `0.5`: 50/50 mixture (maximum heteroplasmy) - `1.0`: all copies carry the alternate allele (homoplasmic variant) In most sequences, `het_levels` is all zeros. The model handles this gracefully: the het projection contributes zero to the embedding when all values are zero. ### How it feeds into the model In `MtDNAEmbeddings`, the heteroplasmy scalar is projected into the embedding space: ```python het_proj = self.het_norm(self.het_projection(het_values.unsqueeze(-1))) emb = kmer_emb + circular_pe + het_proj ``` The projection is a `Linear(1, hidden_size)` layer followed by `LayerNorm`. This learned transformation allows the model to modulate the k-mer representation based on how heteroplasmically variable the position is. **Why a linear projection instead of discretization?** Discretizing (e.g., high/medium/low) introduces an arbitrary threshold and discards information. The continuous projection is learned end-to-end and preserves the full signal. --- ## Vocabulary Statistics For the human mtDNA corpus (HmtDB, ~47k sequences): | Statistic | Value | |---|---| | Total unique k-mers observed | 4,068 of 4,096 (99.3%) | | K-mers never observed | ~28 (all contain unusual base combinations) | | Mean tokens per genome | 16,564 (≈ genome_length − k + 1) | | Most frequent k-mer | varies by GC content; poly-C tracts (D-loop) dominate | | `[UNK]` token rate | < 0.1% of tokens (N bases are rare in curated HmtDB) | The near-complete coverage of the k-mer vocabulary means the model is unlikely to encounter out-of-vocabulary tokens even on divergent sequences (Neanderthal, Denisovan) not seen during training.