--- license: mit datasets: - DNA-LLM/vae_trainset - Hack90/virus_dna_dataset - dnagpt/human_genome_GCF_009914755.1 language: - en tags: - biology - alphafold - bio-compute --- # Biosaic Tokenizer ## Overview Biosaic(Bio-Mosaic) is a tokenizer library built for [Enigma2](https://github.com/shivendrra/enigma2). It contains: Tokenizer, Embedder for DNA & Amino Acid Protein Sequences. Has a VQ-VAE & Evoformer architecture based encoders that could convert sequences into embeddings and vice-versa for model training use-case. ## Features - **Tokenization:** converts the sequences into K-Mers. *(for DNA only)* - **Encoding:** converts sequences into embeddings for classification, training purposes. - **Easy use:** it's very basic and easy to use library. - **SoTA encoder:** Evoformer & VQ-VAE model are inspired from the [AlphaFold-2](https://www.biorxiv.org/content/10.1101/2024.12.02.626366v1.full) ## Models It has two different Models, - for DNA tokenization & encoding: **VQ-VAE** - for Protein Encodings: **EvoFormer** **VQ-VAE** is around 160M parameter big(for now it's just around 40M just to test run). **EvoFormer** is around 136M parameter big (still in training). ### Config: ```python class ModelConfig: d_model: int= 768 in_dim: int= 4 beta: float= 0.15 dropout: float= 0.25 n_heads: int= 16 n_layers: int= 12 ``` ```python class ModelConfig: DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu") A = 4 # DNA alphabet C = 21 # 21 letter for amino acid & 4 for dna d_msa = 768 d_pair = 256 n_heads = 32 n_blocks = 28 ``` ## Training: For training the ``VQ-VAE`` & ``Evo-Former`` model, batch training is preferred, with it's own sepearte ``Dateset`` class that takes input of the strings and then Hot-encodes the DNA Sequences first and then fill them into batches according to ``train`` & ``val`` splits which is around 20% of the full dataset. #### For VQ-VAE: ```python class TrainConfig: device = torch.device("cuda" if torch.cuda.is_available() else "cpu") learning_rate = 1e-4 # bumped from 1e-5 weight_decay = 1e-4 amsgrad = True warmup_epochs = 50 # linear warm‑up epochs = 2000 eval_interval = 100 eval_iters = 30 batch_size = 6 block_size = 256 ``` #### For EvoFormer: ```python class TrainConfig: DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu") LR = 1e-4 WD = 1e-4 AMS = True WARMUP = 50 EPOCHS = 500 BATCH = 8 MSA_SEQ = 32 # number of sequences in each MSA L_SEQ = 256 # length of each sequence EVAL_ITERS = 5 EVAL_INTV = 50 ```