|
|
--- |
|
|
license: mit |
|
|
datasets: |
|
|
- DNA-LLM/vae_trainset |
|
|
- Hack90/virus_dna_dataset |
|
|
- dnagpt/human_genome_GCF_009914755.1 |
|
|
language: |
|
|
- en |
|
|
tags: |
|
|
- biology |
|
|
- alphafold |
|
|
- bio-compute |
|
|
--- |
|
|
|
|
|
# Biosaic Tokenizer |
|
|
|
|
|
## Overview |
|
|
Biosaic(Bio-Mosaic) is a tokenizer library built for [Enigma2](https://github.com/shivendrra/enigma2). It contains: Tokenizer, Embedder for DNA & Amino Acid Protein Sequences. Has a VQ-VAE & Evoformer architecture based encoders that could convert sequences into embeddings and vice-versa for model training use-case. |
|
|
|
|
|
## Features |
|
|
- **Tokenization:** converts the sequences into K-Mers. *(for DNA only)* |
|
|
- **Encoding:** converts sequences into embeddings for classification, training purposes. |
|
|
- **Easy use:** it's very basic and easy to use library. |
|
|
- **SoTA encoder:** Evoformer & VQ-VAE model are inspired from the [AlphaFold-2](https://www.biorxiv.org/content/10.1101/2024.12.02.626366v1.full) |
|
|
|
|
|
## Models |
|
|
|
|
|
It has two different Models, |
|
|
- for DNA tokenization & encoding: **VQ-VAE** |
|
|
- for Protein Encodings: **EvoFormer** |
|
|
|
|
|
**VQ-VAE** is around 160M parameter big(for now it's just around 40M just to test run). |
|
|
**EvoFormer** is around 136M parameter big (still in training). |
|
|
|
|
|
|
|
|
### Config: |
|
|
|
|
|
```python |
|
|
class ModelConfig: |
|
|
d_model: int= 768 |
|
|
in_dim: int= 4 |
|
|
beta: float= 0.15 |
|
|
dropout: float= 0.25 |
|
|
n_heads: int= 16 |
|
|
n_layers: int= 12 |
|
|
``` |
|
|
|
|
|
```python |
|
|
class ModelConfig: |
|
|
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
|
A = 4 # DNA alphabet |
|
|
C = 21 # 21 letter for amino acid & 4 for dna |
|
|
d_msa = 768 |
|
|
d_pair = 256 |
|
|
n_heads = 32 |
|
|
n_blocks = 28 |
|
|
``` |
|
|
|
|
|
## Training: |
|
|
|
|
|
For training the ``VQ-VAE`` & ``Evo-Former`` model, batch training is preferred, with it's own sepearte ``Dateset`` class that takes input of the strings and then Hot-encodes the DNA Sequences first and then fill them into batches according to ``train`` & ``val`` splits which is around 20% of the full dataset. |
|
|
|
|
|
#### For VQ-VAE: |
|
|
```python |
|
|
class TrainConfig: |
|
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
|
learning_rate = 1e-4 # bumped from 1e-5 |
|
|
weight_decay = 1e-4 |
|
|
amsgrad = True |
|
|
warmup_epochs = 50 # linear warm‑up |
|
|
epochs = 2000 |
|
|
eval_interval = 100 |
|
|
eval_iters = 30 |
|
|
batch_size = 6 |
|
|
block_size = 256 |
|
|
``` |
|
|
|
|
|
#### For EvoFormer: |
|
|
```python |
|
|
class TrainConfig: |
|
|
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
|
LR = 1e-4 |
|
|
WD = 1e-4 |
|
|
AMS = True |
|
|
WARMUP = 50 |
|
|
EPOCHS = 500 |
|
|
BATCH = 8 |
|
|
MSA_SEQ = 32 # number of sequences in each MSA |
|
|
L_SEQ = 256 # length of each sequence |
|
|
EVAL_ITERS = 5 |
|
|
EVAL_INTV = 50 |
|
|
``` |