File size: 2,777 Bytes
a5f2b4a 31c2a86 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 |
---
license: mit
datasets:
- DNA-LLM/vae_trainset
- Hack90/virus_dna_dataset
- dnagpt/human_genome_GCF_009914755.1
language:
- en
tags:
- biology
- alphafold
- bio-compute
---
# Biosaic Tokenizer
## Overview
Biosaic(Bio-Mosaic) is a tokenizer library built for [Enigma2](https://github.com/shivendrra/enigma2). It contains: Tokenizer, Embedder for DNA & Amino Acid Protein Sequences. Has a VQ-VAE & Evoformer architecture based encoders that could convert sequences into embeddings and vice-versa for model training use-case.
## Features
- **Tokenization:** converts the sequences into K-Mers. *(for DNA only)*
- **Encoding:** converts sequences into embeddings for classification, training purposes.
- **Easy use:** it's very basic and easy to use library.
- **SoTA encoder:** Evoformer & VQ-VAE model are inspired from the [AlphaFold-2](https://www.biorxiv.org/content/10.1101/2024.12.02.626366v1.full)
## Models
It has two different Models,
- for DNA tokenization & encoding: **VQ-VAE**
- for Protein Encodings: **EvoFormer**
**VQ-VAE** is around 160M parameter big(for now it's just around 40M just to test run).
**EvoFormer** is around 136M parameter big (still in training).
### Config:
```python
class ModelConfig:
d_model: int= 768
in_dim: int= 4
beta: float= 0.15
dropout: float= 0.25
n_heads: int= 16
n_layers: int= 12
```
```python
class ModelConfig:
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
A = 4 # DNA alphabet
C = 21 # 21 letter for amino acid & 4 for dna
d_msa = 768
d_pair = 256
n_heads = 32
n_blocks = 28
```
## Training:
For training the ``VQ-VAE`` & ``Evo-Former`` model, batch training is preferred, with it's own sepearte ``Dateset`` class that takes input of the strings and then Hot-encodes the DNA Sequences first and then fill them into batches according to ``train`` & ``val`` splits which is around 20% of the full dataset.
#### For VQ-VAE:
```python
class TrainConfig:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
learning_rate = 1e-4 # bumped from 1e-5
weight_decay = 1e-4
amsgrad = True
warmup_epochs = 50 # linear warm‑up
epochs = 2000
eval_interval = 100
eval_iters = 30
batch_size = 6
block_size = 256
```
#### For EvoFormer:
```python
class TrainConfig:
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
LR = 1e-4
WD = 1e-4
AMS = True
WARMUP = 50
EPOCHS = 500
BATCH = 8
MSA_SEQ = 32 # number of sequences in each MSA
L_SEQ = 256 # length of each sequence
EVAL_ITERS = 5
EVAL_INTV = 50
``` |