English
biology
alphafold
bio-compute
File size: 2,777 Bytes
a5f2b4a
 
 
 
 
 
 
 
 
 
 
 
31c2a86
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
---
license: mit
datasets:
- DNA-LLM/vae_trainset
- Hack90/virus_dna_dataset
- dnagpt/human_genome_GCF_009914755.1
language:
- en
tags:
- biology
- alphafold
- bio-compute
---

# Biosaic Tokenizer

## Overview
Biosaic(Bio-Mosaic) is a tokenizer library built for [Enigma2](https://github.com/shivendrra/enigma2). It contains: Tokenizer, Embedder for DNA & Amino Acid Protein Sequences. Has a VQ-VAE & Evoformer architecture based encoders that could convert sequences into embeddings and vice-versa for model training use-case.

## Features
- **Tokenization:** converts the sequences into K-Mers. *(for DNA only)*
- **Encoding:** converts sequences into embeddings for classification, training purposes.
- **Easy use:** it's very basic and easy to use library.
- **SoTA encoder:** Evoformer & VQ-VAE model are inspired from the [AlphaFold-2](https://www.biorxiv.org/content/10.1101/2024.12.02.626366v1.full)

## Models

It has two different Models,
  - for DNA tokenization & encoding: **VQ-VAE**
  - for Protein Encodings: **EvoFormer**

**VQ-VAE** is around 160M parameter big(for now it's just around 40M just to test run).
**EvoFormer** is around 136M parameter big (still in training).


### Config:

```python
class ModelConfig:
  d_model: int= 768
  in_dim: int= 4
  beta: float= 0.15
  dropout: float= 0.25
  n_heads: int= 16
  n_layers: int= 12
```

```python
class ModelConfig:
  DEVICE       = torch.device("cuda" if torch.cuda.is_available() else "cpu")
  A            = 4        # DNA alphabet
  C            = 21       # 21 letter for amino acid & 4 for dna
  d_msa        = 768
  d_pair       = 256
  n_heads      = 32
  n_blocks     = 28
```

## Training:

For training the ``VQ-VAE`` & ``Evo-Former`` model, batch training is preferred, with it's own sepearte ``Dateset`` class that takes input of the strings and then Hot-encodes the DNA Sequences first and then fill them into batches according to ``train`` & ``val`` splits which is around 20% of the full dataset.

#### For VQ-VAE:
```python
class TrainConfig:
  device        = torch.device("cuda" if torch.cuda.is_available() else "cpu")
  learning_rate = 1e-4         # bumped from 1e-5
  weight_decay  = 1e-4
  amsgrad       = True
  warmup_epochs = 50           # linear warm‑up
  epochs        = 2000
  eval_interval = 100
  eval_iters    = 30
  batch_size    = 6
  block_size    = 256
```

#### For EvoFormer:
```python
class TrainConfig:
  DEVICE       = torch.device("cuda" if torch.cuda.is_available() else "cpu")
  LR           = 1e-4
  WD           = 1e-4
  AMS          = True
  WARMUP       = 50
  EPOCHS       = 500
  BATCH        = 8
  MSA_SEQ      = 32       # number of sequences in each MSA
  L_SEQ        = 256      # length of each sequence
  EVAL_ITERS   = 5
  EVAL_INTV    = 50
```