shivendrra
/

BiosaicTokenizer

Model card Files Files and versions

BiosaicTokenizer / README.md

shivendrra's picture

Update README.md

31c2a86 verified 9 months ago

|

history blame contribute delete

2.78 kB

	---
	license: mit
	datasets:
	- DNA-LLM/vae_trainset
	- Hack90/virus_dna_dataset
	- dnagpt/human_genome_GCF_009914755.1
	language:
	- en
	tags:
	- biology
	- alphafold
	- bio-compute
	---

	# Biosaic Tokenizer

	## Overview
	Biosaic(Bio-Mosaic) is a tokenizer library built for [Enigma2](https://github.com/shivendrra/enigma2). It contains: Tokenizer, Embedder for DNA & Amino Acid Protein Sequences. Has a VQ-VAE & Evoformer architecture based encoders that could convert sequences into embeddings and vice-versa for model training use-case.

	## Features
	- Tokenization: converts the sequences into K-Mers. (for DNA only)
	- Encoding: converts sequences into embeddings for classification, training purposes.
	- Easy use: it's very basic and easy to use library.
	- SoTA encoder: Evoformer & VQ-VAE model are inspired from the [AlphaFold-2](https://www.biorxiv.org/content/10.1101/2024.12.02.626366v1.full)

	## Models

	It has two different Models,
	- for DNA tokenization & encoding: VQ-VAE
	- for Protein Encodings: EvoFormer

	VQ-VAE is around 160M parameter big(for now it's just around 40M just to test run).
	EvoFormer is around 136M parameter big (still in training).


	### Config:

	```python
	class ModelConfig:
	d_model: int= 768
	in_dim: int= 4
	beta: float= 0.15
	dropout: float= 0.25
	n_heads: int= 16
	n_layers: int= 12
	```

	```python
	class ModelConfig:
	DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	A = 4 # DNA alphabet
	C = 21 # 21 letter for amino acid & 4 for dna
	d_msa = 768
	d_pair = 256
	n_heads = 32
	n_blocks = 28
	```

	## Training:

	For training the ``VQ-VAE`` & ``Evo-Former`` model, batch training is preferred, with it's own sepearte ``Dateset`` class that takes input of the strings and then Hot-encodes the DNA Sequences first and then fill them into batches according to ``train`` & ``val`` splits which is around 20% of the full dataset.

	#### For VQ-VAE:
	```python
	class TrainConfig:
	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	learning_rate = 1e-4 # bumped from 1e-5
	weight_decay = 1e-4
	amsgrad = True
	warmup_epochs = 50 # linear warm‑up
	epochs = 2000
	eval_interval = 100
	eval_iters = 30
	batch_size = 6
	block_size = 256
	```

	#### For EvoFormer:
	```python
	class TrainConfig:
	DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	LR = 1e-4
	WD = 1e-4
	AMS = True
	WARMUP = 50
	EPOCHS = 500
	BATCH = 8
	MSA_SEQ = 32 # number of sequences in each MSA
	L_SEQ = 256 # length of each sequence
	EVAL_ITERS = 5
	EVAL_INTV = 50
	```