CodonTranslator / README.md
alegendaryfish's picture
Align public training codebase with local training setup
0a3bc69 verified
---
license: mit
library_name: pytorch
tags:
- biology
- dna
- codon-optimization
- protein-conditioned-generation
- fsdp
datasets:
- alegendaryfish/CodonTranslator-data
---
# CodonTranslator
CodonTranslator is a protein-conditioned codon sequence generation model trained on the representative-only `data_v3` release.
This repository is the public model and training-code release. It contains:
- `final_model/`: inference-ready weights
- `src/`, `train.py`, `sampling.py`: training and inference code
- `resplit_data_v3.py`: the `data_v3` reconstruction pipeline
- `slurm/`: the single-node H200 training and data rebuild submission scripts
- `CodonTranslator/` and `pyproject.toml`: a lightweight packaged inference wrapper
## Training configuration
- Architecture: `hidden=750`, `layers=20`, `heads=15`, `mlp_ratio=3.2`
- Attention: `mha`
- Precision: `bf16`
- Parallelism: FSDP full shard
- Effective global batch: `1536`
- Weight decay: `1e-4`
- Dataset: `alegendaryfish/CodonTranslator-data`
## Dataset release
The corresponding public dataset and species embedding release is:
- `alegendaryfish/CodonTranslator-data`
That dataset repo contains:
- final representative-only `train/`, `val/`, `test/` parquet shards
- `embeddings_v2/`
- split audit files and reconstruction metadata
## Quick start
### Install
```bash
git clone https://huggingface.co/alegendaryfish/CodonTranslator
cd CodonTranslator
conda env create -f environment.yml
conda activate codontranslator
pip install -r requirements.txt
pip install -e .
```
Both import styles are supported:
```python
from CodonTranslator import CodonTranslator
```
```python
from codontranslator import CodonTranslator
```
### Train
```bash
python train.py \
--train_data /path/to/train \
--val_data /path/to/val \
--embeddings_dir /path/to/embeddings_v2 \
--output_dir outputs \
--fsdp \
--bf16 \
--attn mha \
--hidden 750 \
--layers 20 \
--heads 15 \
--mlp_ratio 3.2 \
--batch_size 48 \
--grad_accum 4 \
--epochs 3 \
--lr 7e-5 \
--weight_decay 1e-4
```
The included Slurm launchers use the same training flags as the local single-node H200 workflow:
- `slurm/train_v3_h200_8x_single.sbatch`
- `slurm/submit_train_v3_h200_8x_chain.sh`
### Sample
```bash
python sampling.py \
--model_path final_model \
--embeddings_dir /path/to/embeddings_v2 \
--species "Panicum hallii" \
--protein_sequence "MSEQUENCE" \
--strict_species_lookup
```
## Notes
- Training uses precomputed `embeddings_v2` for species conditioning.
- The data split is built in protein space with MMseqs clustering and binomial-species test holdout.
- `final_model/` is the published inference entrypoint.
- For compatibility, released model directories contain both `trainer_config.json` and `config.json`.
## Sampling arguments
- `enforce_mapping`: when `True`, each generated codon is constrained to encode the provided amino acid at that position.
- `temperature`: softmax temperature. Lower values are more deterministic; `0` selects argmax greedily.
- `top_k`: keep only the `k` highest-logit codon candidates before sampling.
- `top_p`: nucleus sampling threshold; keep the smallest probability mass whose cumulative sum reaches `p`.