| --- |
| license: mit |
| library_name: pytorch |
| tags: |
| - biology |
| - dna |
| - codon-optimization |
| - protein-conditioned-generation |
| - fsdp |
| datasets: |
| - alegendaryfish/CodonTranslator-data |
| --- |
| |
| # CodonTranslator |
|
|
| CodonTranslator is a protein-conditioned codon sequence generation model trained on the representative-only `data_v3` release. |
|
|
| This repository is the public model and training-code release. It contains: |
|
|
| - `final_model/`: inference-ready weights |
| - `src/`, `train.py`, `sampling.py`: training and inference code |
| - `resplit_data_v3.py`: the `data_v3` reconstruction pipeline |
| - `slurm/`: the single-node H200 training and data rebuild submission scripts |
| - `CodonTranslator/` and `pyproject.toml`: a lightweight packaged inference wrapper |
|
|
| ## Training configuration |
|
|
| - Architecture: `hidden=750`, `layers=20`, `heads=15`, `mlp_ratio=3.2` |
| - Attention: `mha` |
| - Precision: `bf16` |
| - Parallelism: FSDP full shard |
| - Effective global batch: `1536` |
| - Weight decay: `1e-4` |
| - Dataset: `alegendaryfish/CodonTranslator-data` |
|
|
| ## Dataset release |
|
|
| The corresponding public dataset and species embedding release is: |
|
|
| - `alegendaryfish/CodonTranslator-data` |
|
|
| That dataset repo contains: |
|
|
| - final representative-only `train/`, `val/`, `test/` parquet shards |
| - `embeddings_v2/` |
| - split audit files and reconstruction metadata |
|
|
| ## Quick start |
|
|
| ### Install |
|
|
| ```bash |
| git clone https://huggingface.co/alegendaryfish/CodonTranslator |
| cd CodonTranslator |
| conda env create -f environment.yml |
| conda activate codontranslator |
| pip install -r requirements.txt |
| pip install -e . |
| ``` |
|
|
| Both import styles are supported: |
|
|
| ```python |
| from CodonTranslator import CodonTranslator |
| ``` |
|
|
| ```python |
| from codontranslator import CodonTranslator |
| ``` |
|
|
| ### Train |
|
|
| ```bash |
| python train.py \ |
| --train_data /path/to/train \ |
| --val_data /path/to/val \ |
| --embeddings_dir /path/to/embeddings_v2 \ |
| --output_dir outputs \ |
| --fsdp \ |
| --bf16 \ |
| --attn mha \ |
| --hidden 750 \ |
| --layers 20 \ |
| --heads 15 \ |
| --mlp_ratio 3.2 \ |
| --batch_size 48 \ |
| --grad_accum 4 \ |
| --epochs 3 \ |
| --lr 7e-5 \ |
| --weight_decay 1e-4 |
| ``` |
|
|
| The included Slurm launchers use the same training flags as the local single-node H200 workflow: |
|
|
| - `slurm/train_v3_h200_8x_single.sbatch` |
| - `slurm/submit_train_v3_h200_8x_chain.sh` |
|
|
| ### Sample |
|
|
| ```bash |
| python sampling.py \ |
| --model_path final_model \ |
| --embeddings_dir /path/to/embeddings_v2 \ |
| --species "Panicum hallii" \ |
| --protein_sequence "MSEQUENCE" \ |
| --strict_species_lookup |
| ``` |
|
|
| ## Notes |
|
|
| - Training uses precomputed `embeddings_v2` for species conditioning. |
| - The data split is built in protein space with MMseqs clustering and binomial-species test holdout. |
| - `final_model/` is the published inference entrypoint. |
| - For compatibility, released model directories contain both `trainer_config.json` and `config.json`. |
|
|
| ## Sampling arguments |
|
|
| - `enforce_mapping`: when `True`, each generated codon is constrained to encode the provided amino acid at that position. |
| - `temperature`: softmax temperature. Lower values are more deterministic; `0` selects argmax greedily. |
| - `top_k`: keep only the `k` highest-logit codon candidates before sampling. |
| - `top_p`: nucleus sampling threshold; keep the smallest probability mass whose cumulative sum reaches `p`. |
|
|