--- license: mit library_name: pytorch tags: - biology - dna - codon-optimization - protein-conditioned-generation - fsdp datasets: - alegendaryfish/CodonTranslator-data --- # CodonTranslator CodonTranslator is a protein-conditioned codon sequence generation model trained on the representative-only `data_v3` release. This repository is the public model and training-code release. It contains: - `final_model/`: inference-ready weights - `src/`, `train.py`, `sampling.py`: training and inference code - `resplit_data_v3.py`: the `data_v3` reconstruction pipeline - `slurm/`: the single-node H200 training and data rebuild submission scripts - `CodonTranslator/` and `pyproject.toml`: a lightweight packaged inference wrapper ## Training configuration - Architecture: `hidden=750`, `layers=20`, `heads=15`, `mlp_ratio=3.2` - Attention: `mha` - Precision: `bf16` - Parallelism: FSDP full shard - Effective global batch: `1536` - Weight decay: `1e-4` - Dataset: `alegendaryfish/CodonTranslator-data` ## Dataset release The corresponding public dataset and species embedding release is: - `alegendaryfish/CodonTranslator-data` That dataset repo contains: - final representative-only `train/`, `val/`, `test/` parquet shards - `embeddings_v2/` - split audit files and reconstruction metadata ## Quick start ### Install ```bash git clone https://huggingface.co/alegendaryfish/CodonTranslator cd CodonTranslator conda env create -f environment.yml conda activate codontranslator pip install -r requirements.txt pip install -e . ``` Both import styles are supported: ```python from CodonTranslator import CodonTranslator ``` ```python from codontranslator import CodonTranslator ``` ### Train ```bash python train.py \ --train_data /path/to/train \ --val_data /path/to/val \ --embeddings_dir /path/to/embeddings_v2 \ --output_dir outputs \ --fsdp \ --bf16 \ --attn mha \ --hidden 750 \ --layers 20 \ --heads 15 \ --mlp_ratio 3.2 \ --batch_size 48 \ --grad_accum 4 \ --epochs 3 \ --lr 7e-5 \ --weight_decay 1e-4 ``` The included Slurm launchers use the same training flags as the local single-node H200 workflow: - `slurm/train_v3_h200_8x_single.sbatch` - `slurm/submit_train_v3_h200_8x_chain.sh` ### Sample ```bash python sampling.py \ --model_path final_model \ --embeddings_dir /path/to/embeddings_v2 \ --species "Panicum hallii" \ --protein_sequence "MSEQUENCE" \ --strict_species_lookup ``` ## Notes - Training uses precomputed `embeddings_v2` for species conditioning. - The data split is built in protein space with MMseqs clustering and binomial-species test holdout. - `final_model/` is the published inference entrypoint. - For compatibility, released model directories contain both `trainer_config.json` and `config.json`. ## Sampling arguments - `enforce_mapping`: when `True`, each generated codon is constrained to encode the provided amino acid at that position. - `temperature`: softmax temperature. Lower values are more deterministic; `0` selects argmax greedily. - `top_k`: keep only the `k` highest-logit codon candidates before sampling. - `top_p`: nucleus sampling threshold; keep the smallest probability mass whose cumulative sum reaches `p`.