Align public training codebase with local training setup

0a3bc69 verified 5 days ago

3.24 kB

	---
	license: mit
	library_name: pytorch
	tags:
	- biology
	- dna
	- codon-optimization
	- protein-conditioned-generation
	- fsdp
	datasets:
	- alegendaryfish/CodonTranslator-data
	---

	# CodonTranslator

	CodonTranslator is a protein-conditioned codon sequence generation model trained on the representative-only `data_v3` release.

	This repository is the public model and training-code release. It contains:

	- `final_model/`: inference-ready weights
	- `src/`, `train.py`, `sampling.py`: training and inference code
	- `resplit_data_v3.py`: the `data_v3` reconstruction pipeline
	- `slurm/`: the single-node H200 training and data rebuild submission scripts
	- `CodonTranslator/` and `pyproject.toml`: a lightweight packaged inference wrapper

	## Training configuration

	- Architecture: `hidden=750`, `layers=20`, `heads=15`, `mlp_ratio=3.2`
	- Attention: `mha`
	- Precision: `bf16`
	- Parallelism: FSDP full shard
	- Effective global batch: `1536`
	- Weight decay: `1e-4`
	- Dataset: `alegendaryfish/CodonTranslator-data`

	## Dataset release

	The corresponding public dataset and species embedding release is:

	- `alegendaryfish/CodonTranslator-data`

	That dataset repo contains:

	- final representative-only `train/`, `val/`, `test/` parquet shards
	- `embeddings_v2/`
	- split audit files and reconstruction metadata

	## Quick start

	### Install

	```bash
	git clone https://huggingface.co/alegendaryfish/CodonTranslator
	cd CodonTranslator
	conda env create -f environment.yml
	conda activate codontranslator
	pip install -r requirements.txt
	pip install -e .
	```

	Both import styles are supported:

	```python
	from CodonTranslator import CodonTranslator
	```

	```python
	from codontranslator import CodonTranslator
	```

	### Train

	```bash
	python train.py \
	--train_data /path/to/train \
	--val_data /path/to/val \
	--embeddings_dir /path/to/embeddings_v2 \
	--output_dir outputs \
	--fsdp \
	--bf16 \
	--attn mha \
	--hidden 750 \
	--layers 20 \
	--heads 15 \
	--mlp_ratio 3.2 \
	--batch_size 48 \
	--grad_accum 4 \
	--epochs 3 \
	--lr 7e-5 \
	--weight_decay 1e-4
	```

	The included Slurm launchers use the same training flags as the local single-node H200 workflow:

	- `slurm/train_v3_h200_8x_single.sbatch`
	- `slurm/submit_train_v3_h200_8x_chain.sh`

	### Sample

	```bash
	python sampling.py \
	--model_path final_model \
	--embeddings_dir /path/to/embeddings_v2 \
	--species "Panicum hallii" \
	--protein_sequence "MSEQUENCE" \
	--strict_species_lookup
	```

	## Notes

	- Training uses precomputed `embeddings_v2` for species conditioning.
	- The data split is built in protein space with MMseqs clustering and binomial-species test holdout.
	- `final_model/` is the published inference entrypoint.
	- For compatibility, released model directories contain both `trainer_config.json` and `config.json`.

	## Sampling arguments

	- `enforce_mapping`: when `True`, each generated codon is constrained to encode the provided amino acid at that position.
	- `temperature`: softmax temperature. Lower values are more deterministic; `0` selects argmax greedily.
	- `top_k`: keep only the `k` highest-logit codon candidates before sampling.
	- `top_p`: nucleus sampling threshold; keep the smallest probability mass whose cumulative sum reaches `p`.