File size: 3,240 Bytes
2d8da02
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0a3bc69
 
2d8da02
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0a3bc69
 
 
 
 
2d8da02
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1dbb59f
2d8da02
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
---
license: mit
library_name: pytorch
tags:
  - biology
  - dna
  - codon-optimization
  - protein-conditioned-generation
  - fsdp
datasets:
  - alegendaryfish/CodonTranslator-data
---

# CodonTranslator

CodonTranslator is a protein-conditioned codon sequence generation model trained on the representative-only `data_v3` release.

This repository is the public model and training-code release. It contains:

- `final_model/`: inference-ready weights
- `src/`, `train.py`, `sampling.py`: training and inference code
- `resplit_data_v3.py`: the `data_v3` reconstruction pipeline
- `slurm/`: the single-node H200 training and data rebuild submission scripts
- `CodonTranslator/` and `pyproject.toml`: a lightweight packaged inference wrapper

## Training configuration

- Architecture: `hidden=750`, `layers=20`, `heads=15`, `mlp_ratio=3.2`
- Attention: `mha`
- Precision: `bf16`
- Parallelism: FSDP full shard
- Effective global batch: `1536`
- Weight decay: `1e-4`
- Dataset: `alegendaryfish/CodonTranslator-data`

## Dataset release

The corresponding public dataset and species embedding release is:

- `alegendaryfish/CodonTranslator-data`

That dataset repo contains:

- final representative-only `train/`, `val/`, `test/` parquet shards
- `embeddings_v2/`
- split audit files and reconstruction metadata

## Quick start

### Install

```bash
git clone https://huggingface.co/alegendaryfish/CodonTranslator
cd CodonTranslator
conda env create -f environment.yml
conda activate codontranslator
pip install -r requirements.txt
pip install -e .
```

Both import styles are supported:

```python
from CodonTranslator import CodonTranslator
```

```python
from codontranslator import CodonTranslator
```

### Train

```bash
python train.py \
  --train_data /path/to/train \
  --val_data /path/to/val \
  --embeddings_dir /path/to/embeddings_v2 \
  --output_dir outputs \
  --fsdp \
  --bf16 \
  --attn mha \
  --hidden 750 \
  --layers 20 \
  --heads 15 \
  --mlp_ratio 3.2 \
  --batch_size 48 \
  --grad_accum 4 \
  --epochs 3 \
  --lr 7e-5 \
  --weight_decay 1e-4
```

The included Slurm launchers use the same training flags as the local single-node H200 workflow:

- `slurm/train_v3_h200_8x_single.sbatch`
- `slurm/submit_train_v3_h200_8x_chain.sh`

### Sample

```bash
python sampling.py \
  --model_path final_model \
  --embeddings_dir /path/to/embeddings_v2 \
  --species "Panicum hallii" \
  --protein_sequence "MSEQUENCE" \
  --strict_species_lookup
```

## Notes

- Training uses precomputed `embeddings_v2` for species conditioning.
- The data split is built in protein space with MMseqs clustering and binomial-species test holdout.
- `final_model/` is the published inference entrypoint.
- For compatibility, released model directories contain both `trainer_config.json` and `config.json`.

## Sampling arguments

- `enforce_mapping`: when `True`, each generated codon is constrained to encode the provided amino acid at that position.
- `temperature`: softmax temperature. Lower values are more deterministic; `0` selects argmax greedily.
- `top_k`: keep only the `k` highest-logit codon candidates before sampling.
- `top_p`: nucleus sampling threshold; keep the smallest probability mass whose cumulative sum reaches `p`.