|
|
--- |
|
|
language: |
|
|
- en |
|
|
license: mit |
|
|
tags: |
|
|
- genomics |
|
|
- dna |
|
|
- chicken |
|
|
- poultry |
|
|
- caduceus |
|
|
- mamba |
|
|
- biology |
|
|
- bioinformatics |
|
|
- gallus-gallus |
|
|
datasets: |
|
|
- custom |
|
|
pipeline_tag: feature-extraction |
|
|
base_model: kuleshov-group/caduceus-ph_seqlen-131k_d_model-256_n_layer-4 |
|
|
--- |
|
|
|
|
|
# π PoultryCaduceus |
|
|
|
|
|
**A Bidirectional DNA Language Model for Chicken Genome** |
|
|
|
|
|
PoultryCaduceus is the first DNA foundation model specifically pre-trained on the chicken (*Gallus gallus*) genome, based on the [Caduceus](https://github.com/kuleshov-group/caduceus) architecture. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
| Feature | Value | |
|
|
|---------|-------| |
|
|
| **Base Model** | caduceus-ph (4-layer) | |
|
|
| **Pre-training Genome** | GRCg6a (galGal6) | |
|
|
| **Sequence Length** | 65,536 bp | |
|
|
| **Hidden Dimension** | 256 | |
|
|
| **Layers** | 4 | |
|
|
| **Vocab Size** | 16 | |
|
|
| **Training Steps** | 10,000 | |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForMaskedLM |
|
|
|
|
|
# Load model |
|
|
model = AutoModelForMaskedLM.from_pretrained( |
|
|
"jamie0315/PoultryCaduceus", |
|
|
subfolder="checkpoint-10000", |
|
|
trust_remote_code=True |
|
|
) |
|
|
``` |
|
|
|
|
|
### Get Sequence Embeddings |
|
|
|
|
|
```python |
|
|
import torch |
|
|
|
|
|
# DNA vocabulary |
|
|
DNA_VOCAB = {'A': 7, 'C': 8, 'G': 9, 'T': 10, 'N': 5, '[MASK]': 4} |
|
|
|
|
|
# Tokenize sequence |
|
|
sequence = "ATGCGATCGATCGATCG" |
|
|
input_ids = torch.tensor([[DNA_VOCAB.get(c, 5) for c in sequence]]) |
|
|
|
|
|
# Get embeddings |
|
|
model.eval() |
|
|
with torch.no_grad(): |
|
|
outputs = model(input_ids, output_hidden_states=True) |
|
|
embeddings = outputs.hidden_states[-1] # (batch, seq_len, 256) |
|
|
``` |
|
|
|
|
|
## Training Data |
|
|
|
|
|
Pre-training data is available in `chicken_pretrain_data_GRCg6a/`: |
|
|
- `train_65k.h5` - Training set (~58,000 sequences) |
|
|
- `val_65k.h5` - Validation set (~1,200 sequences) |
|
|
|
|
|
## Repository Structure |
|
|
|
|
|
``` |
|
|
PoultryCaduceus/ |
|
|
βββ checkpoint-10000/ # Model weights |
|
|
β βββ config.json |
|
|
β βββ model.safetensors |
|
|
βββ chicken_pretrain_data_GRCg6a/ # Pre-training data |
|
|
βββ train_65k.h5 |
|
|
βββ val_65k.h5 |
|
|
``` |
|
|
|
|
|
## Links |
|
|
|
|
|
- π¦ **GitHub**: [chengzhimin/PoultryCaduceus](https://github.com/chengzhimin/PoultryCaduceus) |
|
|
|
|
|
## License |
|
|
|
|
|
MIT License |
|
|
|