PoultryCaduceus / README.md
jamie0315's picture
Upload README.md
b8ea27b verified
---
language:
- en
license: mit
tags:
- genomics
- dna
- chicken
- poultry
- caduceus
- mamba
- biology
- bioinformatics
- gallus-gallus
datasets:
- custom
pipeline_tag: feature-extraction
base_model: kuleshov-group/caduceus-ph_seqlen-131k_d_model-256_n_layer-4
---
# πŸ” PoultryCaduceus
**A Bidirectional DNA Language Model for Chicken Genome**
PoultryCaduceus is the first DNA foundation model specifically pre-trained on the chicken (*Gallus gallus*) genome, based on the [Caduceus](https://github.com/kuleshov-group/caduceus) architecture.
## Model Description
| Feature | Value |
|---------|-------|
| **Base Model** | caduceus-ph (4-layer) |
| **Pre-training Genome** | GRCg6a (galGal6) |
| **Sequence Length** | 65,536 bp |
| **Hidden Dimension** | 256 |
| **Layers** | 4 |
| **Vocab Size** | 16 |
| **Training Steps** | 10,000 |
## Usage
```python
from transformers import AutoModelForMaskedLM
# Load model
model = AutoModelForMaskedLM.from_pretrained(
"jamie0315/PoultryCaduceus",
subfolder="checkpoint-10000",
trust_remote_code=True
)
```
### Get Sequence Embeddings
```python
import torch
# DNA vocabulary
DNA_VOCAB = {'A': 7, 'C': 8, 'G': 9, 'T': 10, 'N': 5, '[MASK]': 4}
# Tokenize sequence
sequence = "ATGCGATCGATCGATCG"
input_ids = torch.tensor([[DNA_VOCAB.get(c, 5) for c in sequence]])
# Get embeddings
model.eval()
with torch.no_grad():
outputs = model(input_ids, output_hidden_states=True)
embeddings = outputs.hidden_states[-1] # (batch, seq_len, 256)
```
## Training Data
Pre-training data is available in `chicken_pretrain_data_GRCg6a/`:
- `train_65k.h5` - Training set (~58,000 sequences)
- `val_65k.h5` - Validation set (~1,200 sequences)
## Repository Structure
```
PoultryCaduceus/
β”œβ”€β”€ checkpoint-10000/ # Model weights
β”‚ β”œβ”€β”€ config.json
β”‚ └── model.safetensors
└── chicken_pretrain_data_GRCg6a/ # Pre-training data
β”œβ”€β”€ train_65k.h5
└── val_65k.h5
```
## Links
- πŸ“¦ **GitHub**: [chengzhimin/PoultryCaduceus](https://github.com/chengzhimin/PoultryCaduceus)
## License
MIT License