jamie0315
/

PoultryCaduceus

Feature Extraction

Model card Files Files and versions

jamie0315 commited on Jan 17

Commit

b8ea27b

·

verified ·

1 Parent(s): 21ca99c

Upload README.md

Files changed (1) hide show

README.md +95 -0

README.md ADDED Viewed

	@@ -0,0 +1,95 @@

+---
+language:
+- en
+license: mit
+tags:
+- genomics
+- dna
+- chicken
+- poultry
+- caduceus
+- mamba
+- biology
+- bioinformatics
+- gallus-gallus
+datasets:
+- custom
+pipeline_tag: feature-extraction
+base_model: kuleshov-group/caduceus-ph_seqlen-131k_d_model-256_n_layer-4
+---
+# 🐔 PoultryCaduceus
+**A Bidirectional DNA Language Model for Chicken Genome**
+PoultryCaduceus is the first DNA foundation model specifically pre-trained on the chicken (*Gallus gallus*) genome, based on the [Caduceus](https://github.com/kuleshov-group/caduceus) architecture.
+## Model Description
+| Feature | Value |
+|---------|-------|
+| **Base Model** | caduceus-ph (4-layer) |
+| **Pre-training Genome** | GRCg6a (galGal6) |
+| **Sequence Length** | 65,536 bp |
+| **Hidden Dimension** | 256 |
+| **Layers** | 4 |
+| **Vocab Size** | 16 |
+| **Training Steps** | 10,000 |
+## Usage
+```python
+from transformers import AutoModelForMaskedLM
+# Load model
+model = AutoModelForMaskedLM.from_pretrained(
+    "jamie0315/PoultryCaduceus",
+    subfolder="checkpoint-10000",
+    trust_remote_code=True
+)
+```
+### Get Sequence Embeddings
+```python
+import torch
+# DNA vocabulary
+DNA_VOCAB = {'A': 7, 'C': 8, 'G': 9, 'T': 10, 'N': 5, '[MASK]': 4}
+# Tokenize sequence
+sequence = "ATGCGATCGATCGATCG"
+input_ids = torch.tensor([[DNA_VOCAB.get(c, 5) for c in sequence]])
+# Get embeddings
+model.eval()
+with torch.no_grad():
+    outputs = model(input_ids, output_hidden_states=True)
+    embeddings = outputs.hidden_states[-1]  # (batch, seq_len, 256)
+```
+## Training Data
+Pre-training data is available in `chicken_pretrain_data_GRCg6a/`:
+- `train_65k.h5` - Training set (~58,000 sequences)
+- `val_65k.h5` - Validation set (~1,200 sequences)
+## Repository Structure
+```
+PoultryCaduceus/
+├── checkpoint-10000/              # Model weights
+│   ├── config.json
+│   └── model.safetensors
+└── chicken_pretrain_data_GRCg6a/  # Pre-training data
+    ├── train_65k.h5
+    └── val_65k.h5
+```
+## Links
+- 📦 **GitHub**: [chengzhimin/PoultryCaduceus](https://github.com/chengzhimin/PoultryCaduceus)
+## License
+MIT License