--- language: - en license: mit tags: - genomics - dna - chicken - poultry - caduceus - mamba - biology - bioinformatics - gallus-gallus datasets: - custom pipeline_tag: feature-extraction base_model: kuleshov-group/caduceus-ph_seqlen-131k_d_model-256_n_layer-4 --- # 🐔 PoultryCaduceus **A Bidirectional DNA Language Model for Chicken Genome** PoultryCaduceus is the first DNA foundation model specifically pre-trained on the chicken (*Gallus gallus*) genome, based on the [Caduceus](https://github.com/kuleshov-group/caduceus) architecture. ## Model Description | Feature | Value | |---------|-------| | **Base Model** | caduceus-ph (4-layer) | | **Pre-training Genome** | GRCg6a (galGal6) | | **Sequence Length** | 65,536 bp | | **Hidden Dimension** | 256 | | **Layers** | 4 | | **Vocab Size** | 16 | | **Training Steps** | 10,000 | ## Usage ```python from transformers import AutoModelForMaskedLM # Load model model = AutoModelForMaskedLM.from_pretrained( "jamie0315/PoultryCaduceus", subfolder="checkpoint-10000", trust_remote_code=True ) ``` ### Get Sequence Embeddings ```python import torch # DNA vocabulary DNA_VOCAB = {'A': 7, 'C': 8, 'G': 9, 'T': 10, 'N': 5, '[MASK]': 4} # Tokenize sequence sequence = "ATGCGATCGATCGATCG" input_ids = torch.tensor([[DNA_VOCAB.get(c, 5) for c in sequence]]) # Get embeddings model.eval() with torch.no_grad(): outputs = model(input_ids, output_hidden_states=True) embeddings = outputs.hidden_states[-1] # (batch, seq_len, 256) ``` ## Training Data Pre-training data is available in `chicken_pretrain_data_GRCg6a/`: - `train_65k.h5` - Training set (~58,000 sequences) - `val_65k.h5` - Validation set (~1,200 sequences) ## Repository Structure ``` PoultryCaduceus/ ├── checkpoint-10000/ # Model weights │ ├── config.json │ └── model.safetensors └── chicken_pretrain_data_GRCg6a/ # Pre-training data ├── train_65k.h5 └── val_65k.h5 ``` ## Links - 📦 **GitHub**: [chengzhimin/PoultryCaduceus](https://github.com/chengzhimin/PoultryCaduceus) ## License MIT License