jamie0315 commited on
Commit
b8ea27b
Β·
verified Β·
1 Parent(s): 21ca99c

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +95 -0
README.md ADDED
@@ -0,0 +1,95 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: mit
5
+ tags:
6
+ - genomics
7
+ - dna
8
+ - chicken
9
+ - poultry
10
+ - caduceus
11
+ - mamba
12
+ - biology
13
+ - bioinformatics
14
+ - gallus-gallus
15
+ datasets:
16
+ - custom
17
+ pipeline_tag: feature-extraction
18
+ base_model: kuleshov-group/caduceus-ph_seqlen-131k_d_model-256_n_layer-4
19
+ ---
20
+
21
+ # πŸ” PoultryCaduceus
22
+
23
+ **A Bidirectional DNA Language Model for Chicken Genome**
24
+
25
+ PoultryCaduceus is the first DNA foundation model specifically pre-trained on the chicken (*Gallus gallus*) genome, based on the [Caduceus](https://github.com/kuleshov-group/caduceus) architecture.
26
+
27
+ ## Model Description
28
+
29
+ | Feature | Value |
30
+ |---------|-------|
31
+ | **Base Model** | caduceus-ph (4-layer) |
32
+ | **Pre-training Genome** | GRCg6a (galGal6) |
33
+ | **Sequence Length** | 65,536 bp |
34
+ | **Hidden Dimension** | 256 |
35
+ | **Layers** | 4 |
36
+ | **Vocab Size** | 16 |
37
+ | **Training Steps** | 10,000 |
38
+
39
+ ## Usage
40
+
41
+ ```python
42
+ from transformers import AutoModelForMaskedLM
43
+
44
+ # Load model
45
+ model = AutoModelForMaskedLM.from_pretrained(
46
+ "jamie0315/PoultryCaduceus",
47
+ subfolder="checkpoint-10000",
48
+ trust_remote_code=True
49
+ )
50
+ ```
51
+
52
+ ### Get Sequence Embeddings
53
+
54
+ ```python
55
+ import torch
56
+
57
+ # DNA vocabulary
58
+ DNA_VOCAB = {'A': 7, 'C': 8, 'G': 9, 'T': 10, 'N': 5, '[MASK]': 4}
59
+
60
+ # Tokenize sequence
61
+ sequence = "ATGCGATCGATCGATCG"
62
+ input_ids = torch.tensor([[DNA_VOCAB.get(c, 5) for c in sequence]])
63
+
64
+ # Get embeddings
65
+ model.eval()
66
+ with torch.no_grad():
67
+ outputs = model(input_ids, output_hidden_states=True)
68
+ embeddings = outputs.hidden_states[-1] # (batch, seq_len, 256)
69
+ ```
70
+
71
+ ## Training Data
72
+
73
+ Pre-training data is available in `chicken_pretrain_data_GRCg6a/`:
74
+ - `train_65k.h5` - Training set (~58,000 sequences)
75
+ - `val_65k.h5` - Validation set (~1,200 sequences)
76
+
77
+ ## Repository Structure
78
+
79
+ ```
80
+ PoultryCaduceus/
81
+ β”œβ”€β”€ checkpoint-10000/ # Model weights
82
+ β”‚ β”œβ”€β”€ config.json
83
+ β”‚ └── model.safetensors
84
+ └── chicken_pretrain_data_GRCg6a/ # Pre-training data
85
+ β”œβ”€β”€ train_65k.h5
86
+ └── val_65k.h5
87
+ ```
88
+
89
+ ## Links
90
+
91
+ - πŸ“¦ **GitHub**: [chengzhimin/PoultryCaduceus](https://github.com/chengzhimin/PoultryCaduceus)
92
+
93
+ ## License
94
+
95
+ MIT License