vojtam commited on
Commit
cc28f01
·
verified ·
1 Parent(s): 4d4b072

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +100 -5
README.md CHANGED
@@ -1,9 +1,104 @@
1
  ---
2
  tags:
3
- - model_hub_mixin
4
- - pytorch_model_hub_mixin
 
 
 
 
 
 
 
 
5
  ---
6
 
7
- This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration:
8
- - Library: [More Information Needed]
9
- - Docs: [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  tags:
3
+ - biology
4
+ - genomics
5
+ - dna-compression
6
+ - causal-language-modeling
7
+ - gpt2
8
+ license: apache-2.0
9
+ datasets:
10
+ - dnabert-2
11
+ library_name: transformers
12
+ pipeline_tag: text-generation
13
  ---
14
 
15
+ # DNAGPT2: Genomic Large Language Model for Compression and Analysis
16
+
17
+ **DNAGPT2** is a family of autoregressive (decoder-only) transformer models trained on genomic DNA sequences.
18
+
19
+ The models follow the GPT-2 architecture and are trained from scratch on a multi-species genome dataset.
20
+
21
+ ## Model Details
22
+
23
+ - **Model Type:** Causal Language Model (Decoder-only Transformer)
24
+ - **Architecture:** GPT-2 Small
25
+ - **Parameters:** ~86 Million
26
+ - **Layers:** 12
27
+ - **Heads:** 12
28
+ - **Embedding Dimensions:** 768
29
+ - **Context Window:** 1,024 tokens
30
+ - **Vocabulary Sizes:** Models are available with vocabulary sizes of 16, **32**, 64, 128, 256, 512, 1024, 2048, 4096, and 8192.
31
+
32
+ ## Intended Use
33
+
34
+ These models are designed for:
35
+ 1. **DNA Compression:** Used in conjunction with Arithmetic Encoding (AE) to compress genomic sequences.
36
+ 3. **Sequence Modeling:** Next-token prediction for DNA sequences.
37
+
38
+ **Input:** Raw DNA sequences containing the characters `A`, `C`, `G`, `T`.
39
+ **Output:** Logits/Probabilities for the next token in the sequence.
40
+
41
+ ## Training Data
42
+
43
+ The models were pretrained on the dataset provided by the authors of **DNABERT-2**.
44
+ - **Composition:** 135 genomes covering Vertebrata, Fungi, Protozoa, Invertebrata, and Bacteria.
45
+ - **Size:** Approximately 32.5 billion nucleotides.
46
+ - **Preprocessing:** The alphabet was restricted to **A, C, G, T**. The letter **N** (unknown/ambiguous nucleotide) was omitted from the training data.
47
+
48
+ ## Training Procedure
49
+
50
+ The models were trained using the PyTorch framework and the `nanoGPT` recipe.
51
+
52
+ - **Tokenizer:** Byte-Pair Encoding (BPE) trained via SentencePiece.
53
+ - **Epochs:** 1
54
+ - **Optimization:** AdamW (Betas: 0.9, 0.95; Weight decay: 0.1)
55
+ - **Learning Rate:** Cosine decay (Max: 8e-4, Min: 8e-5) with linear warmup.
56
+ - **Batch Size:** $2^{19}$ tokens per step.
57
+ - **Hardware:** Single NVIDIA A40 GPU.
58
+
59
+ ## Performance
60
+
61
+ The models were evaluated on their ability to compress DNA sequences (measured in **bits per symbol** or **bps**) using Arithmetic Encoding. Lower is better.
62
+
63
+ | Dataset | Metric | DNAGPT2_32 | Benchmark (gzip -9) | Benchmark (Jarvis3) |
64
+ | :--- | :--- | :--- | :--- | :--- |
65
+ | **Homo sapiens** (T2T-CHM13v2.0) | bits/symbol | **1.470** | 2.022 | 1.384 |
66
+ | **M. llanfair...** (Bacteria) | bits/symbol | **1.783** | 2.142 | 1.713 |
67
+ | **A. thaliana** (Plant - Chr1) | bits/symbol | **1.876** | 2.161 | 1.702 |
68
+
69
+ The `DNAGPT2_32` model outperforms general-purpose compressors (gzip) and competitive deep learning models like `hyenaDNA` and `megaDNA` on the evaluated datasets.
70
+
71
+ ## How to Use
72
+
73
+ The model is compatible with the Hugging Face `transformers` library.
74
+
75
+ ```python
76
+ import torch
77
+ from transformers import AutoModelForCausalLM, AutoTokenizer
78
+
79
+ # Select the model variant (e.g., vocab size 128 or 32)
80
+ # Replace with the specific repository path if hosted on HF Hub
81
+ hf_model_repository = "vojtam/DNAGPT2_128"
82
+
83
+ device = "cuda" if torch.cuda.is_available() else "cpu"
84
+
85
+ # Load model and tokenizer
86
+ model = AutoModelForCausalLM.from_pretrained(
87
+ hf_model_repository,
88
+ trust_remote_code=True
89
+ ).to(device)
90
+
91
+ tokenizer = AutoTokenizer.from_pretrained(
92
+ hf_model_repository,
93
+ trust_remote_code=True
94
+ )
95
+
96
+ # Inference Example
97
+ dna_sequence = "ACGTTGCAAACG"
98
+ token_ids = tokenizer.encode(dna_sequence, return_tensors="pt").to(device)
99
+
100
+ with torch.no_grad():
101
+ logits = model(token_ids).logits
102
+
103
+ print(f"Input: {dna_sequence}")
104
+ print(f"Logits shape: {logits.shape}")