adriennehoarfrost commited on
Commit
2b36589
·
verified ·
1 Parent(s): cb2997e

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +215 -44
README.md CHANGED
@@ -1,44 +1,215 @@
1
- {
2
- "name": "Introduction",
3
- "objective": (
4
- "Establish the scientific background, identify the gap in existing knowledge, "
5
- "and state the paper's aims/hypotheses and main findings and contribution to the field. Structure as a narrative funnel: "
6
- "broad context → specific gap → this paper's claim and contribution."
7
- ),
8
- },
9
- {
10
- "name": "Methods",
11
- "objective": (
12
- "Describe the experimental design, materials, procedures, and "
13
- "analysis pipeline with sufficient detail for reproducibility."
14
- ),
15
- },
16
- {
17
- "name": "Results",
18
- "objective": (
19
- "Present findings objectively, referencing figures and statistics. "
20
- "State results as claims supported by evidence; do not interpret here. "
21
- ),
22
- },
23
- {
24
- "name": "Discussion",
25
- "objective": (
26
- "Interpret results in the context of prior work, address alternative explanations "
27
- "and limitations, state conclusions clearly, and identify future directions. "
28
- "Open with the central finding restated as a claim. Close with the central contribution to the field and potential for the future."
29
- ),
30
- },
31
- {
32
- "name": "Abstract",
33
- "objective": (
34
- "Concise summary (typically 150–300 words) of motivation/objective, methods, key results, "
35
- "and conclusions. Written last. Should stand alone and answer 'So what?'."
36
- ),
37
- },
38
- {
39
- "name": "References",
40
- "objective": (
41
- "Complete, correctly formatted bibliography of all works cited in the text. Only real scientific publications. "
42
- "Every in-text citation must appear here; every entry here must be cited in text."
43
- ),
44
- },
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ tags:
5
+ - biology
6
+ - dna
7
+ - genomics
8
+ - metagenomics
9
+ - language-model
10
+ - awd-lstm
11
+ - transfer-learning
12
+ license: mit
13
+ pipeline_tag: feature-extraction
14
+ library_name: pytorch
15
+ ---
16
+
17
+ # LookingGlass
18
+
19
+ LookingGlass is a general-purpose "universal language of life" deep learning model for read-length biological sequences. LookingGlass generates contextually-aware, meaningful representations of short DNA reads, enabling transfer learning for a range of downstream tasks.
20
+
21
+ This is a **pure PyTorch implementation** with no fastai dependencies.
22
+
23
+ ## Links
24
+
25
+ - **Paper**: [Deep learning of a bacterial and archaeal universal language of life enables transfer learning and illuminates microbial dark matter](https://doi.org/10.1038/s41467-022-30070-8) (Nature Communications, 2022)
26
+ - **GitHub**: [ahoarfrost/LookingGlass](https://github.com/ahoarfrost/LookingGlass)
27
+
28
+ ## Citation
29
+
30
+ If you use LookingGlass, please cite:
31
+
32
+ ```bibtex
33
+ @article{hoarfrost2022deep,
34
+ title={Deep learning of a bacterial and archaeal universal language of life
35
+ enables transfer learning and illuminates microbial dark matter},
36
+ author={Hoarfrost, Adrienne and Aptekmann, Ariel and Farfanuk, Gaetan and Bromberg, Yana},
37
+ journal={Nature Communications},
38
+ volume={13},
39
+ number={1},
40
+ pages={2606},
41
+ year={2022},
42
+ publisher={Nature Publishing Group}
43
+ }
44
+ ```
45
+
46
+ ## Model
47
+
48
+ | | |
49
+ |---|---|
50
+ | Architecture | AWD-LSTM (3-layer, unidirectional) |
51
+ | Hidden size | 1152 |
52
+ | Embedding size | 104 |
53
+ | Parameters | ~17M |
54
+ | Vocabulary | 8 tokens (G, A, C, T + special tokens) |
55
+ | Training data | Metagenomic sequences |
56
+
57
+ ## Vocabulary
58
+
59
+ | Token | ID | Description |
60
+ |-------|-----|-------------|
61
+ | `xxunk` | 0 | Unknown |
62
+ | `xxpad` | 1 | Padding |
63
+ | `xxbos` | 2 | Beginning of sequence |
64
+ | `xxeos` | 3 | End of sequence |
65
+ | `G` | 4 | Guanine |
66
+ | `A` | 5 | Adenine |
67
+ | `C` | 6 | Cytosine |
68
+ | `T` | 7 | Thymine |
69
+
70
+ ## Installation
71
+
72
+ ```bash
73
+ pip install torch
74
+ git clone https://huggingface.co/HoarfrostLab/lookingglass-v1
75
+ ```
76
+
77
+ ## Usage
78
+
79
+ ### Quick Start
80
+
81
+ ```python
82
+ from lookingglass import LookingGlass, LookingGlassTokenizer
83
+
84
+ model = LookingGlass.from_pretrained('./lookingglass-v1')
85
+ tokenizer = LookingGlassTokenizer()
86
+
87
+ inputs = tokenizer(["GATTACA", "ATCGATCGATCG"], return_tensors=True)
88
+ embeddings = model.get_embeddings(inputs['input_ids'])
89
+ print(embeddings.shape) # torch.Size([2, 104])
90
+ ```
91
+
92
+ ### Getting Embeddings
93
+
94
+ The primary use case is extracting sequence embeddings for downstream tasks:
95
+
96
+ ```python
97
+ from lookingglass import LookingGlass, LookingGlassTokenizer
98
+ import torch
99
+
100
+ model = LookingGlass.from_pretrained('./lookingglass-v1')
101
+ tokenizer = LookingGlassTokenizer()
102
+ model.eval()
103
+
104
+ # Your DNA sequences
105
+ sequences = [
106
+ "ATCGATCGATCG",
107
+ "GATTACAGATTACA",
108
+ "GCGCGCGCGCGC"
109
+ ]
110
+
111
+ # Tokenize
112
+ inputs = tokenizer(sequences, return_tensors=True)
113
+
114
+ # Extract embeddings
115
+ with torch.no_grad():
116
+ embeddings = model.get_embeddings(inputs['input_ids'])
117
+
118
+ # embeddings: (3, 104) - one 104-dimensional vector per sequence
119
+ print(f"Embedding shape: {embeddings.shape}")
120
+ ```
121
+
122
+ ### Language Modeling
123
+
124
+ To access the full language model with prediction head:
125
+
126
+ ```python
127
+ from lookingglass import LookingGlassLM, LookingGlassTokenizer
128
+
129
+ model = LookingGlassLM.from_pretrained('./lookingglass-v1')
130
+ tokenizer = LookingGlassTokenizer()
131
+
132
+ inputs = tokenizer("GATTACA", return_tensors=True)
133
+
134
+ # Get next-token prediction logits
135
+ logits = model(inputs['input_ids'])
136
+ print(logits.shape) # torch.Size([1, 8, 8]) - (batch, seq_len, vocab_size)
137
+
138
+ # Embeddings also available
139
+ embeddings = model.get_embeddings(inputs['input_ids'])
140
+ ```
141
+
142
+ ### GPU Usage
143
+
144
+ ```python
145
+ import torch
146
+ from lookingglass import LookingGlass, LookingGlassTokenizer
147
+
148
+ device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
149
+
150
+ model = LookingGlass.from_pretrained('./lookingglass-v1')
151
+ model = model.to(device)
152
+ model.eval()
153
+
154
+ tokenizer = LookingGlassTokenizer()
155
+
156
+ inputs = tokenizer(["GATTACA"], return_tensors=True)
157
+ input_ids = inputs['input_ids'].to(device)
158
+
159
+ with torch.no_grad():
160
+ embeddings = model.get_embeddings(input_ids)
161
+ ```
162
+
163
+ ## API Reference
164
+
165
+ ### LookingGlassTokenizer
166
+
167
+ ```python
168
+ tokenizer = LookingGlassTokenizer(
169
+ add_bos_token=True, # Add xxbos at start (default: True)
170
+ add_eos_token=False, # Add xxeos at end (default: False)
171
+ )
172
+
173
+ # Tokenize
174
+ inputs = tokenizer(
175
+ sequences, # str or List[str]
176
+ return_tensors=True, # Return PyTorch tensors
177
+ padding=True, # Pad to longest sequence
178
+ max_length=None, # Optional max length
179
+ truncation=False, # Truncate to max_length
180
+ )
181
+
182
+ # Decode
183
+ tokenizer.decode(token_ids, skip_special_tokens=True)
184
+ ```
185
+
186
+ ### LookingGlass
187
+
188
+ ```python
189
+ model = LookingGlass.from_pretrained(path)
190
+
191
+ # Get sequence embeddings (recommended)
192
+ embeddings = model.get_embeddings(input_ids) # (batch, 104)
193
+
194
+ # Get hidden states for all positions
195
+ hidden = model.get_hidden_states(input_ids) # (batch, seq_len, 104)
196
+
197
+ # Forward pass (same as get_embeddings)
198
+ embeddings = model(input_ids) # (batch, 104)
199
+ ```
200
+
201
+ ### LookingGlassLM
202
+
203
+ ```python
204
+ model = LookingGlassLM.from_pretrained(path)
205
+
206
+ # Get logits for next-token prediction
207
+ logits = model(input_ids) # (batch, seq_len, 8)
208
+
209
+ # Get embeddings
210
+ embeddings = model.get_embeddings(input_ids) # (batch, 104)
211
+ ```
212
+
213
+ ## License
214
+
215
+ MIT License