adriennehoarfrost commited on
Commit
cb2997e
·
verified ·
1 Parent(s): 205c1b5

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +44 -218
README.md CHANGED
@@ -1,218 +1,44 @@
1
- ---
2
- language:
3
- - en
4
- tags:
5
- - biology
6
- - dna
7
- - genomics
8
- - metagenomics
9
- - language-model
10
- - awd-lstm
11
- - transfer-learning
12
- license: mit
13
- pipeline_tag: feature-extraction
14
- library_name: pytorch
15
- ---
16
-
17
- # LookingGlass
18
-
19
- LookingGlass is a general-purpose "universal language of life" deep learning model for read-length biological sequences. LookingGlass generates contextually-aware, meaningful representations of short DNA reads, enabling transfer learning for a range of downstream tasks.
20
-
21
- This is a **pure PyTorch implementation** with no fastai dependencies.
22
-
23
- ## Links
24
-
25
- - **Paper**: [Deep learning of a bacterial and archaeal universal language of life enables transfer learning and illuminates microbial dark matter](https://doi.org/10.1038/s41467-022-30070-8) (Nature Communications, 2022)
26
- - **GitHub**: [ahoarfrost/LookingGlass](https://github.com/ahoarfrost/LookingGlass)
27
-
28
- ## Citation
29
-
30
- If you use LookingGlass, please cite:
31
-
32
- ```bibtex
33
- @article{hoarfrost2022deep,
34
- title={Deep learning of a bacterial and archaeal universal language of life
35
- enables transfer learning and illuminates microbial dark matter},
36
- author={Hoarfrost, Adrienne and Aptekmann, Ariel and Farfanuk, Gaetan and Bromberg, Yana},
37
- journal={Nature Communications},
38
- volume={13},
39
- number={1},
40
- pages={2606},
41
- year={2022},
42
- publisher={Nature Publishing Group}
43
- }
44
- ```
45
-
46
- ## Model
47
-
48
- | | |
49
- |---|---|
50
- | Architecture | AWD-LSTM (3-layer, unidirectional) |
51
- | Hidden size | 1152 |
52
- | Embedding size | 104 |
53
- | Parameters | ~17M |
54
- | Vocabulary | 8 tokens (G, A, C, T + special tokens) |
55
- | Training data | Metagenomic sequences |
56
-
57
- ## Vocabulary
58
-
59
- | Token | ID | Description |
60
- |-------|-----|-------------|
61
- | `xxunk` | 0 | Unknown |
62
- | `xxpad` | 1 | Padding |
63
- | `xxbos` | 2 | Beginning of sequence |
64
- | `xxeos` | 3 | End of sequence |
65
- | `G` | 4 | Guanine |
66
- | `A` | 5 | Adenine |
67
- | `C` | 6 | Cytosine |
68
- | `T` | 7 | Thymine |
69
-
70
- ## Installation
71
-
72
- ```bash
73
- pip install torch huggingface_hub
74
- ```
75
-
76
- ## Usage
77
-
78
- ### Quick Start
79
-
80
- ```python
81
- from lookingglass import LookingGlass, LookingGlassTokenizer
82
-
83
- # Load directly from HuggingFace Hub
84
- model = LookingGlass.from_pretrained('HoarfrostLab/lookingglass-v1')
85
- tokenizer = LookingGlassTokenizer()
86
-
87
- # Tokenize DNA sequences
88
- inputs = tokenizer(["GATTACA", "ATCGATCGATCG"], return_tensors=True)
89
-
90
- # Get embeddings
91
- embeddings = model.get_embeddings(inputs['input_ids'])
92
- print(embeddings.shape) # torch.Size([2, 104])
93
- ```
94
-
95
- ### Getting Embeddings
96
-
97
- The primary use case is extracting sequence embeddings for downstream tasks:
98
-
99
- ```python
100
- from lookingglass import LookingGlass, LookingGlassTokenizer
101
- import torch
102
-
103
- model = LookingGlass.from_pretrained('./lookingglass-v1')
104
- tokenizer = LookingGlassTokenizer()
105
- model.eval()
106
-
107
- # Your DNA sequences
108
- sequences = [
109
- "ATCGATCGATCG",
110
- "GATTACAGATTACA",
111
- "GCGCGCGCGCGC"
112
- ]
113
-
114
- # Tokenize
115
- inputs = tokenizer(sequences, return_tensors=True)
116
-
117
- # Extract embeddings
118
- with torch.no_grad():
119
- embeddings = model.get_embeddings(inputs['input_ids'])
120
-
121
- # embeddings: (3, 104) - one 104-dimensional vector per sequence
122
- print(f"Embedding shape: {embeddings.shape}")
123
- ```
124
-
125
- ### Language Modeling
126
-
127
- To access the full language model with prediction head:
128
-
129
- ```python
130
- from lookingglass import LookingGlassLM, LookingGlassTokenizer
131
-
132
- model = LookingGlassLM.from_pretrained('./lookingglass-v1')
133
- tokenizer = LookingGlassTokenizer()
134
-
135
- inputs = tokenizer("GATTACA", return_tensors=True)
136
-
137
- # Get next-token prediction logits
138
- logits = model(inputs['input_ids'])
139
- print(logits.shape) # torch.Size([1, 8, 8]) - (batch, seq_len, vocab_size)
140
-
141
- # Embeddings also available
142
- embeddings = model.get_embeddings(inputs['input_ids'])
143
- ```
144
-
145
- ### GPU Usage
146
-
147
- ```python
148
- import torch
149
- from lookingglass import LookingGlass, LookingGlassTokenizer
150
-
151
- device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
152
-
153
- model = LookingGlass.from_pretrained('./lookingglass-v1')
154
- model = model.to(device)
155
- model.eval()
156
-
157
- tokenizer = LookingGlassTokenizer()
158
-
159
- inputs = tokenizer(["GATTACA"], return_tensors=True)
160
- input_ids = inputs['input_ids'].to(device)
161
-
162
- with torch.no_grad():
163
- embeddings = model.get_embeddings(input_ids)
164
- ```
165
-
166
- ## API Reference
167
-
168
- ### LookingGlassTokenizer
169
-
170
- ```python
171
- tokenizer = LookingGlassTokenizer(
172
- add_bos_token=True, # Add xxbos at start (default: True)
173
- add_eos_token=False, # Add xxeos at end (default: False)
174
- )
175
-
176
- # Tokenize
177
- inputs = tokenizer(
178
- sequences, # str or List[str]
179
- return_tensors=True, # Return PyTorch tensors
180
- padding=True, # Pad to longest sequence
181
- max_length=None, # Optional max length
182
- truncation=False, # Truncate to max_length
183
- )
184
-
185
- # Decode
186
- tokenizer.decode(token_ids, skip_special_tokens=True)
187
- ```
188
-
189
- ### LookingGlass
190
-
191
- ```python
192
- model = LookingGlass.from_pretrained(path)
193
-
194
- # Get sequence embeddings (recommended)
195
- embeddings = model.get_embeddings(input_ids) # (batch, 104)
196
-
197
- # Get hidden states for all positions
198
- hidden = model.get_hidden_states(input_ids) # (batch, seq_len, 104)
199
-
200
- # Forward pass (same as get_embeddings)
201
- embeddings = model(input_ids) # (batch, 104)
202
- ```
203
-
204
- ### LookingGlassLM
205
-
206
- ```python
207
- model = LookingGlassLM.from_pretrained(path)
208
-
209
- # Get logits for next-token prediction
210
- logits = model(input_ids) # (batch, seq_len, 8)
211
-
212
- # Get embeddings
213
- embeddings = model.get_embeddings(input_ids) # (batch, 104)
214
- ```
215
-
216
- ## License
217
-
218
- MIT License
 
1
+ {
2
+ "name": "Introduction",
3
+ "objective": (
4
+ "Establish the scientific background, identify the gap in existing knowledge, "
5
+ "and state the paper's aims/hypotheses and main findings and contribution to the field. Structure as a narrative funnel: "
6
+ "broad context → specific gap → this paper's claim and contribution."
7
+ ),
8
+ },
9
+ {
10
+ "name": "Methods",
11
+ "objective": (
12
+ "Describe the experimental design, materials, procedures, and "
13
+ "analysis pipeline with sufficient detail for reproducibility."
14
+ ),
15
+ },
16
+ {
17
+ "name": "Results",
18
+ "objective": (
19
+ "Present findings objectively, referencing figures and statistics. "
20
+ "State results as claims supported by evidence; do not interpret here. "
21
+ ),
22
+ },
23
+ {
24
+ "name": "Discussion",
25
+ "objective": (
26
+ "Interpret results in the context of prior work, address alternative explanations "
27
+ "and limitations, state conclusions clearly, and identify future directions. "
28
+ "Open with the central finding restated as a claim. Close with the central contribution to the field and potential for the future."
29
+ ),
30
+ },
31
+ {
32
+ "name": "Abstract",
33
+ "objective": (
34
+ "Concise summary (typically 150–300 words) of motivation/objective, methods, key results, "
35
+ "and conclusions. Written last. Should stand alone and answer 'So what?'."
36
+ ),
37
+ },
38
+ {
39
+ "name": "References",
40
+ "objective": (
41
+ "Complete, correctly formatted bibliography of all works cited in the text. Only real scientific publications. "
42
+ "Every in-text citation must appear here; every entry here must be cited in text."
43
+ ),
44
+ },