HoarfrostLab
/

lookingglass-v1

@@ -1,44 +1,215 @@
- {
-         "name": "Introduction",
-         "objective": (
-             "Establish the scientific background, identify the gap in existing knowledge, "
-             "and state the paper's aims/hypotheses and main findings and contribution to the field. Structure as a narrative funnel: "
-             "broad context → specific gap → this paper's claim and contribution."
-         ),
-     },
-     {
-         "name": "Methods",
-         "objective": (
-             "Describe the experimental design, materials, procedures, and "
-             "analysis pipeline with sufficient detail for reproducibility."
-         ),
-     },
-     {
-         "name": "Results",
-         "objective": (
-             "Present findings objectively, referencing figures and statistics. "
-             "State results as claims supported by evidence; do not interpret here. "
-         ),
-     },
-     {
-         "name": "Discussion",
-         "objective": (
-             "Interpret results in the context of prior work, address alternative explanations "
-             "and limitations, state conclusions clearly, and identify future directions. "
-             "Open with the central finding restated as a claim. Close with the central contribution to the field and potential for the future."
-         ),
-     },
-     {
-         "name": "Abstract",
-         "objective": (
-             "Concise summary (typically 150–300 words) of motivation/objective, methods, key results, "
-             "and conclusions. Written last. Should stand alone and answer 'So what?'."
-         ),
-     },
-     {
-         "name": "References",
-         "objective": (
-             "Complete, correctly formatted bibliography of all works cited in the text. Only real scientific publications. "
-             "Every in-text citation must appear here; every entry here must be cited in text."
-         ),
-     },

+---
+language:
+- en
+tags:
+- biology
+- dna
+- genomics
+- metagenomics
+- language-model
+- awd-lstm
+- transfer-learning
+license: mit
+pipeline_tag: feature-extraction
+library_name: pytorch
+---
+# LookingGlass
+LookingGlass is a general-purpose "universal language of life" deep learning model for read-length biological sequences. LookingGlass generates contextually-aware, meaningful representations of short DNA reads, enabling transfer learning for a range of downstream tasks.
+This is a **pure PyTorch implementation** with no fastai dependencies.
+## Links
+- **Paper**: [Deep learning of a bacterial and archaeal universal language of life enables transfer learning and illuminates microbial dark matter](https://doi.org/10.1038/s41467-022-30070-8) (Nature Communications, 2022)
+- **GitHub**: [ahoarfrost/LookingGlass](https://github.com/ahoarfrost/LookingGlass)
+## Citation
+If you use LookingGlass, please cite:
+```bibtex
+@article{hoarfrost2022deep,
+  title={Deep learning of a bacterial and archaeal universal language of life
+         enables transfer learning and illuminates microbial dark matter},
+  author={Hoarfrost, Adrienne and Aptekmann, Ariel and Farfanuk, Gaetan and Bromberg, Yana},
+  journal={Nature Communications},
+  volume={13},
+  number={1},
+  pages={2606},
+  year={2022},
+  publisher={Nature Publishing Group}
+}
+```
+## Model
+| | |
+|---|---|
+| Architecture | AWD-LSTM (3-layer, unidirectional) |
+| Hidden size | 1152 |
+| Embedding size | 104 |
+| Parameters | ~17M |
+| Vocabulary | 8 tokens (G, A, C, T + special tokens) |
+| Training data | Metagenomic sequences |
+## Vocabulary
+| Token | ID | Description |
+|-------|-----|-------------|
+| `xxunk` | 0 | Unknown |
+| `xxpad` | 1 | Padding |
+| `xxbos` | 2 | Beginning of sequence |
+| `xxeos` | 3 | End of sequence |
+| `G` | 4 | Guanine |
+| `A` | 5 | Adenine |
+| `C` | 6 | Cytosine |
+| `T` | 7 | Thymine |
+## Installation
+```bash
+pip install torch
+git clone https://huggingface.co/HoarfrostLab/lookingglass-v1
+```
+## Usage
+### Quick Start
+```python
+from lookingglass import LookingGlass, LookingGlassTokenizer
+model = LookingGlass.from_pretrained('./lookingglass-v1')
+tokenizer = LookingGlassTokenizer()
+inputs = tokenizer(["GATTACA", "ATCGATCGATCG"], return_tensors=True)
+embeddings = model.get_embeddings(inputs['input_ids'])
+print(embeddings.shape)  # torch.Size([2, 104])
+```
+### Getting Embeddings
+The primary use case is extracting sequence embeddings for downstream tasks:
+```python
+from lookingglass import LookingGlass, LookingGlassTokenizer
+import torch
+model = LookingGlass.from_pretrained('./lookingglass-v1')
+tokenizer = LookingGlassTokenizer()
+model.eval()
+# Your DNA sequences
+sequences = [
+    "ATCGATCGATCG",
+    "GATTACAGATTACA",
+    "GCGCGCGCGCGC"
+]
+# Tokenize
+inputs = tokenizer(sequences, return_tensors=True)
+# Extract embeddings
+with torch.no_grad():
+    embeddings = model.get_embeddings(inputs['input_ids'])
+# embeddings: (3, 104) - one 104-dimensional vector per sequence
+print(f"Embedding shape: {embeddings.shape}")
+```
+### Language Modeling
+To access the full language model with prediction head:
+```python
+from lookingglass import LookingGlassLM, LookingGlassTokenizer
+model = LookingGlassLM.from_pretrained('./lookingglass-v1')
+tokenizer = LookingGlassTokenizer()
+inputs = tokenizer("GATTACA", return_tensors=True)
+# Get next-token prediction logits
+logits = model(inputs['input_ids'])
+print(logits.shape)  # torch.Size([1, 8, 8]) - (batch, seq_len, vocab_size)
+# Embeddings also available
+embeddings = model.get_embeddings(inputs['input_ids'])
+```
+### GPU Usage
+```python
+import torch
+from lookingglass import LookingGlass, LookingGlassTokenizer
+device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+model = LookingGlass.from_pretrained('./lookingglass-v1')
+model = model.to(device)
+model.eval()
+tokenizer = LookingGlassTokenizer()
+inputs = tokenizer(["GATTACA"], return_tensors=True)
+input_ids = inputs['input_ids'].to(device)
+with torch.no_grad():
+    embeddings = model.get_embeddings(input_ids)
+```
+## API Reference
+### LookingGlassTokenizer
+```python
+tokenizer = LookingGlassTokenizer(
+    add_bos_token=True,   # Add xxbos at start (default: True)
+    add_eos_token=False,  # Add xxeos at end (default: False)
+)
+# Tokenize
+inputs = tokenizer(
+    sequences,            # str or List[str]
+    return_tensors=True,  # Return PyTorch tensors
+    padding=True,         # Pad to longest sequence
+    max_length=None,      # Optional max length
+    truncation=False,     # Truncate to max_length
+)
+# Decode
+tokenizer.decode(token_ids, skip_special_tokens=True)
+```
+### LookingGlass
+```python
+model = LookingGlass.from_pretrained(path)
+# Get sequence embeddings (recommended)
+embeddings = model.get_embeddings(input_ids)  # (batch, 104)
+# Get hidden states for all positions
+hidden = model.get_hidden_states(input_ids)   # (batch, seq_len, 104)
+# Forward pass (same as get_embeddings)
+embeddings = model(input_ids)                 # (batch, 104)
+```
+### LookingGlassLM
+```python
+model = LookingGlassLM.from_pretrained(path)
+# Get logits for next-token prediction
+logits = model(input_ids)                     # (batch, seq_len, 8)
+# Get embeddings
+embeddings = model.get_embeddings(input_ids)  # (batch, 104)
+```
+## License
+MIT License