HoarfrostLab
/

lookingglass-v1

@@ -1,218 +1,44 @@
----
-language:
-- en
-tags:
-- biology
-- dna
-- genomics
-- metagenomics
-- language-model
-- awd-lstm
-- transfer-learning
-license: mit
-pipeline_tag: feature-extraction
-library_name: pytorch
----
-# LookingGlass
-LookingGlass is a general-purpose "universal language of life" deep learning model for read-length biological sequences. LookingGlass generates contextually-aware, meaningful representations of short DNA reads, enabling transfer learning for a range of downstream tasks.
-This is a **pure PyTorch implementation** with no fastai dependencies.
-## Links
-- **Paper**: [Deep learning of a bacterial and archaeal universal language of life enables transfer learning and illuminates microbial dark matter](https://doi.org/10.1038/s41467-022-30070-8) (Nature Communications, 2022)
-- **GitHub**: [ahoarfrost/LookingGlass](https://github.com/ahoarfrost/LookingGlass)
-## Citation
-If you use LookingGlass, please cite:
-```bibtex
-@article{hoarfrost2022deep,
-  title={Deep learning of a bacterial and archaeal universal language of life
-         enables transfer learning and illuminates microbial dark matter},
-  author={Hoarfrost, Adrienne and Aptekmann, Ariel and Farfanuk, Gaetan and Bromberg, Yana},
-  journal={Nature Communications},
-  volume={13},
-  number={1},
-  pages={2606},
-  year={2022},
-  publisher={Nature Publishing Group}
-}
-```
-## Model
-| | |
-|---|---|
-| Architecture | AWD-LSTM (3-layer, unidirectional) |
-| Hidden size | 1152 |
-| Embedding size | 104 |
-| Parameters | ~17M |
-| Vocabulary | 8 tokens (G, A, C, T + special tokens) |
-| Training data | Metagenomic sequences |
-## Vocabulary
-| Token | ID | Description |
-|-------|-----|-------------|
-| `xxunk` | 0 | Unknown |
-| `xxpad` | 1 | Padding |
-| `xxbos` | 2 | Beginning of sequence |
-| `xxeos` | 3 | End of sequence |
-| `G` | 4 | Guanine |
-| `A` | 5 | Adenine |
-| `C` | 6 | Cytosine |
-| `T` | 7 | Thymine |
-## Installation
-```bash
-pip install torch huggingface_hub
-```
-## Usage
-### Quick Start
-```python
-from lookingglass import LookingGlass, LookingGlassTokenizer
-# Load directly from HuggingFace Hub
-model = LookingGlass.from_pretrained('HoarfrostLab/lookingglass-v1')
-tokenizer = LookingGlassTokenizer()
-# Tokenize DNA sequences
-inputs = tokenizer(["GATTACA", "ATCGATCGATCG"], return_tensors=True)
-# Get embeddings
-embeddings = model.get_embeddings(inputs['input_ids'])
-print(embeddings.shape)  # torch.Size([2, 104])
-```
-### Getting Embeddings
-The primary use case is extracting sequence embeddings for downstream tasks:
-```python
-from lookingglass import LookingGlass, LookingGlassTokenizer
-import torch
-model = LookingGlass.from_pretrained('./lookingglass-v1')
-tokenizer = LookingGlassTokenizer()
-model.eval()
-# Your DNA sequences
-sequences = [
-    "ATCGATCGATCG",
-    "GATTACAGATTACA",
-    "GCGCGCGCGCGC"
-]
-# Tokenize
-inputs = tokenizer(sequences, return_tensors=True)
-# Extract embeddings
-with torch.no_grad():
-    embeddings = model.get_embeddings(inputs['input_ids'])
-# embeddings: (3, 104) - one 104-dimensional vector per sequence
-print(f"Embedding shape: {embeddings.shape}")
-```
-### Language Modeling
-To access the full language model with prediction head:
-```python
-from lookingglass import LookingGlassLM, LookingGlassTokenizer
-model = LookingGlassLM.from_pretrained('./lookingglass-v1')
-tokenizer = LookingGlassTokenizer()
-inputs = tokenizer("GATTACA", return_tensors=True)
-# Get next-token prediction logits
-logits = model(inputs['input_ids'])
-print(logits.shape)  # torch.Size([1, 8, 8]) - (batch, seq_len, vocab_size)
-# Embeddings also available
-embeddings = model.get_embeddings(inputs['input_ids'])
-```
-### GPU Usage
-```python
-import torch
-from lookingglass import LookingGlass, LookingGlassTokenizer
-device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
-model = LookingGlass.from_pretrained('./lookingglass-v1')
-model = model.to(device)
-model.eval()
-tokenizer = LookingGlassTokenizer()
-inputs = tokenizer(["GATTACA"], return_tensors=True)
-input_ids = inputs['input_ids'].to(device)
-with torch.no_grad():
-    embeddings = model.get_embeddings(input_ids)
-```
-## API Reference
-### LookingGlassTokenizer
-```python
-tokenizer = LookingGlassTokenizer(
-    add_bos_token=True,   # Add xxbos at start (default: True)
-    add_eos_token=False,  # Add xxeos at end (default: False)
-)
-# Tokenize
-inputs = tokenizer(
-    sequences,            # str or List[str]
-    return_tensors=True,  # Return PyTorch tensors
-    padding=True,         # Pad to longest sequence
-    max_length=None,      # Optional max length
-    truncation=False,     # Truncate to max_length
-)
-# Decode
-tokenizer.decode(token_ids, skip_special_tokens=True)
-```
-### LookingGlass
-```python
-model = LookingGlass.from_pretrained(path)
-# Get sequence embeddings (recommended)
-embeddings = model.get_embeddings(input_ids)  # (batch, 104)
-# Get hidden states for all positions
-hidden = model.get_hidden_states(input_ids)   # (batch, seq_len, 104)
-# Forward pass (same as get_embeddings)
-embeddings = model(input_ids)                 # (batch, 104)
-```
-### LookingGlassLM
-```python
-model = LookingGlassLM.from_pretrained(path)
-# Get logits for next-token prediction
-logits = model(input_ids)                     # (batch, seq_len, 8)
-# Get embeddings
-embeddings = model.get_embeddings(input_ids)  # (batch, 104)
-```
-## License
-MIT License

+ {
+         "name": "Introduction",
+         "objective": (
+             "Establish the scientific background, identify the gap in existing knowledge, "
+             "and state the paper's aims/hypotheses and main findings and contribution to the field. Structure as a narrative funnel: "
+             "broad context → specific gap → this paper's claim and contribution."
+         ),
+     },
+     {
+         "name": "Methods",
+         "objective": (
+             "Describe the experimental design, materials, procedures, and "
+             "analysis pipeline with sufficient detail for reproducibility."
+         ),
+     },
+     {
+         "name": "Results",
+         "objective": (
+             "Present findings objectively, referencing figures and statistics. "
+             "State results as claims supported by evidence; do not interpret here. "
+         ),
+     },
+     {
+         "name": "Discussion",
+         "objective": (
+             "Interpret results in the context of prior work, address alternative explanations "
+             "and limitations, state conclusions clearly, and identify future directions. "
+             "Open with the central finding restated as a claim. Close with the central contribution to the field and potential for the future."
+         ),
+     },
+     {
+         "name": "Abstract",
+         "objective": (
+             "Concise summary (typically 150–300 words) of motivation/objective, methods, key results, "
+             "and conclusions. Written last. Should stand alone and answer 'So what?'."
+         ),
+     },
+     {
+         "name": "References",
+         "objective": (
+             "Complete, correctly formatted bibliography of all works cited in the text. Only real scientific publications. "
+             "Every in-text citation must appear here; every entry here must be cited in text."
+         ),
+     },