LookingGlass
LookingGlass is a general-purpose "universal language of life" deep learning model for read-length biological sequences. LookingGlass generates contextually-aware, meaningful representations of short DNA reads, enabling transfer learning for a range of downstream tasks.
This is a pure PyTorch implementation with no fastai dependencies.
Links
- Paper: Deep learning of a bacterial and archaeal universal language of life enables transfer learning and illuminates microbial dark matter (Nature Communications, 2022)
- GitHub: ahoarfrost/LookingGlass
Citation
If you use LookingGlass, please cite:
@article{hoarfrost2022deep,
title={Deep learning of a bacterial and archaeal universal language of life
enables transfer learning and illuminates microbial dark matter},
author={Hoarfrost, Adrienne and Aptekmann, Ariel and Farfanuk, Gaetan and Bromberg, Yana},
journal={Nature Communications},
volume={13},
number={1},
pages={2606},
year={2022},
publisher={Nature Publishing Group}
}
Model
| Architecture | AWD-LSTM (3-layer, unidirectional) |
| Hidden size | 1152 |
| Embedding size | 104 |
| Parameters | ~17M |
| Vocabulary | 8 tokens (G, A, C, T + special tokens) |
| Training data | Metagenomic sequences |
Vocabulary
| Token | ID | Description |
|---|---|---|
xxunk |
0 | Unknown |
xxpad |
1 | Padding |
xxbos |
2 | Beginning of sequence |
xxeos |
3 | End of sequence |
G |
4 | Guanine |
A |
5 | Adenine |
C |
6 | Cytosine |
T |
7 | Thymine |
Installation
pip install torch
git clone https://huggingface.co/HoarfrostLab/lookingglass-v1
cd lookingglass-v1
Usage
Quick Start
from lookingglass import LookingGlass, LookingGlassTokenizer
model = LookingGlass.from_pretrained('.')
tokenizer = LookingGlassTokenizer()
inputs = tokenizer(["GATTACA", "ATCGATCGATCG"], return_tensors=True)
embeddings = model.get_embeddings(inputs['input_ids'])
print(embeddings.shape) # torch.Size([2, 104])
Getting Embeddings
The primary use case is extracting sequence embeddings for downstream tasks:
from lookingglass import LookingGlass, LookingGlassTokenizer
import torch
model = LookingGlass.from_pretrained('./lookingglass-v1')
tokenizer = LookingGlassTokenizer()
model.eval()
# Your DNA sequences
sequences = [
"ATCGATCGATCG",
"GATTACAGATTACA",
"GCGCGCGCGCGC"
]
# Tokenize
inputs = tokenizer(sequences, return_tensors=True)
# Extract embeddings
with torch.no_grad():
embeddings = model.get_embeddings(inputs['input_ids'])
# embeddings: (3, 104) - one 104-dimensional vector per sequence
print(f"Embedding shape: {embeddings.shape}")
Language Modeling
To access the full language model with prediction head:
from lookingglass import LookingGlassLM, LookingGlassTokenizer
model = LookingGlassLM.from_pretrained('./lookingglass-v1')
tokenizer = LookingGlassTokenizer()
inputs = tokenizer("GATTACA", return_tensors=True)
# Get next-token prediction logits
logits = model(inputs['input_ids'])
print(logits.shape) # torch.Size([1, 8, 8]) - (batch, seq_len, vocab_size)
# Embeddings also available
embeddings = model.get_embeddings(inputs['input_ids'])
GPU Usage
import torch
from lookingglass import LookingGlass, LookingGlassTokenizer
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = LookingGlass.from_pretrained('./lookingglass-v1')
model = model.to(device)
model.eval()
tokenizer = LookingGlassTokenizer()
inputs = tokenizer(["GATTACA"], return_tensors=True)
input_ids = inputs['input_ids'].to(device)
with torch.no_grad():
embeddings = model.get_embeddings(input_ids)
API Reference
LookingGlassTokenizer
tokenizer = LookingGlassTokenizer(
add_bos_token=True, # Add xxbos at start (default: True)
add_eos_token=False, # Add xxeos at end (default: False)
)
# Tokenize
inputs = tokenizer(
sequences, # str or List[str]
return_tensors=True, # Return PyTorch tensors
padding=True, # Pad to longest sequence
max_length=None, # Optional max length
truncation=False, # Truncate to max_length
)
# Decode
tokenizer.decode(token_ids, skip_special_tokens=True)
LookingGlass
model = LookingGlass.from_pretrained(path)
# Get sequence embeddings (recommended)
embeddings = model.get_embeddings(input_ids) # (batch, 104)
# Get hidden states for all positions
hidden = model.get_hidden_states(input_ids) # (batch, seq_len, 104)
# Forward pass (same as get_embeddings)
embeddings = model(input_ids) # (batch, 104)
LookingGlassLM
model = LookingGlassLM.from_pretrained(path)
# Get logits for next-token prediction
logits = model(input_ids) # (batch, seq_len, 8)
# Get embeddings
embeddings = model.get_embeddings(input_ids) # (batch, 104)
License
MIT License
- Downloads last month
- 30