LookingGlass

LookingGlass is a general-purpose "universal language of life" deep learning model for read-length biological sequences. LookingGlass generates contextually-aware, meaningful representations of short DNA reads, enabling transfer learning for a range of downstream tasks.

This is a pure PyTorch implementation with no fastai dependencies.

Links

Citation

If you use LookingGlass, please cite:

@article{hoarfrost2022deep,
  title={Deep learning of a bacterial and archaeal universal language of life
         enables transfer learning and illuminates microbial dark matter},
  author={Hoarfrost, Adrienne and Aptekmann, Ariel and Farfanuk, Gaetan and Bromberg, Yana},
  journal={Nature Communications},
  volume={13},
  number={1},
  pages={2606},
  year={2022},
  publisher={Nature Publishing Group}
}

Model

Architecture AWD-LSTM (3-layer, unidirectional)
Hidden size 1152
Embedding size 104
Parameters ~17M
Vocabulary 8 tokens (G, A, C, T + special tokens)
Training data Metagenomic sequences

Vocabulary

Token ID Description
xxunk 0 Unknown
xxpad 1 Padding
xxbos 2 Beginning of sequence
xxeos 3 End of sequence
G 4 Guanine
A 5 Adenine
C 6 Cytosine
T 7 Thymine

Installation

pip install torch
git clone https://huggingface.co/HoarfrostLab/lookingglass-v1
cd lookingglass-v1

Usage

Quick Start

from lookingglass import LookingGlass, LookingGlassTokenizer

model = LookingGlass.from_pretrained('.')
tokenizer = LookingGlassTokenizer()

inputs = tokenizer(["GATTACA", "ATCGATCGATCG"], return_tensors=True)
embeddings = model.get_embeddings(inputs['input_ids'])
print(embeddings.shape)  # torch.Size([2, 104])

Getting Embeddings

The primary use case is extracting sequence embeddings for downstream tasks:

from lookingglass import LookingGlass, LookingGlassTokenizer
import torch

model = LookingGlass.from_pretrained('./lookingglass-v1')
tokenizer = LookingGlassTokenizer()
model.eval()

# Your DNA sequences
sequences = [
    "ATCGATCGATCG",
    "GATTACAGATTACA",
    "GCGCGCGCGCGC"
]

# Tokenize
inputs = tokenizer(sequences, return_tensors=True)

# Extract embeddings
with torch.no_grad():
    embeddings = model.get_embeddings(inputs['input_ids'])

# embeddings: (3, 104) - one 104-dimensional vector per sequence
print(f"Embedding shape: {embeddings.shape}")

Language Modeling

To access the full language model with prediction head:

from lookingglass import LookingGlassLM, LookingGlassTokenizer

model = LookingGlassLM.from_pretrained('./lookingglass-v1')
tokenizer = LookingGlassTokenizer()

inputs = tokenizer("GATTACA", return_tensors=True)

# Get next-token prediction logits
logits = model(inputs['input_ids'])
print(logits.shape)  # torch.Size([1, 8, 8]) - (batch, seq_len, vocab_size)

# Embeddings also available
embeddings = model.get_embeddings(inputs['input_ids'])

GPU Usage

import torch
from lookingglass import LookingGlass, LookingGlassTokenizer

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model = LookingGlass.from_pretrained('./lookingglass-v1')
model = model.to(device)
model.eval()

tokenizer = LookingGlassTokenizer()

inputs = tokenizer(["GATTACA"], return_tensors=True)
input_ids = inputs['input_ids'].to(device)

with torch.no_grad():
    embeddings = model.get_embeddings(input_ids)

API Reference

LookingGlassTokenizer

tokenizer = LookingGlassTokenizer(
    add_bos_token=True,   # Add xxbos at start (default: True)
    add_eos_token=False,  # Add xxeos at end (default: False)
)

# Tokenize
inputs = tokenizer(
    sequences,            # str or List[str]
    return_tensors=True,  # Return PyTorch tensors
    padding=True,         # Pad to longest sequence
    max_length=None,      # Optional max length
    truncation=False,     # Truncate to max_length
)

# Decode
tokenizer.decode(token_ids, skip_special_tokens=True)

LookingGlass

model = LookingGlass.from_pretrained(path)

# Get sequence embeddings (recommended)
embeddings = model.get_embeddings(input_ids)  # (batch, 104)

# Get hidden states for all positions
hidden = model.get_hidden_states(input_ids)   # (batch, seq_len, 104)

# Forward pass (same as get_embeddings)
embeddings = model(input_ids)                 # (batch, 104)

LookingGlassLM

model = LookingGlassLM.from_pretrained(path)

# Get logits for next-token prediction
logits = model(input_ids)                     # (batch, seq_len, 8)

# Get embeddings
embeddings = model.get_embeddings(input_ids)  # (batch, 104)

License

MIT License

Downloads last month
30
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support