|
|
--- |
|
|
language: |
|
|
- en |
|
|
tags: |
|
|
- biology |
|
|
- dna |
|
|
- genomics |
|
|
- metagenomics |
|
|
- language-model |
|
|
- awd-lstm |
|
|
- transfer-learning |
|
|
license: mit |
|
|
pipeline_tag: feature-extraction |
|
|
library_name: pytorch |
|
|
--- |
|
|
|
|
|
# LookingGlass |
|
|
|
|
|
LookingGlass is a general-purpose "universal language of life" deep learning model for read-length biological sequences. LookingGlass generates contextually-aware, meaningful representations of short DNA reads, enabling transfer learning for a range of downstream tasks. |
|
|
|
|
|
This is a **pure PyTorch implementation** with no fastai dependencies. |
|
|
|
|
|
## Links |
|
|
|
|
|
- **Paper**: [Deep learning of a bacterial and archaeal universal language of life enables transfer learning and illuminates microbial dark matter](https://doi.org/10.1038/s41467-022-30070-8) (Nature Communications, 2022) |
|
|
- **GitHub**: [ahoarfrost/LookingGlass](https://github.com/ahoarfrost/LookingGlass) |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use LookingGlass, please cite: |
|
|
|
|
|
```bibtex |
|
|
@article{hoarfrost2022deep, |
|
|
title={Deep learning of a bacterial and archaeal universal language of life |
|
|
enables transfer learning and illuminates microbial dark matter}, |
|
|
author={Hoarfrost, Adrienne and Aptekmann, Ariel and Farfanuk, Gaetan and Bromberg, Yana}, |
|
|
journal={Nature Communications}, |
|
|
volume={13}, |
|
|
number={1}, |
|
|
pages={2606}, |
|
|
year={2022}, |
|
|
publisher={Nature Publishing Group} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Model |
|
|
|
|
|
| | | |
|
|
|---|---| |
|
|
| Architecture | AWD-LSTM (3-layer, unidirectional) | |
|
|
| Hidden size | 1152 | |
|
|
| Embedding size | 104 | |
|
|
| Parameters | ~17M | |
|
|
| Vocabulary | 8 tokens (G, A, C, T + special tokens) | |
|
|
| Training data | Metagenomic sequences | |
|
|
|
|
|
## Vocabulary |
|
|
|
|
|
| Token | ID | Description | |
|
|
|-------|-----|-------------| |
|
|
| `xxunk` | 0 | Unknown | |
|
|
| `xxpad` | 1 | Padding | |
|
|
| `xxbos` | 2 | Beginning of sequence | |
|
|
| `xxeos` | 3 | End of sequence | |
|
|
| `G` | 4 | Guanine | |
|
|
| `A` | 5 | Adenine | |
|
|
| `C` | 6 | Cytosine | |
|
|
| `T` | 7 | Thymine | |
|
|
|
|
|
## Installation |
|
|
|
|
|
```bash |
|
|
pip install torch |
|
|
git clone https://huggingface.co/HoarfrostLab/lookingglass-v1 |
|
|
cd lookingglass-v1 |
|
|
``` |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Quick Start |
|
|
|
|
|
```python |
|
|
from lookingglass import LookingGlass, LookingGlassTokenizer |
|
|
|
|
|
model = LookingGlass.from_pretrained('.') |
|
|
tokenizer = LookingGlassTokenizer() |
|
|
|
|
|
inputs = tokenizer(["GATTACA", "ATCGATCGATCG"], return_tensors=True) |
|
|
embeddings = model.get_embeddings(inputs['input_ids']) |
|
|
print(embeddings.shape) # torch.Size([2, 104]) |
|
|
``` |
|
|
|
|
|
### Getting Embeddings |
|
|
|
|
|
The primary use case is extracting sequence embeddings for downstream tasks: |
|
|
|
|
|
```python |
|
|
from lookingglass import LookingGlass, LookingGlassTokenizer |
|
|
import torch |
|
|
|
|
|
model = LookingGlass.from_pretrained('./lookingglass-v1') |
|
|
tokenizer = LookingGlassTokenizer() |
|
|
model.eval() |
|
|
|
|
|
# Your DNA sequences |
|
|
sequences = [ |
|
|
"ATCGATCGATCG", |
|
|
"GATTACAGATTACA", |
|
|
"GCGCGCGCGCGC" |
|
|
] |
|
|
|
|
|
# Tokenize |
|
|
inputs = tokenizer(sequences, return_tensors=True) |
|
|
|
|
|
# Extract embeddings |
|
|
with torch.no_grad(): |
|
|
embeddings = model.get_embeddings(inputs['input_ids']) |
|
|
|
|
|
# embeddings: (3, 104) - one 104-dimensional vector per sequence |
|
|
print(f"Embedding shape: {embeddings.shape}") |
|
|
``` |
|
|
|
|
|
### Language Modeling |
|
|
|
|
|
To access the full language model with prediction head: |
|
|
|
|
|
```python |
|
|
from lookingglass import LookingGlassLM, LookingGlassTokenizer |
|
|
|
|
|
model = LookingGlassLM.from_pretrained('./lookingglass-v1') |
|
|
tokenizer = LookingGlassTokenizer() |
|
|
|
|
|
inputs = tokenizer("GATTACA", return_tensors=True) |
|
|
|
|
|
# Get next-token prediction logits |
|
|
logits = model(inputs['input_ids']) |
|
|
print(logits.shape) # torch.Size([1, 8, 8]) - (batch, seq_len, vocab_size) |
|
|
|
|
|
# Embeddings also available |
|
|
embeddings = model.get_embeddings(inputs['input_ids']) |
|
|
``` |
|
|
|
|
|
### GPU Usage |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from lookingglass import LookingGlass, LookingGlassTokenizer |
|
|
|
|
|
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') |
|
|
|
|
|
model = LookingGlass.from_pretrained('./lookingglass-v1') |
|
|
model = model.to(device) |
|
|
model.eval() |
|
|
|
|
|
tokenizer = LookingGlassTokenizer() |
|
|
|
|
|
inputs = tokenizer(["GATTACA"], return_tensors=True) |
|
|
input_ids = inputs['input_ids'].to(device) |
|
|
|
|
|
with torch.no_grad(): |
|
|
embeddings = model.get_embeddings(input_ids) |
|
|
``` |
|
|
|
|
|
## API Reference |
|
|
|
|
|
### LookingGlassTokenizer |
|
|
|
|
|
```python |
|
|
tokenizer = LookingGlassTokenizer( |
|
|
add_bos_token=True, # Add xxbos at start (default: True) |
|
|
add_eos_token=False, # Add xxeos at end (default: False) |
|
|
) |
|
|
|
|
|
# Tokenize |
|
|
inputs = tokenizer( |
|
|
sequences, # str or List[str] |
|
|
return_tensors=True, # Return PyTorch tensors |
|
|
padding=True, # Pad to longest sequence |
|
|
max_length=None, # Optional max length |
|
|
truncation=False, # Truncate to max_length |
|
|
) |
|
|
|
|
|
# Decode |
|
|
tokenizer.decode(token_ids, skip_special_tokens=True) |
|
|
``` |
|
|
|
|
|
### LookingGlass |
|
|
|
|
|
```python |
|
|
model = LookingGlass.from_pretrained(path) |
|
|
|
|
|
# Get sequence embeddings (recommended) |
|
|
embeddings = model.get_embeddings(input_ids) # (batch, 104) |
|
|
|
|
|
# Get hidden states for all positions |
|
|
hidden = model.get_hidden_states(input_ids) # (batch, seq_len, 104) |
|
|
|
|
|
# Forward pass (same as get_embeddings) |
|
|
embeddings = model(input_ids) # (batch, 104) |
|
|
``` |
|
|
|
|
|
### LookingGlassLM |
|
|
|
|
|
```python |
|
|
model = LookingGlassLM.from_pretrained(path) |
|
|
|
|
|
# Get logits for next-token prediction |
|
|
logits = model(input_ids) # (batch, seq_len, 8) |
|
|
|
|
|
# Get embeddings |
|
|
embeddings = model.get_embeddings(input_ids) # (batch, 104) |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
MIT License |
|
|
|