lookingglass-v1 / README.md
adriennehoarfrost's picture
Upload README.md with huggingface_hub
1f37dec verified
---
language:
- en
tags:
- biology
- dna
- genomics
- metagenomics
- language-model
- awd-lstm
- transfer-learning
license: mit
pipeline_tag: feature-extraction
library_name: pytorch
---
# LookingGlass
LookingGlass is a general-purpose "universal language of life" deep learning model for read-length biological sequences. LookingGlass generates contextually-aware, meaningful representations of short DNA reads, enabling transfer learning for a range of downstream tasks.
This is a **pure PyTorch implementation** with no fastai dependencies.
## Links
- **Paper**: [Deep learning of a bacterial and archaeal universal language of life enables transfer learning and illuminates microbial dark matter](https://doi.org/10.1038/s41467-022-30070-8) (Nature Communications, 2022)
- **GitHub**: [ahoarfrost/LookingGlass](https://github.com/ahoarfrost/LookingGlass)
## Citation
If you use LookingGlass, please cite:
```bibtex
@article{hoarfrost2022deep,
title={Deep learning of a bacterial and archaeal universal language of life
enables transfer learning and illuminates microbial dark matter},
author={Hoarfrost, Adrienne and Aptekmann, Ariel and Farfanuk, Gaetan and Bromberg, Yana},
journal={Nature Communications},
volume={13},
number={1},
pages={2606},
year={2022},
publisher={Nature Publishing Group}
}
```
## Model
| | |
|---|---|
| Architecture | AWD-LSTM (3-layer, unidirectional) |
| Hidden size | 1152 |
| Embedding size | 104 |
| Parameters | ~17M |
| Vocabulary | 8 tokens (G, A, C, T + special tokens) |
| Training data | Metagenomic sequences |
## Vocabulary
| Token | ID | Description |
|-------|-----|-------------|
| `xxunk` | 0 | Unknown |
| `xxpad` | 1 | Padding |
| `xxbos` | 2 | Beginning of sequence |
| `xxeos` | 3 | End of sequence |
| `G` | 4 | Guanine |
| `A` | 5 | Adenine |
| `C` | 6 | Cytosine |
| `T` | 7 | Thymine |
## Installation
```bash
pip install torch
git clone https://huggingface.co/HoarfrostLab/lookingglass-v1
cd lookingglass-v1
```
## Usage
### Quick Start
```python
from lookingglass import LookingGlass, LookingGlassTokenizer
model = LookingGlass.from_pretrained('.')
tokenizer = LookingGlassTokenizer()
inputs = tokenizer(["GATTACA", "ATCGATCGATCG"], return_tensors=True)
embeddings = model.get_embeddings(inputs['input_ids'])
print(embeddings.shape) # torch.Size([2, 104])
```
### Getting Embeddings
The primary use case is extracting sequence embeddings for downstream tasks:
```python
from lookingglass import LookingGlass, LookingGlassTokenizer
import torch
model = LookingGlass.from_pretrained('./lookingglass-v1')
tokenizer = LookingGlassTokenizer()
model.eval()
# Your DNA sequences
sequences = [
"ATCGATCGATCG",
"GATTACAGATTACA",
"GCGCGCGCGCGC"
]
# Tokenize
inputs = tokenizer(sequences, return_tensors=True)
# Extract embeddings
with torch.no_grad():
embeddings = model.get_embeddings(inputs['input_ids'])
# embeddings: (3, 104) - one 104-dimensional vector per sequence
print(f"Embedding shape: {embeddings.shape}")
```
### Language Modeling
To access the full language model with prediction head:
```python
from lookingglass import LookingGlassLM, LookingGlassTokenizer
model = LookingGlassLM.from_pretrained('./lookingglass-v1')
tokenizer = LookingGlassTokenizer()
inputs = tokenizer("GATTACA", return_tensors=True)
# Get next-token prediction logits
logits = model(inputs['input_ids'])
print(logits.shape) # torch.Size([1, 8, 8]) - (batch, seq_len, vocab_size)
# Embeddings also available
embeddings = model.get_embeddings(inputs['input_ids'])
```
### GPU Usage
```python
import torch
from lookingglass import LookingGlass, LookingGlassTokenizer
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = LookingGlass.from_pretrained('./lookingglass-v1')
model = model.to(device)
model.eval()
tokenizer = LookingGlassTokenizer()
inputs = tokenizer(["GATTACA"], return_tensors=True)
input_ids = inputs['input_ids'].to(device)
with torch.no_grad():
embeddings = model.get_embeddings(input_ids)
```
## API Reference
### LookingGlassTokenizer
```python
tokenizer = LookingGlassTokenizer(
add_bos_token=True, # Add xxbos at start (default: True)
add_eos_token=False, # Add xxeos at end (default: False)
)
# Tokenize
inputs = tokenizer(
sequences, # str or List[str]
return_tensors=True, # Return PyTorch tensors
padding=True, # Pad to longest sequence
max_length=None, # Optional max length
truncation=False, # Truncate to max_length
)
# Decode
tokenizer.decode(token_ids, skip_special_tokens=True)
```
### LookingGlass
```python
model = LookingGlass.from_pretrained(path)
# Get sequence embeddings (recommended)
embeddings = model.get_embeddings(input_ids) # (batch, 104)
# Get hidden states for all positions
hidden = model.get_hidden_states(input_ids) # (batch, seq_len, 104)
# Forward pass (same as get_embeddings)
embeddings = model(input_ids) # (batch, 104)
```
### LookingGlassLM
```python
model = LookingGlassLM.from_pretrained(path)
# Get logits for next-token prediction
logits = model(input_ids) # (batch, seq_len, 8)
# Get embeddings
embeddings = model.get_embeddings(input_ids) # (batch, 104)
```
## License
MIT License