--- language: - en tags: - biology - dna - genomics - metagenomics - language-model - awd-lstm - transfer-learning license: mit pipeline_tag: feature-extraction library_name: pytorch --- # LookingGlass LookingGlass is a general-purpose "universal language of life" deep learning model for read-length biological sequences. LookingGlass generates contextually-aware, meaningful representations of short DNA reads, enabling transfer learning for a range of downstream tasks. This is a **pure PyTorch implementation** with no fastai dependencies. ## Links - **Paper**: [Deep learning of a bacterial and archaeal universal language of life enables transfer learning and illuminates microbial dark matter](https://doi.org/10.1038/s41467-022-30070-8) (Nature Communications, 2022) - **GitHub**: [ahoarfrost/LookingGlass](https://github.com/ahoarfrost/LookingGlass) ## Citation If you use LookingGlass, please cite: ```bibtex @article{hoarfrost2022deep, title={Deep learning of a bacterial and archaeal universal language of life enables transfer learning and illuminates microbial dark matter}, author={Hoarfrost, Adrienne and Aptekmann, Ariel and Farfanuk, Gaetan and Bromberg, Yana}, journal={Nature Communications}, volume={13}, number={1}, pages={2606}, year={2022}, publisher={Nature Publishing Group} } ``` ## Model | | | |---|---| | Architecture | AWD-LSTM (3-layer, unidirectional) | | Hidden size | 1152 | | Embedding size | 104 | | Parameters | ~17M | | Vocabulary | 8 tokens (G, A, C, T + special tokens) | | Training data | Metagenomic sequences | ## Vocabulary | Token | ID | Description | |-------|-----|-------------| | `xxunk` | 0 | Unknown | | `xxpad` | 1 | Padding | | `xxbos` | 2 | Beginning of sequence | | `xxeos` | 3 | End of sequence | | `G` | 4 | Guanine | | `A` | 5 | Adenine | | `C` | 6 | Cytosine | | `T` | 7 | Thymine | ## Installation ```bash pip install torch git clone https://huggingface.co/HoarfrostLab/lookingglass-v1 cd lookingglass-v1 ``` ## Usage ### Quick Start ```python from lookingglass import LookingGlass, LookingGlassTokenizer model = LookingGlass.from_pretrained('.') tokenizer = LookingGlassTokenizer() inputs = tokenizer(["GATTACA", "ATCGATCGATCG"], return_tensors=True) embeddings = model.get_embeddings(inputs['input_ids']) print(embeddings.shape) # torch.Size([2, 104]) ``` ### Getting Embeddings The primary use case is extracting sequence embeddings for downstream tasks: ```python from lookingglass import LookingGlass, LookingGlassTokenizer import torch model = LookingGlass.from_pretrained('./lookingglass-v1') tokenizer = LookingGlassTokenizer() model.eval() # Your DNA sequences sequences = [ "ATCGATCGATCG", "GATTACAGATTACA", "GCGCGCGCGCGC" ] # Tokenize inputs = tokenizer(sequences, return_tensors=True) # Extract embeddings with torch.no_grad(): embeddings = model.get_embeddings(inputs['input_ids']) # embeddings: (3, 104) - one 104-dimensional vector per sequence print(f"Embedding shape: {embeddings.shape}") ``` ### Language Modeling To access the full language model with prediction head: ```python from lookingglass import LookingGlassLM, LookingGlassTokenizer model = LookingGlassLM.from_pretrained('./lookingglass-v1') tokenizer = LookingGlassTokenizer() inputs = tokenizer("GATTACA", return_tensors=True) # Get next-token prediction logits logits = model(inputs['input_ids']) print(logits.shape) # torch.Size([1, 8, 8]) - (batch, seq_len, vocab_size) # Embeddings also available embeddings = model.get_embeddings(inputs['input_ids']) ``` ### GPU Usage ```python import torch from lookingglass import LookingGlass, LookingGlassTokenizer device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') model = LookingGlass.from_pretrained('./lookingglass-v1') model = model.to(device) model.eval() tokenizer = LookingGlassTokenizer() inputs = tokenizer(["GATTACA"], return_tensors=True) input_ids = inputs['input_ids'].to(device) with torch.no_grad(): embeddings = model.get_embeddings(input_ids) ``` ## API Reference ### LookingGlassTokenizer ```python tokenizer = LookingGlassTokenizer( add_bos_token=True, # Add xxbos at start (default: True) add_eos_token=False, # Add xxeos at end (default: False) ) # Tokenize inputs = tokenizer( sequences, # str or List[str] return_tensors=True, # Return PyTorch tensors padding=True, # Pad to longest sequence max_length=None, # Optional max length truncation=False, # Truncate to max_length ) # Decode tokenizer.decode(token_ids, skip_special_tokens=True) ``` ### LookingGlass ```python model = LookingGlass.from_pretrained(path) # Get sequence embeddings (recommended) embeddings = model.get_embeddings(input_ids) # (batch, 104) # Get hidden states for all positions hidden = model.get_hidden_states(input_ids) # (batch, seq_len, 104) # Forward pass (same as get_embeddings) embeddings = model(input_ids) # (batch, 104) ``` ### LookingGlassLM ```python model = LookingGlassLM.from_pretrained(path) # Get logits for next-token prediction logits = model(input_ids) # (batch, seq_len, 8) # Get embeddings embeddings = model.get_embeddings(input_ids) # (batch, 104) ``` ## License MIT License