adriennehoarfrost's picture
Upload folder using huggingface_hub
2eef7fa verified
metadata
language:
  - en
tags:
  - biology
  - dna
  - genomics
  - metagenomics
  - classifier
  - awd-lstm
  - transfer-learning
license: mit
pipeline_tag: text-classification
library_name: pytorch

LookingGlass Reading Frame Classifier

Identifies the correct reading frame start position (1, 2, 3, -1, -2, or -3) for DNA reads. Note: currently only intended for prokaryotic sequences with low proportions of noncoding DNA.

This is a pure PyTorch implementation fine-tuned from the LookingGlass base model.

Links

Citation

@article{hoarfrost2022deep,
  title={Deep learning of a bacterial and archaeal universal language of life
         enables transfer learning and illuminates microbial dark matter},
  author={Hoarfrost, Adrienne and Aptekmann, Ariel and Farfanuk, Gaetan and Bromberg, Yana},
  journal={Nature Communications},
  volume={13},
  number={1},
  pages={2606},
  year={2022},
  publisher={Nature Publishing Group}
}

Model

Architecture LookingGlass encoder + classification head
Encoder AWD-LSTM (3-layer, unidirectional)
Classes 6 classes: 1, 2, 3, -1, -2, -3
Parameters ~17M

Installation

pip install torch
git clone https://huggingface.co/HoarfrostLab/LGv1_ReadingFrameClassifier
cd LGv1_ReadingFrameClassifier

Usage

from lookingglass_classifier import LookingGlassClassifier, LookingGlassTokenizer

model = LookingGlassClassifier.from_pretrained('.')
tokenizer = LookingGlassTokenizer()
model.eval()

inputs = tokenizer(["GATTACA", "ATCGATCGATCG"], return_tensors=True)

# Get predictions
predictions = model.predict(inputs['input_ids'])
print(predictions)  # tensor([class_idx, class_idx])

# Get probabilities
probs = model.predict_proba(inputs['input_ids'])
print(probs.shape)  # torch.Size([2, 6])

# Get raw logits
logits = model(inputs['input_ids'])
print(logits.shape)  # torch.Size([2, 6])

License

MIT License