|
|
--- |
|
|
tags: |
|
|
- genomics |
|
|
- bioinformatics |
|
|
- nanopore |
|
|
- rna-sequencing |
|
|
- chimera-detection |
|
|
- token-classification |
|
|
- hyenadna |
|
|
- pytorch |
|
|
- lightning |
|
|
license: apache-2.0 |
|
|
datasets: |
|
|
- nanopore-drna-seq |
|
|
language: |
|
|
- dna |
|
|
library_name: deepchopper |
|
|
pipeline_tag: token-classification |
|
|
--- |
|
|
|
|
|
# DeepChopper: Chimera Detection for Nanopore Direct RNA Sequencing |
|
|
|
|
|
DeepChopper is a genomic language model designed to accurately detect and remove chimera artifacts in Nanopore direct RNA sequencing data. It uses a HyenaDNA backbone with a token classification head to identify artificial adapter sequences within reads. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Model Description |
|
|
|
|
|
DeepChopper leverages the HyenaDNA-small-32k backbone, a genomic foundation model, combined with a specialized token classification head to detect chimeric artifacts in nanopore direct RNA sequencing reads. The model processes both sequence information and base quality scores to make accurate predictions. |
|
|
|
|
|
- **Developed by:** YLab Team ([Li et al., 2024](https://www.biorxiv.org/content/10.1101/2024.10.23.619929v2)) |
|
|
- **Model type:** Token Classification |
|
|
- **Language(s):** DNA sequences |
|
|
- **License:** Apache 2.0 |
|
|
- **Base Model:** HyenaDNA-small-32k-seqlen |
|
|
- **Repository:** [DeepChopper GitHub](https://github.com/ylab-hi/DeepChopper) |
|
|
- **Paper:** [A Genomic Language Model for Chimera Artifact Detection](https://www.biorxiv.org/content/10.1101/2024.10.23.619929v2) |
|
|
|
|
|
### Model Architecture |
|
|
|
|
|
- **Backbone:** HyenaDNA-small-32k (256 dimensions) |
|
|
- **Classification Head:** |
|
|
- Linear Layer 1: 256 → 1024 dimensions |
|
|
- Linear Layer 2: 1024 → 1024 dimensions |
|
|
- Output Layer: 1024 → 2 classes (artifact/non-artifact) |
|
|
- Quality Score Integration: Identity layer for base quality incorporation |
|
|
- **Input:** |
|
|
- Tokenized DNA sequences (vocabulary size: 12) |
|
|
- Base quality scores |
|
|
- **Output:** Per-base classification (artifact vs. non-artifact) |
|
|
|
|
|
## Uses |
|
|
|
|
|
### Direct Use |
|
|
|
|
|
DeepChopper is designed for: |
|
|
- Detecting chimeric artifacts in Nanopore direct RNA sequencing data |
|
|
- Identifying adapter sequences within base-called reads |
|
|
- Preprocessing RNA-seq data before downstream transcriptomics analysis |
|
|
- Improving accuracy of transcript annotation and gene fusion detection |
|
|
|
|
|
### Downstream Use |
|
|
|
|
|
The cleaned data can be used for: |
|
|
- Transcript isoform analysis |
|
|
- Gene expression quantification |
|
|
- Novel transcript discovery |
|
|
- Gene fusion detection |
|
|
- Alternative splicing analysis |
|
|
|
|
|
### Out-of-Scope Use |
|
|
|
|
|
This model is NOT designed for: |
|
|
- DNA sequencing data (it's specifically trained on RNA sequences) |
|
|
- PacBio or Illumina sequencing platforms |
|
|
- Genome assembly or variant calling |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Data |
|
|
|
|
|
The model was trained on Nanopore direct RNA sequencing data with manually curated annotations of chimeric artifacts and adapter sequences. |
|
|
|
|
|
### Training Procedure |
|
|
|
|
|
- **Optimizer:** Adam (lr=0.0002, weight_decay=0) |
|
|
- **Learning Rate Scheduler:** ReduceLROnPlateau (mode=min, factor=0.1, patience=10) |
|
|
- **Loss Function:** Continuous Interval Loss (CrossEntropyLoss with no penalty) |
|
|
- **Framework:** PyTorch Lightning |
|
|
|
|
|
### Training Hyperparameters |
|
|
|
|
|
- Learning Rate: 0.0002 |
|
|
- Batch Size: Configured per experiment |
|
|
- Weight Decay: 0 |
|
|
- Backbone: Fine-tuned (not frozen) |
|
|
|
|
|
## Evaluation |
|
|
|
|
|
### Testing Data & Metrics |
|
|
|
|
|
|
|
|
The model is evaluated on held-out test sets using: |
|
|
- F1 Score (primary metric) |
|
|
- Precision |
|
|
- Recall |
|
|
|
|
|
### Results |
|
|
|
|
|
DeepChopper significantly improves downstream analysis quality by accurately removing chimeric artifacts that would otherwise confound transcriptome analyses. |
|
|
|
|
|
## How to Use |
|
|
|
|
|
### Installation |
|
|
|
|
|
```bash |
|
|
pip install deepchopper |
|
|
``` |
|
|
|
|
|
### Python API |
|
|
|
|
|
```python |
|
|
import deepchopper |
|
|
|
|
|
# Load the pretrained model |
|
|
model = deepchopper.DeepChopper.from_pretrained("yangliz5/deepchopper-rna004") |
|
|
|
|
|
# The model is ready for inference |
|
|
# Use with deepchopper's predict pipeline |
|
|
``` |
|
|
|
|
|
### Command Line Interface |
|
|
|
|
|
```bash |
|
|
# Step 1: Encode your FASTQ data |
|
|
deepchopper encode input.fq |
|
|
|
|
|
# Step 2: Predict chimeric artifacts |
|
|
deepchopper predict input.parquet --output predictions |
|
|
|
|
|
# Step 3: Remove artifacts and generate clean FASTQ |
|
|
deepchopper chop predictions input.fq |
|
|
``` |
|
|
|
|
|
For GPU acceleration: |
|
|
```bash |
|
|
deepchopper predict input.parquet --output predictions --gpus 1 |
|
|
``` |
|
|
|
|
|
### Web Interface |
|
|
|
|
|
Try DeepChopper online without installation: |
|
|
- [Hugging Face Space](https://huggingface.co/spaces/yangliz5/deepchopper) |
|
|
- Or run locally: `deepchopper web` |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- **Platform-specific:** Optimized for Nanopore direct RNA sequencing |
|
|
- **Read length:** Best performance on reads up to 32k bases (model context window) |
|
|
- **Species:** Trained primarily on human RNA sequences |
|
|
- **Computational requirements:** GPU recommended for large datasets |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use DeepChopper in your research, please cite: |
|
|
|
|
|
```bibtex |
|
|
@article{Li2024.10.23.619929, |
|
|
author = {Li, Yangyang and Wang, Ting-You and Guo, Qingxiang and Ren, Yanan and Lu, Xiaotong and Cao, Qi and Yang, Rendong}, |
|
|
title = {A Genomic Language Model for Chimera Artifact Detection in Nanopore Direct RNA Sequencing}, |
|
|
year = {2024}, |
|
|
doi = {10.1101/2024.10.23.619929}, |
|
|
journal = {bioRxiv} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Contact & Support |
|
|
|
|
|
- **Issues:** [GitHub Issues](https://github.com/ylab-hi/DeepChopper/issues) |
|
|
- **Documentation:** [Full Tutorial](https://github.com/ylab-hi/DeepChopper/blob/main/documentation/tutorial.md) |
|
|
- **Repository:** [GitHub](https://github.com/ylab-hi/DeepChopper) |
|
|
|
|
|
## Model Card Authors |
|
|
|
|
|
YLab Team |
|
|
|
|
|
## Model Card Contact |
|
|
|
|
|
For questions about this model, please open an issue on the [GitHub repository](https://github.com/ylab-hi/DeepChopper/issues). |