deepchopper / README.md
yangliz5's picture
Update README.md
5938b59 verified
---
tags:
- genomics
- bioinformatics
- nanopore
- rna-sequencing
- chimera-detection
- token-classification
- hyenadna
- pytorch
- lightning
license: apache-2.0
datasets:
- nanopore-drna-seq
language:
- dna
library_name: deepchopper
pipeline_tag: token-classification
---
# DeepChopper: Chimera Detection for Nanopore Direct RNA Sequencing
DeepChopper is a genomic language model designed to accurately detect and remove chimera artifacts in Nanopore direct RNA sequencing data. It uses a HyenaDNA backbone with a token classification head to identify artificial adapter sequences within reads.
## Model Details
### Model Description
DeepChopper leverages the HyenaDNA-small-32k backbone, a genomic foundation model, combined with a specialized token classification head to detect chimeric artifacts in nanopore direct RNA sequencing reads. The model processes both sequence information and base quality scores to make accurate predictions.
- **Developed by:** YLab Team ([Li et al., 2024](https://www.biorxiv.org/content/10.1101/2024.10.23.619929v2))
- **Model type:** Token Classification
- **Language(s):** DNA sequences
- **License:** Apache 2.0
- **Base Model:** HyenaDNA-small-32k-seqlen
- **Repository:** [DeepChopper GitHub](https://github.com/ylab-hi/DeepChopper)
- **Paper:** [A Genomic Language Model for Chimera Artifact Detection](https://www.biorxiv.org/content/10.1101/2024.10.23.619929v2)
### Model Architecture
- **Backbone:** HyenaDNA-small-32k (256 dimensions)
- **Classification Head:**
- Linear Layer 1: 256 → 1024 dimensions
- Linear Layer 2: 1024 → 1024 dimensions
- Output Layer: 1024 → 2 classes (artifact/non-artifact)
- Quality Score Integration: Identity layer for base quality incorporation
- **Input:**
- Tokenized DNA sequences (vocabulary size: 12)
- Base quality scores
- **Output:** Per-base classification (artifact vs. non-artifact)
## Uses
### Direct Use
DeepChopper is designed for:
- Detecting chimeric artifacts in Nanopore direct RNA sequencing data
- Identifying adapter sequences within base-called reads
- Preprocessing RNA-seq data before downstream transcriptomics analysis
- Improving accuracy of transcript annotation and gene fusion detection
### Downstream Use
The cleaned data can be used for:
- Transcript isoform analysis
- Gene expression quantification
- Novel transcript discovery
- Gene fusion detection
- Alternative splicing analysis
### Out-of-Scope Use
This model is NOT designed for:
- DNA sequencing data (it's specifically trained on RNA sequences)
- PacBio or Illumina sequencing platforms
- Genome assembly or variant calling
## Training Details
### Training Data
The model was trained on Nanopore direct RNA sequencing data with manually curated annotations of chimeric artifacts and adapter sequences.
### Training Procedure
- **Optimizer:** Adam (lr=0.0002, weight_decay=0)
- **Learning Rate Scheduler:** ReduceLROnPlateau (mode=min, factor=0.1, patience=10)
- **Loss Function:** Continuous Interval Loss (CrossEntropyLoss with no penalty)
- **Framework:** PyTorch Lightning
### Training Hyperparameters
- Learning Rate: 0.0002
- Batch Size: Configured per experiment
- Weight Decay: 0
- Backbone: Fine-tuned (not frozen)
## Evaluation
### Testing Data & Metrics
The model is evaluated on held-out test sets using:
- F1 Score (primary metric)
- Precision
- Recall
### Results
DeepChopper significantly improves downstream analysis quality by accurately removing chimeric artifacts that would otherwise confound transcriptome analyses.
## How to Use
### Installation
```bash
pip install deepchopper
```
### Python API
```python
import deepchopper
# Load the pretrained model
model = deepchopper.DeepChopper.from_pretrained("yangliz5/deepchopper-rna004")
# The model is ready for inference
# Use with deepchopper's predict pipeline
```
### Command Line Interface
```bash
# Step 1: Encode your FASTQ data
deepchopper encode input.fq
# Step 2: Predict chimeric artifacts
deepchopper predict input.parquet --output predictions
# Step 3: Remove artifacts and generate clean FASTQ
deepchopper chop predictions input.fq
```
For GPU acceleration:
```bash
deepchopper predict input.parquet --output predictions --gpus 1
```
### Web Interface
Try DeepChopper online without installation:
- [Hugging Face Space](https://huggingface.co/spaces/yangliz5/deepchopper)
- Or run locally: `deepchopper web`
## Limitations
- **Platform-specific:** Optimized for Nanopore direct RNA sequencing
- **Read length:** Best performance on reads up to 32k bases (model context window)
- **Species:** Trained primarily on human RNA sequences
- **Computational requirements:** GPU recommended for large datasets
## Citation
If you use DeepChopper in your research, please cite:
```bibtex
@article{Li2024.10.23.619929,
author = {Li, Yangyang and Wang, Ting-You and Guo, Qingxiang and Ren, Yanan and Lu, Xiaotong and Cao, Qi and Yang, Rendong},
title = {A Genomic Language Model for Chimera Artifact Detection in Nanopore Direct RNA Sequencing},
year = {2024},
doi = {10.1101/2024.10.23.619929},
journal = {bioRxiv}
}
```
## Contact & Support
- **Issues:** [GitHub Issues](https://github.com/ylab-hi/DeepChopper/issues)
- **Documentation:** [Full Tutorial](https://github.com/ylab-hi/DeepChopper/blob/main/documentation/tutorial.md)
- **Repository:** [GitHub](https://github.com/ylab-hi/DeepChopper)
## Model Card Authors
YLab Team
## Model Card Contact
For questions about this model, please open an issue on the [GitHub repository](https://github.com/ylab-hi/DeepChopper/issues).