File size: 5,610 Bytes
fad4305 002c34e 4fdcc97 002c34e 0648c7e 002c34e 0648c7e fad4305 002c34e 5938b59 002c34e 4fdcc97 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 |
---
tags:
- genomics
- bioinformatics
- nanopore
- rna-sequencing
- chimera-detection
- token-classification
- hyenadna
- pytorch
- lightning
license: apache-2.0
datasets:
- nanopore-drna-seq
language:
- dna
library_name: deepchopper
pipeline_tag: token-classification
---
# DeepChopper: Chimera Detection for Nanopore Direct RNA Sequencing
DeepChopper is a genomic language model designed to accurately detect and remove chimera artifacts in Nanopore direct RNA sequencing data. It uses a HyenaDNA backbone with a token classification head to identify artificial adapter sequences within reads.
## Model Details
### Model Description
DeepChopper leverages the HyenaDNA-small-32k backbone, a genomic foundation model, combined with a specialized token classification head to detect chimeric artifacts in nanopore direct RNA sequencing reads. The model processes both sequence information and base quality scores to make accurate predictions.
- **Developed by:** YLab Team ([Li et al., 2024](https://www.biorxiv.org/content/10.1101/2024.10.23.619929v2))
- **Model type:** Token Classification
- **Language(s):** DNA sequences
- **License:** Apache 2.0
- **Base Model:** HyenaDNA-small-32k-seqlen
- **Repository:** [DeepChopper GitHub](https://github.com/ylab-hi/DeepChopper)
- **Paper:** [A Genomic Language Model for Chimera Artifact Detection](https://www.biorxiv.org/content/10.1101/2024.10.23.619929v2)
### Model Architecture
- **Backbone:** HyenaDNA-small-32k (256 dimensions)
- **Classification Head:**
- Linear Layer 1: 256 → 1024 dimensions
- Linear Layer 2: 1024 → 1024 dimensions
- Output Layer: 1024 → 2 classes (artifact/non-artifact)
- Quality Score Integration: Identity layer for base quality incorporation
- **Input:**
- Tokenized DNA sequences (vocabulary size: 12)
- Base quality scores
- **Output:** Per-base classification (artifact vs. non-artifact)
## Uses
### Direct Use
DeepChopper is designed for:
- Detecting chimeric artifacts in Nanopore direct RNA sequencing data
- Identifying adapter sequences within base-called reads
- Preprocessing RNA-seq data before downstream transcriptomics analysis
- Improving accuracy of transcript annotation and gene fusion detection
### Downstream Use
The cleaned data can be used for:
- Transcript isoform analysis
- Gene expression quantification
- Novel transcript discovery
- Gene fusion detection
- Alternative splicing analysis
### Out-of-Scope Use
This model is NOT designed for:
- DNA sequencing data (it's specifically trained on RNA sequences)
- PacBio or Illumina sequencing platforms
- Genome assembly or variant calling
## Training Details
### Training Data
The model was trained on Nanopore direct RNA sequencing data with manually curated annotations of chimeric artifacts and adapter sequences.
### Training Procedure
- **Optimizer:** Adam (lr=0.0002, weight_decay=0)
- **Learning Rate Scheduler:** ReduceLROnPlateau (mode=min, factor=0.1, patience=10)
- **Loss Function:** Continuous Interval Loss (CrossEntropyLoss with no penalty)
- **Framework:** PyTorch Lightning
### Training Hyperparameters
- Learning Rate: 0.0002
- Batch Size: Configured per experiment
- Weight Decay: 0
- Backbone: Fine-tuned (not frozen)
## Evaluation
### Testing Data & Metrics
The model is evaluated on held-out test sets using:
- F1 Score (primary metric)
- Precision
- Recall
### Results
DeepChopper significantly improves downstream analysis quality by accurately removing chimeric artifacts that would otherwise confound transcriptome analyses.
## How to Use
### Installation
```bash
pip install deepchopper
```
### Python API
```python
import deepchopper
# Load the pretrained model
model = deepchopper.DeepChopper.from_pretrained("yangliz5/deepchopper-rna004")
# The model is ready for inference
# Use with deepchopper's predict pipeline
```
### Command Line Interface
```bash
# Step 1: Encode your FASTQ data
deepchopper encode input.fq
# Step 2: Predict chimeric artifacts
deepchopper predict input.parquet --output predictions
# Step 3: Remove artifacts and generate clean FASTQ
deepchopper chop predictions input.fq
```
For GPU acceleration:
```bash
deepchopper predict input.parquet --output predictions --gpus 1
```
### Web Interface
Try DeepChopper online without installation:
- [Hugging Face Space](https://huggingface.co/spaces/yangliz5/deepchopper)
- Or run locally: `deepchopper web`
## Limitations
- **Platform-specific:** Optimized for Nanopore direct RNA sequencing
- **Read length:** Best performance on reads up to 32k bases (model context window)
- **Species:** Trained primarily on human RNA sequences
- **Computational requirements:** GPU recommended for large datasets
## Citation
If you use DeepChopper in your research, please cite:
```bibtex
@article{Li2024.10.23.619929,
author = {Li, Yangyang and Wang, Ting-You and Guo, Qingxiang and Ren, Yanan and Lu, Xiaotong and Cao, Qi and Yang, Rendong},
title = {A Genomic Language Model for Chimera Artifact Detection in Nanopore Direct RNA Sequencing},
year = {2024},
doi = {10.1101/2024.10.23.619929},
journal = {bioRxiv}
}
```
## Contact & Support
- **Issues:** [GitHub Issues](https://github.com/ylab-hi/DeepChopper/issues)
- **Documentation:** [Full Tutorial](https://github.com/ylab-hi/DeepChopper/blob/main/documentation/tutorial.md)
- **Repository:** [GitHub](https://github.com/ylab-hi/DeepChopper)
## Model Card Authors
YLab Team
## Model Card Contact
For questions about this model, please open an issue on the [GitHub repository](https://github.com/ylab-hi/DeepChopper/issues). |