---
tags:
- genomics
- bioinformatics
- nanopore
- rna-sequencing
- chimera-detection
- token-classification
- hyenadna
- pytorch
- lightning
license: apache-2.0
datasets:
- nanopore-drna-seq
language:
- dna
library_name: deepchopper
pipeline_tag: token-classification
---

# DeepChopper: Chimera Detection for Nanopore Direct RNA Sequencing

DeepChopper is a genomic language model designed to accurately detect and remove chimera artifacts in Nanopore direct RNA sequencing data. It uses a HyenaDNA backbone with a token classification head to identify artificial adapter sequences within reads.

## Model Details

### Model Description

DeepChopper leverages the HyenaDNA-small-32k backbone, a genomic foundation model, combined with a specialized token classification head to detect chimeric artifacts in nanopore direct RNA sequencing reads. The model processes both sequence information and base quality scores to make accurate predictions.

- **Developed by:** YLab Team ([Li et al., 2024](https://www.biorxiv.org/content/10.1101/2024.10.23.619929v2))
- **Model type:** Token Classification
- **Language(s):** DNA sequences
- **License:** Apache 2.0
- **Base Model:** HyenaDNA-small-32k-seqlen
- **Repository:** [DeepChopper GitHub](https://github.com/ylab-hi/DeepChopper)
- **Paper:** [A Genomic Language Model for Chimera Artifact Detection](https://www.biorxiv.org/content/10.1101/2024.10.23.619929v2)

### Model Architecture

- **Backbone:** HyenaDNA-small-32k (256 dimensions)
- **Classification Head:**
  - Linear Layer 1: 256 → 1024 dimensions
  - Linear Layer 2: 1024 → 1024 dimensions
  - Output Layer: 1024 → 2 classes (artifact/non-artifact)
  - Quality Score Integration: Identity layer for base quality incorporation
- **Input:**
  - Tokenized DNA sequences (vocabulary size: 12)
  - Base quality scores
- **Output:** Per-base classification (artifact vs. non-artifact)

## Uses

### Direct Use

DeepChopper is designed for:
- Detecting chimeric artifacts in Nanopore direct RNA sequencing data
- Identifying adapter sequences within base-called reads
- Preprocessing RNA-seq data before downstream transcriptomics analysis
- Improving accuracy of transcript annotation and gene fusion detection

### Downstream Use

The cleaned data can be used for:
- Transcript isoform analysis
- Gene expression quantification
- Novel transcript discovery
- Gene fusion detection
- Alternative splicing analysis

### Out-of-Scope Use

This model is NOT designed for:
- DNA sequencing data (it's specifically trained on RNA sequences)
- PacBio or Illumina sequencing platforms
- Genome assembly or variant calling

## Training Details

### Training Data

The model was trained on Nanopore direct RNA sequencing data with manually curated annotations of chimeric artifacts and adapter sequences.

### Training Procedure

- **Optimizer:** Adam (lr=0.0002, weight_decay=0)
- **Learning Rate Scheduler:** ReduceLROnPlateau (mode=min, factor=0.1, patience=10)
- **Loss Function:** Continuous Interval Loss (CrossEntropyLoss with no penalty)
- **Framework:** PyTorch Lightning

### Training Hyperparameters

- Learning Rate: 0.0002
- Batch Size: Configured per experiment
- Weight Decay: 0
- Backbone: Fine-tuned (not frozen)

## Evaluation

### Testing Data & Metrics


The model is evaluated on held-out test sets using:
- F1 Score (primary metric)
- Precision
- Recall

### Results

DeepChopper significantly improves downstream analysis quality by accurately removing chimeric artifacts that would otherwise confound transcriptome analyses.

## How to Use

### Installation

```bash
pip install deepchopper
```

### Python API

```python
import deepchopper

# Load the pretrained model
model = deepchopper.DeepChopper.from_pretrained("yangliz5/deepchopper-rna004")

# The model is ready for inference
# Use with deepchopper's predict pipeline
```

### Command Line Interface

```bash
# Step 1: Encode your FASTQ data
deepchopper encode input.fq

# Step 2: Predict chimeric artifacts
deepchopper predict input.parquet --output predictions

# Step 3: Remove artifacts and generate clean FASTQ
deepchopper chop predictions input.fq
```

For GPU acceleration:
```bash
deepchopper predict input.parquet --output predictions --gpus 1
```

### Web Interface

Try DeepChopper online without installation:
- [Hugging Face Space](https://huggingface.co/spaces/yangliz5/deepchopper)
- Or run locally: `deepchopper web`

## Limitations

- **Platform-specific:** Optimized for Nanopore direct RNA sequencing
- **Read length:** Best performance on reads up to 32k bases (model context window)
- **Species:** Trained primarily on human RNA sequences
- **Computational requirements:** GPU recommended for large datasets

## Citation

If you use DeepChopper in your research, please cite:

```bibtex
@article{Li2024.10.23.619929,
    author = {Li, Yangyang and Wang, Ting-You and Guo, Qingxiang and Ren, Yanan and Lu, Xiaotong and Cao, Qi and Yang, Rendong},
    title = {A Genomic Language Model for Chimera Artifact Detection in Nanopore Direct RNA Sequencing},
    year = {2024},
    doi = {10.1101/2024.10.23.619929},
    journal = {bioRxiv}
}
```

## Contact & Support

- **Issues:** [GitHub Issues](https://github.com/ylab-hi/DeepChopper/issues)
- **Documentation:** [Full Tutorial](https://github.com/ylab-hi/DeepChopper/blob/main/documentation/tutorial.md)
- **Repository:** [GitHub](https://github.com/ylab-hi/DeepChopper)

## Model Card Authors

YLab Team

## Model Card Contact

For questions about this model, please open an issue on the [GitHub repository](https://github.com/ylab-hi/DeepChopper/issues).