File size: 5,610 Bytes
fad4305
 
002c34e
 
 
 
 
 
 
 
 
4fdcc97
002c34e
 
0648c7e
002c34e
 
0648c7e
fad4305
 
002c34e
 
 
 
 
 
 
 
 
 
 
 
 
5938b59
002c34e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4fdcc97
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
---
tags:
- genomics
- bioinformatics
- nanopore
- rna-sequencing
- chimera-detection
- token-classification
- hyenadna
- pytorch
- lightning
license: apache-2.0
datasets:
- nanopore-drna-seq
language:
- dna
library_name: deepchopper
pipeline_tag: token-classification
---

# DeepChopper: Chimera Detection for Nanopore Direct RNA Sequencing

DeepChopper is a genomic language model designed to accurately detect and remove chimera artifacts in Nanopore direct RNA sequencing data. It uses a HyenaDNA backbone with a token classification head to identify artificial adapter sequences within reads.

## Model Details

### Model Description

DeepChopper leverages the HyenaDNA-small-32k backbone, a genomic foundation model, combined with a specialized token classification head to detect chimeric artifacts in nanopore direct RNA sequencing reads. The model processes both sequence information and base quality scores to make accurate predictions.

- **Developed by:** YLab Team ([Li et al., 2024](https://www.biorxiv.org/content/10.1101/2024.10.23.619929v2))
- **Model type:** Token Classification
- **Language(s):** DNA sequences
- **License:** Apache 2.0
- **Base Model:** HyenaDNA-small-32k-seqlen
- **Repository:** [DeepChopper GitHub](https://github.com/ylab-hi/DeepChopper)
- **Paper:** [A Genomic Language Model for Chimera Artifact Detection](https://www.biorxiv.org/content/10.1101/2024.10.23.619929v2)

### Model Architecture

- **Backbone:** HyenaDNA-small-32k (256 dimensions)
- **Classification Head:**
  - Linear Layer 1: 256 → 1024 dimensions
  - Linear Layer 2: 1024 → 1024 dimensions
  - Output Layer: 1024 → 2 classes (artifact/non-artifact)
  - Quality Score Integration: Identity layer for base quality incorporation
- **Input:**
  - Tokenized DNA sequences (vocabulary size: 12)
  - Base quality scores
- **Output:** Per-base classification (artifact vs. non-artifact)

## Uses

### Direct Use

DeepChopper is designed for:
- Detecting chimeric artifacts in Nanopore direct RNA sequencing data
- Identifying adapter sequences within base-called reads
- Preprocessing RNA-seq data before downstream transcriptomics analysis
- Improving accuracy of transcript annotation and gene fusion detection

### Downstream Use

The cleaned data can be used for:
- Transcript isoform analysis
- Gene expression quantification
- Novel transcript discovery
- Gene fusion detection
- Alternative splicing analysis

### Out-of-Scope Use

This model is NOT designed for:
- DNA sequencing data (it's specifically trained on RNA sequences)
- PacBio or Illumina sequencing platforms
- Genome assembly or variant calling

## Training Details

### Training Data

The model was trained on Nanopore direct RNA sequencing data with manually curated annotations of chimeric artifacts and adapter sequences.

### Training Procedure

- **Optimizer:** Adam (lr=0.0002, weight_decay=0)
- **Learning Rate Scheduler:** ReduceLROnPlateau (mode=min, factor=0.1, patience=10)
- **Loss Function:** Continuous Interval Loss (CrossEntropyLoss with no penalty)
- **Framework:** PyTorch Lightning

### Training Hyperparameters

- Learning Rate: 0.0002
- Batch Size: Configured per experiment
- Weight Decay: 0
- Backbone: Fine-tuned (not frozen)

## Evaluation

### Testing Data & Metrics


The model is evaluated on held-out test sets using:
- F1 Score (primary metric)
- Precision
- Recall

### Results

DeepChopper significantly improves downstream analysis quality by accurately removing chimeric artifacts that would otherwise confound transcriptome analyses.

## How to Use

### Installation

```bash
pip install deepchopper
```

### Python API

```python
import deepchopper

# Load the pretrained model
model = deepchopper.DeepChopper.from_pretrained("yangliz5/deepchopper-rna004")

# The model is ready for inference
# Use with deepchopper's predict pipeline
```

### Command Line Interface

```bash
# Step 1: Encode your FASTQ data
deepchopper encode input.fq

# Step 2: Predict chimeric artifacts
deepchopper predict input.parquet --output predictions

# Step 3: Remove artifacts and generate clean FASTQ
deepchopper chop predictions input.fq
```

For GPU acceleration:
```bash
deepchopper predict input.parquet --output predictions --gpus 1
```

### Web Interface

Try DeepChopper online without installation:
- [Hugging Face Space](https://huggingface.co/spaces/yangliz5/deepchopper)
- Or run locally: `deepchopper web`

## Limitations

- **Platform-specific:** Optimized for Nanopore direct RNA sequencing
- **Read length:** Best performance on reads up to 32k bases (model context window)
- **Species:** Trained primarily on human RNA sequences
- **Computational requirements:** GPU recommended for large datasets

## Citation

If you use DeepChopper in your research, please cite:

```bibtex
@article{Li2024.10.23.619929,
    author = {Li, Yangyang and Wang, Ting-You and Guo, Qingxiang and Ren, Yanan and Lu, Xiaotong and Cao, Qi and Yang, Rendong},
    title = {A Genomic Language Model for Chimera Artifact Detection in Nanopore Direct RNA Sequencing},
    year = {2024},
    doi = {10.1101/2024.10.23.619929},
    journal = {bioRxiv}
}
```

## Contact & Support

- **Issues:** [GitHub Issues](https://github.com/ylab-hi/DeepChopper/issues)
- **Documentation:** [Full Tutorial](https://github.com/ylab-hi/DeepChopper/blob/main/documentation/tutorial.md)
- **Repository:** [GitHub](https://github.com/ylab-hi/DeepChopper)

## Model Card Authors

YLab Team

## Model Card Contact

For questions about this model, please open an issue on the [GitHub repository](https://github.com/ylab-hi/DeepChopper/issues).