yangliz5 commited on
Commit
002c34e
·
verified ·
1 Parent(s): 0648c7e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +183 -11
README.md CHANGED
@@ -1,17 +1,189 @@
1
  ---
2
  tags:
3
- - model_hub_mixin
4
- - pytorch_model_hub_mixin
5
- license: apache-2.0
 
 
 
 
 
 
 
 
 
6
  language:
7
- - en
8
- metrics:
9
- - accuracy
10
- - recall
11
- - f1
12
  pipeline_tag: token-classification
13
  ---
14
 
15
- This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration:
16
- - Library: [More Information Needed]
17
- - Docs: [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  tags:
3
+ - genomics
4
+ - bioinformatics
5
+ - nanopore
6
+ - rna-sequencing
7
+ - chimera-detection
8
+ - token-classification
9
+ - hyenadna
10
+ - pytorch
11
+ - lightning
12
+ license: mit
13
+ datasets:
14
+ - nanopore-drna-seq
15
  language:
16
+ - dna
17
+ library_name: deepchopper
 
 
 
18
  pipeline_tag: token-classification
19
  ---
20
 
21
+ # DeepChopper: Chimera Detection for Nanopore Direct RNA Sequencing
22
+
23
+ DeepChopper is a genomic language model designed to accurately detect and remove chimera artifacts in Nanopore direct RNA sequencing data. It uses a HyenaDNA backbone with a token classification head to identify artificial adapter sequences within reads.
24
+
25
+ ## Model Details
26
+
27
+ ### Model Description
28
+
29
+ DeepChopper leverages the HyenaDNA-small-32k backbone, a genomic foundation model, combined with a specialized token classification head to detect chimeric artifacts in nanopore direct RNA sequencing reads. The model processes both sequence information and base quality scores to make accurate predictions.
30
+
31
+ - **Developed by:** YLab Team ([Li et al., 2024](https://www.biorxiv.org/content/10.1101/2024.10.23.619929v2))
32
+ - **Model type:** Token Classification
33
+ - **Language(s):** DNA sequences
34
+ - **License:** MIT
35
+ - **Base Model:** HyenaDNA-small-32k-seqlen
36
+ - **Repository:** [DeepChopper GitHub](https://github.com/ylab-hi/DeepChopper)
37
+ - **Paper:** [A Genomic Language Model for Chimera Artifact Detection](https://www.biorxiv.org/content/10.1101/2024.10.23.619929v2)
38
+
39
+ ### Model Architecture
40
+
41
+ - **Backbone:** HyenaDNA-small-32k (256 dimensions)
42
+ - **Classification Head:**
43
+ - Linear Layer 1: 256 → 1024 dimensions
44
+ - Linear Layer 2: 1024 → 1024 dimensions
45
+ - Output Layer: 1024 → 2 classes (artifact/non-artifact)
46
+ - Quality Score Integration: Identity layer for base quality incorporation
47
+ - **Input:**
48
+ - Tokenized DNA sequences (vocabulary size: 12)
49
+ - Base quality scores
50
+ - **Output:** Per-base classification (artifact vs. non-artifact)
51
+
52
+ ## Uses
53
+
54
+ ### Direct Use
55
+
56
+ DeepChopper is designed for:
57
+ - Detecting chimeric artifacts in Nanopore direct RNA sequencing data
58
+ - Identifying adapter sequences within base-called reads
59
+ - Preprocessing RNA-seq data before downstream transcriptomics analysis
60
+ - Improving accuracy of transcript annotation and gene fusion detection
61
+
62
+ ### Downstream Use
63
+
64
+ The cleaned data can be used for:
65
+ - Transcript isoform analysis
66
+ - Gene expression quantification
67
+ - Novel transcript discovery
68
+ - Gene fusion detection
69
+ - Alternative splicing analysis
70
+
71
+ ### Out-of-Scope Use
72
+
73
+ This model is NOT designed for:
74
+ - DNA sequencing data (it's specifically trained on RNA sequences)
75
+ - PacBio or Illumina sequencing platforms
76
+ - Genome assembly or variant calling
77
+
78
+ ## Training Details
79
+
80
+ ### Training Data
81
+
82
+ The model was trained on Nanopore direct RNA sequencing data with manually curated annotations of chimeric artifacts and adapter sequences.
83
+
84
+ ### Training Procedure
85
+
86
+ - **Optimizer:** Adam (lr=0.0002, weight_decay=0)
87
+ - **Learning Rate Scheduler:** ReduceLROnPlateau (mode=min, factor=0.1, patience=10)
88
+ - **Loss Function:** Continuous Interval Loss (CrossEntropyLoss with no penalty)
89
+ - **Framework:** PyTorch Lightning
90
+
91
+ ### Training Hyperparameters
92
+
93
+ - Learning Rate: 0.0002
94
+ - Batch Size: Configured per experiment
95
+ - Weight Decay: 0
96
+ - Backbone: Fine-tuned (not frozen)
97
+
98
+ ## Evaluation
99
+
100
+ ### Testing Data & Metrics
101
+
102
+
103
+ The model is evaluated on held-out test sets using:
104
+ - F1 Score (primary metric)
105
+ - Precision
106
+ - Recall
107
+
108
+ ### Results
109
+
110
+ DeepChopper significantly improves downstream analysis quality by accurately removing chimeric artifacts that would otherwise confound transcriptome analyses.
111
+
112
+ ## How to Use
113
+
114
+ ### Installation
115
+
116
+ ```bash
117
+ pip install deepchopper
118
+ ```
119
+
120
+ ### Python API
121
+
122
+ ```python
123
+ import deepchopper
124
+
125
+ # Load the pretrained model
126
+ model = deepchopper.DeepChopper.from_pretrained("yangliz5/deepchopper-rna004")
127
+
128
+ # The model is ready for inference
129
+ # Use with deepchopper's predict pipeline
130
+ ```
131
+
132
+ ### Command Line Interface
133
+
134
+ ```bash
135
+ # Step 1: Encode your FASTQ data
136
+ deepchopper encode input.fq
137
+
138
+ # Step 2: Predict chimeric artifacts
139
+ deepchopper predict input.parquet --output predictions
140
+
141
+ # Step 3: Remove artifacts and generate clean FASTQ
142
+ deepchopper chop predictions input.fq
143
+ ```
144
+
145
+ For GPU acceleration:
146
+ ```bash
147
+ deepchopper predict input.parquet --output predictions --gpus 1
148
+ ```
149
+
150
+ ### Web Interface
151
+
152
+ Try DeepChopper online without installation:
153
+ - [Hugging Face Space](https://huggingface.co/spaces/yangliz5/deepchopper)
154
+ - Or run locally: `deepchopper web`
155
+
156
+ ## Limitations
157
+
158
+ - **Platform-specific:** Optimized for Nanopore direct RNA sequencing
159
+ - **Read length:** Best performance on reads up to 32k bases (model context window)
160
+ - **Species:** Trained primarily on human RNA sequences
161
+ - **Computational requirements:** GPU recommended for large datasets
162
+
163
+ ## Citation
164
+
165
+ If you use DeepChopper in your research, please cite:
166
+
167
+ ```bibtex
168
+ @article{Li2024.10.23.619929,
169
+ author = {Li, Yangyang and Wang, Ting-You and Guo, Qingxiang and Ren, Yanan and Lu, Xiaotong and Cao, Qi and Yang, Rendong},
170
+ title = {A Genomic Language Model for Chimera Artifact Detection in Nanopore Direct RNA Sequencing},
171
+ year = {2024},
172
+ doi = {10.1101/2024.10.23.619929},
173
+ journal = {bioRxiv}
174
+ }
175
+ ```
176
+
177
+ ## Contact & Support
178
+
179
+ - **Issues:** [GitHub Issues](https://github.com/ylab-hi/DeepChopper/issues)
180
+ - **Documentation:** [Full Tutorial](https://github.com/ylab-hi/DeepChopper/blob/main/documentation/tutorial.md)
181
+ - **Repository:** [GitHub](https://github.com/ylab-hi/DeepChopper)
182
+
183
+ ## Model Card Authors
184
+
185
+ YLab Team
186
+
187
+ ## Model Card Contact
188
+
189
+ For questions about this model, please open an issue on the [GitHub repository](https://github.com/ylab-hi/DeepChopper/issues).