--- tags: - genomics - bioinformatics - nanopore - rna-sequencing - chimera-detection - token-classification - hyenadna - pytorch - lightning license: apache-2.0 datasets: - nanopore-drna-seq language: - dna library_name: deepchopper pipeline_tag: token-classification --- # DeepChopper: Chimera Detection for Nanopore Direct RNA Sequencing DeepChopper is a genomic language model designed to accurately detect and remove chimera artifacts in Nanopore direct RNA sequencing data. It uses a HyenaDNA backbone with a token classification head to identify artificial adapter sequences within reads. ## Model Details ### Model Description DeepChopper leverages the HyenaDNA-small-32k backbone, a genomic foundation model, combined with a specialized token classification head to detect chimeric artifacts in nanopore direct RNA sequencing reads. The model processes both sequence information and base quality scores to make accurate predictions. - **Developed by:** YLab Team ([Li et al., 2024](https://www.biorxiv.org/content/10.1101/2024.10.23.619929v2)) - **Model type:** Token Classification - **Language(s):** DNA sequences - **License:** Apache 2.0 - **Base Model:** HyenaDNA-small-32k-seqlen - **Repository:** [DeepChopper GitHub](https://github.com/ylab-hi/DeepChopper) - **Paper:** [A Genomic Language Model for Chimera Artifact Detection](https://www.biorxiv.org/content/10.1101/2024.10.23.619929v2) ### Model Architecture - **Backbone:** HyenaDNA-small-32k (256 dimensions) - **Classification Head:** - Linear Layer 1: 256 → 1024 dimensions - Linear Layer 2: 1024 → 1024 dimensions - Output Layer: 1024 → 2 classes (artifact/non-artifact) - Quality Score Integration: Identity layer for base quality incorporation - **Input:** - Tokenized DNA sequences (vocabulary size: 12) - Base quality scores - **Output:** Per-base classification (artifact vs. non-artifact) ## Uses ### Direct Use DeepChopper is designed for: - Detecting chimeric artifacts in Nanopore direct RNA sequencing data - Identifying adapter sequences within base-called reads - Preprocessing RNA-seq data before downstream transcriptomics analysis - Improving accuracy of transcript annotation and gene fusion detection ### Downstream Use The cleaned data can be used for: - Transcript isoform analysis - Gene expression quantification - Novel transcript discovery - Gene fusion detection - Alternative splicing analysis ### Out-of-Scope Use This model is NOT designed for: - DNA sequencing data (it's specifically trained on RNA sequences) - PacBio or Illumina sequencing platforms - Genome assembly or variant calling ## Training Details ### Training Data The model was trained on Nanopore direct RNA sequencing data with manually curated annotations of chimeric artifacts and adapter sequences. ### Training Procedure - **Optimizer:** Adam (lr=0.0002, weight_decay=0) - **Learning Rate Scheduler:** ReduceLROnPlateau (mode=min, factor=0.1, patience=10) - **Loss Function:** Continuous Interval Loss (CrossEntropyLoss with no penalty) - **Framework:** PyTorch Lightning ### Training Hyperparameters - Learning Rate: 0.0002 - Batch Size: Configured per experiment - Weight Decay: 0 - Backbone: Fine-tuned (not frozen) ## Evaluation ### Testing Data & Metrics The model is evaluated on held-out test sets using: - F1 Score (primary metric) - Precision - Recall ### Results DeepChopper significantly improves downstream analysis quality by accurately removing chimeric artifacts that would otherwise confound transcriptome analyses. ## How to Use ### Installation ```bash pip install deepchopper ``` ### Python API ```python import deepchopper # Load the pretrained model model = deepchopper.DeepChopper.from_pretrained("yangliz5/deepchopper-rna004") # The model is ready for inference # Use with deepchopper's predict pipeline ``` ### Command Line Interface ```bash # Step 1: Encode your FASTQ data deepchopper encode input.fq # Step 2: Predict chimeric artifacts deepchopper predict input.parquet --output predictions # Step 3: Remove artifacts and generate clean FASTQ deepchopper chop predictions input.fq ``` For GPU acceleration: ```bash deepchopper predict input.parquet --output predictions --gpus 1 ``` ### Web Interface Try DeepChopper online without installation: - [Hugging Face Space](https://huggingface.co/spaces/yangliz5/deepchopper) - Or run locally: `deepchopper web` ## Limitations - **Platform-specific:** Optimized for Nanopore direct RNA sequencing - **Read length:** Best performance on reads up to 32k bases (model context window) - **Species:** Trained primarily on human RNA sequences - **Computational requirements:** GPU recommended for large datasets ## Citation If you use DeepChopper in your research, please cite: ```bibtex @article{Li2024.10.23.619929, author = {Li, Yangyang and Wang, Ting-You and Guo, Qingxiang and Ren, Yanan and Lu, Xiaotong and Cao, Qi and Yang, Rendong}, title = {A Genomic Language Model for Chimera Artifact Detection in Nanopore Direct RNA Sequencing}, year = {2024}, doi = {10.1101/2024.10.23.619929}, journal = {bioRxiv} } ``` ## Contact & Support - **Issues:** [GitHub Issues](https://github.com/ylab-hi/DeepChopper/issues) - **Documentation:** [Full Tutorial](https://github.com/ylab-hi/DeepChopper/blob/main/documentation/tutorial.md) - **Repository:** [GitHub](https://github.com/ylab-hi/DeepChopper) ## Model Card Authors YLab Team ## Model Card Contact For questions about this model, please open an issue on the [GitHub repository](https://github.com/ylab-hi/DeepChopper/issues).