deepchopper / README.md

Update README.md

5938b59 verified about 2 months ago

5.61 kB

	---
	tags:
	- genomics
	- bioinformatics
	- nanopore
	- rna-sequencing
	- chimera-detection
	- token-classification
	- hyenadna
	- pytorch
	- lightning
	license: apache-2.0
	datasets:
	- nanopore-drna-seq
	language:
	- dna
	library_name: deepchopper
	pipeline_tag: token-classification
	---

	# DeepChopper: Chimera Detection for Nanopore Direct RNA Sequencing

	DeepChopper is a genomic language model designed to accurately detect and remove chimera artifacts in Nanopore direct RNA sequencing data. It uses a HyenaDNA backbone with a token classification head to identify artificial adapter sequences within reads.

	## Model Details

	### Model Description

	DeepChopper leverages the HyenaDNA-small-32k backbone, a genomic foundation model, combined with a specialized token classification head to detect chimeric artifacts in nanopore direct RNA sequencing reads. The model processes both sequence information and base quality scores to make accurate predictions.

	- Developed by: YLab Team ([Li et al., 2024](https://www.biorxiv.org/content/10.1101/2024.10.23.619929v2))
	- Model type: Token Classification
	- Language(s): DNA sequences
	- License: Apache 2.0
	- Base Model: HyenaDNA-small-32k-seqlen
	- Repository: [DeepChopper GitHub](https://github.com/ylab-hi/DeepChopper)
	- Paper: [A Genomic Language Model for Chimera Artifact Detection](https://www.biorxiv.org/content/10.1101/2024.10.23.619929v2)

	### Model Architecture

	- Backbone: HyenaDNA-small-32k (256 dimensions)
	- Classification Head:
	- Linear Layer 1: 256 → 1024 dimensions
	- Linear Layer 2: 1024 → 1024 dimensions
	- Output Layer: 1024 → 2 classes (artifact/non-artifact)
	- Quality Score Integration: Identity layer for base quality incorporation
	- Input:
	- Tokenized DNA sequences (vocabulary size: 12)
	- Base quality scores
	- Output: Per-base classification (artifact vs. non-artifact)

	## Uses

	### Direct Use

	DeepChopper is designed for:
	- Detecting chimeric artifacts in Nanopore direct RNA sequencing data
	- Identifying adapter sequences within base-called reads
	- Preprocessing RNA-seq data before downstream transcriptomics analysis
	- Improving accuracy of transcript annotation and gene fusion detection

	### Downstream Use

	The cleaned data can be used for:
	- Transcript isoform analysis
	- Gene expression quantification
	- Novel transcript discovery
	- Gene fusion detection
	- Alternative splicing analysis

	### Out-of-Scope Use

	This model is NOT designed for:
	- DNA sequencing data (it's specifically trained on RNA sequences)
	- PacBio or Illumina sequencing platforms
	- Genome assembly or variant calling

	## Training Details

	### Training Data

	The model was trained on Nanopore direct RNA sequencing data with manually curated annotations of chimeric artifacts and adapter sequences.

	### Training Procedure

	- Optimizer: Adam (lr=0.0002, weight_decay=0)
	- Learning Rate Scheduler: ReduceLROnPlateau (mode=min, factor=0.1, patience=10)
	- Loss Function: Continuous Interval Loss (CrossEntropyLoss with no penalty)
	- Framework: PyTorch Lightning

	### Training Hyperparameters

	- Learning Rate: 0.0002
	- Batch Size: Configured per experiment
	- Weight Decay: 0
	- Backbone: Fine-tuned (not frozen)

	## Evaluation

	### Testing Data & Metrics


	The model is evaluated on held-out test sets using:
	- F1 Score (primary metric)
	- Precision
	- Recall

	### Results

	DeepChopper significantly improves downstream analysis quality by accurately removing chimeric artifacts that would otherwise confound transcriptome analyses.

	## How to Use

	### Installation

	```bash
	pip install deepchopper
	```

	### Python API

	```python
	import deepchopper

	# Load the pretrained model
	model = deepchopper.DeepChopper.from_pretrained("yangliz5/deepchopper-rna004")

	# The model is ready for inference
	# Use with deepchopper's predict pipeline
	```

	### Command Line Interface

	```bash
	# Step 1: Encode your FASTQ data
	deepchopper encode input.fq

	# Step 2: Predict chimeric artifacts
	deepchopper predict input.parquet --output predictions

	# Step 3: Remove artifacts and generate clean FASTQ
	deepchopper chop predictions input.fq
	```

	For GPU acceleration:
	```bash
	deepchopper predict input.parquet --output predictions --gpus 1
	```

	### Web Interface

	Try DeepChopper online without installation:
	- [Hugging Face Space](https://huggingface.co/spaces/yangliz5/deepchopper)
	- Or run locally: `deepchopper web`

	## Limitations

	- Platform-specific: Optimized for Nanopore direct RNA sequencing
	- Read length: Best performance on reads up to 32k bases (model context window)
	- Species: Trained primarily on human RNA sequences
	- Computational requirements: GPU recommended for large datasets

	## Citation

	If you use DeepChopper in your research, please cite:

	```bibtex
	@article{Li2024.10.23.619929,
	author = {Li, Yangyang and Wang, Ting-You and Guo, Qingxiang and Ren, Yanan and Lu, Xiaotong and Cao, Qi and Yang, Rendong},
	title = {A Genomic Language Model for Chimera Artifact Detection in Nanopore Direct RNA Sequencing},
	year = {2024},
	doi = {10.1101/2024.10.23.619929},
	journal = {bioRxiv}
	}
	```

	## Contact & Support

	- Issues: [GitHub Issues](https://github.com/ylab-hi/DeepChopper/issues)
	- Documentation: [Full Tutorial](https://github.com/ylab-hi/DeepChopper/blob/main/documentation/tutorial.md)
	- Repository: [GitHub](https://github.com/ylab-hi/DeepChopper)

	## Model Card Authors

	YLab Team

	## Model Card Contact

	For questions about this model, please open an issue on the [GitHub repository](https://github.com/ylab-hi/DeepChopper/issues).