yangliz5
/

deepchopper

@@ -1,17 +1,189 @@
 ---
 tags:
-- model_hub_mixin
-- pytorch_model_hub_mixin
-license: apache-2.0
 language:
-- en
-metrics:
-- accuracy
-- recall
-- f1
 pipeline_tag: token-classification
 ---
-This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration:
-- Library: [More Information Needed]
-- Docs: [More Information Needed]

 ---
 tags:
+- genomics
+- bioinformatics
+- nanopore
+- rna-sequencing
+- chimera-detection
+- token-classification
+- hyenadna
+- pytorch
+- lightning
+license: mit
+datasets:
+- nanopore-drna-seq
 language:
+- dna
+library_name: deepchopper
 pipeline_tag: token-classification
 ---
+# DeepChopper: Chimera Detection for Nanopore Direct RNA Sequencing
+DeepChopper is a genomic language model designed to accurately detect and remove chimera artifacts in Nanopore direct RNA sequencing data. It uses a HyenaDNA backbone with a token classification head to identify artificial adapter sequences within reads.
+## Model Details
+### Model Description
+DeepChopper leverages the HyenaDNA-small-32k backbone, a genomic foundation model, combined with a specialized token classification head to detect chimeric artifacts in nanopore direct RNA sequencing reads. The model processes both sequence information and base quality scores to make accurate predictions.
+- **Developed by:** YLab Team ([Li et al., 2024](https://www.biorxiv.org/content/10.1101/2024.10.23.619929v2))
+- **Model type:** Token Classification
+- **Language(s):** DNA sequences
+- **License:** MIT
+- **Base Model:** HyenaDNA-small-32k-seqlen
+- **Repository:** [DeepChopper GitHub](https://github.com/ylab-hi/DeepChopper)
+- **Paper:** [A Genomic Language Model for Chimera Artifact Detection](https://www.biorxiv.org/content/10.1101/2024.10.23.619929v2)
+### Model Architecture
+- **Backbone:** HyenaDNA-small-32k (256 dimensions)
+- **Classification Head:**
+  - Linear Layer 1: 256 → 1024 dimensions
+  - Linear Layer 2: 1024 → 1024 dimensions
+  - Output Layer: 1024 → 2 classes (artifact/non-artifact)
+  - Quality Score Integration: Identity layer for base quality incorporation
+- **Input:**
+  - Tokenized DNA sequences (vocabulary size: 12)
+  - Base quality scores
+- **Output:** Per-base classification (artifact vs. non-artifact)
+## Uses
+### Direct Use
+DeepChopper is designed for:
+- Detecting chimeric artifacts in Nanopore direct RNA sequencing data
+- Identifying adapter sequences within base-called reads
+- Preprocessing RNA-seq data before downstream transcriptomics analysis
+- Improving accuracy of transcript annotation and gene fusion detection
+### Downstream Use
+The cleaned data can be used for:
+- Transcript isoform analysis
+- Gene expression quantification
+- Novel transcript discovery
+- Gene fusion detection
+- Alternative splicing analysis
+### Out-of-Scope Use
+This model is NOT designed for:
+- DNA sequencing data (it's specifically trained on RNA sequences)
+- PacBio or Illumina sequencing platforms
+- Genome assembly or variant calling
+## Training Details
+### Training Data
+The model was trained on Nanopore direct RNA sequencing data with manually curated annotations of chimeric artifacts and adapter sequences.
+### Training Procedure
+- **Optimizer:** Adam (lr=0.0002, weight_decay=0)
+- **Learning Rate Scheduler:** ReduceLROnPlateau (mode=min, factor=0.1, patience=10)
+- **Loss Function:** Continuous Interval Loss (CrossEntropyLoss with no penalty)
+- **Framework:** PyTorch Lightning
+### Training Hyperparameters
+- Learning Rate: 0.0002
+- Batch Size: Configured per experiment
+- Weight Decay: 0
+- Backbone: Fine-tuned (not frozen)
+## Evaluation
+### Testing Data & Metrics
+The model is evaluated on held-out test sets using:
+- F1 Score (primary metric)
+- Precision
+- Recall
+### Results
+DeepChopper significantly improves downstream analysis quality by accurately removing chimeric artifacts that would otherwise confound transcriptome analyses.
+## How to Use
+### Installation
+```bash
+pip install deepchopper
+```
+### Python API
+```python
+import deepchopper
+# Load the pretrained model
+model = deepchopper.DeepChopper.from_pretrained("yangliz5/deepchopper-rna004")
+# The model is ready for inference
+# Use with deepchopper's predict pipeline
+```
+### Command Line Interface
+```bash
+# Step 1: Encode your FASTQ data
+deepchopper encode input.fq
+# Step 2: Predict chimeric artifacts
+deepchopper predict input.parquet --output predictions
+# Step 3: Remove artifacts and generate clean FASTQ
+deepchopper chop predictions input.fq
+```
+For GPU acceleration:
+```bash
+deepchopper predict input.parquet --output predictions --gpus 1
+```
+### Web Interface
+Try DeepChopper online without installation:
+- [Hugging Face Space](https://huggingface.co/spaces/yangliz5/deepchopper)
+- Or run locally: `deepchopper web`
+## Limitations
+- **Platform-specific:** Optimized for Nanopore direct RNA sequencing
+- **Read length:** Best performance on reads up to 32k bases (model context window)
+- **Species:** Trained primarily on human RNA sequences
+- **Computational requirements:** GPU recommended for large datasets
+## Citation
+If you use DeepChopper in your research, please cite:
+```bibtex
+@article{Li2024.10.23.619929,
+    author = {Li, Yangyang and Wang, Ting-You and Guo, Qingxiang and Ren, Yanan and Lu, Xiaotong and Cao, Qi and Yang, Rendong},
+    title = {A Genomic Language Model for Chimera Artifact Detection in Nanopore Direct RNA Sequencing},
+    year = {2024},
+    doi = {10.1101/2024.10.23.619929},
+    journal = {bioRxiv}
+}
+```
+## Contact & Support
+- **Issues:** [GitHub Issues](https://github.com/ylab-hi/DeepChopper/issues)
+- **Documentation:** [Full Tutorial](https://github.com/ylab-hi/DeepChopper/blob/main/documentation/tutorial.md)
+- **Repository:** [GitHub](https://github.com/ylab-hi/DeepChopper)
+## Model Card Authors
+YLab Team
+## Model Card Contact
+For questions about this model, please open an issue on the [GitHub repository](https://github.com/ylab-hi/DeepChopper/issues).