About CANLoc
CANLoc is a machine-learning system designed to predict the subcellular localization of proteins directly from the protein sequence. It combines transformer-based embeddings from the ESM2 model with an optimized XGBoost classifier trained on curated protein datasets.
Performance & Evaluation
CANLoc achieves high accuracy, precision, recall, and F1-scores across all classes. We additionally validate the model using:
- Train/test split evaluation
- 10-fold stratified cross-validation
- ROC curves for each class
- Sensitivity and specificity analysis
These evaluations confirm that CANLoc predictions are reliable for academic and research workflows.
Intended Use
- Functional protein studies
- Localization-oriented drug delivery strategy
Model Strengths
- Fast and scalable for single or batch prediction
- Transformer embeddings provide rich biological context
- High accuracy with interpretable confidence metrics
- No alignment or preprocessing required beyond the raw sequence
Limitations
- Performance depends on sequence length and quality
- Ambiguous sequences may reduce confidence
- Designed for four major classes only
CANLoc represents a balance between modern deep learning and classical machine learning methods, producing a system that is both reliable and lightweight enough to deploy in real-world web applications.