| | --- |
| | language: en |
| | license: mit |
| | library_name: scikit-learn |
| | tags: |
| | - protein-analysis |
| | - signal-peptide |
| | - bioinformatics |
| | - protein-secretion |
| | - machine-learning |
| | datasets: |
| | - uniprot |
| | pipeline_tag: tabular-classification |
| | widget: |
| | - example_title: "Human Albumin (Secreted)" |
| | text: "MKWVTFISLLFLFSSAYSRGVFRRDAHKSEVAHRFKDLGEENFKALVLIAFAQYLQQCPFEDHVKLVNEVTEFAKT" |
| | - example_title: "GFP (Non-secreted)" |
| | text: "MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTFGYGVQCFARY" |
| | --- |
| | |
| | # SignalSeeker: Protein Signal Peptide Prediction |
| |
|
| | ## Model Description |
| |
|
| | SignalSeeker is a machine learning ensemble for predicting signal peptides in protein sequences. It combines multiple algorithms including Random Forest, Extra Trees, SVM, and Logistic Regression with ProtBERT embeddings to achieve high accuracy in signal peptide detection. |
| |
|
| | ## Model Performance |
| |
|
| | - **Best Model**: Logistic regression (L2) |
| | - **Test AUC**: 0.99433 |
| | - **Training Data**: 5000 mixed seqeunces from UniProt verified eukaryotic proteins |
| | - **Test Data**: 1000 mixed seqeunces from UniProt verified eukaryotic proteins, isolated from training data |
| |
|
| | ## Intended Use |
| |
|
| | This model is designed to: |
| | - Predict whether a protein sequence contains a signal peptide |
| | - Assist in protein subcellular localization prediction |
| | - Support research in protein secretion pathways |
| | - Aid in biotechnology applications requiring secreted proteins |
| |
|
| | ## How to Use |
| |
|
| | ### Installation |
| |
|
| | ```bash |
| | pip install torch transformers scikit-learn numpy |
| | ``` |
| |
|
| | ### Basic Usage |
| |
|
| | ```python |
| | from signalseeker import SignalSeekerPredictor |
| | |
| | # Initialize predictor |
| | predictor = SignalSeekerPredictor.from_pretrained("your-username/signalseeker") |
| | |
| | # Predict signal peptide |
| | sequence = "MKWVTFISLLFLFSSAYSRGVFRRDAHKSEVAHRFKDLGEENFK..." |
| | result = predictor.predict(sequence) |
| | |
| | print(f"Has signal peptide: {result['has_signal_peptide']}") |
| | print(f"Confidence: {result['probability']:.3f}") |
| | ``` |
| |
|
| | ### Batch Prediction |
| |
|
| | ```python |
| | sequences = { |
| | "protein1": "MKWVTFISLLFLFSSAYS...", |
| | "protein2": "MSKGEELFTGVVPILVELD..." |
| | } |
| | |
| | results = predictor.predict_batch(sequences) |
| | ``` |
| |
|
| | ## Model Architecture |
| |
|
| | The SignalSeeker ensemble consists of: |
| |
|
| | 1. **Feature Extraction**: ProtBERT embeddings of N-terminal 50 amino acids |
| | 2. **Ensemble Models**: |
| | - Random Forest (Regularized) |
| | - Extra Trees (Regularized) |
| | - Support Vector Machine |
| | - Logistic Regression (L2) |
| | 3. **Feature Scaling**: StandardScaler normalization |
| | 4. **Decision Logic**: Weighted ensemble with confidence assessment |
| |
|
| | ## Training Data |
| |
|
| | - **Source**: UniProt database |
| | - **Organisms**: Eukaryotic proteins (Human, Mouse, Plant, Fungal) |
| | - **Positive Examples**: Proteins with experimentally verified signal peptides |
| | - **Negative Examples**: Cytoplasmic and nuclear proteins |
| | - **Validation**: Cross-validation with similarity-aware train/test splits |
| |
|
| | ## Performance Metrics |
| |
|
| | | Model | CV AUC | Test AUC | Test Accuracy | |
| | |-------|--------|----------|---------------| |
| | | Logistic regression (L2) | 0.99433 | 0.98432 | 0.92284 | |
| | | Random Forest (Regularised) | 0.98941 | 0.98869 | 0.96192 | |
| | | Extra Trees (Regularised) | 0.99032 | 0.99072 | 0.94899 | |
| | | SVM (Conservative) | 0.98711 | 0.98439 | 0.92284 | |
| |
|
| | ## Limitations |
| |
|
| | - Trained primarily on eukaryotic sequences |
| | - Performance may vary for prokaryotic proteins |
| | - Requires sequences of at least 50 amino acids for optimal performance |
| | - May have reduced accuracy for highly divergent organisms |
| |
|
| | ## Ethical Considerations |
| |
|
| | - This model is for research purposes only |
| | - Not intended for clinical diagnosis |
| | - Results should be validated experimentally |
| | - Consider potential biases in training data |
| |
|
| | ## Citation |
| |
|
| | If you use SignalSeeker in your research, please cite: |
| |
|
| | ```bibtex |
| | @misc{signalseeker2025, |
| | title={SignalSeeker: Machine Learning Ensemble for Protein Signal Peptide Prediction}, |
| | author={Hugo Cooper}, |
| | year={2025}, |
| | url={https://huggingface.co/hcoops/signalseeker} |
| | } |
| | ``` |
| |
|
| | ## Contact |
| |
|
| | For questions or issues, please open an issue on the [GitHub repository](https://github.com/hcoo25/signalseeker). |
| |
|
| | ## License |
| |
|
| | This model is released under the MIT License. |
| |
|