File size: 5,599 Bytes
46a5e5a | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 | ---
language:
- en
license: mit
tags:
- tabular
- classification
- scikit-learn
- ensemble-learning
- breast-cancer-detection
- medical-imaging
datasets:
- uci-wdbc
metrics:
- accuracy
- precision
- recall
- f1
- roc_auc
pipeline_tag: tabular-classification
---
# ποΈ Breast Cancer Detection Ensemble Pipeline
An optimized, production-ready machine learning pipeline featuring a **Soft-Voting Ensemble Classifier**. This model is trained on clinical data to distinguish between malignant and benign tumors with high sensitivity (recall), minimizing false negatives in diagnostic screening.
This repository structure is modeled after the methodology discussed in *"Comparison of ML Algorithms for Breast Cancer Prediction" (CTEMS 2018)*, expanding the baseline framework to a robust 5-model ensemble architecture with automated pipeline scaling.
---
# π Model Description
The model utilizes a **Soft-Voting architecture** that aggregates predicted class probabilities across five diverse individual base estimators. Every individual classifier is encapsulated within a leakage-free preprocessing pipeline featuring automated standardization using `StandardScaler`.
## Component Estimators
1. **Random Forest Classifier**
- 72 estimators
- Balanced class weights
2. **k-Nearest Neighbors (kNN)**
- Euclidean distance metric
- `k = 5`
3. **Gaussian Naive Bayes**
- Probabilistic baseline classifier
4. **Support Vector Classifier (SVC)**
- `rbf` kernel
- Probability estimation enabled
5. **Logistic Regression**
- Regularized linear classifier
- Balanced class distributions
---
# π Dataset & Training Architecture
- **Dataset Source:** Wisconsin Diagnosis Breast Cancer (WDBC) β UCI Machine Learning Repository
- **Instances:** 569 samples
- 357 Benign
- 212 Malignant
- **Features:** 30 real-valued clinical features extracted from digitized FNA images
- **Split Strategy:** Stratified train-test split
- Training: 398 samples
- Testing: 171 samples
The pipeline uses:
- `StratifiedKFold` cross-validation
- Leakage-free preprocessing
- Automated scaling pipelines
---
# β‘ Performance Metrics
Evaluation prioritizes **Recall (Sensitivity)** to reduce false negatives while maintaining strong overall classification accuracy.
| Model | Accuracy | Precision | Recall | F1-Score | ROC-AUC |
|---|---|---|---|---|---|
| **Ensemble (Soft Voting)** | **0.9766** | **0.9725** | **0.9907** | **0.9815** | **0.9972** |
| Random Forest | 0.9649 | 0.9633 | 0.9813 | 0.9722 | 0.9936 |
| kNN | 0.9591 | 0.9545 | 0.9813 | 0.9677 | 0.9877 |
| Support Vector Machine | 0.9766 | 0.9725 | 0.9907 | 0.9815 | 0.9974 |
| Logistic Regression | 0.9766 | 0.9725 | 0.9907 | 0.9815 | 0.9969 |
| Naive Bayes | 0.9591 | 0.9545 | 0.9813 | 0.9677 | 0.9892 |
> **Note:** Results may vary slightly depending on package versions and random seeds.
---
# π» Installation
## Dependencies
```text
scikit-learn>=1.0
numpy
pandas
joblib
huggingface_hub
```
Install dependencies:
```bash
pip install scikit-learn numpy pandas joblib huggingface_hub
```
---
# π Dynamic Inference Example
You can directly download and run the trained pipeline from Hugging Face Hub.
```python
import joblib
import pandas as pd
from huggingface_hub import hf_hub_download
# Download model pipeline
model_path = hf_hub_download(
repo_id="NethranjaliSE/Breast-Cancer-detection-using-ML-Algorithm",
filename="ensemble_soft_voting.pkl"
)
# Load pipeline
pipeline = joblib.load(model_path)
# Example sample input (30 WDBC features)
sample_data = [[
14.12, 19.28, 91.96, 654.8, 0.096, 0.11, 0.08, 0.04, 0.18, 0.06,
0.25, 0.89, 1.82, 24.3, 0.006, 0.02, 0.02, 0.01, 0.01, 0.003,
16.26, 25.67, 107.26, 880.5, 0.132, 0.21, 0.19, 0.09, 0.28, 0.08
]]
feature_names = (
pipeline.feature_names_in_
if hasattr(pipeline, "feature_names_in_")
else None
)
input_df = pd.DataFrame(sample_data, columns=feature_names)
# Predict
prediction = pipeline.predict(input_df)
probabilities = pipeline.predict_proba(input_df)[0]
diagnosis = (
"Benign (Low Risk)"
if prediction[0] == 1
else "Malignant (High Risk)"
)
print(f"Diagnostic Assessment: {diagnosis}")
print(
f"Confidence Matrix -> "
f"Malignant: {probabilities[0]:.4f} | "
f"Benign: {probabilities[1]:.4f}"
)
```
---
# π Repository Structure
```text
.
βββ ensemble_soft_voting.pkl
βββ training_pipeline.ipynb
βββ requirements.txt
βββ README.md
```
---
# β οΈ Limitations & Intended Use
This model is developed strictly for:
- Academic research
- Educational purposes
- Machine learning experimentation
- Pipeline prototyping
It is **NOT** approved for:
- Clinical deployment
- Medical diagnosis
- Real-world healthcare decision-making
All diagnostic decisions must be performed by qualified medical professionals using certified medical systems.
---
# π Citations
### Research Reference
```bibtex
@article{street1993nuclear,
title={Nuclear feature extraction for breast tumor diagnosis},
author={Street, W.N. and Wolberg, W.H. and Mangasarian, O.L.},
journal={IS&T/SPIE Biomedical Imaging},
year={1993}
}
```
### Dataset Reference
- UCI Machine Learning Repository
- Breast Cancer Wisconsin (Diagnostic) Dataset
---
# π€ Acknowledgements
Special thanks to:
- UCI Machine Learning Repository
- Scikit-learn contributors
- Hugging Face Hub
- Open-source ML research community
---
# π§ Model Author
**Sachini Praboda Nethranjali**
Electronic and Computer Science Undergraduate
University of Kelaniya, Sri Lanka |