NethranjaliSE
/

Breast-Cancer-detection-using-ML-Algorithm

+---
+language:
+- en
+license: mit
+tags:
+- tabular
+- classification
+- scikit-learn
+- ensemble-learning
+- breast-cancer-detection
+- medical-imaging
+datasets:
+- uci-wdbc
+metrics:
+- accuracy
+- precision
+- recall
+- f1
+- roc_auc
+pipeline_tag: tabular-classification
+---
+# 🎗️ Breast Cancer Detection Ensemble Pipeline
+An optimized, production-ready machine learning pipeline featuring a **Soft-Voting Ensemble Classifier**. This model is trained on clinical data to distinguish between malignant and benign tumors with high sensitivity (recall), minimizing false negatives in diagnostic screening.
+This repository structure is modeled after the methodology discussed in *"Comparison of ML Algorithms for Breast Cancer Prediction" (CTEMS 2018)*, expanding the baseline framework to a robust 5-model ensemble architecture with automated pipeline scaling.
+---
+# 📊 Model Description
+The model utilizes a **Soft-Voting architecture** that aggregates predicted class probabilities across five diverse individual base estimators. Every individual classifier is encapsulated within a leakage-free preprocessing pipeline featuring automated standardization using `StandardScaler`.
+## Component Estimators
+1. **Random Forest Classifier**
+   - 72 estimators
+   - Balanced class weights
+2. **k-Nearest Neighbors (kNN)**
+   - Euclidean distance metric
+   - `k = 5`
+3. **Gaussian Naive Bayes**
+   - Probabilistic baseline classifier
+4. **Support Vector Classifier (SVC)**
+   - `rbf` kernel
+   - Probability estimation enabled
+5. **Logistic Regression**
+   - Regularized linear classifier
+   - Balanced class distributions
+---
+# 📈 Dataset & Training Architecture
+- **Dataset Source:** Wisconsin Diagnosis Breast Cancer (WDBC) — UCI Machine Learning Repository
+- **Instances:** 569 samples
+  - 357 Benign
+  - 212 Malignant
+- **Features:** 30 real-valued clinical features extracted from digitized FNA images
+- **Split Strategy:** Stratified train-test split
+  - Training: 398 samples
+  - Testing: 171 samples
+The pipeline uses:
+- `StratifiedKFold` cross-validation
+- Leakage-free preprocessing
+- Automated scaling pipelines
+---
+# ⚡ Performance Metrics
+Evaluation prioritizes **Recall (Sensitivity)** to reduce false negatives while maintaining strong overall classification accuracy.
+| Model | Accuracy | Precision | Recall | F1-Score | ROC-AUC |
+|---|---|---|---|---|---|
+| **Ensemble (Soft Voting)** | **0.9766** | **0.9725** | **0.9907** | **0.9815** | **0.9972** |
+| Random Forest | 0.9649 | 0.9633 | 0.9813 | 0.9722 | 0.9936 |
+| kNN | 0.9591 | 0.9545 | 0.9813 | 0.9677 | 0.9877 |
+| Support Vector Machine | 0.9766 | 0.9725 | 0.9907 | 0.9815 | 0.9974 |
+| Logistic Regression | 0.9766 | 0.9725 | 0.9907 | 0.9815 | 0.9969 |
+| Naive Bayes | 0.9591 | 0.9545 | 0.9813 | 0.9677 | 0.9892 |
+> **Note:** Results may vary slightly depending on package versions and random seeds.
+---
+# 💻 Installation
+## Dependencies
+```text
+scikit-learn>=1.0
+numpy
+pandas
+joblib
+huggingface_hub
+```
+Install dependencies:
+```bash
+pip install scikit-learn numpy pandas joblib huggingface_hub
+```
+---
+# 🚀 Dynamic Inference Example
+You can directly download and run the trained pipeline from Hugging Face Hub.
+```python
+import joblib
+import pandas as pd
+from huggingface_hub import hf_hub_download
+# Download model pipeline
+model_path = hf_hub_download(
+    repo_id="NethranjaliSE/Breast-Cancer-detection-using-ML-Algorithm",
+    filename="ensemble_soft_voting.pkl"
+)
+# Load pipeline
+pipeline = joblib.load(model_path)
+# Example sample input (30 WDBC features)
+sample_data = [[
+    14.12, 19.28, 91.96, 654.8, 0.096, 0.11, 0.08, 0.04, 0.18, 0.06,
+    0.25, 0.89, 1.82, 24.3, 0.006, 0.02, 0.02, 0.01, 0.01, 0.003,
+    16.26, 25.67, 107.26, 880.5, 0.132, 0.21, 0.19, 0.09, 0.28, 0.08
+]]
+feature_names = (
+    pipeline.feature_names_in_
+    if hasattr(pipeline, "feature_names_in_")
+    else None
+)
+input_df = pd.DataFrame(sample_data, columns=feature_names)
+# Predict
+prediction = pipeline.predict(input_df)
+probabilities = pipeline.predict_proba(input_df)[0]
+diagnosis = (
+    "Benign (Low Risk)"
+    if prediction[0] == 1
+    else "Malignant (High Risk)"
+)
+print(f"Diagnostic Assessment: {diagnosis}")
+print(
+    f"Confidence Matrix -> "
+    f"Malignant: {probabilities[0]:.4f} | "
+    f"Benign: {probabilities[1]:.4f}"
+)
+```
+---
+# 📂 Repository Structure
+```text
+.
+├── ensemble_soft_voting.pkl
+├── training_pipeline.ipynb
+├── requirements.txt
+└── README.md
+```
+---
+# ⚠️ Limitations & Intended Use
+This model is developed strictly for:
+- Academic research
+- Educational purposes
+- Machine learning experimentation
+- Pipeline prototyping
+It is **NOT** approved for:
+- Clinical deployment
+- Medical diagnosis
+- Real-world healthcare decision-making
+All diagnostic decisions must be performed by qualified medical professionals using certified medical systems.
+---
+# 📜 Citations
+### Research Reference
+```bibtex
+@article{street1993nuclear,
+  title={Nuclear feature extraction for breast tumor diagnosis},
+  author={Street, W.N. and Wolberg, W.H. and Mangasarian, O.L.},
+  journal={IS&T/SPIE Biomedical Imaging},
+  year={1993}
+}
+```
+### Dataset Reference
+- UCI Machine Learning Repository
+- Breast Cancer Wisconsin (Diagnostic) Dataset
+---
+# 🤝 Acknowledgements
+Special thanks to:
+- UCI Machine Learning Repository
+- Scikit-learn contributors
+- Hugging Face Hub
+- Open-source ML research community
+---
+# 🧠 Model Author
+**Sachini Praboda Nethranjali**
+Electronic and Computer Science Undergraduate
+University of Kelaniya, Sri Lanka