File size: 5,599 Bytes
46a5e5a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
---
language:
- en
license: mit
tags:
- tabular
- classification
- scikit-learn
- ensemble-learning
- breast-cancer-detection
- medical-imaging
datasets:
- uci-wdbc
metrics:
- accuracy
- precision
- recall
- f1
- roc_auc
pipeline_tag: tabular-classification
---

# πŸŽ—οΈ Breast Cancer Detection Ensemble Pipeline

An optimized, production-ready machine learning pipeline featuring a **Soft-Voting Ensemble Classifier**. This model is trained on clinical data to distinguish between malignant and benign tumors with high sensitivity (recall), minimizing false negatives in diagnostic screening.

This repository structure is modeled after the methodology discussed in *"Comparison of ML Algorithms for Breast Cancer Prediction" (CTEMS 2018)*, expanding the baseline framework to a robust 5-model ensemble architecture with automated pipeline scaling.

---

# πŸ“Š Model Description

The model utilizes a **Soft-Voting architecture** that aggregates predicted class probabilities across five diverse individual base estimators. Every individual classifier is encapsulated within a leakage-free preprocessing pipeline featuring automated standardization using `StandardScaler`.

## Component Estimators

1. **Random Forest Classifier**
   - 72 estimators
   - Balanced class weights

2. **k-Nearest Neighbors (kNN)**
   - Euclidean distance metric
   - `k = 5`

3. **Gaussian Naive Bayes**
   - Probabilistic baseline classifier

4. **Support Vector Classifier (SVC)**
   - `rbf` kernel
   - Probability estimation enabled

5. **Logistic Regression**
   - Regularized linear classifier
   - Balanced class distributions

---

# πŸ“ˆ Dataset & Training Architecture

- **Dataset Source:** Wisconsin Diagnosis Breast Cancer (WDBC) β€” UCI Machine Learning Repository
- **Instances:** 569 samples
  - 357 Benign
  - 212 Malignant
- **Features:** 30 real-valued clinical features extracted from digitized FNA images
- **Split Strategy:** Stratified train-test split
  - Training: 398 samples
  - Testing: 171 samples

The pipeline uses:
- `StratifiedKFold` cross-validation
- Leakage-free preprocessing
- Automated scaling pipelines

---

# ⚑ Performance Metrics

Evaluation prioritizes **Recall (Sensitivity)** to reduce false negatives while maintaining strong overall classification accuracy.

| Model | Accuracy | Precision | Recall | F1-Score | ROC-AUC |
|---|---|---|---|---|---|
| **Ensemble (Soft Voting)** | **0.9766** | **0.9725** | **0.9907** | **0.9815** | **0.9972** |
| Random Forest | 0.9649 | 0.9633 | 0.9813 | 0.9722 | 0.9936 |
| kNN | 0.9591 | 0.9545 | 0.9813 | 0.9677 | 0.9877 |
| Support Vector Machine | 0.9766 | 0.9725 | 0.9907 | 0.9815 | 0.9974 |
| Logistic Regression | 0.9766 | 0.9725 | 0.9907 | 0.9815 | 0.9969 |
| Naive Bayes | 0.9591 | 0.9545 | 0.9813 | 0.9677 | 0.9892 |

> **Note:** Results may vary slightly depending on package versions and random seeds.

---

# πŸ’» Installation

## Dependencies

```text
scikit-learn>=1.0
numpy
pandas
joblib
huggingface_hub
```

Install dependencies:

```bash
pip install scikit-learn numpy pandas joblib huggingface_hub
```

---

# πŸš€ Dynamic Inference Example

You can directly download and run the trained pipeline from Hugging Face Hub.

```python
import joblib
import pandas as pd
from huggingface_hub import hf_hub_download

# Download model pipeline
model_path = hf_hub_download(
    repo_id="NethranjaliSE/Breast-Cancer-detection-using-ML-Algorithm",
    filename="ensemble_soft_voting.pkl"
)

# Load pipeline
pipeline = joblib.load(model_path)

# Example sample input (30 WDBC features)
sample_data = [[
    14.12, 19.28, 91.96, 654.8, 0.096, 0.11, 0.08, 0.04, 0.18, 0.06,
    0.25, 0.89, 1.82, 24.3, 0.006, 0.02, 0.02, 0.01, 0.01, 0.003,
    16.26, 25.67, 107.26, 880.5, 0.132, 0.21, 0.19, 0.09, 0.28, 0.08
]]

feature_names = (
    pipeline.feature_names_in_
    if hasattr(pipeline, "feature_names_in_")
    else None
)

input_df = pd.DataFrame(sample_data, columns=feature_names)

# Predict
prediction = pipeline.predict(input_df)
probabilities = pipeline.predict_proba(input_df)[0]

diagnosis = (
    "Benign (Low Risk)"
    if prediction[0] == 1
    else "Malignant (High Risk)"
)

print(f"Diagnostic Assessment: {diagnosis}")

print(
    f"Confidence Matrix -> "
    f"Malignant: {probabilities[0]:.4f} | "
    f"Benign: {probabilities[1]:.4f}"
)
```

---

# πŸ“‚ Repository Structure

```text
.
β”œβ”€β”€ ensemble_soft_voting.pkl
β”œβ”€β”€ training_pipeline.ipynb
β”œβ”€β”€ requirements.txt
└── README.md
```

---

# ⚠️ Limitations & Intended Use

This model is developed strictly for:
- Academic research
- Educational purposes
- Machine learning experimentation
- Pipeline prototyping

It is **NOT** approved for:
- Clinical deployment
- Medical diagnosis
- Real-world healthcare decision-making

All diagnostic decisions must be performed by qualified medical professionals using certified medical systems.

---

# πŸ“œ Citations

### Research Reference

```bibtex
@article{street1993nuclear,
  title={Nuclear feature extraction for breast tumor diagnosis},
  author={Street, W.N. and Wolberg, W.H. and Mangasarian, O.L.},
  journal={IS&T/SPIE Biomedical Imaging},
  year={1993}
}
```

### Dataset Reference

- UCI Machine Learning Repository  
- Breast Cancer Wisconsin (Diagnostic) Dataset

---

# 🀝 Acknowledgements

Special thanks to:
- UCI Machine Learning Repository
- Scikit-learn contributors
- Hugging Face Hub
- Open-source ML research community

---

# 🧠 Model Author

**Sachini Praboda Nethranjali**  
Electronic and Computer Science Undergraduate  
University of Kelaniya, Sri Lanka