NethranjaliSE commited on
Commit
46a5e5a
Β·
verified Β·
1 Parent(s): dfe4920

Create Readme

Browse files
Files changed (1) hide show
  1. README.md +230 -0
README.md ADDED
@@ -0,0 +1,230 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: mit
5
+ tags:
6
+ - tabular
7
+ - classification
8
+ - scikit-learn
9
+ - ensemble-learning
10
+ - breast-cancer-detection
11
+ - medical-imaging
12
+ datasets:
13
+ - uci-wdbc
14
+ metrics:
15
+ - accuracy
16
+ - precision
17
+ - recall
18
+ - f1
19
+ - roc_auc
20
+ pipeline_tag: tabular-classification
21
+ ---
22
+
23
+ # πŸŽ—οΈ Breast Cancer Detection Ensemble Pipeline
24
+
25
+ An optimized, production-ready machine learning pipeline featuring a **Soft-Voting Ensemble Classifier**. This model is trained on clinical data to distinguish between malignant and benign tumors with high sensitivity (recall), minimizing false negatives in diagnostic screening.
26
+
27
+ This repository structure is modeled after the methodology discussed in *"Comparison of ML Algorithms for Breast Cancer Prediction" (CTEMS 2018)*, expanding the baseline framework to a robust 5-model ensemble architecture with automated pipeline scaling.
28
+
29
+ ---
30
+
31
+ # πŸ“Š Model Description
32
+
33
+ The model utilizes a **Soft-Voting architecture** that aggregates predicted class probabilities across five diverse individual base estimators. Every individual classifier is encapsulated within a leakage-free preprocessing pipeline featuring automated standardization using `StandardScaler`.
34
+
35
+ ## Component Estimators
36
+
37
+ 1. **Random Forest Classifier**
38
+ - 72 estimators
39
+ - Balanced class weights
40
+
41
+ 2. **k-Nearest Neighbors (kNN)**
42
+ - Euclidean distance metric
43
+ - `k = 5`
44
+
45
+ 3. **Gaussian Naive Bayes**
46
+ - Probabilistic baseline classifier
47
+
48
+ 4. **Support Vector Classifier (SVC)**
49
+ - `rbf` kernel
50
+ - Probability estimation enabled
51
+
52
+ 5. **Logistic Regression**
53
+ - Regularized linear classifier
54
+ - Balanced class distributions
55
+
56
+ ---
57
+
58
+ # πŸ“ˆ Dataset & Training Architecture
59
+
60
+ - **Dataset Source:** Wisconsin Diagnosis Breast Cancer (WDBC) β€” UCI Machine Learning Repository
61
+ - **Instances:** 569 samples
62
+ - 357 Benign
63
+ - 212 Malignant
64
+ - **Features:** 30 real-valued clinical features extracted from digitized FNA images
65
+ - **Split Strategy:** Stratified train-test split
66
+ - Training: 398 samples
67
+ - Testing: 171 samples
68
+
69
+ The pipeline uses:
70
+ - `StratifiedKFold` cross-validation
71
+ - Leakage-free preprocessing
72
+ - Automated scaling pipelines
73
+
74
+ ---
75
+
76
+ # ⚑ Performance Metrics
77
+
78
+ Evaluation prioritizes **Recall (Sensitivity)** to reduce false negatives while maintaining strong overall classification accuracy.
79
+
80
+ | Model | Accuracy | Precision | Recall | F1-Score | ROC-AUC |
81
+ |---|---|---|---|---|---|
82
+ | **Ensemble (Soft Voting)** | **0.9766** | **0.9725** | **0.9907** | **0.9815** | **0.9972** |
83
+ | Random Forest | 0.9649 | 0.9633 | 0.9813 | 0.9722 | 0.9936 |
84
+ | kNN | 0.9591 | 0.9545 | 0.9813 | 0.9677 | 0.9877 |
85
+ | Support Vector Machine | 0.9766 | 0.9725 | 0.9907 | 0.9815 | 0.9974 |
86
+ | Logistic Regression | 0.9766 | 0.9725 | 0.9907 | 0.9815 | 0.9969 |
87
+ | Naive Bayes | 0.9591 | 0.9545 | 0.9813 | 0.9677 | 0.9892 |
88
+
89
+ > **Note:** Results may vary slightly depending on package versions and random seeds.
90
+
91
+ ---
92
+
93
+ # πŸ’» Installation
94
+
95
+ ## Dependencies
96
+
97
+ ```text
98
+ scikit-learn>=1.0
99
+ numpy
100
+ pandas
101
+ joblib
102
+ huggingface_hub
103
+ ```
104
+
105
+ Install dependencies:
106
+
107
+ ```bash
108
+ pip install scikit-learn numpy pandas joblib huggingface_hub
109
+ ```
110
+
111
+ ---
112
+
113
+ # πŸš€ Dynamic Inference Example
114
+
115
+ You can directly download and run the trained pipeline from Hugging Face Hub.
116
+
117
+ ```python
118
+ import joblib
119
+ import pandas as pd
120
+ from huggingface_hub import hf_hub_download
121
+
122
+ # Download model pipeline
123
+ model_path = hf_hub_download(
124
+ repo_id="NethranjaliSE/Breast-Cancer-detection-using-ML-Algorithm",
125
+ filename="ensemble_soft_voting.pkl"
126
+ )
127
+
128
+ # Load pipeline
129
+ pipeline = joblib.load(model_path)
130
+
131
+ # Example sample input (30 WDBC features)
132
+ sample_data = [[
133
+ 14.12, 19.28, 91.96, 654.8, 0.096, 0.11, 0.08, 0.04, 0.18, 0.06,
134
+ 0.25, 0.89, 1.82, 24.3, 0.006, 0.02, 0.02, 0.01, 0.01, 0.003,
135
+ 16.26, 25.67, 107.26, 880.5, 0.132, 0.21, 0.19, 0.09, 0.28, 0.08
136
+ ]]
137
+
138
+ feature_names = (
139
+ pipeline.feature_names_in_
140
+ if hasattr(pipeline, "feature_names_in_")
141
+ else None
142
+ )
143
+
144
+ input_df = pd.DataFrame(sample_data, columns=feature_names)
145
+
146
+ # Predict
147
+ prediction = pipeline.predict(input_df)
148
+ probabilities = pipeline.predict_proba(input_df)[0]
149
+
150
+ diagnosis = (
151
+ "Benign (Low Risk)"
152
+ if prediction[0] == 1
153
+ else "Malignant (High Risk)"
154
+ )
155
+
156
+ print(f"Diagnostic Assessment: {diagnosis}")
157
+
158
+ print(
159
+ f"Confidence Matrix -> "
160
+ f"Malignant: {probabilities[0]:.4f} | "
161
+ f"Benign: {probabilities[1]:.4f}"
162
+ )
163
+ ```
164
+
165
+ ---
166
+
167
+ # πŸ“‚ Repository Structure
168
+
169
+ ```text
170
+ .
171
+ β”œβ”€β”€ ensemble_soft_voting.pkl
172
+ β”œβ”€β”€ training_pipeline.ipynb
173
+ β”œβ”€β”€ requirements.txt
174
+ └── README.md
175
+ ```
176
+
177
+ ---
178
+
179
+ # ⚠️ Limitations & Intended Use
180
+
181
+ This model is developed strictly for:
182
+ - Academic research
183
+ - Educational purposes
184
+ - Machine learning experimentation
185
+ - Pipeline prototyping
186
+
187
+ It is **NOT** approved for:
188
+ - Clinical deployment
189
+ - Medical diagnosis
190
+ - Real-world healthcare decision-making
191
+
192
+ All diagnostic decisions must be performed by qualified medical professionals using certified medical systems.
193
+
194
+ ---
195
+
196
+ # πŸ“œ Citations
197
+
198
+ ### Research Reference
199
+
200
+ ```bibtex
201
+ @article{street1993nuclear,
202
+ title={Nuclear feature extraction for breast tumor diagnosis},
203
+ author={Street, W.N. and Wolberg, W.H. and Mangasarian, O.L.},
204
+ journal={IS&T/SPIE Biomedical Imaging},
205
+ year={1993}
206
+ }
207
+ ```
208
+
209
+ ### Dataset Reference
210
+
211
+ - UCI Machine Learning Repository
212
+ - Breast Cancer Wisconsin (Diagnostic) Dataset
213
+
214
+ ---
215
+
216
+ # 🀝 Acknowledgements
217
+
218
+ Special thanks to:
219
+ - UCI Machine Learning Repository
220
+ - Scikit-learn contributors
221
+ - Hugging Face Hub
222
+ - Open-source ML research community
223
+
224
+ ---
225
+
226
+ # 🧠 Model Author
227
+
228
+ **Sachini Praboda Nethranjali**
229
+ Electronic and Computer Science Undergraduate
230
+ University of Kelaniya, Sri Lanka