ZainabEman commited on
Commit
0a70de8
Β·
verified Β·
1 Parent(s): 3715ed5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +298 -9
README.md CHANGED
@@ -1,12 +1,301 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
- title: Assignment03
3
- emoji: 🐒
4
- colorFrom: blue
5
- colorTo: purple
6
- sdk: streamlit
7
- sdk_version: 1.44.1
8
- app_file: app.py
9
- pinned: false
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # πŸ“˜ Project Documentation
2
+
3
+ ## **Title:** Comparative Feature Engineering and Visualization of Disease Text Data Using TF-IDF and One-Hot Encoding
4
+
5
+ ---
6
+
7
+ ## πŸ§ͺ Task 01: TF-IDF Feature Extraction
8
+
9
+ ### πŸ”Ή Subtask 1: Parsing Textual Data into Lists
10
+
11
+ **➀ What We Did:**
12
+ The dataset columns `Risk Factors`, `Symptoms`, and `Signs` were originally stored as stringified Python listsβ€”e.g., `"['fever', 'stress']"`. These string representations were parsed back into actual Python list objects.
13
+
14
+ **➀ Why We Did It:**
15
+ Raw string data in the form of list-like strings cannot be directly processed for text vectorization. We needed valid Python lists to:
16
+ - Combine all features into a single document per disease.
17
+ - Make them compatible for natural language processing (NLP) techniques.
18
+ - Enable further transformations like joining into a string and applying vectorization methods.
19
+
20
+ **➀ Tools Used:**
21
+ - `ast.literal_eval()` from Python's built-in `ast` module to safely evaluate string literals into actual Python list objects without executing arbitrary code.
22
+
23
+ ---
24
+
25
+ ### πŸ”Ή Subtask 2: Convert Lists into Strings
26
+
27
+ **➀ What We Did:**
28
+ Each list (e.g., `['fever', 'stress']`) was converted into a space-separated string like `"fever stress"`.
29
+
30
+ **➀ Why We Did It:**
31
+ NLP techniques such as TF-IDF require raw text input in the form of plain strings. This conversion allows:
32
+ - Treating each disease record as a document.
33
+ - Ensuring compatibility with vectorizers that expect string inputs.
34
+
35
+ **➀ Tools Used:**
36
+ - Python's `str.join()` function to concatenate list items into a single space-separated sentence per record.
37
+
38
+ ---
39
+
40
+ ### πŸ”Ή Subtask 3 & 4: TF-IDF Vectorization
41
+
42
+ **➀ What We Did:**
43
+ We applied **TF-IDF (Term Frequency–Inverse Document Frequency)** vectorization to each of the following:
44
+ - Risk Factors
45
+ - Symptoms
46
+ - Signs
47
+
48
+ Each category was processed independently using its own `TfidfVectorizer`.
49
+
50
+ **➀ Why TF-IDF Was Chosen:**
51
+ - Unlike one-hot encoding, which treats all terms equally, **TF-IDF emphasizes terms that are frequent in a document but rare across documents**.
52
+ - This gives more meaningful and discriminative features.
53
+ - Useful when the goal is to highlight domain-specific terms for diseases (e.g., `"chest_pain"` for cardiovascular vs. `"tremors"` for neurological).
54
+
55
+ **➀ Tools Used:**
56
+ - `TfidfVectorizer` from `sklearn.feature_extraction.text`.
57
+
58
+ **➀ Results:**
59
+
60
+ | Category | Rows | Features |
61
+ |---------------|------|----------|
62
+ | Risk Factors | 25 | 360 |
63
+ | Symptoms | 25 | 424 |
64
+ | Signs | 25 | 236 |
65
+
66
+ ---
67
+
68
+ ### πŸ”Ή Subtask 5: Combine TF-IDF Matrices
69
+
70
+ **➀ What We Did:**
71
+ We horizontally stacked the three separate TF-IDF matrices into one unified feature matrix.
72
+
73
+ **➀ Why:**
74
+ - To ensure a single consolidated representation of all textual features (Risk Factors, Symptoms, and Signs).
75
+ - Needed for downstream processes like dimensionality reduction, classification, or clustering.
76
+
77
+ **➀ Result:**
78
+ - Final matrix shape: **25 diseases Γ— 1020 features**
79
+
80
+ ---
81
+
82
+ ### πŸ”Ή Subtask 6: Compare with One-Hot Encoded Matrix
83
+
84
+ **➀ What We Did:**
85
+ We compared the resulting TF-IDF matrix with the **one-hot encoded matrix** provided in `encoded_output2.csv`. The comparison focused on:
86
+ - Matrix shape (number of features)
87
+ - Sparsity (percentage of zero values)
88
+ - Unique features (terms)
89
+
90
+ **➀ Results:**
91
+
92
+ | Feature Encoding | Shape | Sparsity | Unique Features |
93
+ |------------------|-----------|----------|------------------|
94
+ | TF-IDF | (25,1020) | 92.96% | 1020 |
95
+ | One-Hot | (25,496) | 95.33% | 496 |
96
+
97
+ **➀ Interpretation:**
98
+ - **TF-IDF** produced a richer and more detailed representation, albeit slightly less sparse.
99
+ - **One-Hot Encoding** was simpler, faster to compute, and highly sparse but lacked semantic depth.
100
+
101
+ ---
102
+
103
+ ## πŸ“‰ Task 02: Dimensionality Reduction & Visualization
104
+
105
+ ### πŸ”Ή Subtask 1: Apply PCA and Truncated SVD
106
+
107
+ **➀ What We Did:**
108
+ Applied two dimensionality reduction techniques on both **TF-IDF** and **One-Hot** matrices:
109
+ - **PCA (Principal Component Analysis)**: Works best with dense matrices.
110
+ - **Truncated SVD**: Suitable for sparse matrices, often called **Latent Semantic Analysis** in text mining.
111
+
112
+ We reduced dimensions to:
113
+ - **3 Components** for explained variance analysis.
114
+ - **2 Components** for visualization in 2D.
115
+
116
+ **➀ Why:**
117
+ - High-dimensional data is often noisy and hard to visualize.
118
+ - Reducing to fewer components helps detect hidden patterns, clusters, and similarities between diseases.
119
+
120
+ **➀ Tools Used:**
121
+ - `PCA` from `sklearn.decomposition`
122
+ - `TruncatedSVD` from `sklearn.decomposition`
123
+
124
+ **➀ Results – Explained Variance Ratios (Top 3 Components):**
125
+
126
+ | Method | Matrix | Explained Variance |
127
+ |----------------|----------|-----------------------------------|
128
+ | PCA | One-Hot | [0.1054, 0.0917, 0.0678] |
129
+ | PCA | TF-IDF | [0.0656, 0.0586, 0.0568] |
130
+ | Truncated SVD | One-Hot | [0.0225, 0.0920, 0.0891] |
131
+ | Truncated SVD | TF-IDF | [0.0089, 0.0657, 0.0572] |
132
+
133
+ **➀ Interpretation:**
134
+ - **PCA on One-Hot** retained more variance in its first component.
135
+ - **TF-IDF** had more evenly distributed variance, reflecting richer and more distributed semantic features.
136
+
137
  ---
138
+
139
+ ### πŸ”Ή Subtask 2: 2D Visualization of Reduced Dimensions
140
+
141
+ **➀ What We Did:**
142
+ Created **2D scatter plots** using PCA and Truncated SVD projections for both encodings. Each disease was color-coded based on its category:
143
+ - Cardiovascular
144
+ - Neurological
145
+ - Respiratory
146
+ - Endocrine
147
+ - Other
148
+
149
+ **➀ Tools Used:**
150
+ - `matplotlib.pyplot`
151
+ - `seaborn.scatterplot`
152
+
153
+ **➀ Observations:**
154
+
155
+ | Method | Clustering Observed | Interpretation |
156
+ |----------------|---------------------|--------------------------------------------------------------|
157
+ | PCA – One-Hot | βœ… Distinct clusters | Clear grouping, e.g., cardiovascular diseases were close together |
158
+ | PCA – TF-IDF | ⚠️ Mixed clusters | Due to the high dimensionality and dense features, points overlapped |
159
+ | SVD – One-Hot | βœ… Moderate structure | Reasonable groupings, though less tight than PCA |
160
+ | SVD – TF-IDF | ❌ Overlapping, noisy| Features too rich for 2D visualizationβ€”more noise |
161
+
162
+ ## πŸ–ΌοΈ Dimensionality Reduction Visualizations
163
+
164
+ Below are the 2D scatter plots obtained after applying PCA and Truncated SVD to both One-Hot and TF-IDF feature matrices.
165
+
166
  ---
167
 
168
+ ### πŸ”· PCA Results
169
+ ![PCA - One-Hot Encoded Features (2D)](img1.png)
170
+ ![PCA - TF-IDF Features (2D)](img2.png)
171
+
172
+ ---
173
+
174
+ ### πŸ”Ά Truncated SVD Results
175
+ ![Truncated SVD - One-Hot Encoded Features (2D)](img3.png)
176
+ ![Truncated SVD - TF-IDF Features (2D)](img4.png)
177
+
178
+
179
+ # πŸ§ͺ Task 3: Model Training, Evaluation, and Comparison
180
+
181
+ ## 🎯 Objective
182
+
183
+ The goal of Task 3 was to evaluate and compare the performance of classification models using disease data represented through two different feature encoding strategies:
184
+
185
+ - **TF-IDF vectorized features**
186
+ - **One-Hot encoded features**
187
+
188
+ We aimed to benchmark the models based on the following:
189
+
190
+ - **Models**: K-Nearest Neighbors (KNN), Logistic Regression
191
+ - **KNN Distance Metrics**: Euclidean, Manhattan, Cosine
192
+ - **KNN k-values**: 3, 5, 7
193
+ - **Evaluation Metrics**: Accuracy, Precision, Recall, F1-Score
194
+ - **Validation Method**: 5-Fold Cross-Validation
195
+
196
+ ---
197
+
198
+ ## 🧩 Step-by-Step Breakdown
199
+
200
+ ### πŸ”Ή 3.1 Train KNN Models
201
+
202
+ We trained KNN classifiers using the full combination of the following parameters:
203
+
204
+ - **k-values**: 3, 5, 7
205
+ - **Distance metrics**: Euclidean, Manhattan, Cosine
206
+ - **Encodings**: TF-IDF and One-Hot
207
+
208
+ #### ❌ Issue Faced:
209
+ All KNN models returned **0% accuracy**, and **cross-validation failed** with `NaN` F1-scores.
210
+
211
+ #### πŸ›‘ Root Cause:
212
+ - The dataset includes **25 unique diseases**, each with only **one sample**.
213
+ - In **K-Fold Cross-Validation**, no class appears more than once in both training and test folds.
214
+ - As a result:
215
+ - The model never sees the same class twice.
216
+ - `f1_score()` throws `ValueError` when the expected `pos_label=1` is missing in the test split.
217
+
218
+ ---
219
+
220
+ ### πŸ”Ή 3.2 Report Accuracy, Precision, Recall, and F1-Score
221
+
222
+ We attempted to use standard scikit-learn scoring strings like `'f1'` during cross-validation.
223
+
224
+ #### ❌ Problem:
225
+ - `cross_val_score(..., scoring='f1')` led to runtime errors due to the missing `pos_label`.
226
+
227
+ #### βœ… Resolution:
228
+ - Switched to **custom scorers** using `make_scorer(f1_score, average='macro')`.
229
+ - Replaced `cross_val_score()` with a **manual `KFold` loop** to gain control over per-fold evaluation and handle edge cases safely.
230
+
231
+ ---
232
+
233
+ ### πŸ”Ή 3.3 Train Logistic Regression
234
+
235
+ We trained **Logistic Regression** on both:
236
+ - **TF-IDF matrix**
237
+ - **One-Hot matrix**
238
+
239
+ Using the same 5-Fold Cross-Validation and **custom macro scoring strategy**.
240
+
241
+ #### πŸ›‘ Observations:
242
+ - Logistic Regression also failed to learn anything.
243
+ - Accuracy and F1-Score were **0.0 across all folds**, again due to the **one-instance-per-class** issue.
244
+
245
+ ---
246
+
247
+ ### πŸ”Ή 3.4 Compare All Configurations
248
+
249
+ We compiled a table comparing multiple configurations of models, encodings, and parameters:
250
+
251
+ | Model | Matrix | Mean Accuracy | Mean F1-Score |
252
+ |---------------------|-----------|---------------|---------------|
253
+ | KNN (k=3, cosine) | TF-IDF | 0.0 | 0.0 |
254
+ | KNN (k=5, manhattan) | One-Hot | 0.0 | 0.0 |
255
+ | Logistic Regression | TF-IDF | 0.0 | 0.0 |
256
+ | Logistic Regression | One-Hot | 0.0 | 0.0 |
257
+ | ... | ... | ... | ... |
258
+
259
+ All configurations yielded zero performance.
260
+
261
+ ---
262
+
263
+ ## πŸ› οΈ What We Changed (Summary)
264
+
265
+ | **Problem** | **Action Taken** |
266
+ |--------------------------------------------------|-------------------------------------------------------------|
267
+ | `ValueError` due to missing `pos_label=1` | Switched to `make_scorer(f1_score, average='macro')` |
268
+ | `NaN` scores from `cross_val_score()` | Replaced with manual `KFold` loop for safer evaluation |
269
+ | One-instance-per-class structure in the dataset | Identified as a blocker for all supervised classification |
270
+
271
+ ---
272
+
273
+ ## πŸ“Œ Conclusion & Next Steps
274
+
275
+ ### πŸ”š **Conclusion:**
276
+
277
+ - **All models failed to classify correctly** due to the **one-instance-per-class** limitation in the dataset.
278
+ - F1-scores and accuracy remained **zero** across both KNN and Logistic Regression models, regardless of:
279
+ - Encoding (TF-IDF or One-Hot)
280
+ - Model type
281
+ - Distance metric
282
+
283
+ ---
284
+
285
+ ### βœ… **Recommended Fix Going Forward:**
286
+
287
+ To make the classification task **feasible and meaningful**, we recommend the following:
288
+
289
+ - **Group diseases into broader categories** (e.g., Cardiovascular, Neurological, Respiratory).
290
+ - Replace the original target label `Disease` with a new label `Category`.
291
+
292
+ #### Benefits of This Change:
293
+ - Ensures **multiple samples per class**
294
+ - Allows the models to **learn patterns** and generalize
295
+ - Produces **non-zero performance metrics**
296
+ - Makes **supervised learning** applicable
297
+
298
+ ---
299
+
300
+ > πŸ“£ This step is critical for transforming the dataset into a usable format for machine learning. Without this change, all supervised classification models will fail regardless of their complexity or tuning.
301
+