ZainabEman commited on
Commit
ddb53b4
·
verified ·
1 Parent(s): 0a70de8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -299
README.md CHANGED
@@ -1,301 +1,10 @@
1
- # 📘 Project Documentation
2
-
3
- ## **Title:** Comparative Feature Engineering and Visualization of Disease Text Data Using TF-IDF and One-Hot Encoding
4
-
5
  ---
6
-
7
- ## 🧪 Task 01: TF-IDF Feature Extraction
8
-
9
- ### 🔹 Subtask 1: Parsing Textual Data into Lists
10
-
11
- **➤ What We Did:**
12
- The dataset columns `Risk Factors`, `Symptoms`, and `Signs` were originally stored as stringified Python lists—e.g., `"['fever', 'stress']"`. These string representations were parsed back into actual Python list objects.
13
-
14
- **➤ Why We Did It:**
15
- Raw string data in the form of list-like strings cannot be directly processed for text vectorization. We needed valid Python lists to:
16
- - Combine all features into a single document per disease.
17
- - Make them compatible for natural language processing (NLP) techniques.
18
- - Enable further transformations like joining into a string and applying vectorization methods.
19
-
20
- **➤ Tools Used:**
21
- - `ast.literal_eval()` from Python's built-in `ast` module to safely evaluate string literals into actual Python list objects without executing arbitrary code.
22
-
23
  ---
24
-
25
- ### 🔹 Subtask 2: Convert Lists into Strings
26
-
27
- **➤ What We Did:**
28
- Each list (e.g., `['fever', 'stress']`) was converted into a space-separated string like `"fever stress"`.
29
-
30
- **➤ Why We Did It:**
31
- NLP techniques such as TF-IDF require raw text input in the form of plain strings. This conversion allows:
32
- - Treating each disease record as a document.
33
- - Ensuring compatibility with vectorizers that expect string inputs.
34
-
35
- **➤ Tools Used:**
36
- - Python's `str.join()` function to concatenate list items into a single space-separated sentence per record.
37
-
38
- ---
39
-
40
- ### 🔹 Subtask 3 & 4: TF-IDF Vectorization
41
-
42
- **➤ What We Did:**
43
- We applied **TF-IDF (Term Frequency–Inverse Document Frequency)** vectorization to each of the following:
44
- - Risk Factors
45
- - Symptoms
46
- - Signs
47
-
48
- Each category was processed independently using its own `TfidfVectorizer`.
49
-
50
- **➤ Why TF-IDF Was Chosen:**
51
- - Unlike one-hot encoding, which treats all terms equally, **TF-IDF emphasizes terms that are frequent in a document but rare across documents**.
52
- - This gives more meaningful and discriminative features.
53
- - Useful when the goal is to highlight domain-specific terms for diseases (e.g., `"chest_pain"` for cardiovascular vs. `"tremors"` for neurological).
54
-
55
- **➤ Tools Used:**
56
- - `TfidfVectorizer` from `sklearn.feature_extraction.text`.
57
-
58
- **➤ Results:**
59
-
60
- | Category | Rows | Features |
61
- |---------------|------|----------|
62
- | Risk Factors | 25 | 360 |
63
- | Symptoms | 25 | 424 |
64
- | Signs | 25 | 236 |
65
-
66
- ---
67
-
68
- ### 🔹 Subtask 5: Combine TF-IDF Matrices
69
-
70
- **➤ What We Did:**
71
- We horizontally stacked the three separate TF-IDF matrices into one unified feature matrix.
72
-
73
- **➤ Why:**
74
- - To ensure a single consolidated representation of all textual features (Risk Factors, Symptoms, and Signs).
75
- - Needed for downstream processes like dimensionality reduction, classification, or clustering.
76
-
77
- **➤ Result:**
78
- - Final matrix shape: **25 diseases × 1020 features**
79
-
80
- ---
81
-
82
- ### 🔹 Subtask 6: Compare with One-Hot Encoded Matrix
83
-
84
- **➤ What We Did:**
85
- We compared the resulting TF-IDF matrix with the **one-hot encoded matrix** provided in `encoded_output2.csv`. The comparison focused on:
86
- - Matrix shape (number of features)
87
- - Sparsity (percentage of zero values)
88
- - Unique features (terms)
89
-
90
- **➤ Results:**
91
-
92
- | Feature Encoding | Shape | Sparsity | Unique Features |
93
- |------------------|-----------|----------|------------------|
94
- | TF-IDF | (25,1020) | 92.96% | 1020 |
95
- | One-Hot | (25,496) | 95.33% | 496 |
96
-
97
- **➤ Interpretation:**
98
- - **TF-IDF** produced a richer and more detailed representation, albeit slightly less sparse.
99
- - **One-Hot Encoding** was simpler, faster to compute, and highly sparse but lacked semantic depth.
100
-
101
- ---
102
-
103
- ## 📉 Task 02: Dimensionality Reduction & Visualization
104
-
105
- ### 🔹 Subtask 1: Apply PCA and Truncated SVD
106
-
107
- **➤ What We Did:**
108
- Applied two dimensionality reduction techniques on both **TF-IDF** and **One-Hot** matrices:
109
- - **PCA (Principal Component Analysis)**: Works best with dense matrices.
110
- - **Truncated SVD**: Suitable for sparse matrices, often called **Latent Semantic Analysis** in text mining.
111
-
112
- We reduced dimensions to:
113
- - **3 Components** for explained variance analysis.
114
- - **2 Components** for visualization in 2D.
115
-
116
- **➤ Why:**
117
- - High-dimensional data is often noisy and hard to visualize.
118
- - Reducing to fewer components helps detect hidden patterns, clusters, and similarities between diseases.
119
-
120
- **➤ Tools Used:**
121
- - `PCA` from `sklearn.decomposition`
122
- - `TruncatedSVD` from `sklearn.decomposition`
123
-
124
- **➤ Results – Explained Variance Ratios (Top 3 Components):**
125
-
126
- | Method | Matrix | Explained Variance |
127
- |----------------|----------|-----------------------------------|
128
- | PCA | One-Hot | [0.1054, 0.0917, 0.0678] |
129
- | PCA | TF-IDF | [0.0656, 0.0586, 0.0568] |
130
- | Truncated SVD | One-Hot | [0.0225, 0.0920, 0.0891] |
131
- | Truncated SVD | TF-IDF | [0.0089, 0.0657, 0.0572] |
132
-
133
- **➤ Interpretation:**
134
- - **PCA on One-Hot** retained more variance in its first component.
135
- - **TF-IDF** had more evenly distributed variance, reflecting richer and more distributed semantic features.
136
-
137
- ---
138
-
139
- ### 🔹 Subtask 2: 2D Visualization of Reduced Dimensions
140
-
141
- **➤ What We Did:**
142
- Created **2D scatter plots** using PCA and Truncated SVD projections for both encodings. Each disease was color-coded based on its category:
143
- - Cardiovascular
144
- - Neurological
145
- - Respiratory
146
- - Endocrine
147
- - Other
148
-
149
- **➤ Tools Used:**
150
- - `matplotlib.pyplot`
151
- - `seaborn.scatterplot`
152
-
153
- **➤ Observations:**
154
-
155
- | Method | Clustering Observed | Interpretation |
156
- |----------------|---------------------|--------------------------------------------------------------|
157
- | PCA – One-Hot | ✅ Distinct clusters | Clear grouping, e.g., cardiovascular diseases were close together |
158
- | PCA – TF-IDF | ⚠️ Mixed clusters | Due to the high dimensionality and dense features, points overlapped |
159
- | SVD – One-Hot | ✅ Moderate structure | Reasonable groupings, though less tight than PCA |
160
- | SVD – TF-IDF | ❌ Overlapping, noisy| Features too rich for 2D visualization—more noise |
161
-
162
- ## 🖼️ Dimensionality Reduction Visualizations
163
-
164
- Below are the 2D scatter plots obtained after applying PCA and Truncated SVD to both One-Hot and TF-IDF feature matrices.
165
-
166
- ---
167
-
168
- ### 🔷 PCA Results
169
- ![PCA - One-Hot Encoded Features (2D)](img1.png)
170
- ![PCA - TF-IDF Features (2D)](img2.png)
171
-
172
- ---
173
-
174
- ### 🔶 Truncated SVD Results
175
- ![Truncated SVD - One-Hot Encoded Features (2D)](img3.png)
176
- ![Truncated SVD - TF-IDF Features (2D)](img4.png)
177
-
178
-
179
- # 🧪 Task 3: Model Training, Evaluation, and Comparison
180
-
181
- ## 🎯 Objective
182
-
183
- The goal of Task 3 was to evaluate and compare the performance of classification models using disease data represented through two different feature encoding strategies:
184
-
185
- - **TF-IDF vectorized features**
186
- - **One-Hot encoded features**
187
-
188
- We aimed to benchmark the models based on the following:
189
-
190
- - **Models**: K-Nearest Neighbors (KNN), Logistic Regression
191
- - **KNN Distance Metrics**: Euclidean, Manhattan, Cosine
192
- - **KNN k-values**: 3, 5, 7
193
- - **Evaluation Metrics**: Accuracy, Precision, Recall, F1-Score
194
- - **Validation Method**: 5-Fold Cross-Validation
195
-
196
- ---
197
-
198
- ## 🧩 Step-by-Step Breakdown
199
-
200
- ### 🔹 3.1 Train KNN Models
201
-
202
- We trained KNN classifiers using the full combination of the following parameters:
203
-
204
- - **k-values**: 3, 5, 7
205
- - **Distance metrics**: Euclidean, Manhattan, Cosine
206
- - **Encodings**: TF-IDF and One-Hot
207
-
208
- #### ❌ Issue Faced:
209
- All KNN models returned **0% accuracy**, and **cross-validation failed** with `NaN` F1-scores.
210
-
211
- #### 🛑 Root Cause:
212
- - The dataset includes **25 unique diseases**, each with only **one sample**.
213
- - In **K-Fold Cross-Validation**, no class appears more than once in both training and test folds.
214
- - As a result:
215
- - The model never sees the same class twice.
216
- - `f1_score()` throws `ValueError` when the expected `pos_label=1` is missing in the test split.
217
-
218
- ---
219
-
220
- ### 🔹 3.2 Report Accuracy, Precision, Recall, and F1-Score
221
-
222
- We attempted to use standard scikit-learn scoring strings like `'f1'` during cross-validation.
223
-
224
- #### ❌ Problem:
225
- - `cross_val_score(..., scoring='f1')` led to runtime errors due to the missing `pos_label`.
226
-
227
- #### ✅ Resolution:
228
- - Switched to **custom scorers** using `make_scorer(f1_score, average='macro')`.
229
- - Replaced `cross_val_score()` with a **manual `KFold` loop** to gain control over per-fold evaluation and handle edge cases safely.
230
-
231
- ---
232
-
233
- ### 🔹 3.3 Train Logistic Regression
234
-
235
- We trained **Logistic Regression** on both:
236
- - **TF-IDF matrix**
237
- - **One-Hot matrix**
238
-
239
- Using the same 5-Fold Cross-Validation and **custom macro scoring strategy**.
240
-
241
- #### 🛑 Observations:
242
- - Logistic Regression also failed to learn anything.
243
- - Accuracy and F1-Score were **0.0 across all folds**, again due to the **one-instance-per-class** issue.
244
-
245
- ---
246
-
247
- ### 🔹 3.4 Compare All Configurations
248
-
249
- We compiled a table comparing multiple configurations of models, encodings, and parameters:
250
-
251
- | Model | Matrix | Mean Accuracy | Mean F1-Score |
252
- |---------------------|-----------|---------------|---------------|
253
- | KNN (k=3, cosine) | TF-IDF | 0.0 | 0.0 |
254
- | KNN (k=5, manhattan) | One-Hot | 0.0 | 0.0 |
255
- | Logistic Regression | TF-IDF | 0.0 | 0.0 |
256
- | Logistic Regression | One-Hot | 0.0 | 0.0 |
257
- | ... | ... | ... | ... |
258
-
259
- All configurations yielded zero performance.
260
-
261
- ---
262
-
263
- ## 🛠️ What We Changed (Summary)
264
-
265
- | **Problem** | **Action Taken** |
266
- |--------------------------------------------------|-------------------------------------------------------------|
267
- | `ValueError` due to missing `pos_label=1` | Switched to `make_scorer(f1_score, average='macro')` |
268
- | `NaN` scores from `cross_val_score()` | Replaced with manual `KFold` loop for safer evaluation |
269
- | One-instance-per-class structure in the dataset | Identified as a blocker for all supervised classification |
270
-
271
- ---
272
-
273
- ## 📌 Conclusion & Next Steps
274
-
275
- ### 🔚 **Conclusion:**
276
-
277
- - **All models failed to classify correctly** due to the **one-instance-per-class** limitation in the dataset.
278
- - F1-scores and accuracy remained **zero** across both KNN and Logistic Regression models, regardless of:
279
- - Encoding (TF-IDF or One-Hot)
280
- - Model type
281
- - Distance metric
282
-
283
- ---
284
-
285
- ### ✅ **Recommended Fix Going Forward:**
286
-
287
- To make the classification task **feasible and meaningful**, we recommend the following:
288
-
289
- - **Group diseases into broader categories** (e.g., Cardiovascular, Neurological, Respiratory).
290
- - Replace the original target label `Disease` with a new label `Category`.
291
-
292
- #### Benefits of This Change:
293
- - Ensures **multiple samples per class**
294
- - Allows the models to **learn patterns** and generalize
295
- - Produces **non-zero performance metrics**
296
- - Makes **supervised learning** applicable
297
-
298
- ---
299
-
300
- > 📣 This step is critical for transforming the dataset into a usable format for machine learning. Without this change, all supervised classification models will fail regardless of their complexity or tuning.
301
-
 
 
 
 
 
1
  ---
2
+ title: My Cool App
3
+ emoji: 🚀
4
+ colorFrom: indigo
5
+ colorTo: pink
6
+ sdk: gradio
7
+ sdk_version: "4.18.0"
8
+ app_file: app.py
9
+ pinned: false
 
 
 
 
 
 
 
 
 
10
  ---