Spaces:
Sleeping
Sleeping
Update README.md
Browse files
README.md
CHANGED
|
@@ -1,301 +1,10 @@
|
|
| 1 |
-
# 📘 Project Documentation
|
| 2 |
-
|
| 3 |
-
## **Title:** Comparative Feature Engineering and Visualization of Disease Text Data Using TF-IDF and One-Hot Encoding
|
| 4 |
-
|
| 5 |
---
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
**➤ Why We Did It:**
|
| 15 |
-
Raw string data in the form of list-like strings cannot be directly processed for text vectorization. We needed valid Python lists to:
|
| 16 |
-
- Combine all features into a single document per disease.
|
| 17 |
-
- Make them compatible for natural language processing (NLP) techniques.
|
| 18 |
-
- Enable further transformations like joining into a string and applying vectorization methods.
|
| 19 |
-
|
| 20 |
-
**➤ Tools Used:**
|
| 21 |
-
- `ast.literal_eval()` from Python's built-in `ast` module to safely evaluate string literals into actual Python list objects without executing arbitrary code.
|
| 22 |
-
|
| 23 |
---
|
| 24 |
-
|
| 25 |
-
### 🔹 Subtask 2: Convert Lists into Strings
|
| 26 |
-
|
| 27 |
-
**➤ What We Did:**
|
| 28 |
-
Each list (e.g., `['fever', 'stress']`) was converted into a space-separated string like `"fever stress"`.
|
| 29 |
-
|
| 30 |
-
**➤ Why We Did It:**
|
| 31 |
-
NLP techniques such as TF-IDF require raw text input in the form of plain strings. This conversion allows:
|
| 32 |
-
- Treating each disease record as a document.
|
| 33 |
-
- Ensuring compatibility with vectorizers that expect string inputs.
|
| 34 |
-
|
| 35 |
-
**➤ Tools Used:**
|
| 36 |
-
- Python's `str.join()` function to concatenate list items into a single space-separated sentence per record.
|
| 37 |
-
|
| 38 |
-
---
|
| 39 |
-
|
| 40 |
-
### 🔹 Subtask 3 & 4: TF-IDF Vectorization
|
| 41 |
-
|
| 42 |
-
**➤ What We Did:**
|
| 43 |
-
We applied **TF-IDF (Term Frequency–Inverse Document Frequency)** vectorization to each of the following:
|
| 44 |
-
- Risk Factors
|
| 45 |
-
- Symptoms
|
| 46 |
-
- Signs
|
| 47 |
-
|
| 48 |
-
Each category was processed independently using its own `TfidfVectorizer`.
|
| 49 |
-
|
| 50 |
-
**➤ Why TF-IDF Was Chosen:**
|
| 51 |
-
- Unlike one-hot encoding, which treats all terms equally, **TF-IDF emphasizes terms that are frequent in a document but rare across documents**.
|
| 52 |
-
- This gives more meaningful and discriminative features.
|
| 53 |
-
- Useful when the goal is to highlight domain-specific terms for diseases (e.g., `"chest_pain"` for cardiovascular vs. `"tremors"` for neurological).
|
| 54 |
-
|
| 55 |
-
**➤ Tools Used:**
|
| 56 |
-
- `TfidfVectorizer` from `sklearn.feature_extraction.text`.
|
| 57 |
-
|
| 58 |
-
**➤ Results:**
|
| 59 |
-
|
| 60 |
-
| Category | Rows | Features |
|
| 61 |
-
|---------------|------|----------|
|
| 62 |
-
| Risk Factors | 25 | 360 |
|
| 63 |
-
| Symptoms | 25 | 424 |
|
| 64 |
-
| Signs | 25 | 236 |
|
| 65 |
-
|
| 66 |
-
---
|
| 67 |
-
|
| 68 |
-
### 🔹 Subtask 5: Combine TF-IDF Matrices
|
| 69 |
-
|
| 70 |
-
**➤ What We Did:**
|
| 71 |
-
We horizontally stacked the three separate TF-IDF matrices into one unified feature matrix.
|
| 72 |
-
|
| 73 |
-
**➤ Why:**
|
| 74 |
-
- To ensure a single consolidated representation of all textual features (Risk Factors, Symptoms, and Signs).
|
| 75 |
-
- Needed for downstream processes like dimensionality reduction, classification, or clustering.
|
| 76 |
-
|
| 77 |
-
**➤ Result:**
|
| 78 |
-
- Final matrix shape: **25 diseases × 1020 features**
|
| 79 |
-
|
| 80 |
-
---
|
| 81 |
-
|
| 82 |
-
### 🔹 Subtask 6: Compare with One-Hot Encoded Matrix
|
| 83 |
-
|
| 84 |
-
**➤ What We Did:**
|
| 85 |
-
We compared the resulting TF-IDF matrix with the **one-hot encoded matrix** provided in `encoded_output2.csv`. The comparison focused on:
|
| 86 |
-
- Matrix shape (number of features)
|
| 87 |
-
- Sparsity (percentage of zero values)
|
| 88 |
-
- Unique features (terms)
|
| 89 |
-
|
| 90 |
-
**➤ Results:**
|
| 91 |
-
|
| 92 |
-
| Feature Encoding | Shape | Sparsity | Unique Features |
|
| 93 |
-
|------------------|-----------|----------|------------------|
|
| 94 |
-
| TF-IDF | (25,1020) | 92.96% | 1020 |
|
| 95 |
-
| One-Hot | (25,496) | 95.33% | 496 |
|
| 96 |
-
|
| 97 |
-
**➤ Interpretation:**
|
| 98 |
-
- **TF-IDF** produced a richer and more detailed representation, albeit slightly less sparse.
|
| 99 |
-
- **One-Hot Encoding** was simpler, faster to compute, and highly sparse but lacked semantic depth.
|
| 100 |
-
|
| 101 |
-
---
|
| 102 |
-
|
| 103 |
-
## 📉 Task 02: Dimensionality Reduction & Visualization
|
| 104 |
-
|
| 105 |
-
### 🔹 Subtask 1: Apply PCA and Truncated SVD
|
| 106 |
-
|
| 107 |
-
**➤ What We Did:**
|
| 108 |
-
Applied two dimensionality reduction techniques on both **TF-IDF** and **One-Hot** matrices:
|
| 109 |
-
- **PCA (Principal Component Analysis)**: Works best with dense matrices.
|
| 110 |
-
- **Truncated SVD**: Suitable for sparse matrices, often called **Latent Semantic Analysis** in text mining.
|
| 111 |
-
|
| 112 |
-
We reduced dimensions to:
|
| 113 |
-
- **3 Components** for explained variance analysis.
|
| 114 |
-
- **2 Components** for visualization in 2D.
|
| 115 |
-
|
| 116 |
-
**➤ Why:**
|
| 117 |
-
- High-dimensional data is often noisy and hard to visualize.
|
| 118 |
-
- Reducing to fewer components helps detect hidden patterns, clusters, and similarities between diseases.
|
| 119 |
-
|
| 120 |
-
**➤ Tools Used:**
|
| 121 |
-
- `PCA` from `sklearn.decomposition`
|
| 122 |
-
- `TruncatedSVD` from `sklearn.decomposition`
|
| 123 |
-
|
| 124 |
-
**➤ Results – Explained Variance Ratios (Top 3 Components):**
|
| 125 |
-
|
| 126 |
-
| Method | Matrix | Explained Variance |
|
| 127 |
-
|----------------|----------|-----------------------------------|
|
| 128 |
-
| PCA | One-Hot | [0.1054, 0.0917, 0.0678] |
|
| 129 |
-
| PCA | TF-IDF | [0.0656, 0.0586, 0.0568] |
|
| 130 |
-
| Truncated SVD | One-Hot | [0.0225, 0.0920, 0.0891] |
|
| 131 |
-
| Truncated SVD | TF-IDF | [0.0089, 0.0657, 0.0572] |
|
| 132 |
-
|
| 133 |
-
**➤ Interpretation:**
|
| 134 |
-
- **PCA on One-Hot** retained more variance in its first component.
|
| 135 |
-
- **TF-IDF** had more evenly distributed variance, reflecting richer and more distributed semantic features.
|
| 136 |
-
|
| 137 |
-
---
|
| 138 |
-
|
| 139 |
-
### 🔹 Subtask 2: 2D Visualization of Reduced Dimensions
|
| 140 |
-
|
| 141 |
-
**➤ What We Did:**
|
| 142 |
-
Created **2D scatter plots** using PCA and Truncated SVD projections for both encodings. Each disease was color-coded based on its category:
|
| 143 |
-
- Cardiovascular
|
| 144 |
-
- Neurological
|
| 145 |
-
- Respiratory
|
| 146 |
-
- Endocrine
|
| 147 |
-
- Other
|
| 148 |
-
|
| 149 |
-
**➤ Tools Used:**
|
| 150 |
-
- `matplotlib.pyplot`
|
| 151 |
-
- `seaborn.scatterplot`
|
| 152 |
-
|
| 153 |
-
**➤ Observations:**
|
| 154 |
-
|
| 155 |
-
| Method | Clustering Observed | Interpretation |
|
| 156 |
-
|----------------|---------------------|--------------------------------------------------------------|
|
| 157 |
-
| PCA – One-Hot | ✅ Distinct clusters | Clear grouping, e.g., cardiovascular diseases were close together |
|
| 158 |
-
| PCA – TF-IDF | ⚠️ Mixed clusters | Due to the high dimensionality and dense features, points overlapped |
|
| 159 |
-
| SVD – One-Hot | ✅ Moderate structure | Reasonable groupings, though less tight than PCA |
|
| 160 |
-
| SVD – TF-IDF | ❌ Overlapping, noisy| Features too rich for 2D visualization—more noise |
|
| 161 |
-
|
| 162 |
-
## 🖼️ Dimensionality Reduction Visualizations
|
| 163 |
-
|
| 164 |
-
Below are the 2D scatter plots obtained after applying PCA and Truncated SVD to both One-Hot and TF-IDF feature matrices.
|
| 165 |
-
|
| 166 |
-
---
|
| 167 |
-
|
| 168 |
-
### 🔷 PCA Results
|
| 169 |
-

|
| 170 |
-

|
| 171 |
-
|
| 172 |
-
---
|
| 173 |
-
|
| 174 |
-
### 🔶 Truncated SVD Results
|
| 175 |
-

|
| 176 |
-

|
| 177 |
-
|
| 178 |
-
|
| 179 |
-
# 🧪 Task 3: Model Training, Evaluation, and Comparison
|
| 180 |
-
|
| 181 |
-
## 🎯 Objective
|
| 182 |
-
|
| 183 |
-
The goal of Task 3 was to evaluate and compare the performance of classification models using disease data represented through two different feature encoding strategies:
|
| 184 |
-
|
| 185 |
-
- **TF-IDF vectorized features**
|
| 186 |
-
- **One-Hot encoded features**
|
| 187 |
-
|
| 188 |
-
We aimed to benchmark the models based on the following:
|
| 189 |
-
|
| 190 |
-
- **Models**: K-Nearest Neighbors (KNN), Logistic Regression
|
| 191 |
-
- **KNN Distance Metrics**: Euclidean, Manhattan, Cosine
|
| 192 |
-
- **KNN k-values**: 3, 5, 7
|
| 193 |
-
- **Evaluation Metrics**: Accuracy, Precision, Recall, F1-Score
|
| 194 |
-
- **Validation Method**: 5-Fold Cross-Validation
|
| 195 |
-
|
| 196 |
-
---
|
| 197 |
-
|
| 198 |
-
## 🧩 Step-by-Step Breakdown
|
| 199 |
-
|
| 200 |
-
### 🔹 3.1 Train KNN Models
|
| 201 |
-
|
| 202 |
-
We trained KNN classifiers using the full combination of the following parameters:
|
| 203 |
-
|
| 204 |
-
- **k-values**: 3, 5, 7
|
| 205 |
-
- **Distance metrics**: Euclidean, Manhattan, Cosine
|
| 206 |
-
- **Encodings**: TF-IDF and One-Hot
|
| 207 |
-
|
| 208 |
-
#### ❌ Issue Faced:
|
| 209 |
-
All KNN models returned **0% accuracy**, and **cross-validation failed** with `NaN` F1-scores.
|
| 210 |
-
|
| 211 |
-
#### 🛑 Root Cause:
|
| 212 |
-
- The dataset includes **25 unique diseases**, each with only **one sample**.
|
| 213 |
-
- In **K-Fold Cross-Validation**, no class appears more than once in both training and test folds.
|
| 214 |
-
- As a result:
|
| 215 |
-
- The model never sees the same class twice.
|
| 216 |
-
- `f1_score()` throws `ValueError` when the expected `pos_label=1` is missing in the test split.
|
| 217 |
-
|
| 218 |
-
---
|
| 219 |
-
|
| 220 |
-
### 🔹 3.2 Report Accuracy, Precision, Recall, and F1-Score
|
| 221 |
-
|
| 222 |
-
We attempted to use standard scikit-learn scoring strings like `'f1'` during cross-validation.
|
| 223 |
-
|
| 224 |
-
#### ❌ Problem:
|
| 225 |
-
- `cross_val_score(..., scoring='f1')` led to runtime errors due to the missing `pos_label`.
|
| 226 |
-
|
| 227 |
-
#### ✅ Resolution:
|
| 228 |
-
- Switched to **custom scorers** using `make_scorer(f1_score, average='macro')`.
|
| 229 |
-
- Replaced `cross_val_score()` with a **manual `KFold` loop** to gain control over per-fold evaluation and handle edge cases safely.
|
| 230 |
-
|
| 231 |
-
---
|
| 232 |
-
|
| 233 |
-
### 🔹 3.3 Train Logistic Regression
|
| 234 |
-
|
| 235 |
-
We trained **Logistic Regression** on both:
|
| 236 |
-
- **TF-IDF matrix**
|
| 237 |
-
- **One-Hot matrix**
|
| 238 |
-
|
| 239 |
-
Using the same 5-Fold Cross-Validation and **custom macro scoring strategy**.
|
| 240 |
-
|
| 241 |
-
#### 🛑 Observations:
|
| 242 |
-
- Logistic Regression also failed to learn anything.
|
| 243 |
-
- Accuracy and F1-Score were **0.0 across all folds**, again due to the **one-instance-per-class** issue.
|
| 244 |
-
|
| 245 |
-
---
|
| 246 |
-
|
| 247 |
-
### 🔹 3.4 Compare All Configurations
|
| 248 |
-
|
| 249 |
-
We compiled a table comparing multiple configurations of models, encodings, and parameters:
|
| 250 |
-
|
| 251 |
-
| Model | Matrix | Mean Accuracy | Mean F1-Score |
|
| 252 |
-
|---------------------|-----------|---------------|---------------|
|
| 253 |
-
| KNN (k=3, cosine) | TF-IDF | 0.0 | 0.0 |
|
| 254 |
-
| KNN (k=5, manhattan) | One-Hot | 0.0 | 0.0 |
|
| 255 |
-
| Logistic Regression | TF-IDF | 0.0 | 0.0 |
|
| 256 |
-
| Logistic Regression | One-Hot | 0.0 | 0.0 |
|
| 257 |
-
| ... | ... | ... | ... |
|
| 258 |
-
|
| 259 |
-
All configurations yielded zero performance.
|
| 260 |
-
|
| 261 |
-
---
|
| 262 |
-
|
| 263 |
-
## 🛠️ What We Changed (Summary)
|
| 264 |
-
|
| 265 |
-
| **Problem** | **Action Taken** |
|
| 266 |
-
|--------------------------------------------------|-------------------------------------------------------------|
|
| 267 |
-
| `ValueError` due to missing `pos_label=1` | Switched to `make_scorer(f1_score, average='macro')` |
|
| 268 |
-
| `NaN` scores from `cross_val_score()` | Replaced with manual `KFold` loop for safer evaluation |
|
| 269 |
-
| One-instance-per-class structure in the dataset | Identified as a blocker for all supervised classification |
|
| 270 |
-
|
| 271 |
-
---
|
| 272 |
-
|
| 273 |
-
## 📌 Conclusion & Next Steps
|
| 274 |
-
|
| 275 |
-
### 🔚 **Conclusion:**
|
| 276 |
-
|
| 277 |
-
- **All models failed to classify correctly** due to the **one-instance-per-class** limitation in the dataset.
|
| 278 |
-
- F1-scores and accuracy remained **zero** across both KNN and Logistic Regression models, regardless of:
|
| 279 |
-
- Encoding (TF-IDF or One-Hot)
|
| 280 |
-
- Model type
|
| 281 |
-
- Distance metric
|
| 282 |
-
|
| 283 |
-
---
|
| 284 |
-
|
| 285 |
-
### ✅ **Recommended Fix Going Forward:**
|
| 286 |
-
|
| 287 |
-
To make the classification task **feasible and meaningful**, we recommend the following:
|
| 288 |
-
|
| 289 |
-
- **Group diseases into broader categories** (e.g., Cardiovascular, Neurological, Respiratory).
|
| 290 |
-
- Replace the original target label `Disease` with a new label `Category`.
|
| 291 |
-
|
| 292 |
-
#### Benefits of This Change:
|
| 293 |
-
- Ensures **multiple samples per class**
|
| 294 |
-
- Allows the models to **learn patterns** and generalize
|
| 295 |
-
- Produces **non-zero performance metrics**
|
| 296 |
-
- Makes **supervised learning** applicable
|
| 297 |
-
|
| 298 |
-
---
|
| 299 |
-
|
| 300 |
-
> 📣 This step is critical for transforming the dataset into a usable format for machine learning. Without this change, all supervised classification models will fail regardless of their complexity or tuning.
|
| 301 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
title: My Cool App
|
| 3 |
+
emoji: 🚀
|
| 4 |
+
colorFrom: indigo
|
| 5 |
+
colorTo: pink
|
| 6 |
+
sdk: gradio
|
| 7 |
+
sdk_version: "4.18.0"
|
| 8 |
+
app_file: app.py
|
| 9 |
+
pinned: false
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|