Spaces:
Sleeping
Sleeping
Update README.md
Browse files
README.md
CHANGED
|
@@ -1,12 +1,301 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
---
|
| 11 |
|
| 12 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# π Project Documentation
|
| 2 |
+
|
| 3 |
+
## **Title:** Comparative Feature Engineering and Visualization of Disease Text Data Using TF-IDF and One-Hot Encoding
|
| 4 |
+
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
## π§ͺ Task 01: TF-IDF Feature Extraction
|
| 8 |
+
|
| 9 |
+
### πΉ Subtask 1: Parsing Textual Data into Lists
|
| 10 |
+
|
| 11 |
+
**β€ What We Did:**
|
| 12 |
+
The dataset columns `Risk Factors`, `Symptoms`, and `Signs` were originally stored as stringified Python listsβe.g., `"['fever', 'stress']"`. These string representations were parsed back into actual Python list objects.
|
| 13 |
+
|
| 14 |
+
**β€ Why We Did It:**
|
| 15 |
+
Raw string data in the form of list-like strings cannot be directly processed for text vectorization. We needed valid Python lists to:
|
| 16 |
+
- Combine all features into a single document per disease.
|
| 17 |
+
- Make them compatible for natural language processing (NLP) techniques.
|
| 18 |
+
- Enable further transformations like joining into a string and applying vectorization methods.
|
| 19 |
+
|
| 20 |
+
**β€ Tools Used:**
|
| 21 |
+
- `ast.literal_eval()` from Python's built-in `ast` module to safely evaluate string literals into actual Python list objects without executing arbitrary code.
|
| 22 |
+
|
| 23 |
+
---
|
| 24 |
+
|
| 25 |
+
### πΉ Subtask 2: Convert Lists into Strings
|
| 26 |
+
|
| 27 |
+
**β€ What We Did:**
|
| 28 |
+
Each list (e.g., `['fever', 'stress']`) was converted into a space-separated string like `"fever stress"`.
|
| 29 |
+
|
| 30 |
+
**β€ Why We Did It:**
|
| 31 |
+
NLP techniques such as TF-IDF require raw text input in the form of plain strings. This conversion allows:
|
| 32 |
+
- Treating each disease record as a document.
|
| 33 |
+
- Ensuring compatibility with vectorizers that expect string inputs.
|
| 34 |
+
|
| 35 |
+
**β€ Tools Used:**
|
| 36 |
+
- Python's `str.join()` function to concatenate list items into a single space-separated sentence per record.
|
| 37 |
+
|
| 38 |
+
---
|
| 39 |
+
|
| 40 |
+
### πΉ Subtask 3 & 4: TF-IDF Vectorization
|
| 41 |
+
|
| 42 |
+
**β€ What We Did:**
|
| 43 |
+
We applied **TF-IDF (Term FrequencyβInverse Document Frequency)** vectorization to each of the following:
|
| 44 |
+
- Risk Factors
|
| 45 |
+
- Symptoms
|
| 46 |
+
- Signs
|
| 47 |
+
|
| 48 |
+
Each category was processed independently using its own `TfidfVectorizer`.
|
| 49 |
+
|
| 50 |
+
**β€ Why TF-IDF Was Chosen:**
|
| 51 |
+
- Unlike one-hot encoding, which treats all terms equally, **TF-IDF emphasizes terms that are frequent in a document but rare across documents**.
|
| 52 |
+
- This gives more meaningful and discriminative features.
|
| 53 |
+
- Useful when the goal is to highlight domain-specific terms for diseases (e.g., `"chest_pain"` for cardiovascular vs. `"tremors"` for neurological).
|
| 54 |
+
|
| 55 |
+
**β€ Tools Used:**
|
| 56 |
+
- `TfidfVectorizer` from `sklearn.feature_extraction.text`.
|
| 57 |
+
|
| 58 |
+
**β€ Results:**
|
| 59 |
+
|
| 60 |
+
| Category | Rows | Features |
|
| 61 |
+
|---------------|------|----------|
|
| 62 |
+
| Risk Factors | 25 | 360 |
|
| 63 |
+
| Symptoms | 25 | 424 |
|
| 64 |
+
| Signs | 25 | 236 |
|
| 65 |
+
|
| 66 |
+
---
|
| 67 |
+
|
| 68 |
+
### πΉ Subtask 5: Combine TF-IDF Matrices
|
| 69 |
+
|
| 70 |
+
**β€ What We Did:**
|
| 71 |
+
We horizontally stacked the three separate TF-IDF matrices into one unified feature matrix.
|
| 72 |
+
|
| 73 |
+
**β€ Why:**
|
| 74 |
+
- To ensure a single consolidated representation of all textual features (Risk Factors, Symptoms, and Signs).
|
| 75 |
+
- Needed for downstream processes like dimensionality reduction, classification, or clustering.
|
| 76 |
+
|
| 77 |
+
**β€ Result:**
|
| 78 |
+
- Final matrix shape: **25 diseases Γ 1020 features**
|
| 79 |
+
|
| 80 |
+
---
|
| 81 |
+
|
| 82 |
+
### πΉ Subtask 6: Compare with One-Hot Encoded Matrix
|
| 83 |
+
|
| 84 |
+
**β€ What We Did:**
|
| 85 |
+
We compared the resulting TF-IDF matrix with the **one-hot encoded matrix** provided in `encoded_output2.csv`. The comparison focused on:
|
| 86 |
+
- Matrix shape (number of features)
|
| 87 |
+
- Sparsity (percentage of zero values)
|
| 88 |
+
- Unique features (terms)
|
| 89 |
+
|
| 90 |
+
**β€ Results:**
|
| 91 |
+
|
| 92 |
+
| Feature Encoding | Shape | Sparsity | Unique Features |
|
| 93 |
+
|------------------|-----------|----------|------------------|
|
| 94 |
+
| TF-IDF | (25,1020) | 92.96% | 1020 |
|
| 95 |
+
| One-Hot | (25,496) | 95.33% | 496 |
|
| 96 |
+
|
| 97 |
+
**β€ Interpretation:**
|
| 98 |
+
- **TF-IDF** produced a richer and more detailed representation, albeit slightly less sparse.
|
| 99 |
+
- **One-Hot Encoding** was simpler, faster to compute, and highly sparse but lacked semantic depth.
|
| 100 |
+
|
| 101 |
+
---
|
| 102 |
+
|
| 103 |
+
## π Task 02: Dimensionality Reduction & Visualization
|
| 104 |
+
|
| 105 |
+
### πΉ Subtask 1: Apply PCA and Truncated SVD
|
| 106 |
+
|
| 107 |
+
**β€ What We Did:**
|
| 108 |
+
Applied two dimensionality reduction techniques on both **TF-IDF** and **One-Hot** matrices:
|
| 109 |
+
- **PCA (Principal Component Analysis)**: Works best with dense matrices.
|
| 110 |
+
- **Truncated SVD**: Suitable for sparse matrices, often called **Latent Semantic Analysis** in text mining.
|
| 111 |
+
|
| 112 |
+
We reduced dimensions to:
|
| 113 |
+
- **3 Components** for explained variance analysis.
|
| 114 |
+
- **2 Components** for visualization in 2D.
|
| 115 |
+
|
| 116 |
+
**β€ Why:**
|
| 117 |
+
- High-dimensional data is often noisy and hard to visualize.
|
| 118 |
+
- Reducing to fewer components helps detect hidden patterns, clusters, and similarities between diseases.
|
| 119 |
+
|
| 120 |
+
**β€ Tools Used:**
|
| 121 |
+
- `PCA` from `sklearn.decomposition`
|
| 122 |
+
- `TruncatedSVD` from `sklearn.decomposition`
|
| 123 |
+
|
| 124 |
+
**β€ Results β Explained Variance Ratios (Top 3 Components):**
|
| 125 |
+
|
| 126 |
+
| Method | Matrix | Explained Variance |
|
| 127 |
+
|----------------|----------|-----------------------------------|
|
| 128 |
+
| PCA | One-Hot | [0.1054, 0.0917, 0.0678] |
|
| 129 |
+
| PCA | TF-IDF | [0.0656, 0.0586, 0.0568] |
|
| 130 |
+
| Truncated SVD | One-Hot | [0.0225, 0.0920, 0.0891] |
|
| 131 |
+
| Truncated SVD | TF-IDF | [0.0089, 0.0657, 0.0572] |
|
| 132 |
+
|
| 133 |
+
**β€ Interpretation:**
|
| 134 |
+
- **PCA on One-Hot** retained more variance in its first component.
|
| 135 |
+
- **TF-IDF** had more evenly distributed variance, reflecting richer and more distributed semantic features.
|
| 136 |
+
|
| 137 |
---
|
| 138 |
+
|
| 139 |
+
### πΉ Subtask 2: 2D Visualization of Reduced Dimensions
|
| 140 |
+
|
| 141 |
+
**β€ What We Did:**
|
| 142 |
+
Created **2D scatter plots** using PCA and Truncated SVD projections for both encodings. Each disease was color-coded based on its category:
|
| 143 |
+
- Cardiovascular
|
| 144 |
+
- Neurological
|
| 145 |
+
- Respiratory
|
| 146 |
+
- Endocrine
|
| 147 |
+
- Other
|
| 148 |
+
|
| 149 |
+
**β€ Tools Used:**
|
| 150 |
+
- `matplotlib.pyplot`
|
| 151 |
+
- `seaborn.scatterplot`
|
| 152 |
+
|
| 153 |
+
**β€ Observations:**
|
| 154 |
+
|
| 155 |
+
| Method | Clustering Observed | Interpretation |
|
| 156 |
+
|----------------|---------------------|--------------------------------------------------------------|
|
| 157 |
+
| PCA β One-Hot | β
Distinct clusters | Clear grouping, e.g., cardiovascular diseases were close together |
|
| 158 |
+
| PCA β TF-IDF | β οΈ Mixed clusters | Due to the high dimensionality and dense features, points overlapped |
|
| 159 |
+
| SVD β One-Hot | β
Moderate structure | Reasonable groupings, though less tight than PCA |
|
| 160 |
+
| SVD β TF-IDF | β Overlapping, noisy| Features too rich for 2D visualizationβmore noise |
|
| 161 |
+
|
| 162 |
+
## πΌοΈ Dimensionality Reduction Visualizations
|
| 163 |
+
|
| 164 |
+
Below are the 2D scatter plots obtained after applying PCA and Truncated SVD to both One-Hot and TF-IDF feature matrices.
|
| 165 |
+
|
| 166 |
---
|
| 167 |
|
| 168 |
+
### π· PCA Results
|
| 169 |
+

|
| 170 |
+

|
| 171 |
+
|
| 172 |
+
---
|
| 173 |
+
|
| 174 |
+
### πΆ Truncated SVD Results
|
| 175 |
+

|
| 176 |
+

|
| 177 |
+
|
| 178 |
+
|
| 179 |
+
# π§ͺ Task 3: Model Training, Evaluation, and Comparison
|
| 180 |
+
|
| 181 |
+
## π― Objective
|
| 182 |
+
|
| 183 |
+
The goal of Task 3 was to evaluate and compare the performance of classification models using disease data represented through two different feature encoding strategies:
|
| 184 |
+
|
| 185 |
+
- **TF-IDF vectorized features**
|
| 186 |
+
- **One-Hot encoded features**
|
| 187 |
+
|
| 188 |
+
We aimed to benchmark the models based on the following:
|
| 189 |
+
|
| 190 |
+
- **Models**: K-Nearest Neighbors (KNN), Logistic Regression
|
| 191 |
+
- **KNN Distance Metrics**: Euclidean, Manhattan, Cosine
|
| 192 |
+
- **KNN k-values**: 3, 5, 7
|
| 193 |
+
- **Evaluation Metrics**: Accuracy, Precision, Recall, F1-Score
|
| 194 |
+
- **Validation Method**: 5-Fold Cross-Validation
|
| 195 |
+
|
| 196 |
+
---
|
| 197 |
+
|
| 198 |
+
## π§© Step-by-Step Breakdown
|
| 199 |
+
|
| 200 |
+
### πΉ 3.1 Train KNN Models
|
| 201 |
+
|
| 202 |
+
We trained KNN classifiers using the full combination of the following parameters:
|
| 203 |
+
|
| 204 |
+
- **k-values**: 3, 5, 7
|
| 205 |
+
- **Distance metrics**: Euclidean, Manhattan, Cosine
|
| 206 |
+
- **Encodings**: TF-IDF and One-Hot
|
| 207 |
+
|
| 208 |
+
#### β Issue Faced:
|
| 209 |
+
All KNN models returned **0% accuracy**, and **cross-validation failed** with `NaN` F1-scores.
|
| 210 |
+
|
| 211 |
+
#### π Root Cause:
|
| 212 |
+
- The dataset includes **25 unique diseases**, each with only **one sample**.
|
| 213 |
+
- In **K-Fold Cross-Validation**, no class appears more than once in both training and test folds.
|
| 214 |
+
- As a result:
|
| 215 |
+
- The model never sees the same class twice.
|
| 216 |
+
- `f1_score()` throws `ValueError` when the expected `pos_label=1` is missing in the test split.
|
| 217 |
+
|
| 218 |
+
---
|
| 219 |
+
|
| 220 |
+
### πΉ 3.2 Report Accuracy, Precision, Recall, and F1-Score
|
| 221 |
+
|
| 222 |
+
We attempted to use standard scikit-learn scoring strings like `'f1'` during cross-validation.
|
| 223 |
+
|
| 224 |
+
#### β Problem:
|
| 225 |
+
- `cross_val_score(..., scoring='f1')` led to runtime errors due to the missing `pos_label`.
|
| 226 |
+
|
| 227 |
+
#### β
Resolution:
|
| 228 |
+
- Switched to **custom scorers** using `make_scorer(f1_score, average='macro')`.
|
| 229 |
+
- Replaced `cross_val_score()` with a **manual `KFold` loop** to gain control over per-fold evaluation and handle edge cases safely.
|
| 230 |
+
|
| 231 |
+
---
|
| 232 |
+
|
| 233 |
+
### πΉ 3.3 Train Logistic Regression
|
| 234 |
+
|
| 235 |
+
We trained **Logistic Regression** on both:
|
| 236 |
+
- **TF-IDF matrix**
|
| 237 |
+
- **One-Hot matrix**
|
| 238 |
+
|
| 239 |
+
Using the same 5-Fold Cross-Validation and **custom macro scoring strategy**.
|
| 240 |
+
|
| 241 |
+
#### π Observations:
|
| 242 |
+
- Logistic Regression also failed to learn anything.
|
| 243 |
+
- Accuracy and F1-Score were **0.0 across all folds**, again due to the **one-instance-per-class** issue.
|
| 244 |
+
|
| 245 |
+
---
|
| 246 |
+
|
| 247 |
+
### πΉ 3.4 Compare All Configurations
|
| 248 |
+
|
| 249 |
+
We compiled a table comparing multiple configurations of models, encodings, and parameters:
|
| 250 |
+
|
| 251 |
+
| Model | Matrix | Mean Accuracy | Mean F1-Score |
|
| 252 |
+
|---------------------|-----------|---------------|---------------|
|
| 253 |
+
| KNN (k=3, cosine) | TF-IDF | 0.0 | 0.0 |
|
| 254 |
+
| KNN (k=5, manhattan) | One-Hot | 0.0 | 0.0 |
|
| 255 |
+
| Logistic Regression | TF-IDF | 0.0 | 0.0 |
|
| 256 |
+
| Logistic Regression | One-Hot | 0.0 | 0.0 |
|
| 257 |
+
| ... | ... | ... | ... |
|
| 258 |
+
|
| 259 |
+
All configurations yielded zero performance.
|
| 260 |
+
|
| 261 |
+
---
|
| 262 |
+
|
| 263 |
+
## π οΈ What We Changed (Summary)
|
| 264 |
+
|
| 265 |
+
| **Problem** | **Action Taken** |
|
| 266 |
+
|--------------------------------------------------|-------------------------------------------------------------|
|
| 267 |
+
| `ValueError` due to missing `pos_label=1` | Switched to `make_scorer(f1_score, average='macro')` |
|
| 268 |
+
| `NaN` scores from `cross_val_score()` | Replaced with manual `KFold` loop for safer evaluation |
|
| 269 |
+
| One-instance-per-class structure in the dataset | Identified as a blocker for all supervised classification |
|
| 270 |
+
|
| 271 |
+
---
|
| 272 |
+
|
| 273 |
+
## π Conclusion & Next Steps
|
| 274 |
+
|
| 275 |
+
### π **Conclusion:**
|
| 276 |
+
|
| 277 |
+
- **All models failed to classify correctly** due to the **one-instance-per-class** limitation in the dataset.
|
| 278 |
+
- F1-scores and accuracy remained **zero** across both KNN and Logistic Regression models, regardless of:
|
| 279 |
+
- Encoding (TF-IDF or One-Hot)
|
| 280 |
+
- Model type
|
| 281 |
+
- Distance metric
|
| 282 |
+
|
| 283 |
+
---
|
| 284 |
+
|
| 285 |
+
### β
**Recommended Fix Going Forward:**
|
| 286 |
+
|
| 287 |
+
To make the classification task **feasible and meaningful**, we recommend the following:
|
| 288 |
+
|
| 289 |
+
- **Group diseases into broader categories** (e.g., Cardiovascular, Neurological, Respiratory).
|
| 290 |
+
- Replace the original target label `Disease` with a new label `Category`.
|
| 291 |
+
|
| 292 |
+
#### Benefits of This Change:
|
| 293 |
+
- Ensures **multiple samples per class**
|
| 294 |
+
- Allows the models to **learn patterns** and generalize
|
| 295 |
+
- Produces **non-zero performance metrics**
|
| 296 |
+
- Makes **supervised learning** applicable
|
| 297 |
+
|
| 298 |
+
---
|
| 299 |
+
|
| 300 |
+
> π£ This step is critical for transforming the dataset into a usable format for machine learning. Without this change, all supervised classification models will fail regardless of their complexity or tuning.
|
| 301 |
+
|