Spaces:

ZainabEman
/

Assignment03

Sleeping

App Files Files Community

ZainabEman commited on Apr 14, 2025

Commit

0a70de8

verified ·

1 Parent(s): 3715ed5

Update README.md

Browse files

Files changed (1) hide show

README.md +298 -9

README.md CHANGED Viewed

@@ -1,12 +1,301 @@
 ---
-title: Assignment03
-emoji: 🐢
-colorFrom: blue
-colorTo: purple
-sdk: streamlit
-sdk_version: 1.44.1
-app_file: app.py
-pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# 📘 Project Documentation
+## **Title:** Comparative Feature Engineering and Visualization of Disease Text Data Using TF-IDF and One-Hot Encoding
+---
+## 🧪 Task 01: TF-IDF Feature Extraction
+### 🔹 Subtask 1: Parsing Textual Data into Lists
+**➤ What We Did:**
+The dataset columns `Risk Factors`, `Symptoms`, and `Signs` were originally stored as stringified Python lists—e.g., `"['fever', 'stress']"`. These string representations were parsed back into actual Python list objects.
+**➤ Why We Did It:**
+Raw string data in the form of list-like strings cannot be directly processed for text vectorization. We needed valid Python lists to:
+- Combine all features into a single document per disease.
+- Make them compatible for natural language processing (NLP) techniques.
+- Enable further transformations like joining into a string and applying vectorization methods.
+**➤ Tools Used:**
+- `ast.literal_eval()` from Python's built-in `ast` module to safely evaluate string literals into actual Python list objects without executing arbitrary code.
+---
+### 🔹 Subtask 2: Convert Lists into Strings
+**➤ What We Did:**
+Each list (e.g., `['fever', 'stress']`) was converted into a space-separated string like `"fever stress"`.
+**➤ Why We Did It:**
+NLP techniques such as TF-IDF require raw text input in the form of plain strings. This conversion allows:
+- Treating each disease record as a document.
+- Ensuring compatibility with vectorizers that expect string inputs.
+**➤ Tools Used:**
+- Python's `str.join()` function to concatenate list items into a single space-separated sentence per record.
+---
+### 🔹 Subtask 3 & 4: TF-IDF Vectorization
+**➤ What We Did:**
+We applied **TF-IDF (Term Frequency–Inverse Document Frequency)** vectorization to each of the following:
+- Risk Factors
+- Symptoms
+- Signs
+Each category was processed independently using its own `TfidfVectorizer`.
+**➤ Why TF-IDF Was Chosen:**
+- Unlike one-hot encoding, which treats all terms equally, **TF-IDF emphasizes terms that are frequent in a document but rare across documents**.
+- This gives more meaningful and discriminative features.
+- Useful when the goal is to highlight domain-specific terms for diseases (e.g., `"chest_pain"` for cardiovascular vs. `"tremors"` for neurological).
+**➤ Tools Used:**
+- `TfidfVectorizer` from `sklearn.feature_extraction.text`.
+**➤ Results:**
+| Category      | Rows | Features |
+|---------------|------|----------|
+| Risk Factors  | 25   | 360      |
+| Symptoms      | 25   | 424      |
+| Signs         | 25   | 236      |
+---
+### 🔹 Subtask 5: Combine TF-IDF Matrices
+**➤ What We Did:**
+We horizontally stacked the three separate TF-IDF matrices into one unified feature matrix.
+**➤ Why:**
+- To ensure a single consolidated representation of all textual features (Risk Factors, Symptoms, and Signs).
+- Needed for downstream processes like dimensionality reduction, classification, or clustering.
+**➤ Result:**
+- Final matrix shape: **25 diseases × 1020 features**
+---
+### 🔹 Subtask 6: Compare with One-Hot Encoded Matrix
+**➤ What We Did:**
+We compared the resulting TF-IDF matrix with the **one-hot encoded matrix** provided in `encoded_output2.csv`. The comparison focused on:
+- Matrix shape (number of features)
+- Sparsity (percentage of zero values)
+- Unique features (terms)
+**➤ Results:**
+| Feature Encoding | Shape     | Sparsity | Unique Features |
+|------------------|-----------|----------|------------------|
+| TF-IDF           | (25,1020) | 92.96%   | 1020             |
+| One-Hot          | (25,496)  | 95.33%   | 496              |
+**➤ Interpretation:**
+- **TF-IDF** produced a richer and more detailed representation, albeit slightly less sparse.
+- **One-Hot Encoding** was simpler, faster to compute, and highly sparse but lacked semantic depth.
+---
+## 📉 Task 02: Dimensionality Reduction & Visualization
+### 🔹 Subtask 1: Apply PCA and Truncated SVD
+**➤ What We Did:**
+Applied two dimensionality reduction techniques on both **TF-IDF** and **One-Hot** matrices:
+- **PCA (Principal Component Analysis)**: Works best with dense matrices.
+- **Truncated SVD**: Suitable for sparse matrices, often called **Latent Semantic Analysis** in text mining.
+We reduced dimensions to:
+- **3 Components** for explained variance analysis.
+- **2 Components** for visualization in 2D.
+**➤ Why:**
+- High-dimensional data is often noisy and hard to visualize.
+- Reducing to fewer components helps detect hidden patterns, clusters, and similarities between diseases.
+**➤ Tools Used:**
+- `PCA` from `sklearn.decomposition`
+- `TruncatedSVD` from `sklearn.decomposition`
+**➤ Results – Explained Variance Ratios (Top 3 Components):**
+| Method         | Matrix   | Explained Variance                |
+|----------------|----------|-----------------------------------|
+| PCA            | One-Hot  | [0.1054, 0.0917, 0.0678]           |
+| PCA            | TF-IDF   | [0.0656, 0.0586, 0.0568]           |
+| Truncated SVD  | One-Hot  | [0.0225, 0.0920, 0.0891]           |
+| Truncated SVD  | TF-IDF   | [0.0089, 0.0657, 0.0572]           |
+**➤ Interpretation:**
+- **PCA on One-Hot** retained more variance in its first component.
+- **TF-IDF** had more evenly distributed variance, reflecting richer and more distributed semantic features.
 ---
+### 🔹 Subtask 2: 2D Visualization of Reduced Dimensions
+**➤ What We Did:**
+Created **2D scatter plots** using PCA and Truncated SVD projections for both encodings. Each disease was color-coded based on its category:
+- Cardiovascular
+- Neurological
+- Respiratory
+- Endocrine
+- Other
+**➤ Tools Used:**
+- `matplotlib.pyplot`
+- `seaborn.scatterplot`
+**➤ Observations:**
+| Method         | Clustering Observed | Interpretation                                              |
+|----------------|---------------------|--------------------------------------------------------------|
+| PCA – One-Hot  | ✅ Distinct clusters | Clear grouping, e.g., cardiovascular diseases were close together |
+| PCA – TF-IDF   | ⚠️ Mixed clusters    | Due to the high dimensionality and dense features, points overlapped |
+| SVD – One-Hot  | ✅ Moderate structure | Reasonable groupings, though less tight than PCA             |
+| SVD – TF-IDF   | ❌ Overlapping, noisy| Features too rich for 2D visualization—more noise            |
+## 🖼️ Dimensionality Reduction Visualizations
+Below are the 2D scatter plots obtained after applying PCA and Truncated SVD to both One-Hot and TF-IDF feature matrices.
 ---
+### 🔷 PCA Results
+![PCA - One-Hot Encoded Features (2D)](img1.png)
+![PCA - TF-IDF Features (2D)](img2.png)
+---
+### 🔶 Truncated SVD Results
+![Truncated SVD - One-Hot Encoded Features (2D)](img3.png)
+![Truncated SVD - TF-IDF Features (2D)](img4.png)
+# 🧪 Task 3: Model Training, Evaluation, and Comparison
+## 🎯 Objective
+The goal of Task 3 was to evaluate and compare the performance of classification models using disease data represented through two different feature encoding strategies:
+- **TF-IDF vectorized features**
+- **One-Hot encoded features**
+We aimed to benchmark the models based on the following:
+- **Models**: K-Nearest Neighbors (KNN), Logistic Regression
+- **KNN Distance Metrics**: Euclidean, Manhattan, Cosine
+- **KNN k-values**: 3, 5, 7
+- **Evaluation Metrics**: Accuracy, Precision, Recall, F1-Score
+- **Validation Method**: 5-Fold Cross-Validation
+---
+## 🧩 Step-by-Step Breakdown
+### 🔹 3.1 Train KNN Models
+We trained KNN classifiers using the full combination of the following parameters:
+- **k-values**: 3, 5, 7
+- **Distance metrics**: Euclidean, Manhattan, Cosine
+- **Encodings**: TF-IDF and One-Hot
+#### ❌ Issue Faced:
+All KNN models returned **0% accuracy**, and **cross-validation failed** with `NaN` F1-scores.
+#### 🛑 Root Cause:
+- The dataset includes **25 unique diseases**, each with only **one sample**.
+- In **K-Fold Cross-Validation**, no class appears more than once in both training and test folds.
+- As a result:
+  - The model never sees the same class twice.
+  - `f1_score()` throws `ValueError` when the expected `pos_label=1` is missing in the test split.
+---
+### 🔹 3.2 Report Accuracy, Precision, Recall, and F1-Score
+We attempted to use standard scikit-learn scoring strings like `'f1'` during cross-validation.
+#### ❌ Problem:
+- `cross_val_score(..., scoring='f1')` led to runtime errors due to the missing `pos_label`.
+#### ✅ Resolution:
+- Switched to **custom scorers** using `make_scorer(f1_score, average='macro')`.
+- Replaced `cross_val_score()` with a **manual `KFold` loop** to gain control over per-fold evaluation and handle edge cases safely.
+---
+### 🔹 3.3 Train Logistic Regression
+We trained **Logistic Regression** on both:
+- **TF-IDF matrix**
+- **One-Hot matrix**
+Using the same 5-Fold Cross-Validation and **custom macro scoring strategy**.
+#### 🛑 Observations:
+- Logistic Regression also failed to learn anything.
+- Accuracy and F1-Score were **0.0 across all folds**, again due to the **one-instance-per-class** issue.
+---
+### 🔹 3.4 Compare All Configurations
+We compiled a table comparing multiple configurations of models, encodings, and parameters:
+| Model                | Matrix    | Mean Accuracy | Mean F1-Score |
+|---------------------|-----------|---------------|---------------|
+| KNN (k=3, cosine)    | TF-IDF    | 0.0           | 0.0           |
+| KNN (k=5, manhattan) | One-Hot   | 0.0           | 0.0           |
+| Logistic Regression  | TF-IDF    | 0.0           | 0.0           |
+| Logistic Regression  | One-Hot   | 0.0           | 0.0           |
+| ...                 | ...       | ...           | ...           |
+All configurations yielded zero performance.
+---
+## 🛠️ What We Changed (Summary)
+| **Problem**                                      | **Action Taken**                                           |
+|--------------------------------------------------|-------------------------------------------------------------|
+| `ValueError` due to missing `pos_label=1`        | Switched to `make_scorer(f1_score, average='macro')`        |
+| `NaN` scores from `cross_val_score()`            | Replaced with manual `KFold` loop for safer evaluation      |
+| One-instance-per-class structure in the dataset  | Identified as a blocker for all supervised classification   |
+---
+## 📌 Conclusion & Next Steps
+### 🔚 **Conclusion:**
+- **All models failed to classify correctly** due to the **one-instance-per-class** limitation in the dataset.
+- F1-scores and accuracy remained **zero** across both KNN and Logistic Regression models, regardless of:
+  - Encoding (TF-IDF or One-Hot)
+  - Model type
+  - Distance metric
+---
+### ✅ **Recommended Fix Going Forward:**
+To make the classification task **feasible and meaningful**, we recommend the following:
+- **Group diseases into broader categories** (e.g., Cardiovascular, Neurological, Respiratory).
+- Replace the original target label `Disease` with a new label `Category`.
+#### Benefits of This Change:
+- Ensures **multiple samples per class**
+- Allows the models to **learn patterns** and generalize
+- Produces **non-zero performance metrics**
+- Makes **supervised learning** applicable
+---
+> 📣 This step is critical for transforming the dataset into a usable format for machine learning. Without this change, all supervised classification models will fail regardless of their complexity or tuning.