Spaces:

ZainabEman
/

Assignment03

Sleeping

App Files Files Community

ZainabEman commited on May 12, 2025

Commit

ddb53b4

verified ·

1 Parent(s): 0a70de8

Update README.md

Browse files

Files changed (1) hide show

README.md +8 -299

README.md CHANGED Viewed

@@ -1,301 +1,10 @@
-# 📘 Project Documentation
-## **Title:** Comparative Feature Engineering and Visualization of Disease Text Data Using TF-IDF and One-Hot Encoding
 ---
-## 🧪 Task 01: TF-IDF Feature Extraction
-### 🔹 Subtask 1: Parsing Textual Data into Lists
-**➤ What We Did:**
-The dataset columns `Risk Factors`, `Symptoms`, and `Signs` were originally stored as stringified Python lists—e.g., `"['fever', 'stress']"`. These string representations were parsed back into actual Python list objects.
-**➤ Why We Did It:**
-Raw string data in the form of list-like strings cannot be directly processed for text vectorization. We needed valid Python lists to:
-- Combine all features into a single document per disease.
-- Make them compatible for natural language processing (NLP) techniques.
-- Enable further transformations like joining into a string and applying vectorization methods.
-**➤ Tools Used:**
-- `ast.literal_eval()` from Python's built-in `ast` module to safely evaluate string literals into actual Python list objects without executing arbitrary code.
 ---
-### 🔹 Subtask 2: Convert Lists into Strings
-**➤ What We Did:**
-Each list (e.g., `['fever', 'stress']`) was converted into a space-separated string like `"fever stress"`.
-**➤ Why We Did It:**
-NLP techniques such as TF-IDF require raw text input in the form of plain strings. This conversion allows:
-- Treating each disease record as a document.
-- Ensuring compatibility with vectorizers that expect string inputs.
-**➤ Tools Used:**
-- Python's `str.join()` function to concatenate list items into a single space-separated sentence per record.
----
-### 🔹 Subtask 3 & 4: TF-IDF Vectorization
-**➤ What We Did:**
-We applied **TF-IDF (Term Frequency–Inverse Document Frequency)** vectorization to each of the following:
-- Risk Factors
-- Symptoms
-- Signs
-Each category was processed independently using its own `TfidfVectorizer`.
-**➤ Why TF-IDF Was Chosen:**
-- Unlike one-hot encoding, which treats all terms equally, **TF-IDF emphasizes terms that are frequent in a document but rare across documents**.
-- This gives more meaningful and discriminative features.
-- Useful when the goal is to highlight domain-specific terms for diseases (e.g., `"chest_pain"` for cardiovascular vs. `"tremors"` for neurological).
-**➤ Tools Used:**
-- `TfidfVectorizer` from `sklearn.feature_extraction.text`.
-**➤ Results:**
-| Category      | Rows | Features |
-|---------------|------|----------|
-| Risk Factors  | 25   | 360      |
-| Symptoms      | 25   | 424      |
-| Signs         | 25   | 236      |
----
-### 🔹 Subtask 5: Combine TF-IDF Matrices
-**➤ What We Did:**
-We horizontally stacked the three separate TF-IDF matrices into one unified feature matrix.
-**➤ Why:**
-- To ensure a single consolidated representation of all textual features (Risk Factors, Symptoms, and Signs).
-- Needed for downstream processes like dimensionality reduction, classification, or clustering.
-**➤ Result:**
-- Final matrix shape: **25 diseases × 1020 features**
----
-### 🔹 Subtask 6: Compare with One-Hot Encoded Matrix
-**➤ What We Did:**
-We compared the resulting TF-IDF matrix with the **one-hot encoded matrix** provided in `encoded_output2.csv`. The comparison focused on:
-- Matrix shape (number of features)
-- Sparsity (percentage of zero values)
-- Unique features (terms)
-**➤ Results:**
-| Feature Encoding | Shape     | Sparsity | Unique Features |
-|------------------|-----------|----------|------------------|
-| TF-IDF           | (25,1020) | 92.96%   | 1020             |
-| One-Hot          | (25,496)  | 95.33%   | 496              |
-**➤ Interpretation:**
-- **TF-IDF** produced a richer and more detailed representation, albeit slightly less sparse.
-- **One-Hot Encoding** was simpler, faster to compute, and highly sparse but lacked semantic depth.
----
-## 📉 Task 02: Dimensionality Reduction & Visualization
-### 🔹 Subtask 1: Apply PCA and Truncated SVD
-**➤ What We Did:**
-Applied two dimensionality reduction techniques on both **TF-IDF** and **One-Hot** matrices:
-- **PCA (Principal Component Analysis)**: Works best with dense matrices.
-- **Truncated SVD**: Suitable for sparse matrices, often called **Latent Semantic Analysis** in text mining.
-We reduced dimensions to:
-- **3 Components** for explained variance analysis.
-- **2 Components** for visualization in 2D.
-**➤ Why:**
-- High-dimensional data is often noisy and hard to visualize.
-- Reducing to fewer components helps detect hidden patterns, clusters, and similarities between diseases.
-**➤ Tools Used:**
-- `PCA` from `sklearn.decomposition`
-- `TruncatedSVD` from `sklearn.decomposition`
-**➤ Results – Explained Variance Ratios (Top 3 Components):**
-| Method         | Matrix   | Explained Variance                |
-|----------------|----------|-----------------------------------|
-| PCA            | One-Hot  | [0.1054, 0.0917, 0.0678]           |
-| PCA            | TF-IDF   | [0.0656, 0.0586, 0.0568]           |
-| Truncated SVD  | One-Hot  | [0.0225, 0.0920, 0.0891]           |
-| Truncated SVD  | TF-IDF   | [0.0089, 0.0657, 0.0572]           |
-**➤ Interpretation:**
-- **PCA on One-Hot** retained more variance in its first component.
-- **TF-IDF** had more evenly distributed variance, reflecting richer and more distributed semantic features.
----
-### 🔹 Subtask 2: 2D Visualization of Reduced Dimensions
-**➤ What We Did:**
-Created **2D scatter plots** using PCA and Truncated SVD projections for both encodings. Each disease was color-coded based on its category:
-- Cardiovascular
-- Neurological
-- Respiratory
-- Endocrine
-- Other
-**➤ Tools Used:**
-- `matplotlib.pyplot`
-- `seaborn.scatterplot`
-**➤ Observations:**
-| Method         | Clustering Observed | Interpretation                                              |
-|----------------|---------------------|--------------------------------------------------------------|
-| PCA – One-Hot  | ✅ Distinct clusters | Clear grouping, e.g., cardiovascular diseases were close together |
-| PCA – TF-IDF   | ⚠️ Mixed clusters    | Due to the high dimensionality and dense features, points overlapped |
-| SVD – One-Hot  | ✅ Moderate structure | Reasonable groupings, though less tight than PCA             |
-| SVD – TF-IDF   | ❌ Overlapping, noisy| Features too rich for 2D visualization—more noise            |
-## 🖼️ Dimensionality Reduction Visualizations
-Below are the 2D scatter plots obtained after applying PCA and Truncated SVD to both One-Hot and TF-IDF feature matrices.
----
-### 🔷 PCA Results
-![PCA - One-Hot Encoded Features (2D)](img1.png)
-![PCA - TF-IDF Features (2D)](img2.png)
----
-### 🔶 Truncated SVD Results
-![Truncated SVD - One-Hot Encoded Features (2D)](img3.png)
-![Truncated SVD - TF-IDF Features (2D)](img4.png)
-# 🧪 Task 3: Model Training, Evaluation, and Comparison
-## 🎯 Objective
-The goal of Task 3 was to evaluate and compare the performance of classification models using disease data represented through two different feature encoding strategies:
-- **TF-IDF vectorized features**
-- **One-Hot encoded features**
-We aimed to benchmark the models based on the following:
-- **Models**: K-Nearest Neighbors (KNN), Logistic Regression
-- **KNN Distance Metrics**: Euclidean, Manhattan, Cosine
-- **KNN k-values**: 3, 5, 7
-- **Evaluation Metrics**: Accuracy, Precision, Recall, F1-Score
-- **Validation Method**: 5-Fold Cross-Validation
----
-## 🧩 Step-by-Step Breakdown
-### 🔹 3.1 Train KNN Models
-We trained KNN classifiers using the full combination of the following parameters:
-- **k-values**: 3, 5, 7
-- **Distance metrics**: Euclidean, Manhattan, Cosine
-- **Encodings**: TF-IDF and One-Hot
-#### ❌ Issue Faced:
-All KNN models returned **0% accuracy**, and **cross-validation failed** with `NaN` F1-scores.
-#### 🛑 Root Cause:
-- The dataset includes **25 unique diseases**, each with only **one sample**.
-- In **K-Fold Cross-Validation**, no class appears more than once in both training and test folds.
-- As a result:
-  - The model never sees the same class twice.
-  - `f1_score()` throws `ValueError` when the expected `pos_label=1` is missing in the test split.
----
-### 🔹 3.2 Report Accuracy, Precision, Recall, and F1-Score
-We attempted to use standard scikit-learn scoring strings like `'f1'` during cross-validation.
-#### ❌ Problem:
-- `cross_val_score(..., scoring='f1')` led to runtime errors due to the missing `pos_label`.
-#### ✅ Resolution:
-- Switched to **custom scorers** using `make_scorer(f1_score, average='macro')`.
-- Replaced `cross_val_score()` with a **manual `KFold` loop** to gain control over per-fold evaluation and handle edge cases safely.
----
-### 🔹 3.3 Train Logistic Regression
-We trained **Logistic Regression** on both:
-- **TF-IDF matrix**
-- **One-Hot matrix**
-Using the same 5-Fold Cross-Validation and **custom macro scoring strategy**.
-#### 🛑 Observations:
-- Logistic Regression also failed to learn anything.
-- Accuracy and F1-Score were **0.0 across all folds**, again due to the **one-instance-per-class** issue.
----
-### 🔹 3.4 Compare All Configurations
-We compiled a table comparing multiple configurations of models, encodings, and parameters:
-| Model                | Matrix    | Mean Accuracy | Mean F1-Score |
-|---------------------|-----------|---------------|---------------|
-| KNN (k=3, cosine)    | TF-IDF    | 0.0           | 0.0           |
-| KNN (k=5, manhattan) | One-Hot   | 0.0           | 0.0           |
-| Logistic Regression  | TF-IDF    | 0.0           | 0.0           |
-| Logistic Regression  | One-Hot   | 0.0           | 0.0           |
-| ...                 | ...       | ...           | ...           |
-All configurations yielded zero performance.
----
-## 🛠️ What We Changed (Summary)
-| **Problem**                                      | **Action Taken**                                           |
-|--------------------------------------------------|-------------------------------------------------------------|
-| `ValueError` due to missing `pos_label=1`        | Switched to `make_scorer(f1_score, average='macro')`        |
-| `NaN` scores from `cross_val_score()`            | Replaced with manual `KFold` loop for safer evaluation      |
-| One-instance-per-class structure in the dataset  | Identified as a blocker for all supervised classification   |
----
-## 📌 Conclusion & Next Steps
-### 🔚 **Conclusion:**
-- **All models failed to classify correctly** due to the **one-instance-per-class** limitation in the dataset.
-- F1-scores and accuracy remained **zero** across both KNN and Logistic Regression models, regardless of:
-  - Encoding (TF-IDF or One-Hot)
-  - Model type
-  - Distance metric
----
-### ✅ **Recommended Fix Going Forward:**
-To make the classification task **feasible and meaningful**, we recommend the following:
-- **Group diseases into broader categories** (e.g., Cardiovascular, Neurological, Respiratory).
-- Replace the original target label `Disease` with a new label `Category`.
-#### Benefits of This Change:
-- Ensures **multiple samples per class**
-- Allows the models to **learn patterns** and generalize
-- Produces **non-zero performance metrics**
-- Makes **supervised learning** applicable
----
-> 📣 This step is critical for transforming the dataset into a usable format for machine learning. Without this change, all supervised classification models will fail regardless of their complexity or tuning.

 ---
+title: My Cool App
+emoji: 🚀
+colorFrom: indigo
+colorTo: pink
+sdk: gradio
+sdk_version: "4.18.0"
+app_file: app.py
+pinned: false
 ---