srieas
/

TestModel

Tabular Classification

English

Model card Files Files and versions

xet

Community

srieas commited on Jan 20, 2025

Commit

89ac66e

verified ·

1 Parent(s): 3f4ff30

Update README.md

Browse files

Files changed (1) hide show

README.md +103 -1

README.md CHANGED Viewed

@@ -5,4 +5,106 @@ datasets:
 language:
 - en
 pipeline_tag: tabular-classification
----

 language:
 - en
 pipeline_tag: tabular-classification
+---
+# Stroke Prediction Model
+This project implements a machine learning pipeline for predicting stroke risk using tabular data from the patient dataset. Multiple models are trained to choose the best performing. Below is a detailed explanation of how each key consideration was implemented.
+### Data Set
+This dataset is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relavant information about the patient.
+### Attribute Information
+1. id: unique identifier
+2. gender: "Male", "Female" or "Other"
+3. age: age of the patient
+4. hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
+5. heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
+6. ever_married: "No" or "Yes"
+7. work_type: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"
+8. Residence_type: "Rural" or "Urban"
+9. avg_glucose_level: average glucose level in blood
+10. bmi: body mass index
+11. smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"\*
+12. stroke: 1 if the patient had a stroke or 0 if not
+## Key Considerations Implementation
+## Data Cleaning
+#### Drop id column
+The id column is dropped as it serves as a unique identifier for each row but does not contribute to the predictive power of the model.
+#### Remove missing values
+Remove data entries with missing 'bmi' as it corresponds no impact to model accuracy being less in number
+## Feature Engineering
+#### Binary Encoding
+Convert categorical features with only two unique values into binary numeric format for easier processing by machine learning models:
+- ever_married: Encoded as 0 for “No” and 1 for “Yes”.
+- Residence_type: Encoded as 0 for “Rural” and 1 for “Urban”.
+#### One-Hot Encoding for Multi-Class Categorical Features
+- For features with more than two categories, such as gender, work_type, and smoking_status, apply one-hot encoding to create separate binary columns for each category.
+- The onehot_encode function is assumed to handle the transformation, creating additional columns for each category while dropping the original column.
+#### Split Dataset into Features and Target
+- Separate the target variable (stroke) from the features:
+- X: Contains all feature columns used as input for the model.
+- y: Contains the target column, which indicates whether a stroke occurred.
+#### Train-Test Split
+- Split the dataset into training and testing sets to evaluate model performance effectively. This ensures the model is tested on unseen data and helps prevent overfitting.
+- The specific split ratio (e.g., 70% train, 30% test) can be customized as needed.
+### Model Selection
+Following models are evaluated:
+- Logistic Regression
+- K-Nearest Neighbors
+- Support Vector Machine (Linear Kernel)
+- Support Vector Machine (RBF Kernel)
+- Neural Network
+- Gradient Boosting
+Evaluated for:
+- Handles both numerical and categorical features
+- Resistant to overfitting
+- Provides feature importance
+- Good performance on imbalanced data
+### 4. Software Engineering Best Practices
+#### A. Logging
+Comprehensive logging system:
+```python
+logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
+```
+Logging features:
+- Timestamp for each operation
+- Different log levels (INFO, ERROR)
+- Operation tracking
+- Error capture and reporting
+#### B. Documentation
+- Docstrings for all classes and methods
+- Clear code structure with comments
+- This README file
+- Logging outputs for tracking