| # π§ Machine Learning Model Comparison β Classification Project | |
| This project compares a variety of supervised machine learning algorithms to evaluate their performance on structured classification tasks. Each model was analyzed based on speed, accuracy, and practical usability. | |
| ## π Models Included | |
| | **No.** | **Model Name** | **Type** | | |
| |---------|----------------|----------| | |
| | 1 | Logistic Regression | Linear Model | | |
| | 2 | Random Forest | Ensemble (Bagging) | | |
| | 3 | K-Nearest Neighbors | Instance-Based (Lazy) | | |
| | 4 | Support Vector Machine | Margin-based Classifier | | |
| | 5 | ANN (MLPClassifier) | Neural Network | | |
| | 6 | Naive Bayes | Probabilistic | | |
| | 7 | Decision Tree | Tree-based | | |
| ## π Accuracy Summary | |
| | **Model** | **Accuracy (%)** | **Speed** | | |
| |-----------|------------------|-----------| | |
| | Logistic Regression | ~92.3% | π₯ Very Fast | | |
| | Random Forest | ~87.2% | β‘ Medium | | |
| | KNN | ~74.4% | π’ Slow | | |
| | SVM | ~89.7% | β‘ Medium | | |
| | ANN (MLP) | ~46.2% | β‘ Medium | | |
| | Naive Bayes | ~82.1% | π Extremely Fast | | |
| | Decision Tree | ~92.3% | π Fast | | |
| ## π§ Model Descriptions | |
| ### 1. **Logistic Regression** | |
| * A linear model that predicts class probabilities using a sigmoid function. | |
| * β **Best for:** Interpretable and quick binary classification. | |
| * β **Limitations:** Not ideal for non-linear or complex patterns. | |
| * **Performance:** 92.3% accuracy with excellent precision-recall balance. | |
| ### 2. **Random Forest** | |
| * An ensemble of decision trees with majority voting. | |
| * β **Best for:** Robust predictions and feature importance analysis. | |
| * β **Limitations:** Slower and harder to interpret than simpler models. | |
| * **Performance:** 87.2% accuracy with good generalization. | |
| ### 3. **K-Nearest Neighbors (KNN)** | |
| * A lazy learner that predicts based on the nearest data points. | |
| * β **Best for:** Simple implementation and non-parametric classification. | |
| * β **Limitations:** Very slow for large datasets; sensitive to noise. | |
| * **Performance:** 74.4% accuracy, lowest among tested models. | |
| ### 4. **Support Vector Machine (SVM)** | |
| * Separates classes by finding the maximum margin hyperplane. | |
| * β **Best for:** High-dimensional data and non-linear patterns with RBF kernel. | |
| * β **Limitations:** Requires feature scaling; sensitive to hyperparameters. | |
| * **Performance:** 89.7% accuracy with strong classification boundaries. | |
| ### 5. **ANN (MLPClassifier)** | |
| * A basic feedforward neural network with hidden layers. | |
| * β **Best for:** Learning complex non-linear patterns. | |
| * β **Limitations:** Poor performance in this project; needs better tuning and data preprocessing. | |
| * **Performance:** 46.2% accuracy - severely underperformed, likely due to insufficient data scaling or architecture. | |
| ### 6. **Naive Bayes (GaussianNB)** | |
| * A probabilistic classifier assuming feature independence. | |
| * β **Best for:** Fast training and text classification. | |
| * β **Limitations:** Feature independence assumption rarely holds true. | |
| * **Performance:** 82.1% accuracy with extremely fast training time. | |
| ### 7. **Decision Tree** | |
| * A tree-based model that splits data based on feature thresholds. | |
| * β **Best for:** Interpretable rules and handling both numerical and categorical data. | |
| * β **Limitations:** Prone to overfitting without proper pruning. | |
| * **Performance:** 92.3% accuracy with excellent interpretability. | |
| ## π§ͺ Recommendation Summary | |
| | **Best For** | **Model** | | |
| |--------------|-----------| | |
| | **Highest Accuracy** | Logistic Regression & Decision Tree (92.3%) | | |
| | **Fastest Training** | Naive Bayes | | |
| | **Best Interpretability** | Decision Tree | | |
| | **Best Baseline** | Logistic Regression | | |
| | **Most Robust** | Random Forest | | |
| | **High-Dimensional Data** | SVM | | |
| | **Needs Improvement** | ANN (MLPClassifier) | | |
| ## π Model Files Included | |
| * π `logistic_regression.pkl` - Linear classification model | |
| * π `random_forest_model.pkl` - Ensemble model | |
| * π `KNeighborsClassifier_model.pkl` - Instance-based model | |
| * π `SVM_model.pkl` - Support Vector Machine | |
| * π `ANN_model.pkl` - Neural Network (needs optimization) | |
| * π `Naive_Bayes_model.pkl` - Probabilistic model | |
| * π `DecisionTreeClassifier.pkl` - Tree-based model | |
| ## π§ How to Use | |
| ### Loading and Using Models | |
| ```python | |
| import joblib | |
| from sklearn.preprocessing import StandardScaler | |
| # Load any model | |
| model = joblib.load("logistic_regression.pkl") | |
| # For models requiring scaling (SVM, ANN) | |
| scaler = StandardScaler() | |
| X_scaled = scaler.fit_transform(X_new_data) | |
| prediction = model.predict(X_scaled) | |
| # For other models | |
| prediction = model.predict(X_new_data) | |
| print(prediction) | |
| ``` | |
| ### Training Pipeline Example | |
| ```python | |
| from sklearn.linear_model import LogisticRegression | |
| from sklearn.preprocessing import StandardScaler | |
| from sklearn.metrics import accuracy_score, classification_report | |
| import joblib | |
| # Data preprocessing | |
| scaler = StandardScaler() | |
| X_train_scaled = scaler.fit_transform(X_train) | |
| X_test_scaled = scaler.transform(X_test) | |
| # Model training | |
| model = LogisticRegression(max_iter=1000) | |
| model.fit(X_train_scaled, y_train) | |
| # Save model | |
| joblib.dump(model, 'logistic_regression.pkl') | |
| # Evaluation | |
| y_pred = model.predict(X_test_scaled) | |
| accuracy = accuracy_score(y_test, y_pred) | |
| print(f"Accuracy: {accuracy}") | |
| print("Classification Report:\n", classification_report(y_test, y_pred)) | |
| ``` | |
| ## π Performance Details | |
| ### Confusion Matrix Analysis | |
| Most models showed good precision-recall balance: | |
| - **True Positives:** Models correctly identified positive cases | |
| - **False Positives:** Low false alarm rates across top performers | |
| - **Class Imbalance:** Dataset appears well-balanced between classes | |
| ### Key Insights | |
| 1. **Logistic Regression** and **Decision Tree** tied for best accuracy (92.3%) | |
| 2. **ANN** significantly underperformed - requires architecture optimization | |
| 3. **SVM** showed strong performance with RBF kernel | |
| 4. **Naive Bayes** offers best speed-accuracy tradeoff for quick prototyping | |
| ## π Future Improvements | |
| ### For ANN Model: | |
| - Implement proper feature scaling | |
| - Tune hyperparameters (learning rate, architecture) | |
| - Add regularization techniques | |
| - Consider ensemble methods | |
| ### General Optimizations: | |
| - Cross-validation for robust performance estimates | |
| - Hyperparameter tuning with GridSearch/RandomSearch | |
| - Feature engineering and selection | |
| - Ensemble methods combining top performers | |
| ## π Model Selection Guide | |
| **Choose Logistic Regression if:** You need interpretability + high accuracy | |
| **Choose Random Forest if:** You want robust predictions without much tuning | |
| **Choose SVM if:** Working with high-dimensional or complex feature spaces | |
| **Choose Decision Tree if:** Interpretability is crucial and you have domain expertise | |
| **Choose Naive Bayes if:** Speed is critical and features are relatively independent | |
| --- | |
| *For detailed performance metrics, confusion matrices, and visualizations, check the accompanying analysis files.* |