File size: 7,091 Bytes
58ed13c a58a482 58ed13c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 |
# π§ Machine Learning Model Comparison β Classification Project
This project compares a variety of supervised machine learning algorithms to evaluate their performance on structured classification tasks. Each model was analyzed based on speed, accuracy, and practical usability.
## π Models Included
| **No.** | **Model Name** | **Type** |
|---------|----------------|----------|
| 1 | Logistic Regression | Linear Model |
| 2 | Random Forest | Ensemble (Bagging) |
| 3 | K-Nearest Neighbors | Instance-Based (Lazy) |
| 4 | Support Vector Machine | Margin-based Classifier |
| 5 | ANN (MLPClassifier) | Neural Network |
| 6 | Naive Bayes | Probabilistic |
| 7 | Decision Tree | Tree-based |
## π Accuracy Summary
| **Model** | **Accuracy (%)** | **Speed** |
|-----------|------------------|-----------|
| Logistic Regression | ~92.3% | π₯ Very Fast |
| Random Forest | ~87.2% | β‘ Medium |
| KNN | ~74.4% | π’ Slow |
| SVM | ~89.7% | β‘ Medium |
| ANN (MLP) | ~46.2% | β‘ Medium |
| Naive Bayes | ~82.1% | π Extremely Fast |
| Decision Tree | ~92.3% | π Fast |
## π§ Model Descriptions
### 1. **Logistic Regression**
* A linear model that predicts class probabilities using a sigmoid function.
* β
**Best for:** Interpretable and quick binary classification.
* β **Limitations:** Not ideal for non-linear or complex patterns.
* **Performance:** 92.3% accuracy with excellent precision-recall balance.
### 2. **Random Forest**
* An ensemble of decision trees with majority voting.
* β
**Best for:** Robust predictions and feature importance analysis.
* β **Limitations:** Slower and harder to interpret than simpler models.
* **Performance:** 87.2% accuracy with good generalization.
### 3. **K-Nearest Neighbors (KNN)**
* A lazy learner that predicts based on the nearest data points.
* β
**Best for:** Simple implementation and non-parametric classification.
* β **Limitations:** Very slow for large datasets; sensitive to noise.
* **Performance:** 74.4% accuracy, lowest among tested models.
### 4. **Support Vector Machine (SVM)**
* Separates classes by finding the maximum margin hyperplane.
* β
**Best for:** High-dimensional data and non-linear patterns with RBF kernel.
* β **Limitations:** Requires feature scaling; sensitive to hyperparameters.
* **Performance:** 89.7% accuracy with strong classification boundaries.
### 5. **ANN (MLPClassifier)**
* A basic feedforward neural network with hidden layers.
* β
**Best for:** Learning complex non-linear patterns.
* β **Limitations:** Poor performance in this project; needs better tuning and data preprocessing.
* **Performance:** 46.2% accuracy - severely underperformed, likely due to insufficient data scaling or architecture.
### 6. **Naive Bayes (GaussianNB)**
* A probabilistic classifier assuming feature independence.
* β
**Best for:** Fast training and text classification.
* β **Limitations:** Feature independence assumption rarely holds true.
* **Performance:** 82.1% accuracy with extremely fast training time.
### 7. **Decision Tree**
* A tree-based model that splits data based on feature thresholds.
* β
**Best for:** Interpretable rules and handling both numerical and categorical data.
* β **Limitations:** Prone to overfitting without proper pruning.
* **Performance:** 92.3% accuracy with excellent interpretability.
## π§ͺ Recommendation Summary
| **Best For** | **Model** |
|--------------|-----------|
| **Highest Accuracy** | Logistic Regression & Decision Tree (92.3%) |
| **Fastest Training** | Naive Bayes |
| **Best Interpretability** | Decision Tree |
| **Best Baseline** | Logistic Regression |
| **Most Robust** | Random Forest |
| **High-Dimensional Data** | SVM |
| **Needs Improvement** | ANN (MLPClassifier) |
## π Model Files Included
* π `logistic_regression.pkl` - Linear classification model
* π `random_forest_model.pkl` - Ensemble model
* π `KNeighborsClassifier_model.pkl` - Instance-based model
* π `SVM_model.pkl` - Support Vector Machine
* π `ANN_model.pkl` - Neural Network (needs optimization)
* π `Naive_Bayes_model.pkl` - Probabilistic model
* π `DecisionTreeClassifier.pkl` - Tree-based model
## π§ How to Use
### Loading and Using Models
```python
import joblib
from sklearn.preprocessing import StandardScaler
# Load any model
model = joblib.load("logistic_regression.pkl")
# For models requiring scaling (SVM, ANN)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_new_data)
prediction = model.predict(X_scaled)
# For other models
prediction = model.predict(X_new_data)
print(prediction)
```
### Training Pipeline Example
```python
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report
import joblib
# Data preprocessing
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Model training
model = LogisticRegression(max_iter=1000)
model.fit(X_train_scaled, y_train)
# Save model
joblib.dump(model, 'logistic_regression.pkl')
# Evaluation
y_pred = model.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")
print("Classification Report:\n", classification_report(y_test, y_pred))
```
## π Performance Details
### Confusion Matrix Analysis
Most models showed good precision-recall balance:
- **True Positives:** Models correctly identified positive cases
- **False Positives:** Low false alarm rates across top performers
- **Class Imbalance:** Dataset appears well-balanced between classes
### Key Insights
1. **Logistic Regression** and **Decision Tree** tied for best accuracy (92.3%)
2. **ANN** significantly underperformed - requires architecture optimization
3. **SVM** showed strong performance with RBF kernel
4. **Naive Bayes** offers best speed-accuracy tradeoff for quick prototyping
## π Future Improvements
### For ANN Model:
- Implement proper feature scaling
- Tune hyperparameters (learning rate, architecture)
- Add regularization techniques
- Consider ensemble methods
### General Optimizations:
- Cross-validation for robust performance estimates
- Hyperparameter tuning with GridSearch/RandomSearch
- Feature engineering and selection
- Ensemble methods combining top performers
## π Model Selection Guide
**Choose Logistic Regression if:** You need interpretability + high accuracy
**Choose Random Forest if:** You want robust predictions without much tuning
**Choose SVM if:** Working with high-dimensional or complex feature spaces
**Choose Decision Tree if:** Interpretability is crucial and you have domain expertise
**Choose Naive Bayes if:** Speed is critical and features are relatively independent
---
*For detailed performance metrics, confusion matrices, and visualizations, check the accompanying analysis files.* |