Spaces:
Sleeping
Quick Start Guide: Model Improvement
Overview
This guide helps you improve your heart attack risk prediction models using advanced optimization techniques.
🐳 Docker Option (Recommended)
If you have Docker installed, this is the easiest way to run optimization:
# Simple one-command execution
./run_optimization_docker.sh
# Or with custom settings
./run_optimization_docker.sh --trials 50
# Run feature analysis
./run_optimization_docker.sh --script feature_importance_analysis.py
See DOCKER_OPTIMIZATION.md for detailed Docker instructions.
Local Installation Option
Current Performance
Your current models achieve:
- Accuracy: ~85.1%
- Recall: ~84.3%
- ROC-AUC: ~92.5%
Quick Start (3 Steps)
Step 1: Install Dependencies
pip install -r requirements.txt
This will install Optuna and other required packages.
Step 2: Run Model Optimization
python improve_models.py
What this does:
- Optimizes hyperparameters for XGBoost, CatBoost, and LightGBM using Optuna
- Finds optimal prediction thresholds for each model
- Optimizes ensemble weights
- Saves improved models to
content/models/
Time: ~1-2 hours (100 trials per model)
Output:
XGBoost_optimized.joblibCatBoost_optimized.joblibLightGBM_optimized.joblibmodel_metrics_optimized.csvensemble_info_optimized.jsonbest_params_optimized.json
Step 3: Analyze Feature Importance (Optional)
python feature_importance_analysis.py
What this does:
- Analyzes feature importance across all models
- Performs statistical feature selection
- Generates visualizations
- Provides feature selection recommendations
Time: ~5-10 minutes
Output:
feature_selection_recommendations.jsonfeature_importance_top30.pngfeature_correlation_top30.png
Step 4: Compare Results
python compare_models.py
What this does:
- Compares baseline vs optimized models
- Shows improvement metrics
- Displays optimal ensemble configuration
Expected Improvements
After running the optimization:
| Metric | Current | Expected | Improvement |
|---|---|---|---|
| Accuracy | 85.1% | 86-87% | +1-2% |
| Recall | 84.3% | 86-87.5% | +2-4% |
| F1 Score | 85.0% | 86-87% | +1-2% |
Key Improvements Implemented
✅ Optuna Hyperparameter Optimization
- Tree-structured Parzen Estimator (TPE)
- 100+ trials per model
- Expanded parameter search spaces
✅ Multi-Objective Optimization
- Combined accuracy + recall scoring
- Threshold optimization per model
✅ Enhanced Ensemble
- Three-model ensemble (XGBoost + CatBoost + LightGBM)
- Optimized weights
- Optimized threshold
✅ Feature Analysis
- Importance extraction
- Statistical selection methods
- Recommendations for feature engineering
Faster Alternative
If you want faster results (less optimal but quicker):
Edit improve_models.py and change:
n_trials = 100 # Change to 30-50 for faster results
Troubleshooting
Problem: Script takes too long
- Solution: Reduce
n_trialsto 30-50
Problem: Memory errors
- Solution: Reduce
n_jobsor use smaller data sample
Problem: No improvement
- Solution: Check data preprocessing matches training data
Next Steps
- Run optimization scripts
- Compare results with baseline
- Test optimized models on validation set
- Deploy best performing model
- Monitor performance
Files Created
improve_models.py- Main optimization scriptfeature_importance_analysis.py- Feature analysiscompare_models.py- Comparison toolIMPROVEMENTS.md- Detailed improvement analysisQUICK_START.md- This guide
Questions?
See IMPROVEMENTS.md for detailed explanations of all improvements.