# Quick Start Guide: Model Improvement ## Overview This guide helps you improve your heart attack risk prediction models using advanced optimization techniques. ## 🐳 Docker Option (Recommended) If you have Docker installed, this is the easiest way to run optimization: ```bash # Simple one-command execution ./run_optimization_docker.sh # Or with custom settings ./run_optimization_docker.sh --trials 50 # Run feature analysis ./run_optimization_docker.sh --script feature_importance_analysis.py ``` See [DOCKER_OPTIMIZATION.md](DOCKER_OPTIMIZATION.md) for detailed Docker instructions. --- ## Local Installation Option ## Current Performance Your current models achieve: - **Accuracy:** ~85.1% - **Recall:** ~84.3% - **ROC-AUC:** ~92.5% ## Quick Start (3 Steps) ### Step 1: Install Dependencies ```bash pip install -r requirements.txt ``` This will install Optuna and other required packages. ### Step 2: Run Model Optimization ```bash python improve_models.py ``` **What this does:** - Optimizes hyperparameters for XGBoost, CatBoost, and LightGBM using Optuna - Finds optimal prediction thresholds for each model - Optimizes ensemble weights - Saves improved models to `content/models/` **Time:** ~1-2 hours (100 trials per model) **Output:** - `XGBoost_optimized.joblib` - `CatBoost_optimized.joblib` - `LightGBM_optimized.joblib` - `model_metrics_optimized.csv` - `ensemble_info_optimized.json` - `best_params_optimized.json` ### Step 3: Analyze Feature Importance (Optional) ```bash python feature_importance_analysis.py ``` **What this does:** - Analyzes feature importance across all models - Performs statistical feature selection - Generates visualizations - Provides feature selection recommendations **Time:** ~5-10 minutes **Output:** - `feature_selection_recommendations.json` - `feature_importance_top30.png` - `feature_correlation_top30.png` ### Step 4: Compare Results ```bash python compare_models.py ``` **What this does:** - Compares baseline vs optimized models - Shows improvement metrics - Displays optimal ensemble configuration ## Expected Improvements After running the optimization: | Metric | Current | Expected | Improvement | |--------|---------|----------|-------------| | Accuracy | 85.1% | 86-87% | +1-2% | | Recall | 84.3% | 86-87.5% | +2-4% | | F1 Score | 85.0% | 86-87% | +1-2% | ## Key Improvements Implemented 1. ✅ **Optuna Hyperparameter Optimization** - Tree-structured Parzen Estimator (TPE) - 100+ trials per model - Expanded parameter search spaces 2. ✅ **Multi-Objective Optimization** - Combined accuracy + recall scoring - Threshold optimization per model 3. ✅ **Enhanced Ensemble** - Three-model ensemble (XGBoost + CatBoost + LightGBM) - Optimized weights - Optimized threshold 4. ✅ **Feature Analysis** - Importance extraction - Statistical selection methods - Recommendations for feature engineering ## Faster Alternative If you want faster results (less optimal but quicker): Edit `improve_models.py` and change: ```python n_trials = 100 # Change to 30-50 for faster results ``` ## Troubleshooting **Problem:** Script takes too long - **Solution:** Reduce `n_trials` to 30-50 **Problem:** Memory errors - **Solution:** Reduce `n_jobs` or use smaller data sample **Problem:** No improvement - **Solution:** Check data preprocessing matches training data ## Next Steps 1. Run optimization scripts 2. Compare results with baseline 3. Test optimized models on validation set 4. Deploy best performing model 5. Monitor performance ## Files Created - `improve_models.py` - Main optimization script - `feature_importance_analysis.py` - Feature analysis - `compare_models.py` - Comparison tool - `IMPROVEMENTS.md` - Detailed improvement analysis - `QUICK_START.md` - This guide ## Questions? See `IMPROVEMENTS.md` for detailed explanations of all improvements.