Kasilanka Bhoopesh Siva Srikar commited on
Commit
08123aa
·
1 Parent(s): 06a2683

Complete Heart Attack Risk Prediction App - Ready for Deployment

Browse files

- Updated Streamlit app with optimized ensemble models
- Added all 3 models: XGBoost, CatBoost, LightGBM
- Fixed feature alignment and UI display
- Added comprehensive test cases (8 test scenarios)
- Created deployment documentation
- Models: 80.77% accuracy, 93.27% recall
- Ensemble weights: XGB 5%, CAT 85%, LGB 10%
- Ready for Hugging Face Spaces deployment

.gitignore CHANGED
@@ -1,15 +1,68 @@
1
- # Python / Streamlit
2
  __pycache__/
3
- *.pyc
4
- .streamlit/secrets.toml
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
 
6
- # Model assets - allow model files
7
- # model_assets/*
8
- # !model_assets/.gitkeep
 
 
9
 
10
- # Jupyter checkpoints
11
- .ipynb_checkpoints/
12
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
 
14
- all_files.zip
15
- content/models/RF_cw.joblib
 
 
 
1
+ # Python
2
  __pycache__/
3
+ *.py[cod]
4
+ *$py.class
5
+ *.so
6
+ .Python
7
+ build/
8
+ develop-eggs/
9
+ dist/
10
+ downloads/
11
+ eggs/
12
+ .eggs/
13
+ lib/
14
+ lib64/
15
+ parts/
16
+ sdist/
17
+ var/
18
+ wheels/
19
+ *.egg-info/
20
+ .installed.cfg
21
+ *.egg
22
+ MANIFEST
23
+
24
+ # Virtual Environment
25
+ venv/
26
+ env/
27
+ ENV/
28
+ .venv
29
+
30
+ # IDE
31
+ .vscode/
32
+ .idea/
33
+ *.swp
34
+ *.swo
35
+ *~
36
+
37
+ # Jupyter Notebook
38
+ .ipynb_checkpoints
39
+ *.ipynb
40
 
41
+ # Model files (if too large, use Git LFS or exclude)
42
+ # Uncomment if models are too large for GitHub
43
+ # *.joblib
44
+ # content/models/*.joblib
45
+ # model_assets/*.joblib
46
 
47
+ # Data files (usually too large)
48
+ # content/cardio_train_extended.csv
49
 
50
+ # Logs
51
+ *.log
52
+ optimization_log.txt
53
+ optimization_v2_log.txt
54
+
55
+ # OS
56
+ .DS_Store
57
+ Thumbs.db
58
+
59
+ # Docker
60
+ .dockerignore
61
+
62
+ # Streamlit
63
+ .streamlit/secrets.toml
64
 
65
+ # Temporary files
66
+ *.tmp
67
+ *.bak
68
+ *.swp
COLAB_COMPARISON.md ADDED
@@ -0,0 +1,226 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Google Colab Time Estimate & Setup Guide
2
+
3
+ ## ⏱️ Time Comparison
4
+
5
+ ### Current Local Setup (Docker)
6
+ - **CPUs:** 2 cores
7
+ - **Memory:** 4 GB
8
+ - **Total Time:** ~24.4 hours
9
+ - XGBoost: ~2.9 hours
10
+ - CatBoost: ~12.5 hours
11
+ - LightGBM: ~9.0 hours
12
+
13
+ ---
14
+
15
+ ## 🆓 Google Colab Free Tier (CPU Only)
16
+
17
+ ### Specifications
18
+ - **CPUs:** 1-2 cores (variable, shared resources)
19
+ - **Memory:** ~12.7 GB RAM
20
+ - **GPU:** None
21
+ - **Session Timeout:** 12 hours (disconnects after inactivity)
22
+
23
+ ### Estimated Time
24
+ - **Total:** ~30.5 hours (20% slower than local)
25
+ - XGBoost: ~3.7 hours
26
+ - CatBoost: ~15.6 hours
27
+ - LightGBM: ~11.3 hours
28
+
29
+ ### ⚠️ Limitations
30
+ - **May timeout before completion** (12-hour limit)
31
+ - Slower due to shared resources
32
+ - May need to restart and resume from checkpoints
33
+
34
+ ---
35
+
36
+ ## 🎮 Google Colab Free Tier + GPU (T4)
37
+
38
+ ### Specifications
39
+ - **CPUs:** 1-2 cores
40
+ - **Memory:** ~12.7 GB RAM
41
+ - **GPU:** NVIDIA T4 (16 GB)
42
+ - **Session Timeout:** 12 hours
43
+
44
+ ### Estimated Time
45
+ - **Total:** ~18.0 hours (26% faster than local)
46
+ - XGBoost: ~1.9 hours (50% faster with GPU)
47
+ - CatBoost: ~9.6 hours (30% faster with GPU)
48
+ - LightGBM: ~6.4 hours (40% faster with GPU)
49
+
50
+ ### ⚠️ Limitations
51
+ - **May timeout before completion** (12-hour limit)
52
+ - GPU availability not guaranteed (may need to wait)
53
+ - Requires code modifications for GPU support
54
+
55
+ ---
56
+
57
+ ## 💎 Google Colab Pro ($10/month)
58
+
59
+ ### Specifications
60
+ - **CPUs:** 2-4 cores (better allocation)
61
+ - **Memory:** ~32 GB RAM
62
+ - **GPU:** Better GPU access (T4/V100)
63
+ - **Session Timeout:** 24 hours
64
+ - **Background Execution:** Yes
65
+
66
+ ### Estimated Time (CPU)
67
+ - **Total:** ~20.4 hours (17% faster than local)
68
+ - XGBoost: ~2.4 hours
69
+ - CatBoost: ~10.4 hours
70
+ - LightGBM: ~7.5 hours
71
+
72
+ ### Estimated Time (with GPU)
73
+ - **Total:** ~15.0 hours (39% faster than local)
74
+ - XGBoost: ~1.6 hours
75
+ - CatBoost: ~8.0 hours
76
+ - LightGBM: ~5.4 hours
77
+
78
+ ### ✅ Advantages
79
+ - Longer session time (24 hours)
80
+ - Background execution (can close browser)
81
+ - Better resource allocation
82
+ - More reliable GPU access
83
+
84
+ ---
85
+
86
+ ## 📊 Summary Table
87
+
88
+ | Platform | CPUs | GPU | Total Time | Cost | Session Limit |
89
+ |----------|------|-----|------------|------|---------------|
90
+ | **Local (Docker)** | 2 | No | ~24.4 hrs | Free | None |
91
+ | **Colab Free (CPU)** | 1-2 | No | ~30.5 hrs | Free | 12 hrs ⚠️ |
92
+ | **Colab Free (GPU)** | 1-2 | T4 | ~18.0 hrs | Free | 12 hrs ⚠️ |
93
+ | **Colab Pro (CPU)** | 2-4 | No | ~20.4 hrs | $10/mo | 24 hrs |
94
+ | **Colab Pro (GPU)** | 2-4 | T4/V100 | ~15.0 hrs | $10/mo | 24 hrs |
95
+
96
+ ---
97
+
98
+ ## 🚀 Setting Up for Google Colab
99
+
100
+ ### 1. Enable GPU (if using)
101
+ ```python
102
+ # In Colab, go to: Runtime → Change runtime type → Hardware accelerator → GPU
103
+ ```
104
+
105
+ ### 2. Install Dependencies
106
+ ```python
107
+ !pip install xgboost catboost lightgbm optuna pandas numpy scikit-learn joblib
108
+ ```
109
+
110
+ ### 3. Upload Data
111
+ ```python
112
+ from google.colab import files
113
+ # Upload cardio_train_extended.csv
114
+ uploaded = files.upload()
115
+ ```
116
+
117
+ ### 4. Modify Code for GPU Support
118
+
119
+ You'll need to modify `improve_models.py` to enable GPU:
120
+
121
+ **For XGBoost:**
122
+ ```python
123
+ # Change tree_method to use GPU
124
+ xgb_params = {
125
+ 'tree_method': 'gpu_hist', # Enable GPU
126
+ 'device': 'cuda', # Use CUDA
127
+ # ... other parameters
128
+ }
129
+ ```
130
+
131
+ **For CatBoost:**
132
+ ```python
133
+ cat_params = {
134
+ 'task_type': 'GPU', # Enable GPU
135
+ 'devices': '0', # Use first GPU
136
+ # ... other parameters
137
+ }
138
+ ```
139
+
140
+ **For LightGBM:**
141
+ ```python
142
+ lgb_params = {
143
+ 'device': 'gpu', # Enable GPU
144
+ 'gpu_platform_id': 0,
145
+ 'gpu_device_id': 0,
146
+ # ... other parameters
147
+ }
148
+ ```
149
+
150
+ ### 5. Handle Session Timeouts
151
+
152
+ For long-running training, save checkpoints:
153
+
154
+ ```python
155
+ import pickle
156
+
157
+ # Save study state periodically
158
+ def save_checkpoint(study, trial):
159
+ if trial.number % 50 == 0:
160
+ with open('study_checkpoint.pkl', 'wb') as f:
161
+ pickle.dump(study, f)
162
+
163
+ # Load checkpoint if resuming
164
+ try:
165
+ with open('study_checkpoint.pkl', 'rb') as f:
166
+ study = pickle.load(f)
167
+ except FileNotFoundError:
168
+ study = optuna.create_study(...)
169
+ ```
170
+
171
+ ---
172
+
173
+ ## 💡 Recommendations
174
+
175
+ ### Best Option: **Colab Pro + GPU**
176
+ - ✅ Fastest completion (~15 hours)
177
+ - ✅ 24-hour session limit (enough time)
178
+ - ✅ Background execution
179
+ - ✅ Most reliable
180
+
181
+ ### Budget Option: **Colab Free + GPU**
182
+ - ✅ Free
183
+ - ✅ Faster than local (~18 hours)
184
+ - ⚠️ May timeout (12-hour limit)
185
+ - ⚠️ Need to implement checkpointing
186
+
187
+ ### Local Option: **Keep Current Setup**
188
+ - ✅ No cost
189
+ - ✅ No timeouts
190
+ - ✅ Full control
191
+ - ⚠️ Slower (~24 hours)
192
+
193
+ ---
194
+
195
+ ## 📝 Important Notes
196
+
197
+ 1. **GPU Acceleration:** Requires code modifications to enable GPU support in XGBoost, CatBoost, and LightGBM
198
+ 2. **Session Limits:** Free tier has 12-hour limits - may need to restart
199
+ 3. **Resource Availability:** Colab resources vary - actual times may differ
200
+ 4. **Checkpointing:** Essential for long runs on free tier
201
+ 5. **Data Upload:** Need to upload dataset to Colab (or use Google Drive)
202
+
203
+ ---
204
+
205
+ ## 🔧 Quick Colab Setup Script
206
+
207
+ ```python
208
+ # Run this in a Colab cell
209
+ !pip install xgboost catboost lightgbm optuna pandas numpy scikit-learn joblib
210
+
211
+ # Enable GPU (if available)
212
+ import os
213
+ os.environ['CUDA_VISIBLE_DEVICES'] = '0'
214
+
215
+ # Upload your data file
216
+ from google.colab import files
217
+ uploaded = files.upload()
218
+
219
+ # Then run your improve_models.py script
220
+ # (with GPU modifications)
221
+ ```
222
+
223
+ ---
224
+
225
+ **Last Updated:** November 9, 2025
226
+
COMMIT_GUIDE.md ADDED
@@ -0,0 +1,93 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 📤 Quick Commit Guide for GitHub Desktop
2
+
3
+ ## ✅ Good News!
4
+ Your repository is already connected to: `https://github.com/kbssrikar7/heart-attack-risk-ensemble.git`
5
+
6
+ ## 📋 Files Ready to Commit
7
+
8
+ ### Modified Files (need to be staged):
9
+ - ✅ `streamlit_app.py` - Updated with all fixes
10
+ - ✅ `requirements.txt` - Updated dependencies
11
+ - ✅ `model_assets/hybrid_metrics.csv` - Updated metrics
12
+
13
+ ### New Files to Add:
14
+ - ✅ `TEST_CASES.md` - 8 test cases
15
+ - ✅ `DEPLOYMENT_CHECKLIST.md` - Deployment verification
16
+ - ✅ `DEPLOYMENT_OPTIONS.md` - Deployment options guide
17
+ - ✅ `GITHUB_SETUP.md` - GitHub setup guide
18
+ - ✅ `COLAB_COMPARISON.md` - Colab comparison
19
+ - ✅ `COMPLETION_ESTIMATE.md` - Completion estimates
20
+ - ✅ `DOCKER_OPTIMIZATION.md` - Docker optimization guide
21
+ - ✅ `DOCKER_README.md` - Docker readme
22
+ - ✅ `IMPROVEMENTS.md` - Improvements documentation
23
+ - ✅ `Dockerfile.optimization` - Optimization Dockerfile
24
+
25
+ ## 🎯 Steps in GitHub Desktop
26
+
27
+ ### Step 1: Open GitHub Desktop
28
+ 1. Launch GitHub Desktop
29
+ 2. It should automatically detect your repository at:
30
+ `/home/kbs/Documents/heart-attack-risk-ensemble`
31
+
32
+ ### Step 2: Review Changes
33
+ 1. You'll see all modified and new files in the left panel
34
+ 2. Review each file to make sure everything looks good
35
+
36
+ ### Step 3: Stage All Files
37
+ 1. Click the checkbox next to "Changes" (or select all files)
38
+ 2. Or manually select files you want to commit
39
+
40
+ ### Step 4: Write Commit Message
41
+ **Summary:**
42
+ ```
43
+ Complete Heart Attack Risk Prediction App - Ready for Deployment
44
+ ```
45
+
46
+ **Description:**
47
+ ```
48
+ - Updated Streamlit app with optimized ensemble models
49
+ - Added all 3 models: XGBoost, CatBoost, LightGBM
50
+ - Fixed feature alignment and UI display
51
+ - Added comprehensive test cases (8 test scenarios)
52
+ - Created deployment documentation
53
+ - Models: 80.77% accuracy, 93.27% recall
54
+ - Ensemble weights: XGB 5%, CAT 85%, LGB 10%
55
+ - Ready for Hugging Face Spaces deployment
56
+ ```
57
+
58
+ ### Step 5: Commit
59
+ 1. Click **"Commit to main"** button
60
+ 2. Wait for commit to complete
61
+
62
+ ### Step 6: Push to GitHub
63
+ 1. Click **"Push origin"** button (top right)
64
+ 2. Wait for push to complete
65
+ 3. Verify on GitHub.com
66
+
67
+ ## ✅ Verify on GitHub
68
+
69
+ After pushing, check:
70
+ 1. Go to: https://github.com/kbssrikar7/heart-attack-risk-ensemble
71
+ 2. Verify all files are there
72
+ 3. Check that model files are uploaded (should be ~15MB each)
73
+
74
+ ## 🚀 Next: Deploy to Hugging Face
75
+
76
+ Once code is on GitHub:
77
+ 1. Go to https://huggingface.co/spaces
78
+ 2. Click "Create new Space"
79
+ 3. Select "Streamlit"
80
+ 4. Connect your GitHub repo
81
+ 5. Deploy!
82
+
83
+ ---
84
+
85
+ ## 📊 File Sizes (All Good!)
86
+ - ✅ Largest model: 15MB (under 100MB limit)
87
+ - ✅ Total model assets: 44MB
88
+ - ✅ All files can be committed to GitHub
89
+
90
+ ## ⚠️ Note
91
+ - Make sure repository is **Public** (required for free Hugging Face Spaces)
92
+ - If it's private, you'll need Hugging Face Pro
93
+
COMPLETION_ESTIMATE.md ADDED
@@ -0,0 +1,59 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ⏱️ Training Completion Time Estimate
2
+
3
+ ## Current Status
4
+
5
+ **Last Updated:** $(date)
6
+
7
+ ### Progress Summary
8
+
9
+ | Model | Status | Progress | Time Remaining |
10
+ |-------|--------|----------|----------------|
11
+ | **XGBoost** | ✅ COMPLETED | 300/300 (100%) | - |
12
+ | **CatBoost** | 🔄 IN PROGRESS | 16/300 (5.3%) | ~6 hours |
13
+ | **LightGBM** | ⏳ WAITING | 0/300 (0%) | ~4.7 hours |
14
+ | **Final Eval** | ⏳ WAITING | - | ~15 minutes |
15
+
16
+ ## Time Breakdown
17
+
18
+ ### CatBoost Optimization
19
+ - **Current:** Trial 16/300
20
+ - **Remaining:** 284 trials
21
+ - **Average time per trial:** ~1.26 minutes (75 seconds)
22
+ - **Estimated remaining:** ~356 minutes (~6 hours)
23
+
24
+ ### LightGBM Optimization
25
+ - **Total trials:** 300
26
+ - **Estimated time per trial:** ~0.94 minutes (56 seconds, 25% faster than CatBoost)
27
+ - **Estimated total:** ~282 minutes (~4.7 hours)
28
+
29
+ ### Final Evaluation
30
+ - **Estimated time:** ~15 minutes
31
+
32
+ ## Total Estimate
33
+
34
+ **Total Remaining Time:** ~653 minutes (~10.9 hours)
35
+
36
+ **Estimated Completion:** Approximately **10-11 hours** from now
37
+
38
+ *(Note: Actual completion time may vary based on trial complexity and system performance)*
39
+
40
+ ## How to Check Progress
41
+
42
+ Run these commands to monitor progress:
43
+ ```bash
44
+ # Quick status check
45
+ ./check_training.sh
46
+
47
+ # Watch live logs
48
+ docker logs -f heart-optimization-v2
49
+
50
+ # Check container stats
51
+ docker stats heart-optimization-v2
52
+ ```
53
+
54
+ ## Current Best Scores
55
+
56
+ - **XGBoost Best:** 0.842463 (Trial #224)
57
+ - **CatBoost Best:** 0.837881 (Trial #15) *[in progress]*
58
+ - **LightGBM Best:** TBD
59
+
DEPLOYMENT_CHECKLIST.md ADDED
@@ -0,0 +1,97 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ✅ Final Deployment Checklist
2
+
3
+ ## 📋 Pre-Deployment Verification
4
+
5
+ ### ✅ Code Quality
6
+ - [x] All Python files compile without syntax errors
7
+ - [x] No linter errors in streamlit_app.py
8
+ - [x] All imports are correct and available
9
+ - [x] Error handling is in place
10
+
11
+ ### ✅ Model Files
12
+ - [x] XGBoost_optimized.joblib exists in content/models/ or model_assets/
13
+ - [x] CatBoost_optimized.joblib exists in content/models/ or model_assets/
14
+ - [x] LightGBM_optimized.joblib exists in content/models/ or model_assets/
15
+ - [x] ensemble_info_optimized.json exists with correct weights
16
+ - [x] model_metrics_optimized.csv exists with ensemble metrics
17
+
18
+ ### ✅ Configuration
19
+ - [x] Ensemble weights: XGBoost 5%, CatBoost 85%, LightGBM 10%
20
+ - [x] Ensemble metrics: Accuracy 80.77%, Recall 93.27%
21
+ - [x] requirements.txt includes all dependencies
22
+ - [x] Page title and subtitle are correct
23
+
24
+ ### ✅ UI Elements
25
+ - [x] Page title: "Predicting Heart Attack Risk: An Ensemble Modeling Approach"
26
+ - [x] Subtitle includes: "XGBoost, CatBoost, and LightGBM"
27
+ - [x] Sidebar displays optimized ensemble weights correctly
28
+ - [x] Sidebar shows Accuracy: 80.77% and Recall: 93.27%
29
+ - [x] All input fields are present and functional
30
+ - [x] Prediction button works correctly
31
+ - [x] Results display with proper formatting
32
+
33
+ ### ✅ Model Display
34
+ - [x] All 4 models displayed horizontally: XGBoost, CatBoost, LightGBM, Ensemble
35
+ - [x] Each model shows progress bar with percentage inside
36
+ - [x] Risk percentage displayed below each bar
37
+ - [x] Color coding: Green (low), Orange (moderate), Red (high)
38
+ - [x] Ensemble metrics section shows Accuracy and Recall
39
+
40
+ ### ✅ Functionality
41
+ - [x] Feature engineering works correctly
42
+ - [x] One-hot encoding matches training data
43
+ - [x] CatBoost feature alignment is correct
44
+ - [x] LightGBM feature alignment is correct
45
+ - [x] XGBoost predictions work
46
+ - [x] Ensemble prediction uses correct weights
47
+ - [x] Risk factors are identified correctly
48
+ - [x] Recommendations match risk level
49
+
50
+ ### ✅ Test Cases
51
+ - [x] Test Case 1 (Low Risk) - Verified: Ensemble shows ~3.43% (correct)
52
+ - [x] LightGBM behavior documented (may show 20-25% for low risk, but ensemble correct)
53
+ - [x] All test cases documented in TEST_CASES.md
54
+
55
+ ### ✅ Error Handling
56
+ - [x] App handles missing models gracefully
57
+ - [x] Invalid inputs show appropriate warnings
58
+ - [x] Error messages are user-friendly
59
+ - [x] CatBoost feature mismatch errors are handled
60
+
61
+ ### ✅ Documentation
62
+ - [x] TEST_CASES.md created with 8 test cases
63
+ - [x] Deployment checklist created
64
+ - [x] Notes about LightGBM behavior documented
65
+
66
+ ## 🚀 Deployment Ready
67
+
68
+ ### Files to Deploy:
69
+ 1. `streamlit_app.py` - Main application
70
+ 2. `requirements.txt` - Dependencies
71
+ 3. `content/models/` or `model_assets/` - Model files and configs
72
+ 4. `TEST_CASES.md` - Test documentation
73
+
74
+ ### Key Points:
75
+ - ✅ All models load correctly
76
+ - ✅ Ensemble weights are optimized (5%, 85%, 10%)
77
+ - ✅ UI displays all 4 models horizontally
78
+ - ✅ Predictions work correctly
79
+ - ✅ LightGBM behavior is expected (higher individual values, but ensemble correct)
80
+
81
+ ## 📊 Expected Behavior
82
+
83
+ ### For Low Risk Patient (Test Case 1):
84
+ - XGBoost: ~6-7%
85
+ - CatBoost: ~1-2%
86
+ - LightGBM: ~20-25% (expected behavior)
87
+ - **Ensemble: ~3-4%** ✅ (correct due to weighting)
88
+
89
+ ### Sidebar Display:
90
+ - Ensemble weights: XGBoost 5.0% | CatBoost 85.0% | LightGBM 10.0%
91
+ - Accuracy: 80.77%
92
+ - Recall: 93.27%
93
+
94
+ ## ✅ Final Status: READY FOR DEPLOYMENT
95
+
96
+ All checks passed. The application is ready for deployment to Hugging Face Spaces or any other platform.
97
+
DEPLOYMENT_OPTIONS.md ADDED
@@ -0,0 +1,168 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 🚀 Deployment Options Guide
2
+
3
+ ## Option 1: Hugging Face Spaces (Recommended - Easiest) ✅
4
+
5
+ ### ✅ **NO Docker Needed**
6
+ Hugging Face Spaces automatically handles the environment using `requirements.txt`.
7
+
8
+ ### Steps:
9
+ 1. Push code to GitHub
10
+ 2. Go to https://huggingface.co/spaces
11
+ 3. Create new Space → Select "Streamlit"
12
+ 4. Connect your GitHub repo
13
+ 5. Done! Hugging Face handles everything.
14
+
15
+ ### Files Needed:
16
+ - ✅ `streamlit_app.py`
17
+ - ✅ `requirements.txt`
18
+ - ✅ `model_assets/` or `content/models/` (with model files)
19
+ - ✅ `.streamlit/config.toml` (optional)
20
+
21
+ ### Pros:
22
+ - ✅ Free
23
+ - ✅ No Docker needed
24
+ - ✅ Easy setup
25
+ - ✅ Automatic HTTPS
26
+ - ✅ Community-friendly
27
+
28
+ ---
29
+
30
+ ## Option 2: Render (Self-Hosted with Docker) 🐳
31
+
32
+ ### ✅ **YES - Docker Required**
33
+ Render uses Docker for deployment.
34
+
35
+ ### Steps:
36
+ 1. Push code to GitHub
37
+ 2. Go to https://render.com
38
+ 3. Create Web Service → Select your repo
39
+ 4. Runtime: **Docker**
40
+ 5. Render uses your `Dockerfile` automatically
41
+
42
+ ### Files Needed:
43
+ - ✅ `Dockerfile` (already created)
44
+ - ✅ `render.yaml` (already created)
45
+ - ✅ `streamlit_app.py`
46
+ - ✅ `requirements.txt`
47
+ - ✅ `model_assets/` (with model files)
48
+
49
+ ### Pros:
50
+ - ✅ Free tier available
51
+ - ✅ Custom domain support
52
+ - ✅ More control
53
+ - ✅ Docker ensures consistency
54
+
55
+ ### Cons:
56
+ - ⚠️ Free tier: App sleeps after 15 min inactivity
57
+ - ⚠️ First request after sleep takes ~30 seconds
58
+
59
+ ---
60
+
61
+ ## Option 3: AWS/GCP/Azure (Self-Hosted with Docker) ☁️
62
+
63
+ ### ✅ **YES - Docker Recommended**
64
+ For cloud platforms, Docker provides consistency.
65
+
66
+ ### Steps:
67
+ 1. Build Docker image: `docker build -t heart-app .`
68
+ 2. Push to container registry (ECR, GCR, ACR)
69
+ 3. Deploy to container service (ECS, Cloud Run, Container Instances)
70
+
71
+ ### Pros:
72
+ - ✅ Full control
73
+ - ✅ Scalable
74
+ - ✅ Production-ready
75
+
76
+ ### Cons:
77
+ - ⚠️ Costs money (usually)
78
+ - ⚠️ More complex setup
79
+
80
+ ---
81
+
82
+ ## Option 4: Local Server (Self-Hosted with Docker) 🖥️
83
+
84
+ ### ✅ **YES - Docker Recommended**
85
+ For your own server/VPS.
86
+
87
+ ### Steps:
88
+ 1. Build: `docker build -t heart-app .`
89
+ 2. Run: `docker run -d -p 8501:8501 heart-app`
90
+ 3. Access: `http://your-server-ip:8501`
91
+
92
+ ### Pros:
93
+ - ✅ Full control
94
+ - ✅ No external dependencies
95
+ - ✅ Can be free (if you own the server)
96
+
97
+ ---
98
+
99
+ ## 📊 Comparison Table
100
+
101
+ | Platform | Docker Needed? | Difficulty | Cost | Best For |
102
+ |----------|---------------|------------|------|----------|
103
+ | **Hugging Face Spaces** | ❌ No | ⭐ Easy | Free | Quick deployment, sharing |
104
+ | **Render** | ✅ Yes | ⭐⭐ Medium | Free/Paid | Self-hosting, custom domain |
105
+ | **AWS/GCP/Azure** | ✅ Yes | ⭐⭐⭐ Hard | Paid | Production, scaling |
106
+ | **Local Server** | ✅ Yes | ⭐⭐ Medium | Free* | Full control, privacy |
107
+
108
+ *Free if you own the server
109
+
110
+ ---
111
+
112
+ ## 🎯 Recommendation
113
+
114
+ ### For Quick Deployment:
115
+ **Use Hugging Face Spaces** - No Docker needed, easiest option.
116
+
117
+ ### For Self-Hosting:
118
+ **Use Render with Docker** - Your `Dockerfile` is already ready!
119
+
120
+ ---
121
+
122
+ ## ✅ Your Dockerfile Status
123
+
124
+ Your `Dockerfile` is **ready to use** and includes:
125
+ - ✅ Python 3.11 base image
126
+ - ✅ All system dependencies
127
+ - ✅ All Python packages from requirements.txt
128
+ - ✅ Streamlit app configured
129
+ - ✅ Model assets copied
130
+ - ✅ Port 8051 exposed
131
+
132
+ **You can use it for:**
133
+ - Render deployment
134
+ - AWS/GCP/Azure deployment
135
+ - Local server deployment
136
+ - Testing locally
137
+
138
+ ---
139
+
140
+ ## 🚀 Quick Start Commands
141
+
142
+ ### Test Docker Locally:
143
+ ```bash
144
+ # Build image
145
+ docker build -t heart-app .
146
+
147
+ # Run container
148
+ docker run -p 8501:8501 heart-app
149
+
150
+ # Access at http://localhost:8501
151
+ ```
152
+
153
+ ### Deploy to Render:
154
+ 1. Push to GitHub
155
+ 2. Connect repo to Render
156
+ 3. Select "Docker" runtime
157
+ 4. Done!
158
+
159
+ ---
160
+
161
+ ## 📝 Summary
162
+
163
+ **Answer:**
164
+ - **Hugging Face Spaces**: NO Docker needed ✅
165
+ - **Self-hosting (Render/AWS/etc.)**: YES, use Docker ✅
166
+
167
+ Your Dockerfile is ready if you want to self-host!
168
+
DOCKER_OPTIMIZATION.md ADDED
@@ -0,0 +1,294 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Running Model Optimization with Docker
2
+
3
+ This guide shows you how to run the model optimization scripts using Docker.
4
+
5
+ ## Prerequisites
6
+
7
+ - Docker installed and running
8
+ - Docker Compose (usually comes with Docker Desktop)
9
+ - At least 8GB RAM available for Docker
10
+ - Data file: `content/cardio_train_extended.csv`
11
+
12
+ ## Quick Start
13
+
14
+ ### Option 1: Using Docker Compose (Recommended)
15
+
16
+ ```bash
17
+ # Build and run optimization
18
+ docker-compose -f docker-compose.optimization.yml up --build
19
+
20
+ # Run in detached mode (background)
21
+ docker-compose -f docker-compose.optimization.yml up -d --build
22
+
23
+ # View logs
24
+ docker-compose -f docker-compose.optimization.yml logs -f
25
+
26
+ # Stop when done
27
+ docker-compose -f docker-compose.optimization.yml down
28
+ ```
29
+
30
+ ### Option 2: Using Docker Directly
31
+
32
+ ```bash
33
+ # Build the image
34
+ docker build -f Dockerfile.optimization -t heart-optimization .
35
+
36
+ # Run optimization
37
+ docker run --rm \
38
+ -v "$(pwd)/content:/app/content" \
39
+ -v "$(pwd)/model_assets:/app/model_assets:ro" \
40
+ --name heart-optimization \
41
+ heart-optimization
42
+
43
+ # Run with resource limits
44
+ docker run --rm \
45
+ -v "$(pwd)/content:/app/content" \
46
+ -v "$(pwd)/model_assets:/app/model_assets:ro" \
47
+ --cpus="4" \
48
+ --memory="8g" \
49
+ --name heart-optimization \
50
+ heart-optimization
51
+ ```
52
+
53
+ ## Running Specific Scripts
54
+
55
+ ### Run Model Optimization Only
56
+
57
+ ```bash
58
+ docker-compose -f docker-compose.optimization.yml run --rm optimization python improve_models.py
59
+ ```
60
+
61
+ ### Run Feature Analysis Only
62
+
63
+ ```bash
64
+ docker-compose -f docker-compose.optimization.yml run --rm optimization python feature_importance_analysis.py
65
+ ```
66
+
67
+ ### Run Comparison
68
+
69
+ ```bash
70
+ docker-compose -f docker-compose.optimization.yml run --rm optimization python compare_models.py
71
+ ```
72
+
73
+ ## Customization
74
+
75
+ ### Adjust Resource Limits
76
+
77
+ Edit `docker-compose.optimization.yml`:
78
+
79
+ ```yaml
80
+ deploy:
81
+ resources:
82
+ limits:
83
+ cpus: '8' # Use more CPUs if available
84
+ memory: 16G # More RAM for faster processing
85
+ ```
86
+
87
+ ### Reduce Optimization Time
88
+
89
+ Edit `improve_models.py` before building:
90
+
91
+ ```python
92
+ n_trials = 50 # Reduce from 100 to 50 for faster results
93
+ ```
94
+
95
+ Or override at runtime:
96
+
97
+ ```bash
98
+ docker run --rm \
99
+ -v "$(pwd)/content:/app/content" \
100
+ -v "$(pwd)/improve_models.py:/app/improve_models.py" \
101
+ heart-optimization python -c "
102
+ import sys
103
+ sys.path.insert(0, '/app')
104
+ # Modify n_trials here or use environment variable
105
+ exec(open('/app/improve_models.py').read().replace('n_trials = 100', 'n_trials = 50'))
106
+ "
107
+ ```
108
+
109
+ ### Use Environment Variables
110
+
111
+ Create a `.env` file:
112
+
113
+ ```env
114
+ N_TRIALS=50
115
+ STUDY_TIMEOUT=1800
116
+ ```
117
+
118
+ Then use it:
119
+
120
+ ```bash
121
+ docker-compose -f docker-compose.optimization.yml --env-file .env up
122
+ ```
123
+
124
+ ## Monitoring Progress
125
+
126
+ ### View Real-time Logs
127
+
128
+ ```bash
129
+ # Using docker-compose
130
+ docker-compose -f docker-compose.optimization.yml logs -f
131
+
132
+ # Using docker
133
+ docker logs -f heart-optimization
134
+ ```
135
+
136
+ ### Check Container Status
137
+
138
+ ```bash
139
+ docker ps
140
+ docker stats heart-optimization
141
+ ```
142
+
143
+ ## Results Location
144
+
145
+ All results are saved to your host machine in:
146
+ - `content/models/` - Optimized models and metrics
147
+ - `content/reports/` - Feature importance visualizations
148
+
149
+ These persist after the container stops.
150
+
151
+ ## Troubleshooting
152
+
153
+ ### Out of Memory
154
+
155
+ **Error:** `Killed` or memory errors
156
+
157
+ **Solution:**
158
+ 1. Reduce `n_trials` in `improve_models.py`
159
+ 2. Reduce memory limit in docker-compose.yml
160
+ 3. Close other applications
161
+
162
+ ### Build Fails
163
+
164
+ **Error:** Package installation fails
165
+
166
+ **Solution:**
167
+ ```bash
168
+ # Clean build
169
+ docker-compose -f docker-compose.optimization.yml build --no-cache
170
+ ```
171
+
172
+ ### Data Not Found
173
+
174
+ **Error:** `Data file not found`
175
+
176
+ **Solution:**
177
+ ```bash
178
+ # Verify data file exists
179
+ ls -lh content/cardio_train_extended.csv
180
+
181
+ # Check volume mount
182
+ docker-compose -f docker-compose.optimization.yml config
183
+ ```
184
+
185
+ ### Slow Performance
186
+
187
+ **Solutions:**
188
+ 1. Increase CPU allocation in docker-compose.yml
189
+ 2. Use fewer trials: `n_trials = 30`
190
+ 3. Run on a machine with more resources
191
+
192
+ ## Advanced Usage
193
+
194
+ ### Interactive Shell
195
+
196
+ ```bash
197
+ # Get a shell in the container
198
+ docker-compose -f docker-compose.optimization.yml run --rm optimization bash
199
+
200
+ # Then run scripts manually
201
+ python improve_models.py
202
+ ```
203
+
204
+ ### Run Multiple Optimizations
205
+
206
+ ```bash
207
+ # Run optimization with different trial counts
208
+ for trials in 30 50 100; do
209
+ docker run --rm \
210
+ -v "$(pwd)/content:/app/content" \
211
+ -e N_TRIALS=$trials \
212
+ heart-optimization \
213
+ python -c "import sys; sys.path.insert(0, '/app'); exec(open('/app/improve_models.py').read().replace('n_trials = 100', f'n_trials = {trials}'))"
214
+ done
215
+ ```
216
+
217
+ ### Save Container State
218
+
219
+ ```bash
220
+ # Commit container to image
221
+ docker commit heart-optimization heart-optimization:snapshot
222
+
223
+ # Use later
224
+ docker run --rm -v "$(pwd)/content:/app/content" heart-optimization:snapshot
225
+ ```
226
+
227
+ ## Performance Tips
228
+
229
+ 1. **Use SSD storage** - Faster I/O for data loading
230
+ 2. **Allocate more CPUs** - Parallel processing in Optuna
231
+ 3. **Increase memory** - Better for large datasets
232
+ 4. **Run overnight** - Let it run while you sleep
233
+ 5. **Use GPU** (if available) - Requires NVIDIA Docker runtime
234
+
235
+ ## GPU Support (Optional)
236
+
237
+ If you have an NVIDIA GPU:
238
+
239
+ ```yaml
240
+ # Add to docker-compose.optimization.yml
241
+ runtime: nvidia
242
+ environment:
243
+ - NVIDIA_VISIBLE_DEVICES=all
244
+ ```
245
+
246
+ Then build with:
247
+ ```bash
248
+ docker build -f Dockerfile.optimization -t heart-optimization .
249
+ ```
250
+
251
+ ## Example Workflow
252
+
253
+ ```bash
254
+ # 1. Build image
255
+ docker-compose -f docker-compose.optimization.yml build
256
+
257
+ # 2. Run optimization (takes 1-2 hours)
258
+ docker-compose -f docker-compose.optimization.yml up
259
+
260
+ # 3. In another terminal, check progress
261
+ docker-compose -f docker-compose.optimization.yml logs -f
262
+
263
+ # 4. When done, run feature analysis
264
+ docker-compose -f docker-compose.optimization.yml run --rm optimization \
265
+ python feature_importance_analysis.py
266
+
267
+ # 5. Compare results
268
+ docker-compose -f docker-compose.optimization.yml run --rm optimization \
269
+ python compare_models.py
270
+
271
+ # 6. Clean up
272
+ docker-compose -f docker-compose.optimization.yml down
273
+ ```
274
+
275
+ ## Benefits of Using Docker
276
+
277
+ ✅ **Isolation** - No conflicts with your system Python
278
+ ✅ **Reproducibility** - Same environment every time
279
+ ✅ **Resource Control** - Limit CPU/memory usage
280
+ ✅ **Easy Cleanup** - Remove container when done
281
+ ✅ **Portability** - Run on any machine with Docker
282
+
283
+ ## Next Steps
284
+
285
+ After optimization completes:
286
+ 1. Check results in `content/models/model_metrics_optimized.csv`
287
+ 2. Review feature importance in `content/reports/`
288
+ 3. Compare with baseline using `compare_models.py`
289
+ 4. Deploy optimized models to your Streamlit app
290
+
291
+ ---
292
+
293
+ **Note:** The optimization process can take 1-2 hours. Make sure your laptop is plugged in and won't go to sleep!
294
+
DOCKER_README.md ADDED
@@ -0,0 +1,179 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 🐳 Running Optimization with Docker
2
+
3
+ Yes! You can absolutely use Docker to run the optimization code. This is actually **recommended** because:
4
+
5
+ ✅ **Isolated environment** - No conflicts with your system Python
6
+ ✅ **Reproducible** - Same results every time
7
+ ✅ **Easy cleanup** - Just remove the container when done
8
+ ✅ **Resource control** - Limit CPU/memory usage
9
+
10
+ ## Quick Start (3 Commands)
11
+
12
+ ```bash
13
+ # 1. Make script executable (one time)
14
+ chmod +x run_optimization_docker.sh
15
+
16
+ # 2. Run optimization
17
+ ./run_optimization_docker.sh
18
+
19
+ # 3. That's it! Results are saved to content/models/
20
+ ```
21
+
22
+ ## What Gets Created
23
+
24
+ The Docker setup includes:
25
+
26
+ 1. **`Dockerfile.optimization`** - Docker image definition
27
+ 2. **`docker-compose.optimization.yml`** - Easy container management
28
+ 3. **`run_optimization_docker.sh`** - One-command runner script
29
+ 4. **`DOCKER_OPTIMIZATION.md`** - Detailed documentation
30
+
31
+ ## Simple Usage Examples
32
+
33
+ ### Run Full Optimization
34
+ ```bash
35
+ ./run_optimization_docker.sh
36
+ ```
37
+ Takes ~1-2 hours, 100 trials per model
38
+
39
+ ### Faster Run (50 trials)
40
+ ```bash
41
+ ./run_optimization_docker.sh --trials 50
42
+ ```
43
+ Takes ~30-60 minutes
44
+
45
+ ### Run Feature Analysis
46
+ ```bash
47
+ ./run_optimization_docker.sh --script feature_importance_analysis.py
48
+ ```
49
+ Takes ~5-10 minutes
50
+
51
+ ### Compare Results
52
+ ```bash
53
+ ./run_optimization_docker.sh --script compare_models.py
54
+ ```
55
+
56
+ ## Using Docker Compose
57
+
58
+ If you prefer docker-compose:
59
+
60
+ ```bash
61
+ # Build and run
62
+ docker-compose -f docker-compose.optimization.yml up --build
63
+
64
+ # View logs
65
+ docker-compose -f docker-compose.optimization.yml logs -f
66
+
67
+ # Stop when done
68
+ docker-compose -f docker-compose.optimization.yml down
69
+ ```
70
+
71
+ ## Using Docker Directly
72
+
73
+ ```bash
74
+ # Build image
75
+ docker build -f Dockerfile.optimization -t heart-optimization .
76
+
77
+ # Run optimization
78
+ docker run --rm \
79
+ -v "$(pwd)/content:/app/content" \
80
+ -v "$(pwd)/model_assets:/app/model_assets:ro" \
81
+ heart-optimization
82
+ ```
83
+
84
+ ## Results Location
85
+
86
+ All results are automatically saved to your host machine:
87
+ - `content/models/model_metrics_optimized.csv` - Performance metrics
88
+ - `content/models/*_optimized.joblib` - Optimized models
89
+ - `content/models/ensemble_info_optimized.json` - Ensemble configuration
90
+ - `content/reports/` - Feature importance visualizations
91
+
92
+ ## Resource Requirements
93
+
94
+ **Minimum:**
95
+ - 4GB RAM
96
+ - 2 CPU cores
97
+ - 5GB disk space
98
+
99
+ **Recommended:**
100
+ - 8GB RAM
101
+ - 4 CPU cores
102
+ - 10GB disk space
103
+
104
+ ## Time Estimates
105
+
106
+ | Configuration | Time |
107
+ |--------------|------|
108
+ | 30 trials | ~20-30 min |
109
+ | 50 trials | ~30-60 min |
110
+ | 100 trials | ~1-2 hours |
111
+ | 200 trials | ~2-4 hours |
112
+
113
+ ## Troubleshooting
114
+
115
+ ### Docker not running
116
+ ```bash
117
+ # Check Docker status
118
+ docker info
119
+
120
+ # Start Docker Desktop (if on Mac/Windows)
121
+ # Or: sudo systemctl start docker (Linux)
122
+ ```
123
+
124
+ ### Out of memory
125
+ ```bash
126
+ # Reduce trials
127
+ ./run_optimization_docker.sh --trials 30
128
+
129
+ # Or reduce timeout
130
+ STUDY_TIMEOUT=1800 ./run_optimization_docker.sh
131
+ ```
132
+
133
+ ### Data file not found
134
+ ```bash
135
+ # Verify data exists
136
+ ls -lh content/cardio_train_extended.csv
137
+ ```
138
+
139
+ ## Advanced Options
140
+
141
+ ### Custom Resource Limits
142
+ Edit `docker-compose.optimization.yml`:
143
+ ```yaml
144
+ deploy:
145
+ resources:
146
+ limits:
147
+ cpus: '8' # Use more CPUs
148
+ memory: 16G # More RAM
149
+ ```
150
+
151
+ ### Environment Variables
152
+ ```bash
153
+ N_TRIALS=50 STUDY_TIMEOUT=1800 ./run_optimization_docker.sh
154
+ ```
155
+
156
+ ### Interactive Shell
157
+ ```bash
158
+ docker-compose -f docker-compose.optimization.yml run --rm optimization bash
159
+ ```
160
+
161
+ ## Next Steps
162
+
163
+ 1. ✅ Run `./run_optimization_docker.sh`
164
+ 2. ✅ Wait for completion (1-2 hours)
165
+ 3. ✅ Check results in `content/models/`
166
+ 4. ✅ Compare with baseline using `compare_models.py`
167
+ 5. ✅ Deploy optimized models
168
+
169
+ ## Full Documentation
170
+
171
+ For detailed instructions, see:
172
+ - **[DOCKER_OPTIMIZATION.md](DOCKER_OPTIMIZATION.md)** - Complete Docker guide
173
+ - **[QUICK_START.md](QUICK_START.md)** - General quick start
174
+ - **[IMPROVEMENTS.md](IMPROVEMENTS.md)** - Improvement details
175
+
176
+ ---
177
+
178
+ **Pro Tip:** Run optimization overnight or during lunch break. The container will save all results automatically!
179
+
Dockerfile.optimization ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM python:3.11-slim
2
+
3
+ # Prevents Python from writing .pyc files and buffering stdout/stderr
4
+ ENV PYTHONDONTWRITEBYTECODE=1 \
5
+ PYTHONUNBUFFERED=1 \
6
+ PIP_NO_CACHE_DIR=1
7
+
8
+ WORKDIR /app
9
+
10
+ # System deps for lightgbm, xgboost, catboost (build and runtime)
11
+ RUN apt-get update && apt-get install -y --no-install-recommends \
12
+ build-essential \
13
+ libgomp1 \
14
+ curl \
15
+ git \
16
+ && rm -rf /var/lib/apt/lists/*
17
+
18
+ # Copy dependency list and install
19
+ COPY requirements.txt /app/requirements.txt
20
+ RUN pip install --upgrade pip \
21
+ && pip install -r requirements.txt
22
+
23
+ # Copy optimization scripts
24
+ COPY improve_models.py /app/improve_models.py
25
+ COPY feature_importance_analysis.py /app/feature_importance_analysis.py
26
+ COPY compare_models.py /app/compare_models.py
27
+
28
+ # Copy data directory (will be mounted as volume, but include for reference)
29
+ RUN mkdir -p /app/content/models /app/content/reports
30
+
31
+ # Default command: run optimization
32
+ CMD ["python", "improve_models.py"]
33
+
GITHUB_SETUP.md ADDED
@@ -0,0 +1,129 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 📤 GitHub Setup Guide for Hugging Face Deployment
2
+
3
+ ## Step 1: Initialize Git Repository (if not done)
4
+
5
+ If you haven't initialized git yet, run:
6
+ ```bash
7
+ cd /home/kbs/Documents/heart-attack-risk-ensemble
8
+ git init
9
+ ```
10
+
11
+ ## Step 2: Using GitHub Desktop
12
+
13
+ ### Option A: Clone Existing Repository
14
+ 1. Open GitHub Desktop
15
+ 2. Click "File" → "Clone Repository"
16
+ 3. If you already created a repo on GitHub.com:
17
+ - Select "GitHub.com" tab
18
+ - Choose your repository
19
+ - Click "Clone"
20
+
21
+ ### Option B: Create New Repository
22
+ 1. Open GitHub Desktop
23
+ 2. Click "File" → "New Repository"
24
+ 3. Fill in:
25
+ - **Name**: `heart-attack-risk-ensemble` (or your choice)
26
+ - **Description**: "Heart Attack Risk Prediction using Ensemble ML Models"
27
+ - **Local Path**: `/home/kbs/Documents/heart-attack-risk-ensemble`
28
+ - **Initialize with README**: ✅ Check this
29
+ - **Git Ignore**: Python
30
+ - **License**: MIT (optional)
31
+ 4. Click "Create Repository"
32
+
33
+ ## Step 3: Add Files to GitHub Desktop
34
+
35
+ 1. In GitHub Desktop, you'll see all your files listed
36
+ 2. Review the changes:
37
+ - ✅ **Include**: All Python files, requirements.txt, configs, documentation
38
+ - ✅ **Include**: Model files (if under 100MB each)
39
+ - ⚠️ **Check**: Large files (>100MB) - GitHub has limits
40
+
41
+ ### Files to Commit:
42
+ - ✅ `streamlit_app.py`
43
+ - ✅ `requirements.txt`
44
+ - ✅ `Dockerfile`
45
+ - ✅ `render.yaml`
46
+ - ✅ `.streamlit/config.toml`
47
+ - ✅ `TEST_CASES.md`
48
+ - ✅ `DEPLOYMENT_CHECKLIST.md`
49
+ - ✅ `DEPLOYMENT_OPTIONS.md`
50
+ - ✅ `README.md`
51
+ - ✅ `model_assets/` (with optimized models)
52
+ - ✅ `content/models/` (if needed)
53
+ - ✅ `.gitignore`
54
+
55
+ ## Step 4: Commit Changes
56
+
57
+ 1. In GitHub Desktop, you'll see all changes
58
+ 2. **Summary**: Write a commit message like:
59
+ ```
60
+ Initial commit: Heart Attack Risk Prediction App
61
+ - Streamlit app with ensemble models (XGBoost, CatBoost, LightGBM)
62
+ - Optimized models with 80.77% accuracy, 93.27% recall
63
+ - Complete UI with model breakdown
64
+ - Test cases and deployment documentation
65
+ ```
66
+ 3. **Description** (optional): Add more details
67
+ 4. Click **"Commit to main"** (or your branch name)
68
+
69
+ ## Step 5: Publish to GitHub
70
+
71
+ 1. Click **"Publish repository"** button (top right)
72
+ 2. If creating new repo:
73
+ - ✅ **Keep code private**: Uncheck (make it public for Hugging Face)
74
+ - ✅ **Add description**: "Heart Attack Risk Prediction using Ensemble ML Models"
75
+ 3. Click **"Publish Repository"**
76
+
77
+ ## Step 6: Verify on GitHub.com
78
+
79
+ 1. Go to https://github.com/YOUR_USERNAME/heart-attack-risk-ensemble
80
+ 2. Verify all files are there
81
+ 3. Check that model files are uploaded (if they're not too large)
82
+
83
+ ## ⚠️ Important Notes
84
+
85
+ ### File Size Limits:
86
+ - **GitHub**: 100MB per file (hard limit)
87
+ - **GitHub LFS**: For files >100MB, use Git LFS
88
+ - **Model files**: Usually 10-50MB each, should be fine
89
+
90
+ ### If Models Are Too Large:
91
+ 1. Use Git LFS:
92
+ ```bash
93
+ git lfs install
94
+ git lfs track "*.joblib"
95
+ git add .gitattributes
96
+ ```
97
+ 2. Or exclude from git and upload separately to Hugging Face
98
+
99
+ ### Repository Visibility:
100
+ - **Public**: Required for Hugging Face Spaces (free tier)
101
+ - **Private**: Requires Hugging Face Pro for private spaces
102
+
103
+ ## ✅ Next Steps After GitHub Push
104
+
105
+ Once your code is on GitHub:
106
+ 1. Go to https://huggingface.co/spaces
107
+ 2. Click "Create new Space"
108
+ 3. Select "Streamlit"
109
+ 4. Connect your GitHub repository
110
+ 5. Deploy!
111
+
112
+ ---
113
+
114
+ ## 🐛 Troubleshooting
115
+
116
+ ### GitHub Desktop Not Showing Files:
117
+ - Make sure you're in the correct directory
118
+ - Check if `.git` folder exists
119
+ - Try refreshing GitHub Desktop
120
+
121
+ ### Large File Warnings:
122
+ - If models are too large, use Git LFS or exclude them
123
+ - Hugging Face can pull models from other sources if needed
124
+
125
+ ### Commit Fails:
126
+ - Check file permissions
127
+ - Make sure you're not committing sensitive files
128
+ - Review `.gitignore` file
129
+
IMPROVEMENTS.md ADDED
@@ -0,0 +1,219 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Model Improvement Analysis & Recommendations
2
+
3
+ ## Current Performance Summary
4
+
5
+ Based on the existing models:
6
+
7
+ | Model | Accuracy | Precision | Recall | F1 | ROC-AUC |
8
+ |-------|----------|-----------|--------|-----|---------|
9
+ | XGBoost_best | 0.849 | 0.853 | 0.843 | 0.848 | 0.925 |
10
+ | CatBoost_best | 0.851 | 0.857 | 0.842 | 0.849 | 0.925 |
11
+ | LightGBM_best | 0.851 | 0.857 | 0.843 | 0.850 | 0.925 |
12
+ | Ensemble_best | 0.850 | 0.855 | 0.843 | 0.849 | 0.925 |
13
+
14
+ ## Identified Improvement Opportunities
15
+
16
+ ### 1. **Hyperparameter Optimization** ⭐⭐⭐
17
+ **Current State:**
18
+ - Using `RandomizedSearchCV` with limited iterations (20-25)
19
+ - Limited parameter search spaces
20
+ - Scoring only on `roc_auc`
21
+
22
+ **Improvements:**
23
+ - ✅ **Optuna-based optimization** (implemented in `improve_models.py`)
24
+ - Tree-structured Parzen Estimator (TPE) sampler
25
+ - Median pruner for early stopping
26
+ - 100+ trials per model
27
+ - Expanded hyperparameter ranges
28
+
29
+ **Expected Impact:** +1-3% accuracy, +1-2% recall
30
+
31
+ ### 2. **Multi-Objective Optimization** ⭐⭐⭐
32
+ **Current State:**
33
+ - Optimizing only for ROC-AUC
34
+ - No explicit focus on recall (critical for medical diagnosis)
35
+
36
+ **Improvements:**
37
+ - ✅ **Combined scoring function** (0.5 * accuracy + 0.5 * recall)
38
+ - ✅ **Threshold optimization** for each model
39
+ - ✅ **Recall-focused tuning**
40
+
41
+ **Expected Impact:** +2-4% recall improvement
42
+
43
+ ### 3. **Threshold Optimization** ⭐⭐
44
+ **Current State:**
45
+ - Using default threshold of 0.5 for all models
46
+ - No model-specific threshold tuning
47
+
48
+ **Improvements:**
49
+ - ✅ **Per-model threshold optimization**
50
+ - ✅ **Ensemble threshold optimization**
51
+ - ✅ **Metric-specific threshold tuning** (F1, recall, combined)
52
+
53
+ **Expected Impact:** +1-3% recall, +0.5-1% accuracy
54
+
55
+ ### 4. **Expanded Hyperparameter Search Spaces** ⭐⭐
56
+ **Current State:**
57
+ - Limited parameter ranges
58
+ - Missing important hyperparameters
59
+
60
+ **Improvements:**
61
+ - ✅ **XGBoost:** Added `colsample_bylevel`, `gamma`, expanded ranges
62
+ - ✅ **CatBoost:** Added `border_count`, `bagging_temperature`, `random_strength`
63
+ - ✅ **LightGBM:** Added `min_split_gain`, expanded `num_leaves` range
64
+
65
+ **Expected Impact:** +0.5-2% overall improvement
66
+
67
+ ### 5. **Feature Engineering & Selection** ⭐⭐
68
+ **Current State:**
69
+ - Using all features without analysis
70
+ - No feature importance-based selection
71
+
72
+ **Improvements:**
73
+ - ✅ **Feature importance analysis** (implemented in `feature_importance_analysis.py`)
74
+ - ✅ **Statistical feature selection** (F-test, Mutual Information)
75
+ - ✅ **Combined importance scoring**
76
+ - 🔄 **Feature selection experiments** (can be added)
77
+
78
+ **Expected Impact:** +0.5-1.5% accuracy, potential overfitting reduction
79
+
80
+ ### 6. **Ensemble Optimization** ⭐⭐
81
+ **Current State:**
82
+ - Simple 50/50 weighting for XGBoost and CatBoost
83
+ - No optimization of ensemble weights
84
+
85
+ **Improvements:**
86
+ - ✅ **Grid search for optimal weights**
87
+ - ✅ **Three-model ensemble** (XGBoost + CatBoost + LightGBM)
88
+ - ✅ **Weight optimization with threshold tuning**
89
+
90
+ **Expected Impact:** +0.5-1.5% accuracy, +0.5-1% recall
91
+
92
+ ### 7. **Early Stopping & Regularization** ⭐
93
+ **Current State:**
94
+ - Fixed number of estimators
95
+ - Basic regularization
96
+
97
+ **Improvements:**
98
+ - ✅ **Optuna pruner** (MedianPruner)
99
+ - ✅ **Enhanced regularization** (expanded ranges)
100
+ - 🔄 **Early stopping callbacks** (can be added)
101
+
102
+ **Expected Impact:** Better generalization, reduced overfitting
103
+
104
+ ## Implementation Guide
105
+
106
+ ### Step 1: Run Advanced Optimization
107
+ ```bash
108
+ python improve_models.py
109
+ ```
110
+
111
+ This will:
112
+ - Run Optuna optimization for all three models (100 trials each)
113
+ - Optimize thresholds for each model
114
+ - Optimize ensemble weights
115
+ - Save optimized models and results
116
+
117
+ **Time:** ~1-2 hours (depending on hardware)
118
+
119
+ ### Step 2: Analyze Feature Importance
120
+ ```bash
121
+ python feature_importance_analysis.py
122
+ ```
123
+
124
+ This will:
125
+ - Extract feature importance from all models
126
+ - Perform statistical feature selection
127
+ - Generate recommendations
128
+ - Create visualizations
129
+
130
+ **Time:** ~5-10 minutes
131
+
132
+ ### Step 3: Compare Results
133
+ Compare the new `model_metrics_optimized.csv` with existing `model_metrics_best.csv`:
134
+ ```bash
135
+ # View optimized results
136
+ cat content/models/model_metrics_optimized.csv
137
+
138
+ # Compare with previous best
139
+ cat content/models/model_metrics_best.csv
140
+ ```
141
+
142
+ ## Additional Recommendations
143
+
144
+ ### 1. **Advanced Feature Engineering**
145
+ - Polynomial features for key interactions (age × BP, BMI × cholesterol)
146
+ - Binning continuous features
147
+ - Domain-specific features (e.g., Framingham Risk Score components)
148
+
149
+ ### 2. **Advanced Ensemble Methods**
150
+ - **Stacking:** Use meta-learner to combine base models
151
+ - **Blending:** Weighted average with learned weights
152
+ - **Voting:** Hard/soft voting ensembles
153
+
154
+ ### 3. **Data Augmentation**
155
+ - SMOTE for minority class oversampling
156
+ - ADASYN for adaptive synthetic sampling
157
+ - BorderlineSMOTE for better boundary examples
158
+
159
+ ### 4. **Cross-Validation Strategy**
160
+ - Nested cross-validation for unbiased evaluation
161
+ - Time-based splits (if temporal data)
162
+ - Group-based splits (if group structure exists)
163
+
164
+ ### 5. **Model Calibration**
165
+ - Platt scaling
166
+ - Isotonic regression
167
+ - Temperature scaling
168
+
169
+ ### 6. **Hyperparameter Tuning Enhancements**
170
+ - Multi-objective optimization (Pareto front)
171
+ - Bayesian optimization with Gaussian processes
172
+ - Hyperband for faster search
173
+
174
+ ## Expected Overall Improvement
175
+
176
+ With all improvements implemented:
177
+
178
+ | Metric | Current | Expected | Improvement |
179
+ |--------|---------|----------|-------------|
180
+ | Accuracy | 0.851 | 0.860-0.870 | +1-2% |
181
+ | Recall | 0.843 | 0.860-0.875 | +2-4% |
182
+ | F1 Score | 0.850 | 0.860-0.870 | +1-2% |
183
+ | ROC-AUC | 0.925 | 0.930-0.935 | +0.5-1% |
184
+
185
+ ## Files Created
186
+
187
+ 1. **`improve_models.py`** - Main optimization script
188
+ 2. **`feature_importance_analysis.py`** - Feature analysis script
189
+ 3. **`IMPROVEMENTS.md`** - This document
190
+
191
+ ## Next Steps
192
+
193
+ 1. ✅ Run `improve_models.py` to get optimized models
194
+ 2. ✅ Run `feature_importance_analysis.py` for feature insights
195
+ 3. 🔄 Test optimized models on validation set
196
+ 4. 🔄 Compare with baseline models
197
+ 5. 🔄 Deploy best performing model
198
+ 6. 🔄 Monitor performance in production
199
+
200
+ ## Notes
201
+
202
+ - The optimization scripts are designed to be run independently
203
+ - Results are saved to `content/models/` directory
204
+ - All improvements are backward compatible
205
+ - Existing models are not overwritten (new files with `_optimized` suffix)
206
+
207
+ ## Troubleshooting
208
+
209
+ **Issue:** Optuna optimization takes too long
210
+ - **Solution:** Reduce `n_trials` in `improve_models.py` (e.g., 50 instead of 100)
211
+
212
+ **Issue:** Memory errors during optimization
213
+ - **Solution:** Reduce `n_jobs` or use smaller data sample
214
+
215
+ **Issue:** No improvement in metrics
216
+ - **Solution:** Check if data preprocessing matches training data
217
+ - Verify feature alignment
218
+ - Check for data leakage
219
+
IMPROVEMENTS_V2.md ADDED
@@ -0,0 +1,77 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Advanced Model Optimization - Version 2
2
+
3
+ ## Key Improvements Made
4
+
5
+ ### 1. **Removed Timeout Barrier** ✅
6
+ - **Before:** 1-hour timeout limit
7
+ - **After:** No timeout - model will complete all iterations
8
+ - **Impact:** Allows full optimization without interruption
9
+
10
+ ### 2. **Increased Optimization Trials** ✅
11
+ - **Before:** 100 trials per model
12
+ - **After:** 300 trials per model (3x more)
13
+ - **Impact:** Better hyperparameter search, higher chance of finding optimal parameters
14
+
15
+ ### 3. **Balanced Accuracy + Recall Optimization** ✅
16
+ - **Before:** Optimized only for recall (0.5 * accuracy + 0.5 * recall)
17
+ - **After:** Balanced optimization (0.4 * accuracy + 0.6 * recall) with smart penalties
18
+ - **Features:**
19
+ - Penalizes if recall is too low relative to accuracy
20
+ - Bonus if both accuracy > 85% AND recall > 90%
21
+ - Penalty if accuracy drops below 80%
22
+ - **Impact:** Should improve both metrics simultaneously
23
+
24
+ ### 4. **Improved Threshold Optimization** ✅
25
+ - **Before:** Simple combined metric
26
+ - **After:** Balanced threshold optimization that:
27
+ - Rewards high recall but penalizes if accuracy drops too much
28
+ - Gives bonus for high performance in both metrics
29
+ - Prevents accuracy from dropping below acceptable levels
30
+
31
+ ## Expected Results
32
+
33
+ With these improvements, we expect:
34
+ - **Accuracy:** 84-86% (improved from 81.9%)
35
+ - **Recall:** 90-93% (maintained high recall)
36
+ - **F1 Score:** 85-87% (improved balance)
37
+ - **ROC-AUC:** 92-93% (maintained or improved)
38
+
39
+ ## Training Configuration
40
+
41
+ - **Trials per model:** 300 (XGBoost, CatBoost, LightGBM)
42
+ - **Total trials:** 900
43
+ - **Timeout:** None (will complete all trials)
44
+ - **Memory limit:** 4GB
45
+ - **CPU limit:** 2 cores
46
+ - **Estimated time:** 3-6 hours (depending on CPU performance)
47
+
48
+ ## Monitoring Progress
49
+
50
+ Check progress with:
51
+ ```bash
52
+ tail -f optimization_v2_log.txt
53
+ ```
54
+
55
+ Or check Docker logs:
56
+ ```bash
57
+ docker logs -f heart-optimization-v2
58
+ ```
59
+
60
+ ## What's Different
61
+
62
+ 1. **No timeout** - Training will complete all 300 trials per model
63
+ 2. **Better scoring** - Optimizes for both accuracy AND recall
64
+ 3. **Smarter threshold** - Finds thresholds that balance both metrics
65
+ 4. **More exploration** - 3x more trials = better hyperparameter space coverage
66
+
67
+ ## Expected Timeline
68
+
69
+ - **XGBoost (300 trials):** ~1.5-2 hours
70
+ - **CatBoost (300 trials):** ~2-3 hours
71
+ - **LightGBM (300 trials):** ~1-1.5 hours
72
+ - **Threshold optimization:** ~5 minutes
73
+ - **Ensemble optimization:** ~10 minutes
74
+ - **Total:** ~4.5-6.5 hours
75
+
76
+ The model will automatically save results when complete!
77
+
MONITOR_TRAINING.md ADDED
@@ -0,0 +1,64 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # How to Monitor Training Progress
2
+
3
+ ## Training is Currently Running! ✅
4
+
5
+ The model optimization is running in Docker container `heart-optimization-v2`.
6
+
7
+ ## Quick Status Check
8
+
9
+ ```bash
10
+ # Check if container is running
11
+ docker ps | grep heart-optimization
12
+
13
+ # See current progress (last 50 lines)
14
+ docker logs --tail 50 heart-optimization-v2
15
+
16
+ # Follow progress in real-time (like tail -f)
17
+ docker logs -f heart-optimization-v2
18
+ ```
19
+
20
+ ## View Log File
21
+
22
+ ```bash
23
+ # View the log file
24
+ tail -f optimization_v2_log.txt
25
+
26
+ # Or view last 100 lines
27
+ tail -100 optimization_v2_log.txt
28
+ ```
29
+
30
+ ## Current Progress
31
+
32
+ Based on the logs, training is:
33
+ - **XGBoost:** Trial 4/300 (just started)
34
+ - **CatBoost:** Waiting (will start after XGBoost)
35
+ - **LightGBM:** Waiting (will start after CatBoost)
36
+
37
+ ## Estimated Time Remaining
38
+
39
+ - **XGBoost (300 trials):** ~1.5-2 hours remaining
40
+ - **CatBoost (300 trials):** ~2-3 hours
41
+ - **LightGBM (300 trials):** ~1-1.5 hours
42
+ - **Total:** ~4.5-6.5 hours
43
+
44
+ ## What to Look For
45
+
46
+ The logs show:
47
+ - Trial number (e.g., "Trial 4/300")
48
+ - Best score found so far
49
+ - Progress bar
50
+ - Estimated time remaining
51
+
52
+ ## Stop Training (if needed)
53
+
54
+ ```bash
55
+ docker stop heart-optimization-v2
56
+ ```
57
+
58
+ ## Check Results (when complete)
59
+
60
+ Results will be saved to:
61
+ - `content/models/model_metrics_optimized.csv`
62
+ - `content/models/*_optimized.joblib`
63
+ - `content/models/ensemble_info_optimized.json`
64
+
PROGRESS_REPORT.md ADDED
@@ -0,0 +1,62 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 📊 Training Progress Report
2
+
3
+ ## Current Status: 🔄 ACTIVE
4
+
5
+ **Last Updated:** $(date)
6
+
7
+ ### Overall Progress
8
+
9
+ | Model | Status | Progress | Best Score |
10
+ |-------|--------|----------|------------|
11
+ | **XGBoost** | 🔄 In Progress | 295/300 trials (98.3%) | 0.842463 |
12
+ | **CatBoost** | ⏳ Waiting | 0/300 trials (0%) | - |
13
+ | **LightGBM** | ⏳ Waiting | 0/300 trials (0%) | - |
14
+
15
+ ### Current Details
16
+
17
+ - **Container:** Running (Up 6+ hours)
18
+ - **CPU Usage:** 100% (actively training)
19
+ - **Memory:** 300MB / 1.8GB (normal)
20
+ - **Best Score Found:** 0.842463
21
+ - **Current Trial:** 295/300 for XGBoost
22
+
23
+ ### Timeline
24
+
25
+ **XGBoost Optimization:**
26
+ - ✅ Started: ~6 hours ago
27
+ - 🔄 Current: Trial 295/300
28
+ - ⏱️ Remaining: ~5-10 minutes
29
+ - 📊 Progress: 98.3% complete
30
+
31
+ **Next Steps:**
32
+ 1. XGBoost will finish in ~5-10 minutes
33
+ 2. CatBoost will start automatically (~2-3 hours)
34
+ 3. LightGBM will start after CatBoost (~1-1.5 hours)
35
+ 4. Final evaluation and ensemble optimization
36
+
37
+ ### Estimated Completion Time
38
+
39
+ - **XGBoost:** ~5-10 minutes remaining
40
+ - **CatBoost:** ~2-3 hours (after XGBoost completes)
41
+ - **LightGBM:** ~1-1.5 hours (after CatBoost completes)
42
+ - **Final Evaluation:** ~15 minutes
43
+ - **Total Remaining:** ~3.5-5 hours
44
+
45
+ ### What's Happening Now
46
+
47
+ The model is:
48
+ - ✅ Testing hyperparameter combinations
49
+ - ✅ Finding optimal parameters (best score: 0.842463)
50
+ - ✅ Using 100% CPU (actively working)
51
+ - ✅ Almost done with XGBoost (98.3% complete)
52
+
53
+ ### Improvements Found
54
+
55
+ - **Best Score:** 0.842463 (improved from initial 0.838024)
56
+ - **Best Trial:** Trial 224
57
+ - **Optimization:** Balanced accuracy + recall scoring
58
+
59
+ ### Next Check
60
+
61
+ Run `./check_training.sh` to see updated progress!
62
+
PROGRESS_UPDATE.md ADDED
@@ -0,0 +1,65 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 📊 Training Progress Update
2
+
3
+ **Last Updated:** November 9, 2025 at 11:11 AM
4
+
5
+ ## Current Status
6
+
7
+ ### ✅ Container Status
8
+ - **Status:** Running (Up 8 hours)
9
+ - **CPU Usage:** 99.96% (actively processing)
10
+ - **Memory:** 484.9 MB / 1.8 GB (26.4%)
11
+ - **State:** Healthy and working
12
+
13
+ ### 📈 Model Progress
14
+
15
+ | Model | Status | Progress | Best Score |
16
+ |-------|--------|----------|------------|
17
+ | **XGBoost** | ✅ COMPLETED | 300/300 (100%) | 0.842463 (Trial #224) |
18
+ | **CatBoost** | 🔄 IN PROGRESS | 61/300 (20.3%) | 0.838067 (Trial #58) |
19
+ | **LightGBM** | ⏳ WAITING | 0/300 (0%) | - |
20
+
21
+ ### 🔄 CatBoost Details
22
+ - **Current Trial:** 61/300
23
+ - **Remaining:** 239 trials
24
+ - **Best Score:** 0.838067 (Trial #58)
25
+ - **Last Activity:** Trial 61 completed at 05:39 AM
26
+ - **Note:** Container is actively processing (100% CPU). CatBoost trials can take 2-3 minutes each, and the process may be in the middle of a longer trial.
27
+
28
+ ### ⏱️ Time Estimates
29
+
30
+ **CatBoost Remaining:**
31
+ - Average time per trial: ~2.5 minutes
32
+ - Remaining trials: 239
33
+ - Estimated time: ~598 minutes (~10 hours)
34
+
35
+ **LightGBM (Upcoming):**
36
+ - Total trials: 300
37
+ - Estimated time: ~540 minutes (~9 hours)
38
+
39
+ **Final Evaluation:**
40
+ - Estimated time: ~15 minutes
41
+
42
+ **Total Remaining:** ~1,153 minutes (~19.2 hours)
43
+
44
+ **Estimated Completion:** Around **6:23 AM on November 10, 2025**
45
+
46
+ ## Notes
47
+
48
+ - The container is running normally and using full CPU capacity
49
+ - CatBoost optimization is progressing (20.3% complete)
50
+ - No errors detected in the logs
51
+ - The process may appear slow because CatBoost trials involve cross-validation which can take time
52
+
53
+ ## How to Monitor
54
+
55
+ ```bash
56
+ # Check status
57
+ ./check_training.sh
58
+
59
+ # Watch live logs
60
+ docker logs -f heart-optimization-v2
61
+
62
+ # Check container stats
63
+ docker stats heart-optimization-v2
64
+ ```
65
+
QUICK_START.md ADDED
@@ -0,0 +1,167 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Quick Start Guide: Model Improvement
2
+
3
+ ## Overview
4
+
5
+ This guide helps you improve your heart attack risk prediction models using advanced optimization techniques.
6
+
7
+ ## 🐳 Docker Option (Recommended)
8
+
9
+ If you have Docker installed, this is the easiest way to run optimization:
10
+
11
+ ```bash
12
+ # Simple one-command execution
13
+ ./run_optimization_docker.sh
14
+
15
+ # Or with custom settings
16
+ ./run_optimization_docker.sh --trials 50
17
+
18
+ # Run feature analysis
19
+ ./run_optimization_docker.sh --script feature_importance_analysis.py
20
+ ```
21
+
22
+ See [DOCKER_OPTIMIZATION.md](DOCKER_OPTIMIZATION.md) for detailed Docker instructions.
23
+
24
+ ---
25
+
26
+ ## Local Installation Option
27
+
28
+ ## Current Performance
29
+
30
+ Your current models achieve:
31
+ - **Accuracy:** ~85.1%
32
+ - **Recall:** ~84.3%
33
+ - **ROC-AUC:** ~92.5%
34
+
35
+ ## Quick Start (3 Steps)
36
+
37
+ ### Step 1: Install Dependencies
38
+
39
+ ```bash
40
+ pip install -r requirements.txt
41
+ ```
42
+
43
+ This will install Optuna and other required packages.
44
+
45
+ ### Step 2: Run Model Optimization
46
+
47
+ ```bash
48
+ python improve_models.py
49
+ ```
50
+
51
+ **What this does:**
52
+ - Optimizes hyperparameters for XGBoost, CatBoost, and LightGBM using Optuna
53
+ - Finds optimal prediction thresholds for each model
54
+ - Optimizes ensemble weights
55
+ - Saves improved models to `content/models/`
56
+
57
+ **Time:** ~1-2 hours (100 trials per model)
58
+
59
+ **Output:**
60
+ - `XGBoost_optimized.joblib`
61
+ - `CatBoost_optimized.joblib`
62
+ - `LightGBM_optimized.joblib`
63
+ - `model_metrics_optimized.csv`
64
+ - `ensemble_info_optimized.json`
65
+ - `best_params_optimized.json`
66
+
67
+ ### Step 3: Analyze Feature Importance (Optional)
68
+
69
+ ```bash
70
+ python feature_importance_analysis.py
71
+ ```
72
+
73
+ **What this does:**
74
+ - Analyzes feature importance across all models
75
+ - Performs statistical feature selection
76
+ - Generates visualizations
77
+ - Provides feature selection recommendations
78
+
79
+ **Time:** ~5-10 minutes
80
+
81
+ **Output:**
82
+ - `feature_selection_recommendations.json`
83
+ - `feature_importance_top30.png`
84
+ - `feature_correlation_top30.png`
85
+
86
+ ### Step 4: Compare Results
87
+
88
+ ```bash
89
+ python compare_models.py
90
+ ```
91
+
92
+ **What this does:**
93
+ - Compares baseline vs optimized models
94
+ - Shows improvement metrics
95
+ - Displays optimal ensemble configuration
96
+
97
+ ## Expected Improvements
98
+
99
+ After running the optimization:
100
+
101
+ | Metric | Current | Expected | Improvement |
102
+ |--------|---------|----------|-------------|
103
+ | Accuracy | 85.1% | 86-87% | +1-2% |
104
+ | Recall | 84.3% | 86-87.5% | +2-4% |
105
+ | F1 Score | 85.0% | 86-87% | +1-2% |
106
+
107
+ ## Key Improvements Implemented
108
+
109
+ 1. ✅ **Optuna Hyperparameter Optimization**
110
+ - Tree-structured Parzen Estimator (TPE)
111
+ - 100+ trials per model
112
+ - Expanded parameter search spaces
113
+
114
+ 2. ✅ **Multi-Objective Optimization**
115
+ - Combined accuracy + recall scoring
116
+ - Threshold optimization per model
117
+
118
+ 3. ✅ **Enhanced Ensemble**
119
+ - Three-model ensemble (XGBoost + CatBoost + LightGBM)
120
+ - Optimized weights
121
+ - Optimized threshold
122
+
123
+ 4. ✅ **Feature Analysis**
124
+ - Importance extraction
125
+ - Statistical selection methods
126
+ - Recommendations for feature engineering
127
+
128
+ ## Faster Alternative
129
+
130
+ If you want faster results (less optimal but quicker):
131
+
132
+ Edit `improve_models.py` and change:
133
+ ```python
134
+ n_trials = 100 # Change to 30-50 for faster results
135
+ ```
136
+
137
+ ## Troubleshooting
138
+
139
+ **Problem:** Script takes too long
140
+ - **Solution:** Reduce `n_trials` to 30-50
141
+
142
+ **Problem:** Memory errors
143
+ - **Solution:** Reduce `n_jobs` or use smaller data sample
144
+
145
+ **Problem:** No improvement
146
+ - **Solution:** Check data preprocessing matches training data
147
+
148
+ ## Next Steps
149
+
150
+ 1. Run optimization scripts
151
+ 2. Compare results with baseline
152
+ 3. Test optimized models on validation set
153
+ 4. Deploy best performing model
154
+ 5. Monitor performance
155
+
156
+ ## Files Created
157
+
158
+ - `improve_models.py` - Main optimization script
159
+ - `feature_importance_analysis.py` - Feature analysis
160
+ - `compare_models.py` - Comparison tool
161
+ - `IMPROVEMENTS.md` - Detailed improvement analysis
162
+ - `QUICK_START.md` - This guide
163
+
164
+ ## Questions?
165
+
166
+ See `IMPROVEMENTS.md` for detailed explanations of all improvements.
167
+
RUN_STREAMLIT_LOCAL.md ADDED
@@ -0,0 +1,93 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 🚀 Running Streamlit App Locally
2
+
3
+ ## ✅ What's Been Done
4
+
5
+ 1. **Optimized models copied** to `model_assets/`:
6
+ - ✅ XGBoost_optimized.joblib
7
+ - ✅ CatBoost_optimized.joblib
8
+ - ✅ LightGBM_optimized.joblib
9
+ - ✅ ensemble_info_optimized.json
10
+ - ✅ model_metrics_optimized.csv
11
+ - ✅ hybrid_metrics.csv
12
+
13
+ 2. **Streamlit app updated**:
14
+ - ✅ Uses optimized models
15
+ - ✅ Loads ensemble weights from config
16
+ - ✅ Displays optimized ensemble weights in sidebar
17
+ - ✅ All paths configured correctly
18
+
19
+ ## 📋 To Run Locally
20
+
21
+ ### Option 1: Using Docker (Recommended - Already Set Up)
22
+
23
+ ```bash
24
+ # The Docker environment already has all dependencies
25
+ docker run --rm -p 8501:8501 \
26
+ -v "$(pwd)/model_assets:/app/model_assets" \
27
+ -v "$(pwd)/content:/app/content" \
28
+ -v "$(pwd)/streamlit_app.py:/app/streamlit_app.py" \
29
+ heart-optimization \
30
+ streamlit run streamlit_app.py --server.headless=true --server.address=0.0.0.0 --server.port=8501
31
+ ```
32
+
33
+ Then open: http://localhost:8501
34
+
35
+ ### Option 2: Install Dependencies Locally
36
+
37
+ **Note:** Python 3.14.0 may have compatibility issues. Consider using Python 3.11 or 3.12.
38
+
39
+ ```bash
40
+ # Install dependencies
41
+ pip install streamlit pandas numpy scikit-learn xgboost catboost lightgbm joblib
42
+
43
+ # Run the app
44
+ streamlit run streamlit_app.py
45
+ ```
46
+
47
+ ### Option 3: Use Virtual Environment (Recommended)
48
+
49
+ ```bash
50
+ # Create virtual environment with Python 3.11 or 3.12
51
+ python3.11 -m venv venv
52
+ source venv/bin/activate # On Windows: venv\Scripts\activate
53
+
54
+ # Install dependencies
55
+ pip install -r requirements.txt
56
+
57
+ # Run the app
58
+ streamlit run streamlit_app.py
59
+ ```
60
+
61
+ ## 🎯 What to Test
62
+
63
+ 1. **Model Loading**: Check sidebar shows "Using Optimized Ensemble" with correct weights
64
+ 2. **Input Form**: Fill in patient information
65
+ 3. **Prediction**: Click "Predict Heart Attack Risk" button
66
+ 4. **Results**: Verify prediction and risk percentage display correctly
67
+ 5. **Ensemble Info**: Check that ensemble weights match optimized config (XGB: 5%, CAT: 85%, LGB: 10%)
68
+
69
+ ## 📊 Expected Results
70
+
71
+ - **Ensemble Weights**: XGBoost: 5.0% | CatBoost: 85.0% | LightGBM: 10.0%
72
+ - **Accuracy**: ~80.8% (from optimized metrics)
73
+ - **Recall**: ~93.3% (from optimized metrics)
74
+ - **ROC-AUC**: ~0.925
75
+
76
+ ## 🐛 Troubleshooting
77
+
78
+ ### If models don't load:
79
+ - Check `model_assets/` folder has all `.joblib` files
80
+ - Verify file permissions are readable
81
+
82
+ ### If dependencies fail:
83
+ - Use Docker (Option 1) - already configured
84
+ - Or use Python 3.11/3.12 instead of 3.14
85
+
86
+ ### If app doesn't start:
87
+ - Check port 8501 is not in use: `lsof -i :8501`
88
+ - Try different port: `streamlit run streamlit_app.py --server.port=8502`
89
+
90
+ ## ✅ Once Working Locally
91
+
92
+ After confirming the app works locally, we'll proceed with Hugging Face deployment!
93
+
TEST_CASES.md ADDED
@@ -0,0 +1,302 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 🧪 Test Cases for Heart Attack Risk Prediction App
2
+
3
+ ## Test Case 1: Low Risk Patient (Healthy Individual)
4
+ **Input:**
5
+ - Gender: Female (2)
6
+ - Age: 35 years
7
+ - Height: 165 cm
8
+ - Weight: 60 kg
9
+ - Systolic BP: 120 mmHg
10
+ - Diastolic BP: 80 mmHg
11
+ - Cholesterol: Normal (1)
12
+ - Glucose: Normal (1)
13
+ - Smoking: No (0)
14
+ - Alcohol: No (0)
15
+ - Physical Activity: Yes (1)
16
+ - Protein Level: 14.0
17
+ - Ejection Fraction: 60.0
18
+
19
+ **Expected Output:**
20
+ - Risk Level: ✅ Low Risk
21
+ - Risk Probability: < 10% (typically 2-8%)
22
+ - Prediction: No Heart Disease
23
+ - Key Risk Factors: ✅ Health Status: Healthy indicators
24
+ - Model Breakdown:
25
+ - XGBoost: ~5-8% risk
26
+ - CatBoost: ~1-2% risk (most accurate for low risk)
27
+ - LightGBM: ~20-25% risk (Note: LightGBM tends to be more conservative/risk-averse)
28
+ - Ensemble: ~2-5% risk (weighted: 5% XGB + 85% CAT + 10% LGB)
29
+ - Recommendation: ✅ Low Risk - Continue maintaining a healthy lifestyle!
30
+
31
+ **Note:** LightGBM may show higher individual risk percentages due to its training characteristics, but the ensemble weights (85% CatBoost) ensure the final prediction remains accurate.
32
+
33
+ ---
34
+
35
+ ## Test Case 2: Moderate Risk Patient (Some Risk Factors)
36
+ **Input:**
37
+ - Gender: Male (1)
38
+ - Age: 55 years
39
+ - Height: 175 cm
40
+ - Weight: 85 kg (BMI ~27.8 - Overweight)
41
+ - Systolic BP: 135 mmHg
42
+ - Diastolic BP: 88 mmHg
43
+ - Cholesterol: Above Normal (2)
44
+ - Glucose: Normal (1)
45
+ - Smoking: No (0)
46
+ - Alcohol: Yes (1)
47
+ - Physical Activity: No (0)
48
+ - Protein Level: 6.5
49
+ - Ejection Fraction: 55.0
50
+
51
+ **Expected Output:**
52
+ - Risk Level: ⚠️ Moderate Risk
53
+ - Risk Probability: 30-50% (typically 35-45%)
54
+ - Prediction: May indicate risk
55
+ - Key Risk Factors: ⚠️ High BP, High cholesterol, Alcohol consumption, Physical inactivity
56
+ - Model Breakdown:
57
+ - XGBoost: ~35-45% risk
58
+ - CatBoost: ~35-45% risk
59
+ - LightGBM: ~35-45% risk
60
+ - Ensemble: ~35-45% risk
61
+ - Recommendation: ⚠️ Moderate Risk - Consider consulting a healthcare professional.
62
+
63
+ ---
64
+
65
+ ## Test Case 3: High Risk Patient (Multiple Risk Factors)
66
+ **Input:**
67
+ - Gender: Male (1)
68
+ - Age: 65 years
69
+ - Height: 170 cm
70
+ - Weight: 95 kg (BMI ~32.9 - Obese)
71
+ - Systolic BP: 150 mmHg
72
+ - Diastolic BP: 100 mmHg
73
+ - Cholesterol: Well Above Normal (3)
74
+ - Glucose: Well Above Normal (3)
75
+ - Smoking: Yes (1)
76
+ - Alcohol: Yes (1)
77
+ - Physical Activity: No (0)
78
+ - Protein Level: 6.0
79
+ - Ejection Fraction: 45.0
80
+
81
+ **Expected Output:**
82
+ - Risk Level: 🚨 Very High Risk
83
+ - Risk Probability: > 70% (typically 75-90%)
84
+ - Prediction: Heart Disease Detected
85
+ - Key Risk Factors: ⚠️ High BMI (>30), High BP, High cholesterol, High glucose, Smoking, Alcohol consumption, Physical inactivity
86
+ - Model Breakdown:
87
+ - XGBoost: ~75-90% risk
88
+ - CatBoost: ~75-90% risk
89
+ - LightGBM: ~75-90% risk
90
+ - Ensemble: ~75-90% risk
91
+ - Recommendation: ⚠️ High Risk Detected! Please consult with a healthcare professional immediately.
92
+
93
+ ---
94
+
95
+ ## Test Case 4: Borderline Case (Age Factor)
96
+ **Input:**
97
+ - Gender: Female (2)
98
+ - Age: 50 years
99
+ - Height: 160 cm
100
+ - Weight: 70 kg (BMI ~27.3 - Overweight)
101
+ - Systolic BP: 130 mmHg
102
+ - Diastolic BP: 85 mmHg
103
+ - Cholesterol: Above Normal (2)
104
+ - Glucose: Normal (1)
105
+ - Smoking: No (0)
106
+ - Alcohol: No (0)
107
+ - Physical Activity: Yes (1)
108
+ - Protein Level: 7.0
109
+ - Ejection Fraction: 58.0
110
+
111
+ **Expected Output:**
112
+ - Risk Level: ⚠️ Moderate Risk
113
+ - Risk Probability: 20-40% (typically 25-35%)
114
+ - Prediction: May indicate risk
115
+ - Key Risk Factors: ⚠️ High BMI (>30), High BP, High cholesterol
116
+ - Model Breakdown:
117
+ - XGBoost: ~25-35% risk
118
+ - CatBoost: ~25-35% risk
119
+ - LightGBM: ~25-35% risk
120
+ - Ensemble: ~25-35% risk
121
+ - Recommendation: ⚠️ Moderate Risk - Consider consulting a healthcare professional.
122
+
123
+ ---
124
+
125
+ ## Test Case 5: Young Patient with Lifestyle Risks
126
+ **Input:**
127
+ - Gender: Male (1)
128
+ - Age: 28 years
129
+ - Height: 180 cm
130
+ - Weight: 75 kg (BMI ~23.1 - Normal)
131
+ - Systolic BP: 125 mmHg
132
+ - Diastolic BP: 82 mmHg
133
+ - Cholesterol: Normal (1)
134
+ - Glucose: Normal (1)
135
+ - Smoking: Yes (1)
136
+ - Alcohol: Yes (1)
137
+ - Physical Activity: No (0)
138
+ - Protein Level: 14.5
139
+ - Ejection Fraction: 62.0
140
+
141
+ **Expected Output:**
142
+ - Risk Level: ⚠️ Moderate Risk
143
+ - Risk Probability: 15-30% (typically 20-28%)
144
+ - Prediction: May indicate risk
145
+ - Key Risk Factors: ⚠️ Smoking, Alcohol consumption, Physical inactivity
146
+ - Model Breakdown:
147
+ - XGBoost: ~20-28% risk
148
+ - CatBoost: ~20-28% risk
149
+ - LightGBM: ~20-28% risk
150
+ - Ensemble: ~20-28% risk
151
+ - Recommendation: ⚠️ Moderate Risk - Consider consulting a healthcare professional.
152
+
153
+ ---
154
+
155
+ ## Test Case 6: Elderly Patient with Good Health
156
+ **Input:**
157
+ - Gender: Female (2)
158
+ - Age: 70 years
159
+ - Height: 155 cm
160
+ - Weight: 58 kg (BMI ~24.1 - Normal)
161
+ - Systolic BP: 125 mmHg
162
+ - Diastolic BP: 78 mmHg
163
+ - Cholesterol: Normal (1)
164
+ - Glucose: Normal (1)
165
+ - Smoking: No (0)
166
+ - Alcohol: No (0)
167
+ - Physical Activity: Yes (1)
168
+ - Protein Level: 13.5
169
+ - Ejection Fraction: 58.0
170
+
171
+ **Expected Output:**
172
+ - Risk Level: ✅ Low to Moderate Risk
173
+ - Risk Probability: 10-25% (typically 15-22%)
174
+ - Prediction: No Heart Disease (or low risk)
175
+ - Key Risk Factors: ✅ Health Status: Healthy indicators (or minimal risk factors)
176
+ - Model Breakdown:
177
+ - XGBoost: ~15-22% risk
178
+ - CatBoost: ~15-22% risk
179
+ - LightGBM: ~15-22% risk
180
+ - Ensemble: ~15-22% risk
181
+ - Recommendation: ✅ Low Risk - Continue maintaining a healthy lifestyle! (or Moderate Risk warning)
182
+
183
+ ---
184
+
185
+ ## Test Case 7: Extreme High Risk (All Risk Factors)
186
+ **Input:**
187
+ - Gender: Male (1)
188
+ - Age: 60 years
189
+ - Height: 168 cm
190
+ - Weight: 100 kg (BMI ~35.4 - Obese)
191
+ - Systolic BP: 160 mmHg
192
+ - Diastolic BP: 105 mmHg
193
+ - Cholesterol: Well Above Normal (3)
194
+ - Glucose: Well Above Normal (3)
195
+ - Smoking: Yes (1)
196
+ - Alcohol: Yes (1)
197
+ - Physical Activity: No (0)
198
+ - Protein Level: 5.5
199
+ - Ejection Fraction: 40.0
200
+
201
+ **Expected Output:**
202
+ - Risk Level: 🚨 Very High Risk
203
+ - Risk Probability: > 85% (typically 88-95%)
204
+ - Prediction: Heart Disease Detected
205
+ - Key Risk Factors: ⚠️ High BMI (>30), High BP, High cholesterol, High glucose, Smoking, Alcohol consumption, Physical inactivity
206
+ - Model Breakdown:
207
+ - XGBoost: ~88-95% risk
208
+ - CatBoost: ~88-95% risk
209
+ - LightGBM: ~88-95% risk
210
+ - Ensemble: ~88-95% risk
211
+ - Recommendation: ⚠️ High Risk Detected! Please consult with a healthcare professional immediately.
212
+
213
+ ---
214
+
215
+ ## Test Case 8: Only Physical Inactivity
216
+ **Input:**
217
+ - Gender: Female (2)
218
+ - Age: 40 years
219
+ - Height: 165 cm
220
+ - Weight: 65 kg (BMI ~23.9 - Normal)
221
+ - Systolic BP: 118 mmHg
222
+ - Diastolic BP: 75 mmHg
223
+ - Cholesterol: Normal (1)
224
+ - Glucose: Normal (1)
225
+ - Smoking: No (0)
226
+ - Alcohol: No (0)
227
+ - Physical Activity: No (0)
228
+ - Protein Level: 14.0
229
+ - Ejection Fraction: 60.0
230
+
231
+ **Expected Output:**
232
+ - Risk Level: ✅ Low Risk
233
+ - Risk Probability: < 15% (typically 5-12%)
234
+ - Prediction: No Heart Disease
235
+ - Key Risk Factors: ℹ️ Lifestyle Note: Physical inactivity - Consider adding regular physical activity to reduce risk.
236
+ - Model Breakdown:
237
+ - XGBoost: ~5-12% risk
238
+ - CatBoost: ~5-12% risk
239
+ - LightGBM: ~5-12% risk
240
+ - Ensemble: ~5-12% risk
241
+ - Recommendation: ✅ Low Risk - Continue maintaining a healthy lifestyle!
242
+
243
+ ---
244
+
245
+ ## ✅ Verification Checklist
246
+
247
+ ### UI Elements to Verify:
248
+ - [ ] Page title displays correctly: "Predicting Heart Attack Risk: An Ensemble Modeling Approach"
249
+ - [ ] Subtitle includes: "XGBoost, CatBoost, and LightGBM"
250
+ - [ ] Sidebar shows optimized ensemble weights (XGB: 5%, CAT: 85%, LGB: 10%)
251
+ - [ ] Sidebar displays Accuracy: 80.77% and Recall: 93.27%
252
+ - [ ] All input fields are present and functional
253
+ - [ ] Prediction button works correctly
254
+ - [ ] Results display with proper formatting
255
+
256
+ ### Model Display to Verify:
257
+ - [ ] All 4 models displayed horizontally: XGBoost, CatBoost, LightGBM, Ensemble
258
+ - [ ] Each model shows progress bar with percentage inside
259
+ - [ ] Risk percentage displayed below each bar
260
+ - [ ] Color coding: Green (low), Orange (moderate), Red (high)
261
+ - [ ] Ensemble metrics section shows Accuracy and Recall
262
+
263
+ ### Prediction Results to Verify:
264
+ - [ ] Risk probability displayed correctly
265
+ - [ ] Risk level matches probability range
266
+ - [ ] Key risk factors identified correctly
267
+ - [ ] Recommendations match risk level
268
+ - [ ] Model breakdown shows all 4 models
269
+ - [ ] Ensemble method info displayed
270
+
271
+ ### Error Handling:
272
+ - [ ] App handles missing models gracefully
273
+ - [ ] Invalid inputs show appropriate warnings
274
+ - [ ] Error messages are user-friendly
275
+
276
+ ---
277
+
278
+ ## 📊 Expected Ensemble Metrics (Sidebar)
279
+ - **Accuracy**: 80.77%
280
+ - **Recall**: 93.27%
281
+ - **Ensemble Weights**: XGBoost: 5.0%, CatBoost: 85.0%, LightGBM: 10.0%
282
+
283
+ ---
284
+
285
+ ## 🎯 Quick Test Scenarios
286
+
287
+ 1. **Minimum Input Test**: Use default values, click predict → Should show low risk
288
+ 2. **Maximum Risk Test**: Set all risk factors to maximum → Should show very high risk
289
+ 3. **Edge Case Test**: Age 20, all normal → Should show very low risk
290
+ 4. **Edge Case Test**: Age 100, all normal → Should show moderate risk due to age
291
+ 5. **Single Risk Factor**: Only smoking → Should show moderate risk
292
+ 6. **Physical Inactivity Only**: Only inactive, all else normal → Should show info message (not warning)
293
+
294
+ ---
295
+
296
+ ## 📝 Notes
297
+ - Actual risk percentages may vary slightly (±2-3%) due to model variations
298
+ - The ensemble uses weighted average: 5% XGBoost + 85% CatBoost + 10% LightGBM
299
+ - **Important:** LightGBM may show higher individual risk percentages (15-25% for low-risk cases) due to its training characteristics. This is expected behavior and does not affect the final ensemble prediction, which is heavily weighted toward CatBoost (85%).
300
+ - The final ensemble prediction is the weighted average of all three models, so even if LightGBM shows higher values, the ensemble result remains accurate.
301
+ - For low-risk patients: CatBoost typically shows the most accurate low values (~1-2%), while LightGBM may show 20-25%. The ensemble (weighted) will be closer to CatBoost's prediction.
302
+
content/models/best_params_optimized.json ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "xgb": {
3
+ "n_estimators": 833,
4
+ "max_depth": 10,
5
+ "learning_rate": 0.0035309428506954807,
6
+ "subsample": 0.6532157097008766,
7
+ "colsample_bytree": 0.6442296258468639,
8
+ "colsample_bylevel": 0.8339397199904889,
9
+ "min_child_weight": 3,
10
+ "reg_alpha": 4.2228139324855505,
11
+ "reg_lambda": 4.7357932061965835,
12
+ "gamma": 0.21705740646031307
13
+ },
14
+ "cat": {
15
+ "iterations": 991,
16
+ "depth": 10,
17
+ "learning_rate": 0.012080369899297073,
18
+ "l2_leaf_reg": 6.239239675006592,
19
+ "border_count": 185,
20
+ "bagging_temperature": 0.4861933750669403,
21
+ "random_strength": 3.2121038119129146
22
+ },
23
+ "lgb": {
24
+ "n_estimators": 811,
25
+ "num_leaves": 174,
26
+ "learning_rate": 0.0012510889453566246,
27
+ "subsample": 0.7146448893210848,
28
+ "colsample_bytree": 0.6008014841256174,
29
+ "min_child_samples": 17,
30
+ "reg_alpha": 1.3831605249360786,
31
+ "reg_lambda": 4.834622156480472,
32
+ "min_split_gain": 0.11111280146299513
33
+ }
34
+ }
content/models/ensemble_info_optimized.json ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "weights": {
3
+ "XGBoost": 0.05,
4
+ "CatBoost": 0.8500000000000001,
5
+ "LightGBM": 0.09999999999999987
6
+ },
7
+ "threshold": 0.2599999999999999,
8
+ "optimal_thresholds": {
9
+ "XGBoost": 0.2799999999999999,
10
+ "CatBoost": 0.22999999999999995,
11
+ "LightGBM": 0.3699999999999999
12
+ }
13
+ }
content/models/model_metrics_optimized.csv ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ model,threshold,accuracy,precision,recall,f1,roc_auc
2
+ XGBoost_optimized,0.2799999999999999,0.8052142857142857,0.7433587960323794,0.931961120640366,0.82704382571193,0.9223024746293794
3
+ CatBoost_optimized,0.22999999999999995,0.8001428571428572,0.7356308935788056,0.9366781017724414,0.824069416498994,0.9251015265637639
4
+ LightGBM_optimized,0.3699999999999999,0.8022857142857143,0.7407196538373947,0.9298170383076043,0.8245658511851945,0.917906034418297
5
+ Ensemble_optimized,0.2599999999999999,0.8077142857142857,0.7460553395838098,0.9326758147512865,0.8289925041290814,0.9249646489680486
model_assets/ensemble_info_optimized.json ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "weights": {
3
+ "XGBoost": 0.05,
4
+ "CatBoost": 0.8500000000000001,
5
+ "LightGBM": 0.09999999999999987
6
+ },
7
+ "threshold": 0.2599999999999999,
8
+ "optimal_thresholds": {
9
+ "XGBoost": 0.2799999999999999,
10
+ "CatBoost": 0.22999999999999995,
11
+ "LightGBM": 0.3699999999999999
12
+ }
13
+ }
model_assets/hybrid_metrics.csv CHANGED
@@ -1,4 +1,4 @@
1
  version,accuracy,precision,recall,f1,roc_auc
2
- Ensemble@0.5,0.8487857142857143,0.855249745158002,0.8394797026872498,0.8472913510784101,0.923818485328485
3
- HybridA (moderate=positive),0.8252142857142857,0.7774118794974997,0.9110920526014865,0.8389601842711418,0.923818485328485
4
- HybridB (moderate=negative),0.8334285714285714,0.9136218517204683,0.7362778730703259,0.8154187114136457,0.923818485328485
 
1
  version,accuracy,precision,recall,f1,roc_auc
2
+ Ensemble_best@0.5,0.8499285714285715,0.854967367657723,0.8426243567753001,0.8487509898495429,0.9253097715297214
3
+ HybridA_best (moderate=positive),0.8255,0.776173723159044,0.9145225843339051,0.8396876435461644,0.9253097715297214
4
+ HybridB_best (moderate=negative),0.8357857142857142,0.9173627154789408,0.7378502001143511,0.8178721381605006,0.9253097715297214
model_assets/model_metrics_optimized.csv ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ model,threshold,accuracy,precision,recall,f1,roc_auc
2
+ XGBoost_optimized,0.2799999999999999,0.8052142857142857,0.7433587960323794,0.931961120640366,0.82704382571193,0.9223024746293794
3
+ CatBoost_optimized,0.22999999999999995,0.8001428571428572,0.7356308935788056,0.9366781017724414,0.824069416498994,0.9251015265637639
4
+ LightGBM_optimized,0.3699999999999999,0.8022857142857143,0.7407196538373947,0.9298170383076043,0.8245658511851945,0.917906034418297
5
+ Ensemble_optimized,0.2599999999999999,0.8077142857142857,0.7460553395838098,0.9326758147512865,0.8289925041290814,0.9249646489680486
requirements.txt CHANGED
@@ -6,4 +6,5 @@ xgboost==3.1.1
6
  catboost==1.2.8
7
  lightgbm==4.6.0
8
  joblib==1.5.2
 
9
 
 
6
  catboost==1.2.8
7
  lightgbm==4.6.0
8
  joblib==1.5.2
9
+ optuna==3.6.1
10
 
streamlit_app.py CHANGED
@@ -296,8 +296,14 @@ def load_performance_metrics():
296
  os.path.join(BASE_DIR, "content", "models", "hybrid_metrics.csv"),
297
  ]
298
 
299
- # Load model metrics
300
- for fp in candidate_model_metrics:
 
 
 
 
 
 
301
  if os.path.exists(fp):
302
  try:
303
  df = pd.read_csv(fp)
@@ -322,8 +328,14 @@ def load_performance_metrics():
322
  if metrics_rows:
323
  break
324
 
325
- # Load hybrid/ensemble metrics
326
- for fp in candidate_hybrid_metrics:
 
 
 
 
 
 
327
  if os.path.exists(fp):
328
  try:
329
  dfh = pd.read_csv(fp)
@@ -376,20 +388,28 @@ def get_algo_metrics(metrics_rows, algo_name: str):
376
  best = row
377
  return best
378
 
379
- def get_ensemble_metrics(hybrid_rows):
380
  """Return the preferred ensemble metrics row.
381
- Preference: 'Ensemble_best@0.5' -> 'Ensemble@0.5' -> first Ensemble row.
382
  """
 
 
 
 
 
 
 
 
383
  if not hybrid_rows:
384
  return None
385
  # Normalize
386
  rows = list(hybrid_rows)
387
- # First preference
388
  for r in rows:
389
  ver = str(r.get("version", ""))
390
  if ver.lower() == "ensemble_best@0.5" or ("ensemble_best" in ver.lower() and "@0.5" in ver.lower()):
391
  return r
392
- # Second preference
393
  for r in rows:
394
  ver = str(r.get("version", ""))
395
  if ver.lower() == "ensemble@0.5" or ("ensemble" in ver.lower() and "@0.5" in ver.lower()):
@@ -413,15 +433,15 @@ def load_models():
413
  st.warning(f"Preprocessor load skipped: {e}")
414
 
415
  models = {}
416
- # Resolve paths
417
  xgb_path = find_first_existing([
418
- "XGB_spw.joblib", "XGBoost.joblib", "xgb_model.joblib", "xgb_full.joblib", "XGBoost_best_5cv.joblib"
419
  ])
420
  cat_path = find_first_existing([
421
- "CAT_cw.joblib", "CatBoost.joblib", "catboost.joblib", "cat_model.joblib", "cat_full.joblib", "CatBoost_best_5cv.joblib"
422
  ])
423
  lgb_path = find_first_existing([
424
- "LGBM_cw.joblib", "LightGBM.joblib", "lgb_model.joblib", "LightGBM_best_5cv.joblib"
425
  ])
426
 
427
  # Load each model independently so one failure doesn't break others
@@ -520,17 +540,54 @@ if not ("XGBoost" in models and "CatBoost" in models):
520
  st.error("⚠️ Ensemble requires both XGBoost and CatBoost models. Please ensure both artifacts are present in `model_assets/`.")
521
  st.stop()
522
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
523
  # Main title
524
  st.markdown('<h1 class="main-header">Predicting Heart Attack Risk: An Ensemble Modeling Approach</h1>', unsafe_allow_html=True)
525
- st.markdown('<p class="subtitle">Advanced machine learning ensemble combining XGBoost and CatBoost for accurate cardiovascular risk assessment</p>', unsafe_allow_html=True)
526
  st.markdown('<div class="section-divider"></div>', unsafe_allow_html=True)
527
 
528
  # Sidebar for model info
529
  with st.sidebar:
530
  st.header("📊 Ensemble")
531
- st.success("✅ Using Ensemble Only (50% XGBoost + 50% CatBoost)")
 
 
 
 
 
 
 
 
 
 
 
532
  _model_rows, _hybrid_rows = load_performance_metrics()
533
- ens_row = get_ensemble_metrics(_hybrid_rows)
534
  acc_text = f"{ens_row['accuracy']*100:.2f}%" if ens_row and ens_row.get('accuracy') is not None else "n/a"
535
  rec_text = f"{ens_row['recall']*100:.2f}%" if ens_row and ens_row.get('recall') is not None else "n/a"
536
  cols_side = st.columns(2)
@@ -659,7 +716,7 @@ with col7:
659
  risk_factors.append("Alcohol")
660
  if active == 0:
661
  lifestyle_score += 1
662
- risk_factors.append("Inactive")
663
 
664
  if lifestyle_score == 0:
665
  score_label = "✅ Low Risk"
@@ -719,11 +776,11 @@ elif ap_hi < 140 or ap_lo < 90:
719
  else:
720
  bp_category = "Stage 2"
721
 
722
- # Risk Level
723
  if health_risk_score <= 2:
724
  risk_level = "Low"
725
  elif health_risk_score <= 4:
726
- risk_level = "Medium"
727
  else:
728
  risk_level = "High"
729
 
@@ -746,7 +803,7 @@ if lifestyle_score > 0:
746
  if alco == 1:
747
  reasons.append("Alcohol consumption")
748
  if active == 0:
749
- reasons.append("Inactive")
750
  if not reasons:
751
  reasons.append("Healthy indicators")
752
  reason = ", ".join(reasons)
@@ -869,36 +926,38 @@ if predict_button:
869
  X_input = pd.DataFrame([input_row])[feature_cols]
870
 
871
  # The model expects numeric features - categorical columns were one-hot encoded during training
872
- # Load sample data to get all possible categorical values for proper one-hot encoding
873
  sample_csv = os.path.join(BASE_DIR, "content", "cardio_train_extended.csv")
874
  cat_cols = ['Age_Group', 'BMI_Category', 'BP_Category', 'Risk_Level']
875
 
 
876
  if os.path.exists(sample_csv):
877
- # Load sample to get all categorical values
878
- sample_df = pd.read_csv(sample_csv, nrows=1000)
879
- # Get all unique values for each categorical column
880
  cat_values = {}
881
  for col in cat_cols:
882
- if col in sample_df.columns:
883
- cat_values[col] = sorted(sample_df[col].unique().tolist())
 
884
  else:
885
- # Fallback to known values
886
  cat_values = {
887
  'Age_Group': ['20-29', '30-39', '40-49', '50-59', '60+'],
888
- 'BMI_Category': ['Underweight', 'Normal', 'Overweight', 'Obese'],
889
- 'BP_Category': ['Normal', 'Elevated', 'Stage 1', 'Stage 2'],
890
- 'Risk_Level': ['Low', 'Medium', 'High']
891
  }
892
 
893
  # Separate numeric and categorical columns
894
  numeric_cols = [col for col in X_input.columns if col not in cat_cols]
895
  X_numeric = X_input[numeric_cols].copy()
896
 
897
- # One-hot encode categorical columns with all possible categories
 
898
  X_cat_encoded_list = []
899
  for col in cat_cols:
900
  if col in X_input.columns:
901
- # Create one-hot columns for all possible values
902
  for val in cat_values.get(col, []):
903
  col_name = f"{col}_{val}"
904
  X_cat_encoded_list.append(pd.Series([1 if X_input[col].iloc[0] == val else 0], name=col_name))
@@ -913,12 +972,24 @@ if predict_button:
913
  # Ensure all columns are numeric (float)
914
  X_processed = X_processed.astype(float)
915
 
916
- # Use ensemble model (50% XGBoost + 50% CatBoost) if both available, otherwise use best model
917
  predictions = {}
918
  ensemble_probs = []
919
  ensemble_weights = []
920
 
921
- # Try ensemble: XGBoost + CatBoost (0.5 each)
 
 
 
 
 
 
 
 
 
 
 
 
922
  if "XGBoost" in models and "CatBoost" in models:
923
  try:
924
  # Predict with XGBoost
@@ -957,8 +1028,9 @@ if predict_button:
957
 
958
  if hasattr(xgb_model, 'predict_proba'):
959
  xgb_prob = float(xgb_model.predict_proba(X_xgb)[0, 1])
960
- ensemble_probs.append(xgb_prob)
961
- ensemble_weights.append(0.5)
 
962
  predictions["XGBoost"] = xgb_prob
963
  except Exception as e:
964
  st.warning(f"⚠️ XGBoost prediction failed (using CatBoost only): {str(e)}")
@@ -968,35 +1040,84 @@ if predict_button:
968
  if "CatBoost" in models:
969
  try:
970
  cat_model = models["CatBoost"]
971
- if hasattr(cat_model, 'feature_names_in_'):
 
 
 
 
972
  expected_features = list(cat_model.feature_names_in_)
973
- X_aligned = pd.DataFrame(0, index=X_processed.index, columns=expected_features)
 
 
 
 
 
 
974
  for col in X_processed.columns:
975
  if col in X_aligned.columns:
976
- X_aligned[col] = X_processed[col]
977
- X_cat = X_aligned
978
  else:
979
  X_cat = X_processed
980
 
981
  if hasattr(cat_model, 'predict_proba'):
982
  cat_prob = float(cat_model.predict_proba(X_cat)[0, 1])
983
- ensemble_probs.append(cat_prob)
984
- ensemble_weights.append(0.5)
 
985
  predictions["CatBoost"] = cat_prob
986
  except Exception as e:
987
  st.warning(f"CatBoost prediction failed: {e}")
988
 
989
- # Ensemble-only: require both model probabilities
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
990
  if len(ensemble_probs) >= 2:
 
 
 
 
991
  # Ensemble prediction (weighted average)
992
  ensemble_prob = np.average(ensemble_probs, weights=ensemble_weights)
993
  predictions["Ensemble"] = ensemble_prob
994
  else:
995
- st.error("Ensemble prediction requires both XGBoost and CatBoost probabilities.")
996
  with st.expander("Debug Info"):
997
  st.write("XGBoost available:", "XGBoost" in models)
998
  st.write("CatBoost available:", "CatBoost" in models)
 
999
  st.write("Ensemble probs count:", len(ensemble_probs))
 
1000
  st.stop()
1001
 
1002
  if not predictions:
@@ -1057,71 +1178,118 @@ if predict_button:
1057
  </div>
1058
  """, unsafe_allow_html=True)
1059
 
1060
- # Display Reason
1061
- st.info(f"**Key Risk Factors Identified:** {reason}")
 
 
 
 
 
 
 
1062
 
1063
  # Detailed breakdown with visual bars
1064
  with st.expander("📊 Model Details & Breakdown"):
1065
- # Ensemble-only display
1066
- display_order = ["Ensemble"] if "Ensemble" in predictions else []
1067
-
1068
  # Load accuracy/recall metrics for display under each model
1069
  _model_rows_all, _hybrid_rows_all = load_performance_metrics()
1070
  xgb_m_all = get_algo_metrics(_model_rows_all, "XGBoost")
1071
  cat_m_all = get_algo_metrics(_model_rows_all, "CatBoost")
1072
- avg_acc_all = None
1073
- if xgb_m_all and cat_m_all and (xgb_m_all.get("accuracy") is not None) and (cat_m_all.get("accuracy") is not None):
1074
- avg_acc_all = (xgb_m_all["accuracy"] + cat_m_all["accuracy"]) / 2.0
1075
- ens_best_all = None
1076
- for hr in _hybrid_rows_all or []:
1077
- if "ENSEMBLE" in hr.get("version", "").upper() and "@0.5" in hr.get("version", ""):
1078
- ens_best_all = hr
 
1079
  break
1080
-
1081
- # Explicit ensemble header with models and average accuracy
1082
- header_text = "Ensemble uses: XGBoost + CatBoost"
1083
- if avg_acc_all is not None:
1084
- st.markdown(f"**{header_text}** · Average@0.5 Accuracy: {avg_acc_all*100:.2f}%")
 
 
 
 
 
 
 
 
 
 
 
1085
  else:
1086
  st.markdown(f"**{header_text}**")
1087
-
1088
- # Create columns for display
1089
- if len(display_order) > 0:
1090
- cols = st.columns(len(display_order))
1091
- for idx, name in enumerate(display_order):
1092
- with cols[idx]:
1093
- if name == "Ensemble":
1094
- st.write(f"**🎯 {name} (Final)**")
1095
- risk_prob = float(predictions[name])
1096
- risk_pct = risk_prob * 100
1097
-
1098
- # Custom progress bar that fills proportionally to risk
1099
- st.markdown(f"""
1100
- <div style="background: rgba(148, 163, 184, 0.1); border-radius: 10px; height: 32px; width: 100%; position: relative; overflow: hidden; border: 1px solid rgba(148, 163, 184, 0.2);">
1101
- <div style="background: linear-gradient(90deg, {'#EF4444' if risk_pct >= 50 else '#F59E0B' if risk_pct >= 30 else '#10B981'}, {'#DC2626' if risk_pct >= 50 else '#D97706' if risk_pct >= 30 else '#059669'}); width: {risk_pct}%; height: 100%; border-radius: 10px; transition: width 0.3s ease; display: flex; align-items: center; justify-content: center; color: white; font-weight: 600; font-size: 0.9rem;">
1102
- {risk_pct:.1f}%
1103
- </div>
1104
- </div>
1105
- """, unsafe_allow_html=True)
1106
- st.caption(f"Risk Level: {risk_pct:.2f}%")
1107
-
1108
- # Show ensemble accuracy from average and recorded best if available
1109
- if avg_acc_all is not None:
1110
- st.caption(f"Accuracy (Average@0.5): {avg_acc_all*100:.2f}%")
1111
- if ens_best_all and ens_best_all.get("accuracy") is not None:
1112
- st.caption(f"Recorded Ensemble_best@0.5: {ens_best_all['accuracy']*100:.2f}%")
1113
- st.success("✅ Final decision uses Ensemble (50% XGBoost + 50% CatBoost)")
1114
- else:
1115
- pass # No individual model cards in ensemble-only mode
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1116
 
1117
  # Show ensemble info
1118
  if "Ensemble" in predictions:
1119
- st.info("💡 **Ensemble Method**: Weighted average (50% XGBoost + 50% CatBoost). Final decision uses the Ensemble output.")
 
 
 
 
 
 
 
 
 
 
1120
 
1121
  # Metrics breakdown: show per-model accuracy and averaged accuracy (concise)
1122
  st.markdown("---")
1123
  st.subheader("Ensemble Metrics")
1124
- ens_row_bd = get_ensemble_metrics(_hybrid_rows_all)
1125
  acc_bd = f"{ens_row_bd['accuracy']*100:.2f}%" if ens_row_bd and ens_row_bd.get('accuracy') is not None else "n/a"
1126
  rec_bd = f"{ens_row_bd['recall']*100:.2f}%" if ens_row_bd and ens_row_bd.get('recall') is not None else "n/a"
1127
  cols_acc = st.columns(2)
 
296
  os.path.join(BASE_DIR, "content", "models", "hybrid_metrics.csv"),
297
  ]
298
 
299
+ # Load model metrics - prioritize optimized metrics
300
+ candidate_model_metrics_priority = [
301
+ os.path.join(BASE_DIR, "content", "models", "model_metrics_optimized.csv"),
302
+ os.path.join(BASE_DIR, "model_assets", "model_metrics_optimized.csv"),
303
+ os.path.join(BASE_DIR, "content", "models", "model_metrics_best.csv"),
304
+ ] + candidate_model_metrics
305
+
306
+ for fp in candidate_model_metrics_priority:
307
  if os.path.exists(fp):
308
  try:
309
  df = pd.read_csv(fp)
 
328
  if metrics_rows:
329
  break
330
 
331
+ # Load hybrid/ensemble metrics - prioritize optimized metrics
332
+ candidate_hybrid_metrics_priority = [
333
+ os.path.join(BASE_DIR, "content", "models", "hybrid_metrics_best.csv"),
334
+ os.path.join(BASE_DIR, "model_assets", "hybrid_metrics.csv"),
335
+ os.path.join(BASE_DIR, "content", "models", "hybrid_metrics.csv"),
336
+ ] + candidate_hybrid_metrics
337
+
338
+ for fp in candidate_hybrid_metrics_priority:
339
  if os.path.exists(fp):
340
  try:
341
  dfh = pd.read_csv(fp)
 
388
  best = row
389
  return best
390
 
391
+ def get_ensemble_metrics(hybrid_rows, metrics_rows=None):
392
  """Return the preferred ensemble metrics row.
393
+ Preference: 'Ensemble_optimized' from model_metrics -> 'Ensemble_best@0.5' -> 'Ensemble@0.5' -> first Ensemble row.
394
  """
395
+ # First, try to get Ensemble_optimized from model_metrics (most recent optimized)
396
+ if metrics_rows:
397
+ for row in metrics_rows:
398
+ model_name = str(row.get("model", "")).upper()
399
+ if "ENSEMBLE" in model_name and "OPTIMIZED" in model_name:
400
+ return row
401
+
402
+ # Then check hybrid_rows
403
  if not hybrid_rows:
404
  return None
405
  # Normalize
406
  rows = list(hybrid_rows)
407
+ # First preference: Ensemble_best@0.5
408
  for r in rows:
409
  ver = str(r.get("version", ""))
410
  if ver.lower() == "ensemble_best@0.5" or ("ensemble_best" in ver.lower() and "@0.5" in ver.lower()):
411
  return r
412
+ # Second preference: Ensemble@0.5
413
  for r in rows:
414
  ver = str(r.get("version", ""))
415
  if ver.lower() == "ensemble@0.5" or ("ensemble" in ver.lower() and "@0.5" in ver.lower()):
 
433
  st.warning(f"Preprocessor load skipped: {e}")
434
 
435
  models = {}
436
+ # Resolve paths - prioritize optimized models
437
  xgb_path = find_first_existing([
438
+ "XGBoost_optimized.joblib", "XGB_spw.joblib", "XGBoost.joblib", "xgb_model.joblib", "xgb_full.joblib", "XGBoost_best_5cv.joblib"
439
  ])
440
  cat_path = find_first_existing([
441
+ "CatBoost_optimized.joblib", "CAT_cw.joblib", "CatBoost.joblib", "catboost.joblib", "cat_model.joblib", "cat_full.joblib", "CatBoost_best_5cv.joblib"
442
  ])
443
  lgb_path = find_first_existing([
444
+ "LightGBM_optimized.joblib", "LGBM_cw.joblib", "LightGBM.joblib", "lgb_model.joblib", "LightGBM_best_5cv.joblib"
445
  ])
446
 
447
  # Load each model independently so one failure doesn't break others
 
540
  st.error("⚠️ Ensemble requires both XGBoost and CatBoost models. Please ensure both artifacts are present in `model_assets/`.")
541
  st.stop()
542
 
543
+ # Load ensemble configuration (weights and thresholds)
544
+ ensemble_config = None
545
+ ensemble_info_paths = [
546
+ os.path.join(BASE_DIR, "model_assets", "ensemble_info_optimized.json"),
547
+ os.path.join(BASE_DIR, "content", "models", "ensemble_info_optimized.json"),
548
+ ]
549
+ for path in ensemble_info_paths:
550
+ if os.path.exists(path):
551
+ try:
552
+ with open(path, 'r') as f:
553
+ ensemble_config = json.load(f)
554
+ break
555
+ except Exception as e:
556
+ continue
557
+
558
+ # Default ensemble weights if config not found
559
+ if ensemble_config:
560
+ ensemble_weights_config = ensemble_config.get('weights', {})
561
+ default_xgb_weight = ensemble_weights_config.get('XGBoost', 0.5)
562
+ default_cat_weight = ensemble_weights_config.get('CatBoost', 0.5)
563
+ default_lgb_weight = ensemble_weights_config.get('LightGBM', 0.0)
564
+ else:
565
+ default_xgb_weight = 0.5
566
+ default_cat_weight = 0.5
567
+ default_lgb_weight = 0.0
568
+
569
  # Main title
570
  st.markdown('<h1 class="main-header">Predicting Heart Attack Risk: An Ensemble Modeling Approach</h1>', unsafe_allow_html=True)
571
+ st.markdown('<p class="subtitle">Advanced machine learning ensemble combining XGBoost, CatBoost, and LightGBM for accurate cardiovascular risk assessment</p>', unsafe_allow_html=True)
572
  st.markdown('<div class="section-divider"></div>', unsafe_allow_html=True)
573
 
574
  # Sidebar for model info
575
  with st.sidebar:
576
  st.header("📊 Ensemble")
577
+ # Display ensemble weights
578
+ if ensemble_config:
579
+ weights = ensemble_config.get('weights', {})
580
+ xgb_w = weights.get('XGBoost', 0.5) * 100
581
+ cat_w = weights.get('CatBoost', 0.5) * 100
582
+ lgb_w = weights.get('LightGBM', 0.0) * 100
583
+ if lgb_w > 0:
584
+ st.success(f"✅ Using Optimized Ensemble\nXGBoost: {xgb_w:.1f}% | CatBoost: {cat_w:.1f}% | LightGBM: {lgb_w:.1f}%")
585
+ else:
586
+ st.success(f"✅ Using Optimized Ensemble\nXGBoost: {xgb_w:.1f}% | CatBoost: {cat_w:.1f}%")
587
+ else:
588
+ st.success("✅ Using Ensemble (50% XGBoost + 50% CatBoost)")
589
  _model_rows, _hybrid_rows = load_performance_metrics()
590
+ ens_row = get_ensemble_metrics(_hybrid_rows, _model_rows)
591
  acc_text = f"{ens_row['accuracy']*100:.2f}%" if ens_row and ens_row.get('accuracy') is not None else "n/a"
592
  rec_text = f"{ens_row['recall']*100:.2f}%" if ens_row and ens_row.get('recall') is not None else "n/a"
593
  cols_side = st.columns(2)
 
716
  risk_factors.append("Alcohol")
717
  if active == 0:
718
  lifestyle_score += 1
719
+ risk_factors.append("Physical inactivity")
720
 
721
  if lifestyle_score == 0:
722
  score_label = "✅ Low Risk"
 
776
  else:
777
  bp_category = "Stage 2"
778
 
779
+ # Risk Level (Note: data uses "Moderate" not "Medium")
780
  if health_risk_score <= 2:
781
  risk_level = "Low"
782
  elif health_risk_score <= 4:
783
+ risk_level = "Moderate" # Changed from "Medium" to match training data
784
  else:
785
  risk_level = "High"
786
 
 
803
  if alco == 1:
804
  reasons.append("Alcohol consumption")
805
  if active == 0:
806
+ reasons.append("Physical inactivity")
807
  if not reasons:
808
  reasons.append("Healthy indicators")
809
  reason = ", ".join(reasons)
 
926
  X_input = pd.DataFrame([input_row])[feature_cols]
927
 
928
  # The model expects numeric features - categorical columns were one-hot encoded during training
929
+ # Load FULL dataset to get ALL possible categorical values (matching training)
930
  sample_csv = os.path.join(BASE_DIR, "content", "cardio_train_extended.csv")
931
  cat_cols = ['Age_Group', 'BMI_Category', 'BP_Category', 'Risk_Level']
932
 
933
+ # Get all categorical values from FULL dataset (not just sample)
934
  if os.path.exists(sample_csv):
935
+ # Load full dataset to get ALL unique values (matching training)
936
+ full_df = pd.read_csv(sample_csv)
 
937
  cat_values = {}
938
  for col in cat_cols:
939
+ if col in full_df.columns:
940
+ # Get all unique values and sort them (matching pandas get_dummies behavior)
941
+ cat_values[col] = sorted(full_df[col].unique().tolist())
942
  else:
943
+ # Fallback to known values (matching actual data)
944
  cat_values = {
945
  'Age_Group': ['20-29', '30-39', '40-49', '50-59', '60+'],
946
+ 'BMI_Category': ['Normal', 'Obese', 'Overweight', 'Underweight'], # Sorted order from data
947
+ 'BP_Category': ['Elevated', 'Normal', 'Stage 1', 'Stage 2'], # Sorted order from data
948
+ 'Risk_Level': ['High', 'Low', 'Moderate'] # Note: "Moderate" not "Medium"
949
  }
950
 
951
  # Separate numeric and categorical columns
952
  numeric_cols = [col for col in X_input.columns if col not in cat_cols]
953
  X_numeric = X_input[numeric_cols].copy()
954
 
955
+ # One-hot encode categorical columns with all possible categories in sorted order
956
+ # This matches pandas get_dummies behavior during training
957
  X_cat_encoded_list = []
958
  for col in cat_cols:
959
  if col in X_input.columns:
960
+ # Create one-hot columns for all possible values in sorted order
961
  for val in cat_values.get(col, []):
962
  col_name = f"{col}_{val}"
963
  X_cat_encoded_list.append(pd.Series([1 if X_input[col].iloc[0] == val else 0], name=col_name))
 
972
  # Ensure all columns are numeric (float)
973
  X_processed = X_processed.astype(float)
974
 
975
+ # Use ensemble model with optimized weights
976
  predictions = {}
977
  ensemble_probs = []
978
  ensemble_weights = []
979
 
980
+ # Get ensemble weights from config or use defaults
981
+ xgb_weight = default_xgb_weight if ensemble_config else 0.5
982
+ cat_weight = default_cat_weight if ensemble_config else 0.5
983
+ lgb_weight = default_lgb_weight if ensemble_config else 0.0
984
+
985
+ # Normalize weights to sum to 1.0
986
+ total_weight = xgb_weight + cat_weight + lgb_weight
987
+ if total_weight > 0:
988
+ xgb_weight = xgb_weight / total_weight
989
+ cat_weight = cat_weight / total_weight
990
+ lgb_weight = lgb_weight / total_weight
991
+
992
+ # Try ensemble: XGBoost + CatBoost + LightGBM (if available)
993
  if "XGBoost" in models and "CatBoost" in models:
994
  try:
995
  # Predict with XGBoost
 
1028
 
1029
  if hasattr(xgb_model, 'predict_proba'):
1030
  xgb_prob = float(xgb_model.predict_proba(X_xgb)[0, 1])
1031
+ if xgb_weight > 0:
1032
+ ensemble_probs.append(xgb_prob)
1033
+ ensemble_weights.append(xgb_weight)
1034
  predictions["XGBoost"] = xgb_prob
1035
  except Exception as e:
1036
  st.warning(f"⚠️ XGBoost prediction failed (using CatBoost only): {str(e)}")
 
1040
  if "CatBoost" in models:
1041
  try:
1042
  cat_model = models["CatBoost"]
1043
+ # CatBoost is very strict about feature order and names
1044
+ if hasattr(cat_model, 'feature_names_'):
1045
+ # CatBoost uses feature_names_ (with underscore)
1046
+ expected_features = list(cat_model.feature_names_)
1047
+ elif hasattr(cat_model, 'feature_names_in_'):
1048
  expected_features = list(cat_model.feature_names_in_)
1049
+ else:
1050
+ expected_features = None
1051
+
1052
+ if expected_features:
1053
+ # Create DataFrame with exact feature order and names expected by CatBoost
1054
+ X_aligned = pd.DataFrame(0.0, index=X_processed.index, columns=expected_features, dtype=float)
1055
+ # Match columns by name
1056
  for col in X_processed.columns:
1057
  if col in X_aligned.columns:
1058
+ X_aligned[col] = X_processed[col].values
1059
+ X_cat = X_aligned[expected_features] # Ensure exact order
1060
  else:
1061
  X_cat = X_processed
1062
 
1063
  if hasattr(cat_model, 'predict_proba'):
1064
  cat_prob = float(cat_model.predict_proba(X_cat)[0, 1])
1065
+ if cat_weight > 0:
1066
+ ensemble_probs.append(cat_prob)
1067
+ ensemble_weights.append(cat_weight)
1068
  predictions["CatBoost"] = cat_prob
1069
  except Exception as e:
1070
  st.warning(f"CatBoost prediction failed: {e}")
1071
 
1072
+ # Predict with LightGBM (if included in ensemble)
1073
+ if "LightGBM" in models and lgb_weight > 0:
1074
+ try:
1075
+ lgb_model = models["LightGBM"]
1076
+ # LightGBM is strict about feature order and names
1077
+ if hasattr(lgb_model, 'feature_name_'):
1078
+ # LightGBM uses feature_name_ (with underscore, singular)
1079
+ expected_features = list(lgb_model.feature_name_)
1080
+ elif hasattr(lgb_model, 'feature_names_in_'):
1081
+ expected_features = list(lgb_model.feature_names_in_)
1082
+ else:
1083
+ expected_features = None
1084
+
1085
+ if expected_features:
1086
+ # Create DataFrame with exact feature order and names expected by LightGBM
1087
+ X_aligned = pd.DataFrame(0.0, index=X_processed.index, columns=expected_features, dtype=float)
1088
+ # Match columns by name
1089
+ for col in X_processed.columns:
1090
+ if col in X_aligned.columns:
1091
+ X_aligned[col] = X_processed[col].values
1092
+ X_lgb = X_aligned[expected_features] # Ensure exact order
1093
+ else:
1094
+ X_lgb = X_processed
1095
+
1096
+ if hasattr(lgb_model, 'predict_proba'):
1097
+ lgb_prob = float(lgb_model.predict_proba(X_lgb)[0, 1])
1098
+ ensemble_probs.append(lgb_prob)
1099
+ ensemble_weights.append(lgb_weight)
1100
+ predictions["LightGBM"] = lgb_prob
1101
+ except Exception as e:
1102
+ st.warning(f"LightGBM prediction failed: {e}")
1103
+
1104
+ # Ensemble: require at least XGBoost and CatBoost probabilities
1105
  if len(ensemble_probs) >= 2:
1106
+ # Normalize weights to sum to 1.0
1107
+ total_weight = sum(ensemble_weights)
1108
+ if total_weight > 0:
1109
+ ensemble_weights = [w / total_weight for w in ensemble_weights]
1110
  # Ensemble prediction (weighted average)
1111
  ensemble_prob = np.average(ensemble_probs, weights=ensemble_weights)
1112
  predictions["Ensemble"] = ensemble_prob
1113
  else:
1114
+ st.error("Ensemble prediction requires at least XGBoost and CatBoost probabilities.")
1115
  with st.expander("Debug Info"):
1116
  st.write("XGBoost available:", "XGBoost" in models)
1117
  st.write("CatBoost available:", "CatBoost" in models)
1118
+ st.write("LightGBM available:", "LightGBM" in models)
1119
  st.write("Ensemble probs count:", len(ensemble_probs))
1120
+ st.write("Ensemble weights:", ensemble_weights)
1121
  st.stop()
1122
 
1123
  if not predictions:
 
1178
  </div>
1179
  """, unsafe_allow_html=True)
1180
 
1181
+ # Display Reason with better formatting
1182
+ if reason and reason != "Healthy indicators":
1183
+ # Check if only "Physical inactivity" is the risk factor (less severe)
1184
+ if reason == "Physical inactivity":
1185
+ st.info(f"**ℹ️ Lifestyle Note:** {reason} - Consider adding regular physical activity to reduce risk.")
1186
+ else:
1187
+ st.warning(f"**⚠️ Key Risk Factors Identified:** {reason}")
1188
+ else:
1189
+ st.success(f"**✅ Health Status:** {reason}")
1190
 
1191
  # Detailed breakdown with visual bars
1192
  with st.expander("📊 Model Details & Breakdown"):
 
 
 
1193
  # Load accuracy/recall metrics for display under each model
1194
  _model_rows_all, _hybrid_rows_all = load_performance_metrics()
1195
  xgb_m_all = get_algo_metrics(_model_rows_all, "XGBoost")
1196
  cat_m_all = get_algo_metrics(_model_rows_all, "CatBoost")
1197
+ lgb_m_all = get_algo_metrics(_model_rows_all, "LightGBM")
1198
+
1199
+ # Get optimized ensemble metrics
1200
+ ens_opt_all = None
1201
+ for row in _model_rows_all or []:
1202
+ model_name = str(row.get("model", "")).upper()
1203
+ if "ENSEMBLE" in model_name and "OPTIMIZED" in model_name:
1204
+ ens_opt_all = row
1205
  break
1206
+
1207
+ # Explicit ensemble header with models and weights
1208
+ if ensemble_config:
1209
+ weights = ensemble_config.get('weights', {})
1210
+ xgb_w = weights.get('XGBoost', 0.5) * 100
1211
+ cat_w = weights.get('CatBoost', 0.5) * 100
1212
+ lgb_w = weights.get('LightGBM', 0.0) * 100
1213
+ if lgb_w > 0:
1214
+ header_text = f"Ensemble uses: XGBoost ({xgb_w:.1f}%) + CatBoost ({cat_w:.1f}%) + LightGBM ({lgb_w:.1f}%)"
1215
+ else:
1216
+ header_text = f"Ensemble uses: XGBoost ({xgb_w:.1f}%) + CatBoost ({cat_w:.1f}%)"
1217
+ else:
1218
+ header_text = "Ensemble uses: XGBoost + CatBoost"
1219
+
1220
+ if ens_opt_all and ens_opt_all.get("accuracy") is not None:
1221
+ st.markdown(f"**{header_text}** · Accuracy: {ens_opt_all['accuracy']*100:.2f}% | Recall: {ens_opt_all['recall']*100:.2f}%")
1222
  else:
1223
  st.markdown(f"**{header_text}**")
1224
+
1225
+ # Helper function to create risk bar with percentage inside
1226
+ def create_risk_bar(risk_pct, model_name):
1227
+ # Use teal/green color for low risk, orange for moderate, red for high
1228
+ if risk_pct >= 50:
1229
+ color = '#EF4444' # Red
1230
+ elif risk_pct >= 30:
1231
+ color = '#F59E0B' # Orange
1232
+ else:
1233
+ color = '#14B8A6' # Teal/Green
1234
+
1235
+ # Ensure bar width doesn't exceed 100%
1236
+ bar_width = min(risk_pct, 100)
1237
+
1238
+ return f"""
1239
+ <div style="background: rgba(148, 163, 184, 0.15); border-radius: 8px; height: 36px; width: 100%; position: relative; overflow: hidden; border: 1px solid rgba(148, 163, 184, 0.3); margin: 8px 0;">
1240
+ <div style="background: {color}; width: {bar_width}%; height: 100%; border-radius: 8px; display: flex; align-items: center; justify-content: flex-start; padding-left: 8px; color: white; font-weight: 600; font-size: 0.85rem; transition: width 0.3s ease;">
1241
+ {risk_pct:.2f}%
1242
+ </div>
1243
+ </div>
1244
+ """
1245
+
1246
+ # Display all models horizontally on the same line (4 columns)
1247
+ models_to_show = []
1248
+
1249
+ # Collect all available models in order
1250
+ if "XGBoost" in predictions:
1251
+ models_to_show.append(("XGBoost Model", "XGBoost"))
1252
+ if "CatBoost" in predictions:
1253
+ models_to_show.append(("CatBoost Model", "CatBoost"))
1254
+ if "LightGBM" in predictions:
1255
+ models_to_show.append(("LightGBM Model", "LightGBM"))
1256
+ if "Ensemble" in predictions:
1257
+ models_to_show.append(("🎯 Ensemble (Final)", "Ensemble"))
1258
+
1259
+ # Create columns for all models - equal width
1260
+ if models_to_show:
1261
+ num_cols = len(models_to_show)
1262
+ model_cols = st.columns(num_cols)
1263
+
1264
+ for idx, (display_name, model_key) in enumerate(models_to_show):
1265
+ with model_cols[idx]:
1266
+ # Model title
1267
+ st.markdown(f"**{display_name}**", unsafe_allow_html=True)
1268
+ # Calculate risk percentage
1269
+ risk_pct = float(predictions[model_key]) * 100
1270
+ # Display progress bar
1271
+ st.markdown(create_risk_bar(risk_pct, model_key), unsafe_allow_html=True)
1272
+ # Risk percentage below bar
1273
+ st.markdown(f"<div style='text-align: center; margin-top: -8px; font-size: 0.85rem; color: #666;'>{risk_pct:.2f}% risk</div>", unsafe_allow_html=True)
1274
 
1275
  # Show ensemble info
1276
  if "Ensemble" in predictions:
1277
+ if ensemble_config:
1278
+ weights = ensemble_config.get('weights', {})
1279
+ xgb_w = weights.get('XGBoost', 0.5) * 100
1280
+ cat_w = weights.get('CatBoost', 0.5) * 100
1281
+ lgb_w = weights.get('LightGBM', 0.0) * 100
1282
+ if lgb_w > 0:
1283
+ st.info(f"💡 **Ensemble Method**: Weighted average (XGBoost: {xgb_w:.1f}%, CatBoost: {cat_w:.1f}%, LightGBM: {lgb_w:.1f}%). Final decision uses the Ensemble output.")
1284
+ else:
1285
+ st.info(f"💡 **Ensemble Method**: Weighted average (XGBoost: {xgb_w:.1f}%, CatBoost: {cat_w:.1f}%). Final decision uses the Ensemble output.")
1286
+ else:
1287
+ st.info("💡 **Ensemble Method**: Weighted average (50% XGBoost + 50% CatBoost). Final decision uses the Ensemble output.")
1288
 
1289
  # Metrics breakdown: show per-model accuracy and averaged accuracy (concise)
1290
  st.markdown("---")
1291
  st.subheader("Ensemble Metrics")
1292
+ ens_row_bd = get_ensemble_metrics(_hybrid_rows_all, _model_rows_all)
1293
  acc_bd = f"{ens_row_bd['accuracy']*100:.2f}%" if ens_row_bd and ens_row_bd.get('accuracy') is not None else "n/a"
1294
  rec_bd = f"{ens_row_bd['recall']*100:.2f}%" if ens_row_bd and ens_row_bd.get('recall') is not None else "n/a"
1295
  cols_acc = st.columns(2)