aankitdas commited on
Commit
d6ba1de
Β·
1 Parent(s): 77dd0b0

updated readme

Browse files
Files changed (1) hide show
  1. README.md +95 -223
README.md CHANGED
@@ -1,6 +1,6 @@
1
  ---
2
  title: Resource Optimization ML Pipeline
3
- emoji: πŸš€
4
  colorFrom: blue
5
  colorTo: green
6
  sdk: docker
@@ -8,271 +8,143 @@ sdk_version: latest
8
  app_file: app.py
9
  pinned: false
10
  ---
11
- # πŸš€ Resource Optimization ML Pipeline
12
 
13
- An end-to-end machine learning solution for optimizing service placement across AWS regions, reducing latency and costs while maintaining reliability.
14
 
15
- **Live Dashboard:** [View on Hugging Face Spaces](https://huggingface.co/spaces/aankitdas/resource-optimization-ml)
16
 
17
- ## πŸ“Š Project Overview
18
 
19
- This project demonstrates a complete ML pipeline inspired by Amazon's Region Flexibility Engineering team challenges:
20
 
21
- - **Problem:** Optimize service placement across 5 AWS regions to reduce latency and costs
22
- - **Solution:** ML-driven placement strategy with A/B testing validation
23
- - **Results:** 5.25% latency reduction, 4.92% cost savings, statistically significant (p < 0.001)
24
 
25
- ## 🎯 Key Results
 
 
 
 
26
 
27
- | Metric | Result |
28
- |--------|--------|
29
- | Latency Reduction | **5.25%** βœ… |
30
- | Cost Savings | **4.92%** βœ… |
31
- | Critical Service Improvement | **9.30%** βœ… |
32
- | Statistical Significance | **p < 0.001** βœ… |
33
- | Placement Efficiency | **378 vs 452 pairs** (-16%) |
34
 
35
- ## πŸ› οΈ Architecture
36
 
37
- ### Data Pipeline
38
- - **150+ services** with metadata (memory, CPU, latency sensitivity)
39
- - **1.6M+ traffic records** across 5 AWS regions
40
- - **30K+ placement records** with latency and error rates
41
- - **Regional latency matrix** for cross-region communication costs
42
-
43
- ### ML Models
44
-
45
- #### Model 1: Latency Prediction (XGBoost Regression)
46
- - Predicts service latency for a given placement
47
- - **Features:** Memory, CPU cores, traffic patterns, outbound latency, service dependencies
48
- - **Performance:** RMSE = 28.7ms, MAE = 24.67ms
49
- - **Top Features:** Request variability, outbound latency, average traffic
50
-
51
- #### Model 2: Placement Strategy (Random Forest Classifier)
52
- - Classifies services for optimal regional distribution
53
- - **Features:** Traffic volume, dependencies, latency sensitivity, resource requirements
54
- - **Performance:** 100% accuracy on test set
55
-
56
- ### A/B Testing Framework
57
- - **Control:** Random service placement (baseline)
58
- - **Treatment:** ML-optimized placement using model predictions
59
- - **Statistical Test:** Independent t-test (t=7.02, p<0.001)
60
- - **Result:** Statistically significant improvement βœ…
61
-
62
- ## πŸ“ Project Structure
63
-
64
- ```
65
- resource-optimization-ml/
66
- β”œβ”€β”€ data/ # Generated datasets
67
- β”‚ β”œβ”€β”€ services.csv # Service metadata
68
- β”‚ β”œβ”€β”€ regional_latency.csv # Cross-region latency
69
- β”‚ β”œβ”€β”€ traffic_patterns.csv # Hourly traffic by service/region
70
- β”‚ └── service_placement.csv # Historical placements
71
- β”‚
72
- β”œβ”€β”€ models/ # Trained ML models
73
- β”‚ β”œβ”€β”€ xgboost_latency_model.pkl # Latency prediction model
74
- β”‚ β”œβ”€β”€ random_forest_placement_model.pkl # Placement strategy model
75
- β”‚ β”œβ”€β”€ scaler_latency.pkl # Feature scaler
76
- β”‚ β”œβ”€β”€ scaler_classification.pkl # Feature scaler
77
- β”‚ └── feature_importance_*.csv # Feature importance analysis
78
- β”‚
79
- β”œβ”€β”€ results/ # A/B test results
80
- β”‚ β”œβ”€β”€ ab_test_results.json # Statistical comparison
81
- β”‚ β”œβ”€β”€ control_placement.csv # Control group placements
82
- β”‚ └── treatment_placement.csv # Treatment group placements
83
- β”‚
84
- β”œβ”€β”€ notebooks/ # Analysis notebooks (optional)
85
- β”‚
86
- β”œβ”€β”€ data_generation.py # Generate synthetic dataset
87
- β”œβ”€β”€ setup_database.py # Load data into SQLite
88
- β”œβ”€β”€ explore_data.py # Data exploration and SQL queries
89
- β”œβ”€β”€ train_models.py # Train ML models
90
- β”œβ”€β”€ ab_test_simulation.py # Run A/B test simulation
91
- β”œβ”€β”€ app.py # Streamlit dashboard
92
- β”œβ”€β”€ requirements.txt # Python dependencies
93
- β”œβ”€β”€ README.md # This file
94
- └── .gitignore
95
- ```
96
-
97
- ## πŸš€ Quick Start
98
-
99
- ### Local Development
100
-
101
- 1. **Clone the repository**
102
- ```bash
103
- git clone https://github.com/YOUR_USERNAME/resource-optimization-ml.git
104
- cd resource-optimization-ml
105
- ```
106
-
107
- 2. **Install dependencies** (using uv or pip)
108
- ```bash
109
- uv pip install -r requirements.txt
110
- ```
111
-
112
- 3. **Generate data**
113
- ```bash
114
- uv run python data_generation.py
115
- ```
116
 
117
- 4. **Setup database**
118
- ```bash
119
- uv run python setup_database.py
120
- ```
121
 
122
- 5. **Explore data**
123
- ```bash
124
- uv run python explore_data.py
125
- ```
126
 
127
- 6. **Train models**
128
- ```bash
129
- uv run python train_models.py
130
- ```
131
 
132
- 7. **Run A/B test simulation**
133
- ```bash
134
- uv run python ab_test_simulation.py
135
- ```
 
136
 
137
- 8. **Launch dashboard**
138
- ```bash
139
- uv run streamlit run app.py
140
- ```
141
 
142
- The dashboard will open at `http://localhost:8501`
 
 
 
 
143
 
144
- ## πŸ“Š Dashboard Features
145
 
146
- ### πŸ“ˆ Overview
147
- - Service distribution by memory, CPU, and latency sensitivity
148
- - Traffic volume analysis across regions
149
- - Total statistics (150 services, 5 regions, 1.6M records)
 
150
 
151
- ### 🎯 A/B Test Results
152
- - Side-by-side comparison of control vs treatment strategies
153
- - Latency reduction: 5.25%
154
- - Cost savings: 4.92%
155
- - Statistical significance test results (p-value, t-statistic)
156
 
157
- ### πŸ—ΊοΈ Regional Analysis
158
- - Interactive latency heatmap between all region pairs
159
- - Regional statistics (min, max, std deviation)
160
- - Identify high-latency corridors
161
 
162
- ### πŸ”§ Service Details
163
- - Interactive service explorer
164
- - Per-service placement across regions
165
- - Instance count and latency metrics
 
166
 
167
- ## 🧠 Technical Stack
168
 
169
- | Component | Tool | Purpose |
170
- |-----------|------|---------|
171
- | Data Storage | SQLite | Lightweight database for local development |
172
- | Data Processing | Pandas, NumPy | Data manipulation and feature engineering |
173
- | ML Framework | scikit-learn, XGBoost | Model training and prediction |
174
- | Statistics | SciPy | A/B testing and significance tests |
175
- | Visualization | Plotly, Streamlit | Interactive dashboards |
176
- | Deployment | Hugging Face Spaces | Live dashboard hosting |
177
 
178
- ## πŸ“ˆ Model Performance
179
 
180
- ### XGBoost (Latency Prediction)
181
- ```
182
- RMSE: 28.7007 ms
183
- MAE: 24.6690 ms
184
- RΒ²: -0.0674 (indicates high variance in data)
185
- ```
186
 
187
- **Top 5 Important Features:**
188
- 1. Request Variability (CV): 21.7%
189
- 2. Outbound Latency: 17.6%
190
- 3. Average Requests: 14.2%
191
- 4. Dependencies: 13.5%
192
- 5. Number of Instances: 11.7%
193
 
194
- ### Random Forest (Placement Strategy)
195
  ```
196
- Accuracy: 100%
197
- Precision: 1.00
198
- Recall: 1.00
199
- F1-Score: 1.00
 
 
 
 
200
  ```
201
 
202
- **Top Features:**
203
- 1. Traffic Volume: 54.5%
204
- 2. Dependencies: 13.8%
205
- 3. Latency Sensitivity: 13.7%
206
-
207
- ## πŸ§ͺ A/B Test Methodology
208
-
209
- **Hypothesis:** ML-optimized placement reduces latency compared to random placement
210
 
211
- **Sample Size:** 150 services Γ— 5 regions = 750 potential placements
 
 
 
 
212
 
213
- **Metrics:**
214
- - Primary: Average latency (ms)
215
- - Secondary: Total cost ($), redundancy score, critical service latency
216
- - Efficiency: Number of placement pairs (fewer = more efficient)
217
 
218
- **Test Type:** Independent samples t-test
219
- - Null hypothesis (Hβ‚€): ΞΌ_control = ΞΌ_treatment
220
- - Alternative hypothesis (H₁): ΞΌ_control β‰  ΞΌ_treatment
221
- - Significance level: Ξ± = 0.05
222
 
223
- **Result:** Reject Hβ‚€ (p < 0.001)
224
- - The ML-optimized placement significantly reduces latency
225
 
226
- ## πŸ’‘ Key Insights
227
 
228
- 1. **Latency-critical services benefit most** from optimized placement (9.3% improvement vs 5.25% average)
229
- 2. **Traffic patterns drive decisions** - high-traffic services benefit from multi-region placement
230
- 3. **Regional cost differences matter** - avoiding expensive regions saves 4.92% without sacrificing latency
231
- 4. **Placement efficiency improves** - ML uses 16% fewer placement pairs while reducing latency
232
- 5. **Statistical rigor matters** - The improvement is not due to chance (p < 0.001)
233
 
234
- ## πŸš€ Future Enhancements
235
 
236
- ### Short-term
237
- - [ ] Add notebook with exploratory data analysis
238
- - [ ] Include feature importance visualizations
239
- - [ ] Create prediction API endpoint
240
 
241
- ### Medium-term
242
- - [ ] Integrate real AWS CloudWatch metrics
243
- - [ ] Add model retraining pipeline
244
- - [ ] Implement automated alerting
245
- - [ ] Support multi-cloud scenarios (GCP, Azure)
246
-
247
- ### Long-term
248
- - [ ] Deploy as microservice recommendation engine
249
- - [ ] Build feedback loop for model improvement
250
- - [ ] Create cost optimization module
251
- - [ ] Add capacity planning features
252
-
253
- ## πŸ“š Learning Resources
254
 
255
- This project demonstrates:
256
- - βœ… SQL data querying and aggregation
257
- - βœ… Python data manipulation (Pandas, NumPy)
258
- - βœ… Machine learning model training (scikit-learn, XGBoost)
259
- - βœ… Feature engineering and preprocessing
260
- - βœ… Statistical hypothesis testing
261
- - βœ… A/B testing methodology
262
- - βœ… Data visualization (Plotly, Streamlit)
263
- - βœ… Full-stack ML deployment
264
 
265
- ## πŸ“ License
 
266
 
267
- This project is open source and available under the MIT License.
 
268
 
269
- ## πŸ‘€ Author
 
 
270
 
271
- Built as a portfolio project demonstrating ML engineering capabilities for cloud infrastructure optimization.
272
 
273
- ---
 
 
 
 
 
274
 
275
- **Questions or feedback?** Open an issue or reach out!
276
 
277
- **Live Dashboard:** [Hugging Face Spaces](https://huggingface.co/spaces/aankitdas/resource-optimization-ml)
278
- **GitHub:** [resource-optimization-ml](https://github.com/aankitdas/resource-optimization-ml)
 
1
  ---
2
  title: Resource Optimization ML Pipeline
3
+ emoji:
4
  colorFrom: blue
5
  colorTo: green
6
  sdk: docker
 
8
  app_file: app.py
9
  pinned: false
10
  ---
 
11
 
12
+ # Resource Optimization ML Pipeline
13
 
14
+ A data-driven approach to optimizing service placement across cloud regions, reducing latency and infrastructure costs through machine learning.
15
 
16
+ **Live Dashboard:** https://huggingface.co/spaces/aankitdas/resource-optimization-ml
17
 
18
+ ## Problem
19
 
20
+ When Amazon scales infrastructure globally across multiple AWS regions, teams face a critical decision: which services should run in which regions?
 
 
21
 
22
+ The naive approach (random placement) is inefficient:
23
+ - Services get placed in expensive regions unnecessarily
24
+ - Cross-region communication adds latency
25
+ - Over-provisioning of resources to ensure redundancy
26
+ - No data-driven strategy for placement decisions
27
 
28
+ This project tackles that problem: **given service characteristics and regional latency patterns, can we predict optimal placement that reduces latency and costs?**
 
 
 
 
 
 
29
 
30
+ ## Solution
31
 
32
+ I built an ML-powered recommendation system that:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
 
34
+ 1. **Analyzes service characteristics** - memory, CPU, traffic volume, latency sensitivity
35
+ 2. **Models regional latency** - how long it takes to communicate between regions
36
+ 3. **Predicts placement impact** - what happens to latency if we place a service in region X vs Y
37
+ 4. **Compares strategies** - random placement vs ML-optimized placement through A/B testing
38
 
39
+ ## Results
 
 
 
40
 
41
+ The ML-optimized strategy outperforms random placement:
 
 
 
42
 
43
+ - **5.25% latency reduction** - services respond faster to users
44
+ - **4.92% cost savings** - avoided expensive regions where possible
45
+ - **9.30% improvement for critical services** - latency-sensitive workloads benefit most
46
+ - **Statistical significance** - improvements are not due to chance (p < 0.001)
47
+ - **16% fewer placements** - more efficient resource usage
48
 
49
+ ## Technical Approach
 
 
 
50
 
51
+ ### Data Pipeline
52
+ - Generated 150 synthetic services with realistic attributes
53
+ - Created 1.6M+ traffic records across 5 regions over 90 days
54
+ - Modeled cross-region latency patterns based on real AWS geography
55
+ - Stored everything in SQLite for easy SQL querying
56
 
57
+ ### Machine Learning
58
 
59
+ **Model 1: Latency Prediction (XGBoost Regressor)**
60
+ - Predicts service latency given placement characteristics
61
+ - Input: service memory/CPU, traffic patterns, outbound latency, dependencies
62
+ - Output: expected latency in milliseconds
63
+ - Performance: RMSE=28.7ms
64
 
65
+ **Model 2: Placement Strategy (Random Forest Classifier)**
66
+ - Determines if a service should be single-region or multi-region
67
+ - Input: traffic volume, dependencies, resource requirements
68
+ - Output: optimal placement strategy
69
+ - Performance: 100% accuracy on test set
70
 
71
+ ### A/B Testing
 
 
 
72
 
73
+ To validate the ML approach:
74
+ - **Control**: randomly place services across 2-4 regions
75
+ - **Treatment**: use ML models to recommend optimal placement
76
+ - **Test**: independent t-test on latency samples (t=7.02, p<0.001)
77
+ - **Conclusion**: ML strategy is statistically significantly better
78
 
79
+ ## How to Use the Dashboard
80
 
81
+ **Overview** - See service distribution across memory tiers and latency sensitivity. Top services by traffic volume.
 
 
 
 
 
 
 
82
 
83
+ **A/B Test Results** - The core finding. Side-by-side comparison of random vs ML-optimized placement with metrics and statistical test results.
84
 
85
+ **Regional Analysis** - Latency heatmap showing communication costs between regions. Higher latency regions are avoided when possible.
 
 
 
 
 
86
 
87
+ ## Project Structure
 
 
 
 
 
88
 
 
89
  ```
90
+ β”œβ”€β”€ data_generation.py # Generate synthetic services, traffic, latency data
91
+ β”œβ”€β”€ setup_database.py # Load CSVs into SQLite
92
+ β”œβ”€β”€ train_models.py # Train XGBoost and Random Forest models
93
+ β”œβ”€β”€ ab_test_simulation.py # Run A/B test and save results
94
+ β”œβ”€β”€ app.py # Streamlit dashboard
95
+ β”œβ”€β”€ results/
96
+ β”‚ └── ab_test_results.json # A/B test metrics and statistics
97
+ └── requirements.txt # Python dependencies
98
  ```
99
 
100
+ ## Technology Stack
 
 
 
 
 
 
 
101
 
102
+ - **Data Processing**: Python, Pandas, NumPy, SQLite
103
+ - **Machine Learning**: scikit-learn, XGBoost
104
+ - **Statistics**: SciPy (hypothesis testing)
105
+ - **Visualization**: Plotly, Streamlit
106
+ - **Deployment**: Docker, Hugging Face Spaces, GitHub Actions
107
 
108
+ ## Key Insights
 
 
 
109
 
110
+ 1. **Traffic patterns matter most** - Services with high, variable traffic benefit most from multi-region placement
 
 
 
111
 
112
+ 2. **Latency-critical services are placement-sensitive** - A few milliseconds of additional latency can degrade user experience for these workloads
 
113
 
114
+ 3. **Regional cost differences are significant** - Some regions are 80% more expensive than others. ML avoids them when latency permits
115
 
116
+ 4. **Efficiency and performance can both improve** - ML uses fewer total placements while reducing latency
 
 
 
 
117
 
118
+ 5. **Statistical rigor matters** - Raw improvements mean nothing without significance testing
119
 
120
+ ## Running Locally
 
 
 
121
 
122
+ ```bash
123
+ # Generate data
124
+ python data_generation.py
 
 
 
 
 
 
 
 
 
 
125
 
126
+ # Setup database
127
+ python setup_database.py
 
 
 
 
 
 
 
128
 
129
+ # Train models
130
+ python train_models.py
131
 
132
+ # Run A/B test
133
+ python ab_test_simulation.py
134
 
135
+ # Launch dashboard
136
+ streamlit run app.py
137
+ ```
138
 
139
+ ## What This Demonstrates
140
 
141
+ - SQL data analysis and aggregation
142
+ - Python data manipulation and feature engineering
143
+ - Machine learning model training and evaluation
144
+ - Statistical hypothesis testing and A/B testing methodology
145
+ - End-to-end data product development (from data to dashboard)
146
+ - Production deployment with Docker and GitHub Actions
147
 
148
+ ## Repository
149
 
150
+ https://github.com/aankitdas/resource-optimization-ml