aankitdas's picture
updated readme again
abcf84d
---
title: Resource Optimization ML Pipeline
emoji: πŸ”₯
colorFrom: blue
colorTo: green
sdk: docker
sdk_version: latest
app_file: app.py
pinned: false
---
# Resource Optimization ML Pipeline
A data-driven approach to optimizing service placement across cloud regions, reducing latency and infrastructure costs through machine learning.
**Live Dashboard:** https://huggingface.co/spaces/aankitdas/resource-optimization-ml
## Problem
When Amazon scales infrastructure globally across multiple AWS regions, teams face a critical decision: which services should run in which regions?
The naive approach (random placement) is inefficient:
- Services get placed in expensive regions unnecessarily
- Cross-region communication adds latency
- Over-provisioning of resources to ensure redundancy
- No data-driven strategy for placement decisions
This project tackles that problem: **given service characteristics and regional latency patterns, can we predict optimal placement that reduces latency and costs?**
## Solution
I built an ML-powered recommendation system that:
1. **Analyzes service characteristics** - memory, CPU, traffic volume, latency sensitivity
2. **Models regional latency** - how long it takes to communicate between regions
3. **Predicts placement impact** - what happens to latency if we place a service in region X vs Y
4. **Compares strategies** - random placement vs ML-optimized placement through A/B testing
## Results
The ML-optimized strategy outperforms random placement:
- **5.25% latency reduction** - services respond faster to users
- **4.92% cost savings** - avoided expensive regions where possible
- **9.30% improvement for critical services** - latency-sensitive workloads benefit most
- **Statistical significance** - improvements are not due to chance (p < 0.001)
- **16% fewer placements** - more efficient resource usage
## Technical Approach
### Data Pipeline
- Generated 150 synthetic services with realistic attributes
- Created 1.6M+ traffic records across 5 regions over 90 days
- Modeled cross-region latency patterns based on real AWS geography
- Stored everything in SQLite for easy SQL querying
### Machine Learning
**Model 1: Latency Prediction (XGBoost Regressor)**
- Predicts service latency given placement characteristics
- Input: service memory/CPU, traffic patterns, outbound latency, dependencies
- Output: expected latency in milliseconds
- Performance: RMSE=28.7ms
**Model 2: Placement Strategy (Random Forest Classifier)**
- Determines if a service should be single-region or multi-region
- Input: traffic volume, dependencies, resource requirements
- Output: optimal placement strategy
- Performance: 100% accuracy on test set
### A/B Testing
To validate the ML approach:
- **Control**: randomly place services across 2-4 regions
- **Treatment**: use ML models to recommend optimal placement
- **Test**: independent t-test on latency samples (t=7.02, p<0.001)
- **Conclusion**: ML strategy is statistically significantly better
## How to Use the Dashboard
**Overview** - See service distribution across memory tiers and latency sensitivity. Top services by traffic volume.
**A/B Test Results** - The core finding. Side-by-side comparison of random vs ML-optimized placement with metrics and statistical test results.
**Regional Analysis** - Latency heatmap showing communication costs between regions. Higher latency regions are avoided when possible.
## Project Structure
```
β”œβ”€β”€ data_generation.py # Generate synthetic services, traffic, latency data
β”œβ”€β”€ setup_database.py # Load CSVs into SQLite
β”œβ”€β”€ train_models.py # Train XGBoost and Random Forest models
β”œβ”€β”€ ab_test_simulation.py # Run A/B test and save results
β”œβ”€β”€ app.py # Streamlit dashboard
β”œβ”€β”€ results/
β”‚ └── ab_test_results.json # A/B test metrics and statistics
└── requirements.txt # Python dependencies
```
## Technology Stack
- **Data Processing**: Python, Pandas, NumPy, SQLite
- **Machine Learning**: scikit-learn, XGBoost
- **Statistics**: SciPy (hypothesis testing)
- **Visualization**: Plotly, Streamlit
- **Deployment**: Docker, Hugging Face Spaces, GitHub Actions
## Key Insights
1. **Traffic patterns matter most** - Services with high, variable traffic benefit most from multi-region placement
2. **Latency-critical services are placement-sensitive** - A few milliseconds of additional latency can degrade user experience for these workloads
3. **Regional cost differences are significant** - Some regions are 80% more expensive than others. ML avoids them when latency permits
4. **Efficiency and performance can both improve** - ML uses fewer total placements while reducing latency
5. **Statistical rigor matters** - Raw improvements mean nothing without significance testing
## Running Locally
```bash
# Generate data
python data_generation.py
# Setup database
python setup_database.py
# Train models
python train_models.py
# Run A/B test
python ab_test_simulation.py
# Launch dashboard
streamlit run app.py
```
## What This Demonstrates
- SQL data analysis and aggregation
- Python data manipulation and feature engineering
- Machine learning model training and evaluation
- Statistical hypothesis testing and A/B testing methodology
- End-to-end data product development (from data to dashboard)
- Production deployment with Docker and GitHub Actions
## Repository
https://github.com/aankitdas/resource-optimization-ml