Spaces:
Sleeping
Sleeping
| title: Resource Optimization ML Pipeline | |
| emoji: π₯ | |
| colorFrom: blue | |
| colorTo: green | |
| sdk: docker | |
| sdk_version: latest | |
| app_file: app.py | |
| pinned: false | |
| # Resource Optimization ML Pipeline | |
| A data-driven approach to optimizing service placement across cloud regions, reducing latency and infrastructure costs through machine learning. | |
| **Live Dashboard:** https://huggingface.co/spaces/aankitdas/resource-optimization-ml | |
| ## Problem | |
| When Amazon scales infrastructure globally across multiple AWS regions, teams face a critical decision: which services should run in which regions? | |
| The naive approach (random placement) is inefficient: | |
| - Services get placed in expensive regions unnecessarily | |
| - Cross-region communication adds latency | |
| - Over-provisioning of resources to ensure redundancy | |
| - No data-driven strategy for placement decisions | |
| This project tackles that problem: **given service characteristics and regional latency patterns, can we predict optimal placement that reduces latency and costs?** | |
| ## Solution | |
| I built an ML-powered recommendation system that: | |
| 1. **Analyzes service characteristics** - memory, CPU, traffic volume, latency sensitivity | |
| 2. **Models regional latency** - how long it takes to communicate between regions | |
| 3. **Predicts placement impact** - what happens to latency if we place a service in region X vs Y | |
| 4. **Compares strategies** - random placement vs ML-optimized placement through A/B testing | |
| ## Results | |
| The ML-optimized strategy outperforms random placement: | |
| - **5.25% latency reduction** - services respond faster to users | |
| - **4.92% cost savings** - avoided expensive regions where possible | |
| - **9.30% improvement for critical services** - latency-sensitive workloads benefit most | |
| - **Statistical significance** - improvements are not due to chance (p < 0.001) | |
| - **16% fewer placements** - more efficient resource usage | |
| ## Technical Approach | |
| ### Data Pipeline | |
| - Generated 150 synthetic services with realistic attributes | |
| - Created 1.6M+ traffic records across 5 regions over 90 days | |
| - Modeled cross-region latency patterns based on real AWS geography | |
| - Stored everything in SQLite for easy SQL querying | |
| ### Machine Learning | |
| **Model 1: Latency Prediction (XGBoost Regressor)** | |
| - Predicts service latency given placement characteristics | |
| - Input: service memory/CPU, traffic patterns, outbound latency, dependencies | |
| - Output: expected latency in milliseconds | |
| - Performance: RMSE=28.7ms | |
| **Model 2: Placement Strategy (Random Forest Classifier)** | |
| - Determines if a service should be single-region or multi-region | |
| - Input: traffic volume, dependencies, resource requirements | |
| - Output: optimal placement strategy | |
| - Performance: 100% accuracy on test set | |
| ### A/B Testing | |
| To validate the ML approach: | |
| - **Control**: randomly place services across 2-4 regions | |
| - **Treatment**: use ML models to recommend optimal placement | |
| - **Test**: independent t-test on latency samples (t=7.02, p<0.001) | |
| - **Conclusion**: ML strategy is statistically significantly better | |
| ## How to Use the Dashboard | |
| **Overview** - See service distribution across memory tiers and latency sensitivity. Top services by traffic volume. | |
| **A/B Test Results** - The core finding. Side-by-side comparison of random vs ML-optimized placement with metrics and statistical test results. | |
| **Regional Analysis** - Latency heatmap showing communication costs between regions. Higher latency regions are avoided when possible. | |
| ## Project Structure | |
| ``` | |
| βββ data_generation.py # Generate synthetic services, traffic, latency data | |
| βββ setup_database.py # Load CSVs into SQLite | |
| βββ train_models.py # Train XGBoost and Random Forest models | |
| βββ ab_test_simulation.py # Run A/B test and save results | |
| βββ app.py # Streamlit dashboard | |
| βββ results/ | |
| β βββ ab_test_results.json # A/B test metrics and statistics | |
| βββ requirements.txt # Python dependencies | |
| ``` | |
| ## Technology Stack | |
| - **Data Processing**: Python, Pandas, NumPy, SQLite | |
| - **Machine Learning**: scikit-learn, XGBoost | |
| - **Statistics**: SciPy (hypothesis testing) | |
| - **Visualization**: Plotly, Streamlit | |
| - **Deployment**: Docker, Hugging Face Spaces, GitHub Actions | |
| ## Key Insights | |
| 1. **Traffic patterns matter most** - Services with high, variable traffic benefit most from multi-region placement | |
| 2. **Latency-critical services are placement-sensitive** - A few milliseconds of additional latency can degrade user experience for these workloads | |
| 3. **Regional cost differences are significant** - Some regions are 80% more expensive than others. ML avoids them when latency permits | |
| 4. **Efficiency and performance can both improve** - ML uses fewer total placements while reducing latency | |
| 5. **Statistical rigor matters** - Raw improvements mean nothing without significance testing | |
| ## Running Locally | |
| ```bash | |
| # Generate data | |
| python data_generation.py | |
| # Setup database | |
| python setup_database.py | |
| # Train models | |
| python train_models.py | |
| # Run A/B test | |
| python ab_test_simulation.py | |
| # Launch dashboard | |
| streamlit run app.py | |
| ``` | |
| ## What This Demonstrates | |
| - SQL data analysis and aggregation | |
| - Python data manipulation and feature engineering | |
| - Machine learning model training and evaluation | |
| - Statistical hypothesis testing and A/B testing methodology | |
| - End-to-end data product development (from data to dashboard) | |
| - Production deployment with Docker and GitHub Actions | |
| ## Repository | |
| https://github.com/aankitdas/resource-optimization-ml |