Spaces:

aankitdas
/

resource-optimization-ml

Sleeping

App Files Files Community

resource-optimization-ml / README.md

aankitdas

updated readme again

abcf84d about 1 month ago

preview code

raw

history blame contribute delete

5.52 kB

	---
	title: Resource Optimization ML Pipeline
	emoji: 🔥
	colorFrom: blue
	colorTo: green
	sdk: docker
	sdk_version: latest
	app_file: app.py
	pinned: false
	---

	# Resource Optimization ML Pipeline

	A data-driven approach to optimizing service placement across cloud regions, reducing latency and infrastructure costs through machine learning.

	Live Dashboard: https://huggingface.co/spaces/aankitdas/resource-optimization-ml

	## Problem

	When Amazon scales infrastructure globally across multiple AWS regions, teams face a critical decision: which services should run in which regions?

	The naive approach (random placement) is inefficient:
	- Services get placed in expensive regions unnecessarily
	- Cross-region communication adds latency
	- Over-provisioning of resources to ensure redundancy
	- No data-driven strategy for placement decisions

	This project tackles that problem: given service characteristics and regional latency patterns, can we predict optimal placement that reduces latency and costs?

	## Solution

	I built an ML-powered recommendation system that:

	1. Analyzes service characteristics - memory, CPU, traffic volume, latency sensitivity
	2. Models regional latency - how long it takes to communicate between regions
	3. Predicts placement impact - what happens to latency if we place a service in region X vs Y
	4. Compares strategies - random placement vs ML-optimized placement through A/B testing

	## Results

	The ML-optimized strategy outperforms random placement:

	- 5.25% latency reduction - services respond faster to users
	- 4.92% cost savings - avoided expensive regions where possible
	- 9.30% improvement for critical services - latency-sensitive workloads benefit most
	- Statistical significance - improvements are not due to chance (p < 0.001)
	- 16% fewer placements - more efficient resource usage

	## Technical Approach

	### Data Pipeline
	- Generated 150 synthetic services with realistic attributes
	- Created 1.6M+ traffic records across 5 regions over 90 days
	- Modeled cross-region latency patterns based on real AWS geography
	- Stored everything in SQLite for easy SQL querying

	### Machine Learning

	Model 1: Latency Prediction (XGBoost Regressor)
	- Predicts service latency given placement characteristics
	- Input: service memory/CPU, traffic patterns, outbound latency, dependencies
	- Output: expected latency in milliseconds
	- Performance: RMSE=28.7ms

	Model 2: Placement Strategy (Random Forest Classifier)
	- Determines if a service should be single-region or multi-region
	- Input: traffic volume, dependencies, resource requirements
	- Output: optimal placement strategy
	- Performance: 100% accuracy on test set

	### A/B Testing

	To validate the ML approach:
	- Control: randomly place services across 2-4 regions
	- Treatment: use ML models to recommend optimal placement
	- Test: independent t-test on latency samples (t=7.02, p<0.001)
	- Conclusion: ML strategy is statistically significantly better

	## How to Use the Dashboard

	Overview - See service distribution across memory tiers and latency sensitivity. Top services by traffic volume.

	A/B Test Results - The core finding. Side-by-side comparison of random vs ML-optimized placement with metrics and statistical test results.

	Regional Analysis - Latency heatmap showing communication costs between regions. Higher latency regions are avoided when possible.

	## Project Structure

	```
	├── data_generation.py # Generate synthetic services, traffic, latency data
	├── setup_database.py # Load CSVs into SQLite
	├── train_models.py # Train XGBoost and Random Forest models
	├── ab_test_simulation.py # Run A/B test and save results
	├── app.py # Streamlit dashboard
	├── results/
	│ └── ab_test_results.json # A/B test metrics and statistics
	└── requirements.txt # Python dependencies
	```

	## Technology Stack

	- Data Processing: Python, Pandas, NumPy, SQLite
	- Machine Learning: scikit-learn, XGBoost
	- Statistics: SciPy (hypothesis testing)
	- Visualization: Plotly, Streamlit
	- Deployment: Docker, Hugging Face Spaces, GitHub Actions

	## Key Insights

	1. Traffic patterns matter most - Services with high, variable traffic benefit most from multi-region placement

	2. Latency-critical services are placement-sensitive - A few milliseconds of additional latency can degrade user experience for these workloads

	3. Regional cost differences are significant - Some regions are 80% more expensive than others. ML avoids them when latency permits

	4. Efficiency and performance can both improve - ML uses fewer total placements while reducing latency

	5. Statistical rigor matters - Raw improvements mean nothing without significance testing

	## Running Locally

	```bash
	# Generate data
	python data_generation.py

	# Setup database
	python setup_database.py

	# Train models
	python train_models.py

	# Run A/B test
	python ab_test_simulation.py

	# Launch dashboard
	streamlit run app.py
	```

	## What This Demonstrates

	- SQL data analysis and aggregation
	- Python data manipulation and feature engineering
	- Machine learning model training and evaluation
	- Statistical hypothesis testing and A/B testing methodology
	- End-to-end data product development (from data to dashboard)
	- Production deployment with Docker and GitHub Actions

	## Repository

	https://github.com/aankitdas/resource-optimization-ml