Spaces:

aankitdas
/

resource-optimization-ml

Sleeping

File size: 5,516 Bytes

e20bb93
 
abcf84d
e20bb93
 
 
 
 
 
 
87b6610
d6ba1de
87b6610
d6ba1de
87b6610
d6ba1de
87b6610
d6ba1de
87b6610
d6ba1de
87b6610
d6ba1de
 
 
 
 
87b6610
d6ba1de
87b6610
d6ba1de
87b6610
d6ba1de
87b6610
d6ba1de
 
 
 
87b6610
d6ba1de
87b6610
d6ba1de
87b6610
d6ba1de
 
 
 
 
87b6610
d6ba1de
87b6610
d6ba1de
 
 
 
 
87b6610
d6ba1de
87b6610
d6ba1de
 
 
 
 
87b6610
d6ba1de
 
 
 
 
87b6610
d6ba1de
87b6610
d6ba1de
 
 
 
 
87b6610
d6ba1de
87b6610
d6ba1de
87b6610
d6ba1de
87b6610
d6ba1de
87b6610
d6ba1de
87b6610
 
d6ba1de
 
 
 
 
 
 
 
87b6610
 
d6ba1de
87b6610
d6ba1de
 
 
 
 
87b6610
d6ba1de
87b6610
d6ba1de
87b6610
d6ba1de
87b6610
d6ba1de
87b6610
d6ba1de
87b6610
d6ba1de
87b6610
d6ba1de
87b6610
d6ba1de
 
 
87b6610
d6ba1de
 
87b6610
d6ba1de
 
87b6610
d6ba1de
 
87b6610
d6ba1de
 
 
87b6610
d6ba1de
87b6610
d6ba1de
 
 
 
 
 
87b6610
d6ba1de
87b6610
d6ba1de

---
title: Resource Optimization ML Pipeline
emoji: 🔥
colorFrom: blue
colorTo: green
sdk: docker
sdk_version: latest
app_file: app.py
pinned: false
---

# Resource Optimization ML Pipeline

A data-driven approach to optimizing service placement across cloud regions, reducing latency and infrastructure costs through machine learning.

**Live Dashboard:** https://huggingface.co/spaces/aankitdas/resource-optimization-ml

## Problem

When Amazon scales infrastructure globally across multiple AWS regions, teams face a critical decision: which services should run in which regions? 

The naive approach (random placement) is inefficient:
- Services get placed in expensive regions unnecessarily
- Cross-region communication adds latency
- Over-provisioning of resources to ensure redundancy
- No data-driven strategy for placement decisions

This project tackles that problem: **given service characteristics and regional latency patterns, can we predict optimal placement that reduces latency and costs?**

## Solution

I built an ML-powered recommendation system that:

1. **Analyzes service characteristics** - memory, CPU, traffic volume, latency sensitivity
2. **Models regional latency** - how long it takes to communicate between regions
3. **Predicts placement impact** - what happens to latency if we place a service in region X vs Y
4. **Compares strategies** - random placement vs ML-optimized placement through A/B testing

## Results

The ML-optimized strategy outperforms random placement:

- **5.25% latency reduction** - services respond faster to users
- **4.92% cost savings** - avoided expensive regions where possible
- **9.30% improvement for critical services** - latency-sensitive workloads benefit most
- **Statistical significance** - improvements are not due to chance (p < 0.001)
- **16% fewer placements** - more efficient resource usage

## Technical Approach

### Data Pipeline
- Generated 150 synthetic services with realistic attributes
- Created 1.6M+ traffic records across 5 regions over 90 days
- Modeled cross-region latency patterns based on real AWS geography
- Stored everything in SQLite for easy SQL querying

### Machine Learning

**Model 1: Latency Prediction (XGBoost Regressor)**
- Predicts service latency given placement characteristics
- Input: service memory/CPU, traffic patterns, outbound latency, dependencies
- Output: expected latency in milliseconds
- Performance: RMSE=28.7ms

**Model 2: Placement Strategy (Random Forest Classifier)**
- Determines if a service should be single-region or multi-region
- Input: traffic volume, dependencies, resource requirements
- Output: optimal placement strategy
- Performance: 100% accuracy on test set

### A/B Testing

To validate the ML approach:
- **Control**: randomly place services across 2-4 regions
- **Treatment**: use ML models to recommend optimal placement
- **Test**: independent t-test on latency samples (t=7.02, p<0.001)
- **Conclusion**: ML strategy is statistically significantly better

## How to Use the Dashboard

**Overview** - See service distribution across memory tiers and latency sensitivity. Top services by traffic volume.

**A/B Test Results** - The core finding. Side-by-side comparison of random vs ML-optimized placement with metrics and statistical test results.

**Regional Analysis** - Latency heatmap showing communication costs between regions. Higher latency regions are avoided when possible.

## Project Structure

```
├── data_generation.py         # Generate synthetic services, traffic, latency data
├── setup_database.py          # Load CSVs into SQLite
├── train_models.py            # Train XGBoost and Random Forest models
├── ab_test_simulation.py      # Run A/B test and save results
├── app.py                     # Streamlit dashboard
├── results/
│   └── ab_test_results.json   # A/B test metrics and statistics
└── requirements.txt           # Python dependencies
```

## Technology Stack

- **Data Processing**: Python, Pandas, NumPy, SQLite
- **Machine Learning**: scikit-learn, XGBoost
- **Statistics**: SciPy (hypothesis testing)
- **Visualization**: Plotly, Streamlit
- **Deployment**: Docker, Hugging Face Spaces, GitHub Actions

## Key Insights

1. **Traffic patterns matter most** - Services with high, variable traffic benefit most from multi-region placement

2. **Latency-critical services are placement-sensitive** - A few milliseconds of additional latency can degrade user experience for these workloads

3. **Regional cost differences are significant** - Some regions are 80% more expensive than others. ML avoids them when latency permits

4. **Efficiency and performance can both improve** - ML uses fewer total placements while reducing latency

5. **Statistical rigor matters** - Raw improvements mean nothing without significance testing

## Running Locally

```bash
# Generate data
python data_generation.py

# Setup database
python setup_database.py

# Train models
python train_models.py

# Run A/B test
python ab_test_simulation.py

# Launch dashboard
streamlit run app.py
```

## What This Demonstrates

- SQL data analysis and aggregation
- Python data manipulation and feature engineering
- Machine learning model training and evaluation
- Statistical hypothesis testing and A/B testing methodology
- End-to-end data product development (from data to dashboard)
- Production deployment with Docker and GitHub Actions

## Repository

https://github.com/aankitdas/resource-optimization-ml