aankitdas's picture
updated readme again
abcf84d
metadata
title: Resource Optimization ML Pipeline
emoji: πŸ”₯
colorFrom: blue
colorTo: green
sdk: docker
sdk_version: latest
app_file: app.py
pinned: false

Resource Optimization ML Pipeline

A data-driven approach to optimizing service placement across cloud regions, reducing latency and infrastructure costs through machine learning.

Live Dashboard: https://huggingface.co/spaces/aankitdas/resource-optimization-ml

Problem

When Amazon scales infrastructure globally across multiple AWS regions, teams face a critical decision: which services should run in which regions?

The naive approach (random placement) is inefficient:

  • Services get placed in expensive regions unnecessarily
  • Cross-region communication adds latency
  • Over-provisioning of resources to ensure redundancy
  • No data-driven strategy for placement decisions

This project tackles that problem: given service characteristics and regional latency patterns, can we predict optimal placement that reduces latency and costs?

Solution

I built an ML-powered recommendation system that:

  1. Analyzes service characteristics - memory, CPU, traffic volume, latency sensitivity
  2. Models regional latency - how long it takes to communicate between regions
  3. Predicts placement impact - what happens to latency if we place a service in region X vs Y
  4. Compares strategies - random placement vs ML-optimized placement through A/B testing

Results

The ML-optimized strategy outperforms random placement:

  • 5.25% latency reduction - services respond faster to users
  • 4.92% cost savings - avoided expensive regions where possible
  • 9.30% improvement for critical services - latency-sensitive workloads benefit most
  • Statistical significance - improvements are not due to chance (p < 0.001)
  • 16% fewer placements - more efficient resource usage

Technical Approach

Data Pipeline

  • Generated 150 synthetic services with realistic attributes
  • Created 1.6M+ traffic records across 5 regions over 90 days
  • Modeled cross-region latency patterns based on real AWS geography
  • Stored everything in SQLite for easy SQL querying

Machine Learning

Model 1: Latency Prediction (XGBoost Regressor)

  • Predicts service latency given placement characteristics
  • Input: service memory/CPU, traffic patterns, outbound latency, dependencies
  • Output: expected latency in milliseconds
  • Performance: RMSE=28.7ms

Model 2: Placement Strategy (Random Forest Classifier)

  • Determines if a service should be single-region or multi-region
  • Input: traffic volume, dependencies, resource requirements
  • Output: optimal placement strategy
  • Performance: 100% accuracy on test set

A/B Testing

To validate the ML approach:

  • Control: randomly place services across 2-4 regions
  • Treatment: use ML models to recommend optimal placement
  • Test: independent t-test on latency samples (t=7.02, p<0.001)
  • Conclusion: ML strategy is statistically significantly better

How to Use the Dashboard

Overview - See service distribution across memory tiers and latency sensitivity. Top services by traffic volume.

A/B Test Results - The core finding. Side-by-side comparison of random vs ML-optimized placement with metrics and statistical test results.

Regional Analysis - Latency heatmap showing communication costs between regions. Higher latency regions are avoided when possible.

Project Structure

β”œβ”€β”€ data_generation.py         # Generate synthetic services, traffic, latency data
β”œβ”€β”€ setup_database.py          # Load CSVs into SQLite
β”œβ”€β”€ train_models.py            # Train XGBoost and Random Forest models
β”œβ”€β”€ ab_test_simulation.py      # Run A/B test and save results
β”œβ”€β”€ app.py                     # Streamlit dashboard
β”œβ”€β”€ results/
β”‚   └── ab_test_results.json   # A/B test metrics and statistics
└── requirements.txt           # Python dependencies

Technology Stack

  • Data Processing: Python, Pandas, NumPy, SQLite
  • Machine Learning: scikit-learn, XGBoost
  • Statistics: SciPy (hypothesis testing)
  • Visualization: Plotly, Streamlit
  • Deployment: Docker, Hugging Face Spaces, GitHub Actions

Key Insights

  1. Traffic patterns matter most - Services with high, variable traffic benefit most from multi-region placement

  2. Latency-critical services are placement-sensitive - A few milliseconds of additional latency can degrade user experience for these workloads

  3. Regional cost differences are significant - Some regions are 80% more expensive than others. ML avoids them when latency permits

  4. Efficiency and performance can both improve - ML uses fewer total placements while reducing latency

  5. Statistical rigor matters - Raw improvements mean nothing without significance testing

Running Locally

# Generate data
python data_generation.py

# Setup database
python setup_database.py

# Train models
python train_models.py

# Run A/B test
python ab_test_simulation.py

# Launch dashboard
streamlit run app.py

What This Demonstrates

  • SQL data analysis and aggregation
  • Python data manipulation and feature engineering
  • Machine learning model training and evaluation
  • Statistical hypothesis testing and A/B testing methodology
  • End-to-end data product development (from data to dashboard)
  • Production deployment with Docker and GitHub Actions

Repository

https://github.com/aankitdas/resource-optimization-ml