Spaces:

aankitdas
/

resource-optimization-ml

Sleeping

App Files Files Community

aankitdas commited on Dec 31, 2025

Commit

d6ba1de

1 Parent(s): 77dd0b0

updated readme

Browse files

Files changed (1) hide show

README.md +95 -223

README.md CHANGED Viewed

@@ -1,6 +1,6 @@
 ---
 title: Resource Optimization ML Pipeline
-emoji: 🚀
 colorFrom: blue
 colorTo: green
 sdk: docker
@@ -8,271 +8,143 @@ sdk_version: latest
 app_file: app.py
 pinned: false
 ---
-# 🚀 Resource Optimization ML Pipeline
-An end-to-end machine learning solution for optimizing service placement across AWS regions, reducing latency and costs while maintaining reliability.
-**Live Dashboard:** [View on Hugging Face Spaces](https://huggingface.co/spaces/aankitdas/resource-optimization-ml)
-## 📊 Project Overview
-This project demonstrates a complete ML pipeline inspired by Amazon's Region Flexibility Engineering team challenges:
-- **Problem:** Optimize service placement across 5 AWS regions to reduce latency and costs
-- **Solution:** ML-driven placement strategy with A/B testing validation
-- **Results:** 5.25% latency reduction, 4.92% cost savings, statistically significant (p < 0.001)
-## 🎯 Key Results
-| Metric | Result |
-|--------|--------|
-| Latency Reduction | **5.25%** ✅ |
-| Cost Savings | **4.92%** ✅ |
-| Critical Service Improvement | **9.30%** ✅ |
-| Statistical Significance | **p < 0.001** ✅ |
-| Placement Efficiency | **378 vs 452 pairs** (-16%) |
-## 🛠️ Architecture
-### Data Pipeline
-- **150+ services** with metadata (memory, CPU, latency sensitivity)
-- **1.6M+ traffic records** across 5 AWS regions
-- **30K+ placement records** with latency and error rates
-- **Regional latency matrix** for cross-region communication costs
-### ML Models
-#### Model 1: Latency Prediction (XGBoost Regression)
-- Predicts service latency for a given placement
-- **Features:** Memory, CPU cores, traffic patterns, outbound latency, service dependencies
-- **Performance:** RMSE = 28.7ms, MAE = 24.67ms
-- **Top Features:** Request variability, outbound latency, average traffic
-#### Model 2: Placement Strategy (Random Forest Classifier)
-- Classifies services for optimal regional distribution
-- **Features:** Traffic volume, dependencies, latency sensitivity, resource requirements
-- **Performance:** 100% accuracy on test set
-### A/B Testing Framework
-- **Control:** Random service placement (baseline)
-- **Treatment:** ML-optimized placement using model predictions
-- **Statistical Test:** Independent t-test (t=7.02, p<0.001)
-- **Result:** Statistically significant improvement ✅
-## 📁 Project Structure
-```
-resource-optimization-ml/
-├── data/                           # Generated datasets
-│   ├── services.csv               # Service metadata
-│   ├── regional_latency.csv       # Cross-region latency
-│   ├── traffic_patterns.csv       # Hourly traffic by service/region
-│   └── service_placement.csv      # Historical placements
-│
-├── models/                         # Trained ML models
-│   ├── xgboost_latency_model.pkl  # Latency prediction model
-│   ├── random_forest_placement_model.pkl  # Placement strategy model
-│   ├── scaler_latency.pkl         # Feature scaler
-│   ├── scaler_classification.pkl  # Feature scaler
-│   └── feature_importance_*.csv   # Feature importance analysis
-│
-├── results/                        # A/B test results
-│   ├── ab_test_results.json       # Statistical comparison
-│   ├── control_placement.csv      # Control group placements
-│   └── treatment_placement.csv    # Treatment group placements
-│
-├── notebooks/                      # Analysis notebooks (optional)
-│
-├── data_generation.py              # Generate synthetic dataset
-├── setup_database.py               # Load data into SQLite
-├── explore_data.py                 # Data exploration and SQL queries
-├── train_models.py                 # Train ML models
-├── ab_test_simulation.py           # Run A/B test simulation
-├── app.py                          # Streamlit dashboard
-├── requirements.txt                # Python dependencies
-├── README.md                       # This file
-└── .gitignore
-```
-## 🚀 Quick Start
-### Local Development
-1. **Clone the repository**
-```bash
-git clone https://github.com/YOUR_USERNAME/resource-optimization-ml.git
-cd resource-optimization-ml
-```
-2. **Install dependencies** (using uv or pip)
-```bash
-uv pip install -r requirements.txt
-```
-3. **Generate data**
-```bash
-uv run python data_generation.py
-```
-4. **Setup database**
-```bash
-uv run python setup_database.py
-```
-5. **Explore data**
-```bash
-uv run python explore_data.py
-```
-6. **Train models**
-```bash
-uv run python train_models.py
-```
-7. **Run A/B test simulation**
-```bash
-uv run python ab_test_simulation.py
-```
-8. **Launch dashboard**
-```bash
-uv run streamlit run app.py
-```
-The dashboard will open at `http://localhost:8501`
-## 📊 Dashboard Features
-### 📈 Overview
-- Service distribution by memory, CPU, and latency sensitivity
-- Traffic volume analysis across regions
-- Total statistics (150 services, 5 regions, 1.6M records)
-### 🎯 A/B Test Results
-- Side-by-side comparison of control vs treatment strategies
-- Latency reduction: 5.25%
-- Cost savings: 4.92%
-- Statistical significance test results (p-value, t-statistic)
-### 🗺️ Regional Analysis
-- Interactive latency heatmap between all region pairs
-- Regional statistics (min, max, std deviation)
-- Identify high-latency corridors
-### 🔧 Service Details
-- Interactive service explorer
-- Per-service placement across regions
-- Instance count and latency metrics
-## 🧠 Technical Stack
-| Component | Tool | Purpose |
-|-----------|------|---------|
-| Data Storage | SQLite | Lightweight database for local development |
-| Data Processing | Pandas, NumPy | Data manipulation and feature engineering |
-| ML Framework | scikit-learn, XGBoost | Model training and prediction |
-| Statistics | SciPy | A/B testing and significance tests |
-| Visualization | Plotly, Streamlit | Interactive dashboards |
-| Deployment | Hugging Face Spaces | Live dashboard hosting |
-## 📈 Model Performance
-### XGBoost (Latency Prediction)
-```
-RMSE: 28.7007 ms
-MAE:  24.6690 ms
-R²:   -0.0674 (indicates high variance in data)
-```
-**Top 5 Important Features:**
-1. Request Variability (CV): 21.7%
-2. Outbound Latency: 17.6%
-3. Average Requests: 14.2%
-4. Dependencies: 13.5%
-5. Number of Instances: 11.7%
-### Random Forest (Placement Strategy)
 ```
-Accuracy: 100%
-Precision: 1.00
-Recall: 1.00
-F1-Score: 1.00
 ```
-**Top Features:**
-1. Traffic Volume: 54.5%
-2. Dependencies: 13.8%
-3. Latency Sensitivity: 13.7%
-## 🧪 A/B Test Methodology
-**Hypothesis:** ML-optimized placement reduces latency compared to random placement
-**Sample Size:** 150 services × 5 regions = 750 potential placements
-**Metrics:**
-- Primary: Average latency (ms)
-- Secondary: Total cost ($), redundancy score, critical service latency
-- Efficiency: Number of placement pairs (fewer = more efficient)
-**Test Type:** Independent samples t-test
-- Null hypothesis (H₀): μ_control = μ_treatment
-- Alternative hypothesis (H₁): μ_control ≠ μ_treatment
-- Significance level: α = 0.05
-**Result:** Reject H₀ (p < 0.001)
-- The ML-optimized placement significantly reduces latency
-## 💡 Key Insights
-1. **Latency-critical services benefit most** from optimized placement (9.3% improvement vs 5.25% average)
-2. **Traffic patterns drive decisions** - high-traffic services benefit from multi-region placement
-3. **Regional cost differences matter** - avoiding expensive regions saves 4.92% without sacrificing latency
-4. **Placement efficiency improves** - ML uses 16% fewer placement pairs while reducing latency
-5. **Statistical rigor matters** - The improvement is not due to chance (p < 0.001)
-## 🚀 Future Enhancements
-### Short-term
-- [ ] Add notebook with exploratory data analysis
-- [ ] Include feature importance visualizations
-- [ ] Create prediction API endpoint
-### Medium-term
-- [ ] Integrate real AWS CloudWatch metrics
-- [ ] Add model retraining pipeline
-- [ ] Implement automated alerting
-- [ ] Support multi-cloud scenarios (GCP, Azure)
-### Long-term
-- [ ] Deploy as microservice recommendation engine
-- [ ] Build feedback loop for model improvement
-- [ ] Create cost optimization module
-- [ ] Add capacity planning features
-## 📚 Learning Resources
-This project demonstrates:
-- ✅ SQL data querying and aggregation
-- ✅ Python data manipulation (Pandas, NumPy)
-- ✅ Machine learning model training (scikit-learn, XGBoost)
-- ✅ Feature engineering and preprocessing
-- ✅ Statistical hypothesis testing
-- ✅ A/B testing methodology
-- ✅ Data visualization (Plotly, Streamlit)
-- ✅ Full-stack ML deployment
-## 📝 License
-This project is open source and available under the MIT License.
-## 👤 Author
-Built as a portfolio project demonstrating ML engineering capabilities for cloud infrastructure optimization.
----
-**Questions or feedback?** Open an issue or reach out!
-**Live Dashboard:** [Hugging Face Spaces](https://huggingface.co/spaces/aankitdas/resource-optimization-ml)
-**GitHub:** [resource-optimization-ml](https://github.com/aankitdas/resource-optimization-ml)

 ---
 title: Resource Optimization ML Pipeline
+emoji:
 colorFrom: blue
 colorTo: green
 sdk: docker
 app_file: app.py
 pinned: false
 ---
+# Resource Optimization ML Pipeline
+A data-driven approach to optimizing service placement across cloud regions, reducing latency and infrastructure costs through machine learning.
+**Live Dashboard:** https://huggingface.co/spaces/aankitdas/resource-optimization-ml
+## Problem
+When Amazon scales infrastructure globally across multiple AWS regions, teams face a critical decision: which services should run in which regions?
+The naive approach (random placement) is inefficient:
+- Services get placed in expensive regions unnecessarily
+- Cross-region communication adds latency
+- Over-provisioning of resources to ensure redundancy
+- No data-driven strategy for placement decisions
+This project tackles that problem: **given service characteristics and regional latency patterns, can we predict optimal placement that reduces latency and costs?**
+## Solution
+I built an ML-powered recommendation system that:
+1. **Analyzes service characteristics** - memory, CPU, traffic volume, latency sensitivity
+2. **Models regional latency** - how long it takes to communicate between regions
+3. **Predicts placement impact** - what happens to latency if we place a service in region X vs Y
+4. **Compares strategies** - random placement vs ML-optimized placement through A/B testing
+## Results
+The ML-optimized strategy outperforms random placement:
+- **5.25% latency reduction** - services respond faster to users
+- **4.92% cost savings** - avoided expensive regions where possible
+- **9.30% improvement for critical services** - latency-sensitive workloads benefit most
+- **Statistical significance** - improvements are not due to chance (p < 0.001)
+- **16% fewer placements** - more efficient resource usage
+## Technical Approach
+### Data Pipeline
+- Generated 150 synthetic services with realistic attributes
+- Created 1.6M+ traffic records across 5 regions over 90 days
+- Modeled cross-region latency patterns based on real AWS geography
+- Stored everything in SQLite for easy SQL querying
+### Machine Learning
+**Model 1: Latency Prediction (XGBoost Regressor)**
+- Predicts service latency given placement characteristics
+- Input: service memory/CPU, traffic patterns, outbound latency, dependencies
+- Output: expected latency in milliseconds
+- Performance: RMSE=28.7ms
+**Model 2: Placement Strategy (Random Forest Classifier)**
+- Determines if a service should be single-region or multi-region
+- Input: traffic volume, dependencies, resource requirements
+- Output: optimal placement strategy
+- Performance: 100% accuracy on test set
+### A/B Testing
+To validate the ML approach:
+- **Control**: randomly place services across 2-4 regions
+- **Treatment**: use ML models to recommend optimal placement
+- **Test**: independent t-test on latency samples (t=7.02, p<0.001)
+- **Conclusion**: ML strategy is statistically significantly better
+## How to Use the Dashboard
+**Overview** - See service distribution across memory tiers and latency sensitivity. Top services by traffic volume.
+**A/B Test Results** - The core finding. Side-by-side comparison of random vs ML-optimized placement with metrics and statistical test results.
+**Regional Analysis** - Latency heatmap showing communication costs between regions. Higher latency regions are avoided when possible.
+## Project Structure
 ```
+├── data_generation.py         # Generate synthetic services, traffic, latency data
+├── setup_database.py          # Load CSVs into SQLite
+├── train_models.py            # Train XGBoost and Random Forest models
+├── ab_test_simulation.py      # Run A/B test and save results
+├── app.py                     # Streamlit dashboard
+├── results/
+│   └── ab_test_results.json   # A/B test metrics and statistics
+└── requirements.txt           # Python dependencies
 ```
+## Technology Stack
+- **Data Processing**: Python, Pandas, NumPy, SQLite
+- **Machine Learning**: scikit-learn, XGBoost
+- **Statistics**: SciPy (hypothesis testing)
+- **Visualization**: Plotly, Streamlit
+- **Deployment**: Docker, Hugging Face Spaces, GitHub Actions
+## Key Insights
+1. **Traffic patterns matter most** - Services with high, variable traffic benefit most from multi-region placement
+2. **Latency-critical services are placement-sensitive** - A few milliseconds of additional latency can degrade user experience for these workloads
+3. **Regional cost differences are significant** - Some regions are 80% more expensive than others. ML avoids them when latency permits
+4. **Efficiency and performance can both improve** - ML uses fewer total placements while reducing latency
+5. **Statistical rigor matters** - Raw improvements mean nothing without significance testing
+## Running Locally
+```bash
+# Generate data
+python data_generation.py
+# Setup database
+python setup_database.py
+# Train models
+python train_models.py
+# Run A/B test
+python ab_test_simulation.py
+# Launch dashboard
+streamlit run app.py
+```
+## What This Demonstrates
+- SQL data analysis and aggregation
+- Python data manipulation and feature engineering
+- Machine learning model training and evaluation
+- Statistical hypothesis testing and A/B testing methodology
+- End-to-end data product development (from data to dashboard)
+- Production deployment with Docker and GitHub Actions
+## Repository
+https://github.com/aankitdas/resource-optimization-ml