Add files using upload-large-folder tool

61ba51e verified 27 days ago

13.2 kB

	# SGLang CI Monitor

	> Note: This README.md is primarily generated by Claude 4 with some manual adjustments.

	A comprehensive toolkit to analyze CI failures and performance trends for the SGLang project. This toolkit includes four main tools:

	1. CI Analyzer (`ci_analyzer.py`): Analyzes CI failures and provides detailed failure pattern analysis
	2. Performance Analyzer (`ci_analyzer_perf.py`): Tracks performance metrics over time and generates trend charts
	3. Test Balance Analyzer (`ci_analyzer_balance.py`): Analyzes test time gaps between elapsed and estimated times to help balance CI
	4. Failures Analyzer (`ci_failures_analysis.py`): Tracks consecutive failures, identifies flaky jobs, and monitors runner health

	## Features

	### CI Analyzer (`ci_analyzer.py`)
	- Simple Analysis: Analyze recent CI runs and identify failure patterns
	- Category Classification: Automatically categorize failures by type (unit-test, performance, etc.)
	- Pattern Recognition: Identify common failure patterns (timeouts, build failures, etc.)
	- CI Links: Direct links to recent failed CI runs for detailed investigation
	- Last Success Tracking: Track the last successful run for each failed job with PR information
	- JSON Export: Export detailed analysis data to JSON format

	### Performance Analyzer (`ci_analyzer_perf.py`)
	- Performance Tracking: Monitor performance metrics across CI runs over time
	- Automated Chart Generation: Generate time-series charts for each performance metric
	- Multi-Test Support: Track performance for all test types (throughput, latency, accuracy)
	- CSV Export: Export performance data in structured CSV format
	- Trend Analysis: Visualize performance trends with interactive charts
	- Comprehensive Metrics: Track output throughput, E2E latency, TTFT, accept length, and more
	- Time-Based Sampling: Intelligent sampling strategy to cover extended time periods (up to 30 days) with limited API calls

	### Test Balance Analyzer (`ci_analyzer_balance.py`)
	- Time Gap Analysis: Identify GPU tests with large gaps between elapsed and estimated times
	- CI Balancing: Help optimize CI by identifying tests that need time adjustments
	- Gap Tracking: Track maximum time gaps for each test across multiple CI runs
	- PR Test Focus: Only analyzes GPU jobs from pr-test.yml workflow (excludes AMD and other workflows)
	- Ranking System: Sort tests by time gap severity to prioritize adjustments
	- CSV Export: Export analysis results in CSV format for easy review
	- GitHub Integration: Generate GitHub Actions summaries with recommendations

	### Failures Analyzer (`ci_failures_analysis.py`)
	- Consecutive Failure Tracking: Identify jobs currently failing
	- Runner Health Monitoring: Track runner failure rates and identify problematic infrastructure
	- Multi-Workflow Support: Monitors PR Test (Nvidia), PR Test (AMD), and PR Test (Xeon) workflows
	- Queue Time Tracking: Monitor average and P90 queue times per runner type
	- Alert System: Automatic alerts for consecutive failures and runner problems
	- Instance Tracking: Monitor specific runner instances for targeted remediation
	- Slack Notifications: Send condensed alerts to Slack (top 3 jobs/runners by consecutive failures and failure rates)
	- GitHub Integration: Generate comprehensive summaries with actionable recommendations
	- JSON Export: Export detailed analysis data for further processing

	### Common Features
	- Automated Monitoring: GitHub Actions workflow for continuous CI and performance monitoring

	## Installation

	### For CI Analyzer
	No additional dependencies required beyond Python standard library and `requests`:

	```bash
	pip install requests
	```

	### For Performance Analyzer
	Additional dependencies required for chart generation:

	```bash
	pip install requests matplotlib pandas
	```

	### For Test Balance Analyzer
	No additional dependencies required beyond Python standard library and `requests`:

	```bash
	pip install requests
	```

	## Usage

	### CI Analyzer

	#### Basic Usage

	```bash
	# Replace YOUR_GITHUB_TOKEN with your actual token from https://github.com/settings/tokens
	python ci_analyzer.py --token YOUR_GITHUB_TOKEN
	```

	#### Advanced Usage

	```bash
	# Analyze last 1000 runs
	python ci_analyzer.py --token YOUR_GITHUB_TOKEN --limit 1000

	# Custom output file
	python ci_analyzer.py --token YOUR_GITHUB_TOKEN --limit 500 --output my_analysis.json
	```

	### Performance Analyzer

	#### Basic Usage

	```bash
	# Analyze performance trends from recent CI runs
	python ci_analyzer_perf.py --token YOUR_GITHUB_TOKEN
	```

	#### Advanced Usage

	```bash
	# Analyze last 1000 PR Test runs (auto-enables uniform sampling for ~30 days coverage)
	python ci_analyzer_perf.py --token YOUR_GITHUB_TOKEN --limit 1000

	# Custom output directory
	python ci_analyzer_perf.py --token YOUR_GITHUB_TOKEN --limit 500 --output-dir my_performance_data

	# Use sampling with 500 runs (will use sequential mode since < 500 threshold)
	python ci_analyzer_perf.py --token YOUR_GITHUB_TOKEN --limit 500

	# Get ALL performance data within a specific date range (recommended for historical analysis)
	python ci_analyzer_perf.py --token YOUR_GITHUB_TOKEN --start-date 2024-12-01 --end-date 2024-12-31

	# Get complete data for the last week
	python ci_analyzer_perf.py --token YOUR_GITHUB_TOKEN --start-date $(date -d '7 days ago' +%Y-%m-%d) --end-date $(date +%Y-%m-%d)

	# Upload results to GitHub repository for sharing
	python ci_analyzer_perf.py --token YOUR_GITHUB_TOKEN --limit 1000 --upload-to-github
	```

	### Test Balance Analyzer

	#### Basic Usage

	```bash
	# Analyze PR Test GPU job time gaps from recent CI runs
	python ci_analyzer_balance.py --token YOUR_GITHUB_TOKEN
	```

	#### Advanced Usage

	```bash
	# Analyze last 1000 PR Test GPU CI runs for comprehensive test balance analysis
	python ci_analyzer_balance.py --token YOUR_GITHUB_TOKEN --limit 1000

	# Custom output file
	python ci_analyzer_balance.py --token YOUR_GITHUB_TOKEN --limit 500 --output my_balance_analysis.json
	```

	### Failures Analyzer

	#### Quick Start

	```bash
	# Set token as environment variable (recommended for security)
	export GITHUB_TOKEN="your_token_here"

	# Quick test with recent runs
	python ci_failures_analysis.py --token $GITHUB_TOKEN --limit 50 --threshold 2

	# Standard analysis (same as automated workflow)
	python ci_failures_analysis.py --token $GITHUB_TOKEN --limit 300 --threshold 2

	# Deep analysis
	python ci_failures_analysis.py --token $GITHUB_TOKEN --limit 500 --threshold 3
	```

	#### Monitored Workflows

	The Failures Analyzer monitors the following workflows:

	- PR Test - Nvidia GPU tests (self-hosted runners: 1-gpu-runner, 4-gpu-h100-runner, etc.)
	- PR Test (AMD) - AMD GPU tests (AMD-specific runners)
	- PR Test (Xeon) - Intel Xeon CPU tests (Xeon-specific runners)

	All three workflows are analyzed together, with runner statistics tracked separately by runner type.

	#### Slack Notifications

	The Failures Analyzer can send condensed alerts to Slack. See [SLACK_SETUP.md](SLACK_SETUP.md) for complete setup instructions.

	What gets sent:
	- Top 3 jobs with consecutive failures
	- Top 3 runners with consecutive failures
	- Top 3 jobs with highest total failure rate
	- Top 3 runners with highest total failure rate
	- Queue time summary

	```bash
	# Send Slack notification from analysis JSON
	export SLACK_WEBHOOK_URL="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
	python slack_notifier.py --json ci_failure_analysis.json
	```

	#### Understanding the Output

	The script generates a 2-section report:

	Section 1: Currently Broken Jobs (Active Consecutive Failures)
	- Shows consecutive failure streaks
	- These need immediate attention

	Section 2: Runner Health Analysis
	- Shows which runners have high failure rates
	- Includes queue time metrics (average and P90)
	- Helps identify infrastructure vs code issues

	#### Alert Types

	Job Alerts (Consecutive Failures):
	- Triggered when a job fails ≥ threshold times in a row
	- Example: threshold=2, job fails 3 times → ALERT

	Runner Alerts:
	- Runner Health: Runner has >30% failure rate with ≥2 different jobs failing
	- Runner Instance: Specific instance has >50% failure rate with ≥3 jobs

	#### Output Files

	- Console: Human-readable 3-section report (always generated)
	- JSON: Detailed data (optional, only if `--output` is specified)
	- GitHub Summary: Markdown (automatically generated in GitHub Actions)

	Important: Make sure your GitHub token has `repo` and `workflow` permissions, otherwise you'll get 404 errors.

	## Data Collection Strategies

	The Performance Analyzer offers multiple strategies for collecting performance data to suit different analysis needs.

	### 1. Uniform Sampling Strategy

	When to use: Daily monitoring and trend analysis over extended periods.

	- Automatically enabled when `--limit >= 500`
	- Disabled for smaller limits (< 500) to maintain backward compatibility

	#### How it works:
	- Collects data uniformly across a 30-day period
	- Ensures even time distribution of samples
	- Provides consistent coverage for trend analysis

	#### Example with 1000 Runs:
	- Time Range: Last 30 days
	- Distribution: 1000 samples evenly distributed across the period
	- Coverage: ~33 samples per day on average

	### 2. Date Range Collection

	When to use: Historical analysis, specific period investigation, or complete data collection.

	Use `--start-date` and `--end-date` parameters to get ALL CI runs within a specific time range.

	#### Features:
	- Complete Data: Gets every CI run in the specified range (no sampling)
	- No Limit: Ignores the `--limit` parameter
	- Flexible Range: Specify any date range you need
	- Historical Analysis: Perfect for investigating specific time periods

	#### Date Format:
	- Use `YYYY-MM-DD` format (e.g., `2024-12-01`)
	- Both parameters are optional:
	- Only `--start-date`: Gets all runs from that date to now
	- Only `--end-date`: Gets all runs from 30 days ago to that date
	- Both: Gets all runs in the specified range

	### 3. Sequential Collection (Traditional)

	When to use: Quick checks or when you only need recent data.

	- Default behavior for `--limit < 500`
	- Gets the most recent CI runs in chronological order
	- Fast and simple for immediate analysis

	### Comparison

	\| Strategy \| Use Case \| Time Coverage \| Data Completeness \| API Efficiency \|
	\|----------\|----------\|---------------\|-------------------\|----------------\|
	\| Uniform Sampling \| Daily monitoring, trends \| ~30 days \| Sampled \| High \|
	\| Date Range \| Historical analysis \| Any range \| Complete \| Variable \|
	\| Sequential \| Quick checks \| 3-4 days \| Complete (recent) \| High \|

	### Benefits

	- Flexible Analysis: Choose the right strategy for your needs
	- Extended Coverage: Up to 30 days with sampling, unlimited with date ranges
	- Complete Data: Get every run in a specific period when needed
	- API Efficiency: Optimized for different use patterns

	## Parameters

	### CI Analyzer Parameters

	\| Parameter \| Default \| Description \|
	\|-----------\|---------\|-------------\|
	\| `--token` \| Required \| GitHub Personal Access Token \|
	\| `--limit` \| 100 \| Number of CI runs to analyze \|
	\| `--output` \| ci_analysis.json \| Output JSON file for detailed data \|

	### Performance Analyzer Parameters

	\| Parameter \| Default \| Description \|
	\|-----------\|---------\|-------------\|
	\| `--token` \| Required \| GitHub Personal Access Token \|
	\| `--limit` \| 100 \| Number of PR Test runs to analyze (ignored when using date range) \|
	\| `--output-dir` \| performance_tables \| Output directory for CSV tables and PNG charts \|
	\| `--start-date` \| None \| Start date for date range query (YYYY-MM-DD format) \|
	\| `--end-date` \| None \| End date for date range query (YYYY-MM-DD format) \|
	\| `--upload-to-github` \| False \| Upload results to sglang-bot/sglang-ci-data repository \|

	### Test Balance Analyzer Parameters

	\| Parameter \| Default \| Description \|
	\|-----------\|---------\|-------------\|
	\| `--token` \| Required \| GitHub Personal Access Token \|
	\| `--limit` \| 1000 \| Number of CI runs to analyze \|
	\| `--output` \| test_balance_report.json \| Output JSON file for detailed analysis data \|

	### Failures Analyzer Parameters

	\| Parameter \| Default \| Description \|
	\|-----------\|---------\|-------------\|
	\| `--token` \| Required \| GitHub Personal Access Token \|
	\| `--limit` \| 500 \| Number of workflow runs to analyze \|
	\| `--threshold` \| 3 \| Alert threshold for consecutive failures \|
	\| `--output` \| None \| Output JSON file (optional, only writes if specified) \|

	## Getting GitHub Token

	1. Go to [GitHub Settings > Personal Access Tokens](https://github.com/settings/tokens)
	2. Click "Generate new token" > "Generate new token (classic)"
	3. Important: Select the following permissions:
	- `repo` (Full control of private repositories) - Required for accessing repository data
	- `workflow` (Update GitHub Action workflows) - Required for reading CI/CD data
	4. Copy the generated token and use it as `YOUR_GITHUB_TOKEN`

	Note: Without the `repo` and `workflow` permissions, the tool will not be able to access CI run data and will return 404 errors.