| # SGLang CI Monitor |
|
|
| > **Note**: This README.md is primarily generated by Claude 4 with some manual adjustments. |
|
|
| A comprehensive toolkit to analyze CI failures and performance trends for the SGLang project. This toolkit includes four main tools: |
|
|
| 1. **CI Analyzer** (`ci_analyzer.py`): Analyzes CI failures and provides detailed failure pattern analysis |
| 2. **Performance Analyzer** (`ci_analyzer_perf.py`): Tracks performance metrics over time and generates trend charts |
| 3. **Test Balance Analyzer** (`ci_analyzer_balance.py`): Analyzes test time gaps between elapsed and estimated times to help balance CI |
| 4. **Failures Analyzer** (`ci_failures_analysis.py`): Tracks consecutive failures, identifies flaky jobs, and monitors runner health |
|
|
| ## Features |
|
|
| ### CI Analyzer (`ci_analyzer.py`) |
| - **Simple Analysis**: Analyze recent CI runs and identify failure patterns |
| - **Category Classification**: Automatically categorize failures by type (unit-test, performance, etc.) |
| - **Pattern Recognition**: Identify common failure patterns (timeouts, build failures, etc.) |
| - **CI Links**: Direct links to recent failed CI runs for detailed investigation |
| - **Last Success Tracking**: Track the last successful run for each failed job with PR information |
| - **JSON Export**: Export detailed analysis data to JSON format |
| |
| ### Performance Analyzer (`ci_analyzer_perf.py`) |
| - **Performance Tracking**: Monitor performance metrics across CI runs over time |
| - **Automated Chart Generation**: Generate time-series charts for each performance metric |
| - **Multi-Test Support**: Track performance for all test types (throughput, latency, accuracy) |
| - **CSV Export**: Export performance data in structured CSV format |
| - **Trend Analysis**: Visualize performance trends with interactive charts |
| - **Comprehensive Metrics**: Track output throughput, E2E latency, TTFT, accept length, and more |
| - **Time-Based Sampling**: Intelligent sampling strategy to cover extended time periods (up to 30 days) with limited API calls |
| |
| ### Test Balance Analyzer (`ci_analyzer_balance.py`) |
| - **Time Gap Analysis**: Identify GPU tests with large gaps between elapsed and estimated times |
| - **CI Balancing**: Help optimize CI by identifying tests that need time adjustments |
| - **Gap Tracking**: Track maximum time gaps for each test across multiple CI runs |
| - **PR Test Focus**: Only analyzes GPU jobs from pr-test.yml workflow (excludes AMD and other workflows) |
| - **Ranking System**: Sort tests by time gap severity to prioritize adjustments |
| - **CSV Export**: Export analysis results in CSV format for easy review |
| - **GitHub Integration**: Generate GitHub Actions summaries with recommendations |
| |
| ### Failures Analyzer (`ci_failures_analysis.py`) |
| - **Consecutive Failure Tracking**: Identify jobs currently failing |
| - **Runner Health Monitoring**: Track runner failure rates and identify problematic infrastructure |
| - **Multi-Workflow Support**: Monitors PR Test (Nvidia), PR Test (AMD), and PR Test (Xeon) workflows |
| - **Queue Time Tracking**: Monitor average and P90 queue times per runner type |
| - **Alert System**: Automatic alerts for consecutive failures and runner problems |
| - **Instance Tracking**: Monitor specific runner instances for targeted remediation |
| - **Slack Notifications**: Send condensed alerts to Slack (top 3 jobs/runners by consecutive failures and failure rates) |
| - **GitHub Integration**: Generate comprehensive summaries with actionable recommendations |
| - **JSON Export**: Export detailed analysis data for further processing |
| |
| ### Common Features |
| - **Automated Monitoring**: GitHub Actions workflow for continuous CI and performance monitoring |
| |
| ## Installation |
| |
| ### For CI Analyzer |
| No additional dependencies required beyond Python standard library and `requests`: |
| |
| ```bash |
| pip install requests |
| ``` |
| |
| ### For Performance Analyzer |
| Additional dependencies required for chart generation: |
| |
| ```bash |
| pip install requests matplotlib pandas |
| ``` |
| |
| ### For Test Balance Analyzer |
| No additional dependencies required beyond Python standard library and `requests`: |
| |
| ```bash |
| pip install requests |
| ``` |
| |
| ## Usage |
| |
| ### CI Analyzer |
| |
| #### Basic Usage |
| |
| ```bash |
| # Replace YOUR_GITHUB_TOKEN with your actual token from https://github.com/settings/tokens |
| python ci_analyzer.py --token YOUR_GITHUB_TOKEN |
| ``` |
| |
| #### Advanced Usage |
| |
| ```bash |
| # Analyze last 1000 runs |
| python ci_analyzer.py --token YOUR_GITHUB_TOKEN --limit 1000 |
| |
| # Custom output file |
| python ci_analyzer.py --token YOUR_GITHUB_TOKEN --limit 500 --output my_analysis.json |
| ``` |
| |
| ### Performance Analyzer |
| |
| #### Basic Usage |
| |
| ```bash |
| # Analyze performance trends from recent CI runs |
| python ci_analyzer_perf.py --token YOUR_GITHUB_TOKEN |
| ``` |
| |
| #### Advanced Usage |
| |
| ```bash |
| # Analyze last 1000 PR Test runs (auto-enables uniform sampling for ~30 days coverage) |
| python ci_analyzer_perf.py --token YOUR_GITHUB_TOKEN --limit 1000 |
| |
| # Custom output directory |
| python ci_analyzer_perf.py --token YOUR_GITHUB_TOKEN --limit 500 --output-dir my_performance_data |
| |
| # Use sampling with 500 runs (will use sequential mode since < 500 threshold) |
| python ci_analyzer_perf.py --token YOUR_GITHUB_TOKEN --limit 500 |
| |
| # Get ALL performance data within a specific date range (recommended for historical analysis) |
| python ci_analyzer_perf.py --token YOUR_GITHUB_TOKEN --start-date 2024-12-01 --end-date 2024-12-31 |
| |
| # Get complete data for the last week |
| python ci_analyzer_perf.py --token YOUR_GITHUB_TOKEN --start-date $(date -d '7 days ago' +%Y-%m-%d) --end-date $(date +%Y-%m-%d) |
| |
| # Upload results to GitHub repository for sharing |
| python ci_analyzer_perf.py --token YOUR_GITHUB_TOKEN --limit 1000 --upload-to-github |
| ``` |
| |
| ### Test Balance Analyzer |
| |
| #### Basic Usage |
| |
| ```bash |
| # Analyze PR Test GPU job time gaps from recent CI runs |
| python ci_analyzer_balance.py --token YOUR_GITHUB_TOKEN |
| ``` |
| |
| #### Advanced Usage |
| |
| ```bash |
| # Analyze last 1000 PR Test GPU CI runs for comprehensive test balance analysis |
| python ci_analyzer_balance.py --token YOUR_GITHUB_TOKEN --limit 1000 |
| |
| # Custom output file |
| python ci_analyzer_balance.py --token YOUR_GITHUB_TOKEN --limit 500 --output my_balance_analysis.json |
| ``` |
| |
| ### Failures Analyzer |
| |
| #### Quick Start |
| |
| ```bash |
| # Set token as environment variable (recommended for security) |
| export GITHUB_TOKEN="your_token_here" |
|
|
| # Quick test with recent runs |
| python ci_failures_analysis.py --token $GITHUB_TOKEN --limit 50 --threshold 2 |
| |
| # Standard analysis (same as automated workflow) |
| python ci_failures_analysis.py --token $GITHUB_TOKEN --limit 300 --threshold 2 |
|
|
| # Deep analysis |
| python ci_failures_analysis.py --token $GITHUB_TOKEN --limit 500 --threshold 3 |
| ``` |
| |
| #### Monitored Workflows |
| |
| The Failures Analyzer monitors the following workflows: |
| |
| - **PR Test** - Nvidia GPU tests (self-hosted runners: 1-gpu-runner, 4-gpu-h100-runner, etc.) |
| - **PR Test (AMD)** - AMD GPU tests (AMD-specific runners) |
| - **PR Test (Xeon)** - Intel Xeon CPU tests (Xeon-specific runners) |
| |
| All three workflows are analyzed together, with runner statistics tracked separately by runner type. |
| |
| #### Slack Notifications |
| |
| The Failures Analyzer can send condensed alerts to Slack. See [SLACK_SETUP.md](SLACK_SETUP.md) for complete setup instructions. |
| |
| **What gets sent:** |
| - Top 3 jobs with consecutive failures |
| - Top 3 runners with consecutive failures |
| - Top 3 jobs with highest total failure rate |
| - Top 3 runners with highest total failure rate |
| - Queue time summary |
| |
| ```bash |
| # Send Slack notification from analysis JSON |
| export SLACK_WEBHOOK_URL="https://hooks.slack.com/services/YOUR/WEBHOOK/URL" |
| python slack_notifier.py --json ci_failure_analysis.json |
| ``` |
| |
| #### Understanding the Output |
| |
| The script generates a **2-section report**: |
| |
| **Section 1: Currently Broken Jobs (Active Consecutive Failures)** |
| - Shows consecutive failure streaks |
| - These need immediate attention |
| |
| **Section 2: Runner Health Analysis** |
| - Shows which runners have high failure rates |
| - Includes queue time metrics (average and P90) |
| - Helps identify infrastructure vs code issues |
| |
| #### Alert Types |
| |
| **Job Alerts (Consecutive Failures):** |
| - Triggered when a job fails ≥ threshold times in a row |
| - Example: threshold=2, job fails 3 times → ALERT |
| |
| **Runner Alerts:** |
| - **Runner Health**: Runner has >30% failure rate with ≥2 different jobs failing |
| - **Runner Instance**: Specific instance has >50% failure rate with ≥3 jobs |
| |
| #### Output Files |
| |
| - **Console**: Human-readable 3-section report (always generated) |
| - **JSON**: Detailed data (optional, only if `--output` is specified) |
| - **GitHub Summary**: Markdown (automatically generated in GitHub Actions) |
| |
| **Important**: Make sure your GitHub token has `repo` and `workflow` permissions, otherwise you'll get 404 errors. |
| |
| ## Data Collection Strategies |
| |
| The Performance Analyzer offers multiple strategies for collecting performance data to suit different analysis needs. |
| |
| ### 1. Uniform Sampling Strategy |
| |
| **When to use**: Daily monitoring and trend analysis over extended periods. |
| |
| - **Automatically enabled** when `--limit >= 500` |
| - **Disabled** for smaller limits (< 500) to maintain backward compatibility |
| |
| #### How it works: |
| - Collects data uniformly across a 30-day period |
| - Ensures even time distribution of samples |
| - Provides consistent coverage for trend analysis |
| |
| #### Example with 1000 Runs: |
| - **Time Range**: Last 30 days |
| - **Distribution**: 1000 samples evenly distributed across the period |
| - **Coverage**: ~33 samples per day on average |
| |
| ### 2. Date Range Collection |
| |
| **When to use**: Historical analysis, specific period investigation, or complete data collection. |
| |
| Use `--start-date` and `--end-date` parameters to get **ALL** CI runs within a specific time range. |
| |
| #### Features: |
| - **Complete Data**: Gets every CI run in the specified range (no sampling) |
| - **No Limit**: Ignores the `--limit` parameter |
| - **Flexible Range**: Specify any date range you need |
| - **Historical Analysis**: Perfect for investigating specific time periods |
| |
| #### Date Format: |
| - Use `YYYY-MM-DD` format (e.g., `2024-12-01`) |
| - Both parameters are optional: |
| - Only `--start-date`: Gets all runs from that date to now |
| - Only `--end-date`: Gets all runs from 30 days ago to that date |
| - Both: Gets all runs in the specified range |
| |
| ### 3. Sequential Collection (Traditional) |
| |
| **When to use**: Quick checks or when you only need recent data. |
| |
| - **Default behavior** for `--limit < 500` |
| - Gets the most recent CI runs in chronological order |
| - Fast and simple for immediate analysis |
| |
| ### Comparison |
| |
| | Strategy | Use Case | Time Coverage | Data Completeness | API Efficiency | |
| |----------|----------|---------------|-------------------|----------------| |
| | **Uniform Sampling** | Daily monitoring, trends | ~30 days | Sampled | High | |
| | **Date Range** | Historical analysis | Any range | Complete | Variable | |
| | **Sequential** | Quick checks | 3-4 days | Complete (recent) | High | |
| |
| ### Benefits |
| |
| - **Flexible Analysis**: Choose the right strategy for your needs |
| - **Extended Coverage**: Up to 30 days with sampling, unlimited with date ranges |
| - **Complete Data**: Get every run in a specific period when needed |
| - **API Efficiency**: Optimized for different use patterns |
| |
| ## Parameters |
| |
| ### CI Analyzer Parameters |
| |
| | Parameter | Default | Description | |
| |-----------|---------|-------------| |
| | `--token` | Required | GitHub Personal Access Token | |
| | `--limit` | 100 | Number of CI runs to analyze | |
| | `--output` | ci_analysis.json | Output JSON file for detailed data | |
| |
| ### Performance Analyzer Parameters |
| |
| | Parameter | Default | Description | |
| |-----------|---------|-------------| |
| | `--token` | Required | GitHub Personal Access Token | |
| | `--limit` | 100 | Number of PR Test runs to analyze (ignored when using date range) | |
| | `--output-dir` | performance_tables | Output directory for CSV tables and PNG charts | |
| | `--start-date` | None | Start date for date range query (YYYY-MM-DD format) | |
| | `--end-date` | None | End date for date range query (YYYY-MM-DD format) | |
| | `--upload-to-github` | False | Upload results to sglang-bot/sglang-ci-data repository | |
| |
| ### Test Balance Analyzer Parameters |
| |
| | Parameter | Default | Description | |
| |-----------|---------|-------------| |
| | `--token` | Required | GitHub Personal Access Token | |
| | `--limit` | 1000 | Number of CI runs to analyze | |
| | `--output` | test_balance_report.json | Output JSON file for detailed analysis data | |
| |
| ### Failures Analyzer Parameters |
| |
| | Parameter | Default | Description | |
| |-----------|---------|-------------| |
| | `--token` | Required | GitHub Personal Access Token | |
| | `--limit` | 500 | Number of workflow runs to analyze | |
| | `--threshold` | 3 | Alert threshold for consecutive failures | |
| | `--output` | None | Output JSON file (optional, only writes if specified) | |
| |
| ## Getting GitHub Token |
| |
| 1. Go to [GitHub Settings > Personal Access Tokens](https://github.com/settings/tokens) |
| 2. Click "Generate new token" > "Generate new token (classic)" |
| 3. **Important**: Select the following permissions: |
| - `repo` (Full control of private repositories) - **Required for accessing repository data** |
| - `workflow` (Update GitHub Action workflows) - **Required for reading CI/CD data** |
| 4. Copy the generated token and use it as `YOUR_GITHUB_TOKEN` |
| |
| **Note**: Without the `repo` and `workflow` permissions, the tool will not be able to access CI run data and will return 404 errors. |
| |