# SGLang CI Monitor > **Note**: This README.md is primarily generated by Claude 4 with some manual adjustments. A comprehensive toolkit to analyze CI failures and performance trends for the SGLang project. This toolkit includes four main tools: 1. **CI Analyzer** (`ci_analyzer.py`): Analyzes CI failures and provides detailed failure pattern analysis 2. **Performance Analyzer** (`ci_analyzer_perf.py`): Tracks performance metrics over time and generates trend charts 3. **Test Balance Analyzer** (`ci_analyzer_balance.py`): Analyzes test time gaps between elapsed and estimated times to help balance CI 4. **Failures Analyzer** (`ci_failures_analysis.py`): Tracks consecutive failures, identifies flaky jobs, and monitors runner health ## Features ### CI Analyzer (`ci_analyzer.py`) - **Simple Analysis**: Analyze recent CI runs and identify failure patterns - **Category Classification**: Automatically categorize failures by type (unit-test, performance, etc.) - **Pattern Recognition**: Identify common failure patterns (timeouts, build failures, etc.) - **CI Links**: Direct links to recent failed CI runs for detailed investigation - **Last Success Tracking**: Track the last successful run for each failed job with PR information - **JSON Export**: Export detailed analysis data to JSON format ### Performance Analyzer (`ci_analyzer_perf.py`) - **Performance Tracking**: Monitor performance metrics across CI runs over time - **Automated Chart Generation**: Generate time-series charts for each performance metric - **Multi-Test Support**: Track performance for all test types (throughput, latency, accuracy) - **CSV Export**: Export performance data in structured CSV format - **Trend Analysis**: Visualize performance trends with interactive charts - **Comprehensive Metrics**: Track output throughput, E2E latency, TTFT, accept length, and more - **Time-Based Sampling**: Intelligent sampling strategy to cover extended time periods (up to 30 days) with limited API calls ### Test Balance Analyzer (`ci_analyzer_balance.py`) - **Time Gap Analysis**: Identify GPU tests with large gaps between elapsed and estimated times - **CI Balancing**: Help optimize CI by identifying tests that need time adjustments - **Gap Tracking**: Track maximum time gaps for each test across multiple CI runs - **PR Test Focus**: Only analyzes GPU jobs from pr-test.yml workflow (excludes AMD and other workflows) - **Ranking System**: Sort tests by time gap severity to prioritize adjustments - **CSV Export**: Export analysis results in CSV format for easy review - **GitHub Integration**: Generate GitHub Actions summaries with recommendations ### Failures Analyzer (`ci_failures_analysis.py`) - **Consecutive Failure Tracking**: Identify jobs currently failing - **Runner Health Monitoring**: Track runner failure rates and identify problematic infrastructure - **Multi-Workflow Support**: Monitors PR Test (Nvidia), PR Test (AMD), and PR Test (Xeon) workflows - **Queue Time Tracking**: Monitor average and P90 queue times per runner type - **Alert System**: Automatic alerts for consecutive failures and runner problems - **Instance Tracking**: Monitor specific runner instances for targeted remediation - **Slack Notifications**: Send condensed alerts to Slack (top 3 jobs/runners by consecutive failures and failure rates) - **GitHub Integration**: Generate comprehensive summaries with actionable recommendations - **JSON Export**: Export detailed analysis data for further processing ### Common Features - **Automated Monitoring**: GitHub Actions workflow for continuous CI and performance monitoring ## Installation ### For CI Analyzer No additional dependencies required beyond Python standard library and `requests`: ```bash pip install requests ``` ### For Performance Analyzer Additional dependencies required for chart generation: ```bash pip install requests matplotlib pandas ``` ### For Test Balance Analyzer No additional dependencies required beyond Python standard library and `requests`: ```bash pip install requests ``` ## Usage ### CI Analyzer #### Basic Usage ```bash # Replace YOUR_GITHUB_TOKEN with your actual token from https://github.com/settings/tokens python ci_analyzer.py --token YOUR_GITHUB_TOKEN ``` #### Advanced Usage ```bash # Analyze last 1000 runs python ci_analyzer.py --token YOUR_GITHUB_TOKEN --limit 1000 # Custom output file python ci_analyzer.py --token YOUR_GITHUB_TOKEN --limit 500 --output my_analysis.json ``` ### Performance Analyzer #### Basic Usage ```bash # Analyze performance trends from recent CI runs python ci_analyzer_perf.py --token YOUR_GITHUB_TOKEN ``` #### Advanced Usage ```bash # Analyze last 1000 PR Test runs (auto-enables uniform sampling for ~30 days coverage) python ci_analyzer_perf.py --token YOUR_GITHUB_TOKEN --limit 1000 # Custom output directory python ci_analyzer_perf.py --token YOUR_GITHUB_TOKEN --limit 500 --output-dir my_performance_data # Use sampling with 500 runs (will use sequential mode since < 500 threshold) python ci_analyzer_perf.py --token YOUR_GITHUB_TOKEN --limit 500 # Get ALL performance data within a specific date range (recommended for historical analysis) python ci_analyzer_perf.py --token YOUR_GITHUB_TOKEN --start-date 2024-12-01 --end-date 2024-12-31 # Get complete data for the last week python ci_analyzer_perf.py --token YOUR_GITHUB_TOKEN --start-date $(date -d '7 days ago' +%Y-%m-%d) --end-date $(date +%Y-%m-%d) # Upload results to GitHub repository for sharing python ci_analyzer_perf.py --token YOUR_GITHUB_TOKEN --limit 1000 --upload-to-github ``` ### Test Balance Analyzer #### Basic Usage ```bash # Analyze PR Test GPU job time gaps from recent CI runs python ci_analyzer_balance.py --token YOUR_GITHUB_TOKEN ``` #### Advanced Usage ```bash # Analyze last 1000 PR Test GPU CI runs for comprehensive test balance analysis python ci_analyzer_balance.py --token YOUR_GITHUB_TOKEN --limit 1000 # Custom output file python ci_analyzer_balance.py --token YOUR_GITHUB_TOKEN --limit 500 --output my_balance_analysis.json ``` ### Failures Analyzer #### Quick Start ```bash # Set token as environment variable (recommended for security) export GITHUB_TOKEN="your_token_here" # Quick test with recent runs python ci_failures_analysis.py --token $GITHUB_TOKEN --limit 50 --threshold 2 # Standard analysis (same as automated workflow) python ci_failures_analysis.py --token $GITHUB_TOKEN --limit 300 --threshold 2 # Deep analysis python ci_failures_analysis.py --token $GITHUB_TOKEN --limit 500 --threshold 3 ``` #### Monitored Workflows The Failures Analyzer monitors the following workflows: - **PR Test** - Nvidia GPU tests (self-hosted runners: 1-gpu-runner, 4-gpu-h100-runner, etc.) - **PR Test (AMD)** - AMD GPU tests (AMD-specific runners) - **PR Test (Xeon)** - Intel Xeon CPU tests (Xeon-specific runners) All three workflows are analyzed together, with runner statistics tracked separately by runner type. #### Slack Notifications The Failures Analyzer can send condensed alerts to Slack. See [SLACK_SETUP.md](SLACK_SETUP.md) for complete setup instructions. **What gets sent:** - Top 3 jobs with consecutive failures - Top 3 runners with consecutive failures - Top 3 jobs with highest total failure rate - Top 3 runners with highest total failure rate - Queue time summary ```bash # Send Slack notification from analysis JSON export SLACK_WEBHOOK_URL="https://hooks.slack.com/services/YOUR/WEBHOOK/URL" python slack_notifier.py --json ci_failure_analysis.json ``` #### Understanding the Output The script generates a **2-section report**: **Section 1: Currently Broken Jobs (Active Consecutive Failures)** - Shows consecutive failure streaks - These need immediate attention **Section 2: Runner Health Analysis** - Shows which runners have high failure rates - Includes queue time metrics (average and P90) - Helps identify infrastructure vs code issues #### Alert Types **Job Alerts (Consecutive Failures):** - Triggered when a job fails ≥ threshold times in a row - Example: threshold=2, job fails 3 times → ALERT **Runner Alerts:** - **Runner Health**: Runner has >30% failure rate with ≥2 different jobs failing - **Runner Instance**: Specific instance has >50% failure rate with ≥3 jobs #### Output Files - **Console**: Human-readable 3-section report (always generated) - **JSON**: Detailed data (optional, only if `--output` is specified) - **GitHub Summary**: Markdown (automatically generated in GitHub Actions) **Important**: Make sure your GitHub token has `repo` and `workflow` permissions, otherwise you'll get 404 errors. ## Data Collection Strategies The Performance Analyzer offers multiple strategies for collecting performance data to suit different analysis needs. ### 1. Uniform Sampling Strategy **When to use**: Daily monitoring and trend analysis over extended periods. - **Automatically enabled** when `--limit >= 500` - **Disabled** for smaller limits (< 500) to maintain backward compatibility #### How it works: - Collects data uniformly across a 30-day period - Ensures even time distribution of samples - Provides consistent coverage for trend analysis #### Example with 1000 Runs: - **Time Range**: Last 30 days - **Distribution**: 1000 samples evenly distributed across the period - **Coverage**: ~33 samples per day on average ### 2. Date Range Collection **When to use**: Historical analysis, specific period investigation, or complete data collection. Use `--start-date` and `--end-date` parameters to get **ALL** CI runs within a specific time range. #### Features: - **Complete Data**: Gets every CI run in the specified range (no sampling) - **No Limit**: Ignores the `--limit` parameter - **Flexible Range**: Specify any date range you need - **Historical Analysis**: Perfect for investigating specific time periods #### Date Format: - Use `YYYY-MM-DD` format (e.g., `2024-12-01`) - Both parameters are optional: - Only `--start-date`: Gets all runs from that date to now - Only `--end-date`: Gets all runs from 30 days ago to that date - Both: Gets all runs in the specified range ### 3. Sequential Collection (Traditional) **When to use**: Quick checks or when you only need recent data. - **Default behavior** for `--limit < 500` - Gets the most recent CI runs in chronological order - Fast and simple for immediate analysis ### Comparison | Strategy | Use Case | Time Coverage | Data Completeness | API Efficiency | |----------|----------|---------------|-------------------|----------------| | **Uniform Sampling** | Daily monitoring, trends | ~30 days | Sampled | High | | **Date Range** | Historical analysis | Any range | Complete | Variable | | **Sequential** | Quick checks | 3-4 days | Complete (recent) | High | ### Benefits - **Flexible Analysis**: Choose the right strategy for your needs - **Extended Coverage**: Up to 30 days with sampling, unlimited with date ranges - **Complete Data**: Get every run in a specific period when needed - **API Efficiency**: Optimized for different use patterns ## Parameters ### CI Analyzer Parameters | Parameter | Default | Description | |-----------|---------|-------------| | `--token` | Required | GitHub Personal Access Token | | `--limit` | 100 | Number of CI runs to analyze | | `--output` | ci_analysis.json | Output JSON file for detailed data | ### Performance Analyzer Parameters | Parameter | Default | Description | |-----------|---------|-------------| | `--token` | Required | GitHub Personal Access Token | | `--limit` | 100 | Number of PR Test runs to analyze (ignored when using date range) | | `--output-dir` | performance_tables | Output directory for CSV tables and PNG charts | | `--start-date` | None | Start date for date range query (YYYY-MM-DD format) | | `--end-date` | None | End date for date range query (YYYY-MM-DD format) | | `--upload-to-github` | False | Upload results to sglang-bot/sglang-ci-data repository | ### Test Balance Analyzer Parameters | Parameter | Default | Description | |-----------|---------|-------------| | `--token` | Required | GitHub Personal Access Token | | `--limit` | 1000 | Number of CI runs to analyze | | `--output` | test_balance_report.json | Output JSON file for detailed analysis data | ### Failures Analyzer Parameters | Parameter | Default | Description | |-----------|---------|-------------| | `--token` | Required | GitHub Personal Access Token | | `--limit` | 500 | Number of workflow runs to analyze | | `--threshold` | 3 | Alert threshold for consecutive failures | | `--output` | None | Output JSON file (optional, only writes if specified) | ## Getting GitHub Token 1. Go to [GitHub Settings > Personal Access Tokens](https://github.com/settings/tokens) 2. Click "Generate new token" > "Generate new token (classic)" 3. **Important**: Select the following permissions: - `repo` (Full control of private repositories) - **Required for accessing repository data** - `workflow` (Update GitHub Action workflows) - **Required for reading CI/CD data** 4. Copy the generated token and use it as `YOUR_GITHUB_TOKEN` **Note**: Without the `repo` and `workflow` permissions, the tool will not be able to access CI run data and will return 404 errors.