Lekr0

Add files using upload-large-folder tool

61ba51e verified 26 days ago

preview code

raw

history blame contribute delete

13.2 kB

SGLang CI Monitor

Note: This README.md is primarily generated by Claude 4 with some manual adjustments.

A comprehensive toolkit to analyze CI failures and performance trends for the SGLang project. This toolkit includes four main tools:

CI Analyzer (ci_analyzer.py): Analyzes CI failures and provides detailed failure pattern analysis
Performance Analyzer (ci_analyzer_perf.py): Tracks performance metrics over time and generates trend charts
Test Balance Analyzer (ci_analyzer_balance.py): Analyzes test time gaps between elapsed and estimated times to help balance CI
Failures Analyzer (ci_failures_analysis.py): Tracks consecutive failures, identifies flaky jobs, and monitors runner health

Features

CI Analyzer (`ci_analyzer.py`)

Simple Analysis: Analyze recent CI runs and identify failure patterns
Category Classification: Automatically categorize failures by type (unit-test, performance, etc.)
Pattern Recognition: Identify common failure patterns (timeouts, build failures, etc.)
CI Links: Direct links to recent failed CI runs for detailed investigation
Last Success Tracking: Track the last successful run for each failed job with PR information
JSON Export: Export detailed analysis data to JSON format

Performance Analyzer (`ci_analyzer_perf.py`)

Performance Tracking: Monitor performance metrics across CI runs over time
Automated Chart Generation: Generate time-series charts for each performance metric
Multi-Test Support: Track performance for all test types (throughput, latency, accuracy)
CSV Export: Export performance data in structured CSV format
Trend Analysis: Visualize performance trends with interactive charts
Comprehensive Metrics: Track output throughput, E2E latency, TTFT, accept length, and more
Time-Based Sampling: Intelligent sampling strategy to cover extended time periods (up to 30 days) with limited API calls

Test Balance Analyzer (`ci_analyzer_balance.py`)

Time Gap Analysis: Identify GPU tests with large gaps between elapsed and estimated times
CI Balancing: Help optimize CI by identifying tests that need time adjustments
Gap Tracking: Track maximum time gaps for each test across multiple CI runs
PR Test Focus: Only analyzes GPU jobs from pr-test.yml workflow (excludes AMD and other workflows)
Ranking System: Sort tests by time gap severity to prioritize adjustments
CSV Export: Export analysis results in CSV format for easy review
GitHub Integration: Generate GitHub Actions summaries with recommendations

Failures Analyzer (`ci_failures_analysis.py`)

Consecutive Failure Tracking: Identify jobs currently failing
Runner Health Monitoring: Track runner failure rates and identify problematic infrastructure
Multi-Workflow Support: Monitors PR Test (Nvidia), PR Test (AMD), and PR Test (Xeon) workflows
Queue Time Tracking: Monitor average and P90 queue times per runner type
Alert System: Automatic alerts for consecutive failures and runner problems
Instance Tracking: Monitor specific runner instances for targeted remediation
Slack Notifications: Send condensed alerts to Slack (top 3 jobs/runners by consecutive failures and failure rates)
GitHub Integration: Generate comprehensive summaries with actionable recommendations
JSON Export: Export detailed analysis data for further processing

Common Features

Automated Monitoring: GitHub Actions workflow for continuous CI and performance monitoring

Installation

For CI Analyzer

No additional dependencies required beyond Python standard library and requests:

pip install requests

For Performance Analyzer

Additional dependencies required for chart generation:

pip install requests matplotlib pandas

For Test Balance Analyzer

No additional dependencies required beyond Python standard library and requests:

pip install requests

Usage

CI Analyzer

Basic Usage

# Replace YOUR_GITHUB_TOKEN with your actual token from https://github.com/settings/tokens
python ci_analyzer.py --token YOUR_GITHUB_TOKEN

Advanced Usage

# Analyze last 1000 runs
python ci_analyzer.py --token YOUR_GITHUB_TOKEN --limit 1000

# Custom output file
python ci_analyzer.py --token YOUR_GITHUB_TOKEN --limit 500 --output my_analysis.json

Performance Analyzer

Basic Usage

# Analyze performance trends from recent CI runs
python ci_analyzer_perf.py --token YOUR_GITHUB_TOKEN

Advanced Usage

# Analyze last 1000 PR Test runs (auto-enables uniform sampling for ~30 days coverage)
python ci_analyzer_perf.py --token YOUR_GITHUB_TOKEN --limit 1000

# Custom output directory
python ci_analyzer_perf.py --token YOUR_GITHUB_TOKEN --limit 500 --output-dir my_performance_data

# Use sampling with 500 runs (will use sequential mode since < 500 threshold)
python ci_analyzer_perf.py --token YOUR_GITHUB_TOKEN --limit 500

# Get ALL performance data within a specific date range (recommended for historical analysis)
python ci_analyzer_perf.py --token YOUR_GITHUB_TOKEN --start-date 2024-12-01 --end-date 2024-12-31

# Get complete data for the last week
python ci_analyzer_perf.py --token YOUR_GITHUB_TOKEN --start-date $(date -d '7 days ago' +%Y-%m-%d) --end-date $(date +%Y-%m-%d)

# Upload results to GitHub repository for sharing
python ci_analyzer_perf.py --token YOUR_GITHUB_TOKEN --limit 1000 --upload-to-github

Test Balance Analyzer

Basic Usage

# Analyze PR Test GPU job time gaps from recent CI runs
python ci_analyzer_balance.py --token YOUR_GITHUB_TOKEN

Advanced Usage

# Analyze last 1000 PR Test GPU CI runs for comprehensive test balance analysis
python ci_analyzer_balance.py --token YOUR_GITHUB_TOKEN --limit 1000

# Custom output file
python ci_analyzer_balance.py --token YOUR_GITHUB_TOKEN --limit 500 --output my_balance_analysis.json

Failures Analyzer

Quick Start

# Set token as environment variable (recommended for security)
export GITHUB_TOKEN="your_token_here"

# Quick test with recent runs
python ci_failures_analysis.py --token $GITHUB_TOKEN --limit 50 --threshold 2

# Standard analysis (same as automated workflow)
python ci_failures_analysis.py --token $GITHUB_TOKEN --limit 300 --threshold 2

# Deep analysis
python ci_failures_analysis.py --token $GITHUB_TOKEN --limit 500 --threshold 3

Monitored Workflows

The Failures Analyzer monitors the following workflows:

PR Test - Nvidia GPU tests (self-hosted runners: 1-gpu-runner, 4-gpu-h100-runner, etc.)
PR Test (AMD) - AMD GPU tests (AMD-specific runners)
PR Test (Xeon) - Intel Xeon CPU tests (Xeon-specific runners)

All three workflows are analyzed together, with runner statistics tracked separately by runner type.

Slack Notifications

The Failures Analyzer can send condensed alerts to Slack. See SLACK_SETUP.md for complete setup instructions.

What gets sent:

Top 3 jobs with consecutive failures
Top 3 runners with consecutive failures
Top 3 jobs with highest total failure rate
Top 3 runners with highest total failure rate
Queue time summary

# Send Slack notification from analysis JSON
export SLACK_WEBHOOK_URL="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
python slack_notifier.py --json ci_failure_analysis.json

Understanding the Output

The script generates a 2-section report:

Section 1: Currently Broken Jobs (Active Consecutive Failures)

Shows consecutive failure streaks
These need immediate attention

Section 2: Runner Health Analysis

Shows which runners have high failure rates
Includes queue time metrics (average and P90)
Helps identify infrastructure vs code issues

Alert Types

Job Alerts (Consecutive Failures):

Triggered when a job fails ≥ threshold times in a row
Example: threshold=2, job fails 3 times → ALERT

Runner Alerts:

Runner Health: Runner has >30% failure rate with ≥2 different jobs failing
Runner Instance: Specific instance has >50% failure rate with ≥3 jobs

Output Files

Console: Human-readable 3-section report (always generated)
JSON: Detailed data (optional, only if --output is specified)
GitHub Summary: Markdown (automatically generated in GitHub Actions)

Important: Make sure your GitHub token has repo and workflow permissions, otherwise you'll get 404 errors.

Data Collection Strategies

The Performance Analyzer offers multiple strategies for collecting performance data to suit different analysis needs.

1. Uniform Sampling Strategy

When to use: Daily monitoring and trend analysis over extended periods.

Automatically enabled when --limit >= 500
Disabled for smaller limits (< 500) to maintain backward compatibility

How it works:

Collects data uniformly across a 30-day period
Ensures even time distribution of samples
Provides consistent coverage for trend analysis

Example with 1000 Runs:

Time Range: Last 30 days
Distribution: 1000 samples evenly distributed across the period
Coverage: ~33 samples per day on average

2. Date Range Collection

When to use: Historical analysis, specific period investigation, or complete data collection.

Use --start-date and --end-date parameters to get ALL CI runs within a specific time range.

Features:

Complete Data: Gets every CI run in the specified range (no sampling)
No Limit: Ignores the --limit parameter
Flexible Range: Specify any date range you need
Historical Analysis: Perfect for investigating specific time periods

Date Format:

Use YYYY-MM-DD format (e.g., 2024-12-01)
Both parameters are optional:
- Only --start-date: Gets all runs from that date to now
- Only --end-date: Gets all runs from 30 days ago to that date
- Both: Gets all runs in the specified range

3. Sequential Collection (Traditional)

When to use: Quick checks or when you only need recent data.

Default behavior for --limit < 500
Gets the most recent CI runs in chronological order
Fast and simple for immediate analysis

Comparison

Strategy	Use Case	Time Coverage	Data Completeness	API Efficiency
Uniform Sampling	Daily monitoring, trends	~30 days	Sampled	High
Date Range	Historical analysis	Any range	Complete	Variable
Sequential	Quick checks	3-4 days	Complete (recent)	High

Benefits

Flexible Analysis: Choose the right strategy for your needs
Extended Coverage: Up to 30 days with sampling, unlimited with date ranges
Complete Data: Get every run in a specific period when needed
API Efficiency: Optimized for different use patterns

Parameters

CI Analyzer Parameters

Parameter	Default	Description
`--token`	Required	GitHub Personal Access Token
`--limit`	100	Number of CI runs to analyze
`--output`	ci_analysis.json	Output JSON file for detailed data

Performance Analyzer Parameters

Parameter	Default	Description
`--token`	Required	GitHub Personal Access Token
`--limit`	100	Number of PR Test runs to analyze (ignored when using date range)
`--output-dir`	performance_tables	Output directory for CSV tables and PNG charts
`--start-date`	None	Start date for date range query (YYYY-MM-DD format)
`--end-date`	None	End date for date range query (YYYY-MM-DD format)
`--upload-to-github`	False	Upload results to sglang-bot/sglang-ci-data repository

Test Balance Analyzer Parameters

Parameter	Default	Description
`--token`	Required	GitHub Personal Access Token
`--limit`	1000	Number of CI runs to analyze
`--output`	test_balance_report.json	Output JSON file for detailed analysis data

Failures Analyzer Parameters

Parameter	Default	Description
`--token`	Required	GitHub Personal Access Token
`--limit`	500	Number of workflow runs to analyze
`--threshold`	3	Alert threshold for consecutive failures
`--output`	None	Output JSON file (optional, only writes if specified)

Getting GitHub Token

Go to GitHub Settings > Personal Access Tokens
Click "Generate new token" > "Generate new token (classic)"
Important: Select the following permissions:
- repo (Full control of private repositories) - Required for accessing repository data
- workflow (Update GitHub Action workflows) - Required for reading CI/CD data
Copy the generated token and use it as YOUR_GITHUB_TOKEN

Note: Without the repo and workflow permissions, the tool will not be able to access CI run data and will return 404 errors.

SGLang CI Monitor

Features

CI Analyzer (ci_analyzer.py)

Performance Analyzer (ci_analyzer_perf.py)

Test Balance Analyzer (ci_analyzer_balance.py)

Failures Analyzer (ci_failures_analysis.py)

Common Features

Installation

For CI Analyzer

For Performance Analyzer

For Test Balance Analyzer

Usage

CI Analyzer

Basic Usage

Advanced Usage

Performance Analyzer

Basic Usage

Advanced Usage

Test Balance Analyzer

Basic Usage

Advanced Usage

Failures Analyzer

Quick Start

Monitored Workflows

Slack Notifications

Understanding the Output

Alert Types

Output Files

Data Collection Strategies

1. Uniform Sampling Strategy

How it works:

Example with 1000 Runs:

2. Date Range Collection

Features:

Date Format:

3. Sequential Collection (Traditional)

Comparison

Benefits

Parameters

CI Analyzer Parameters

Performance Analyzer Parameters

Test Balance Analyzer Parameters

Failures Analyzer Parameters

Getting GitHub Token

CI Analyzer (`ci_analyzer.py`)

Performance Analyzer (`ci_analyzer_perf.py`)

Test Balance Analyzer (`ci_analyzer_balance.py`)

Failures Analyzer (`ci_failures_analysis.py`)