File size: 13,229 Bytes
61ba51e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
# SGLang CI Monitor

> **Note**: This README.md is primarily generated by Claude 4 with some manual adjustments.

A comprehensive toolkit to analyze CI failures and performance trends for the SGLang project. This toolkit includes four main tools:

1. **CI Analyzer** (`ci_analyzer.py`): Analyzes CI failures and provides detailed failure pattern analysis
2. **Performance Analyzer** (`ci_analyzer_perf.py`): Tracks performance metrics over time and generates trend charts
3. **Test Balance Analyzer** (`ci_analyzer_balance.py`): Analyzes test time gaps between elapsed and estimated times to help balance CI
4. **Failures Analyzer** (`ci_failures_analysis.py`): Tracks consecutive failures, identifies flaky jobs, and monitors runner health

## Features

### CI Analyzer (`ci_analyzer.py`)
- **Simple Analysis**: Analyze recent CI runs and identify failure patterns
- **Category Classification**: Automatically categorize failures by type (unit-test, performance, etc.)
- **Pattern Recognition**: Identify common failure patterns (timeouts, build failures, etc.)
- **CI Links**: Direct links to recent failed CI runs for detailed investigation
- **Last Success Tracking**: Track the last successful run for each failed job with PR information
- **JSON Export**: Export detailed analysis data to JSON format

### Performance Analyzer (`ci_analyzer_perf.py`)
- **Performance Tracking**: Monitor performance metrics across CI runs over time
- **Automated Chart Generation**: Generate time-series charts for each performance metric
- **Multi-Test Support**: Track performance for all test types (throughput, latency, accuracy)
- **CSV Export**: Export performance data in structured CSV format
- **Trend Analysis**: Visualize performance trends with interactive charts
- **Comprehensive Metrics**: Track output throughput, E2E latency, TTFT, accept length, and more
- **Time-Based Sampling**: Intelligent sampling strategy to cover extended time periods (up to 30 days) with limited API calls

### Test Balance Analyzer (`ci_analyzer_balance.py`)
- **Time Gap Analysis**: Identify GPU tests with large gaps between elapsed and estimated times
- **CI Balancing**: Help optimize CI by identifying tests that need time adjustments
- **Gap Tracking**: Track maximum time gaps for each test across multiple CI runs
- **PR Test Focus**: Only analyzes GPU jobs from pr-test.yml workflow (excludes AMD and other workflows)
- **Ranking System**: Sort tests by time gap severity to prioritize adjustments
- **CSV Export**: Export analysis results in CSV format for easy review
- **GitHub Integration**: Generate GitHub Actions summaries with recommendations

### Failures Analyzer (`ci_failures_analysis.py`)
- **Consecutive Failure Tracking**: Identify jobs currently failing
- **Runner Health Monitoring**: Track runner failure rates and identify problematic infrastructure
- **Multi-Workflow Support**: Monitors PR Test (Nvidia), PR Test (AMD), and PR Test (Xeon) workflows
- **Queue Time Tracking**: Monitor average and P90 queue times per runner type
- **Alert System**: Automatic alerts for consecutive failures and runner problems
- **Instance Tracking**: Monitor specific runner instances for targeted remediation
- **Slack Notifications**: Send condensed alerts to Slack (top 3 jobs/runners by consecutive failures and failure rates)
- **GitHub Integration**: Generate comprehensive summaries with actionable recommendations
- **JSON Export**: Export detailed analysis data for further processing

### Common Features
- **Automated Monitoring**: GitHub Actions workflow for continuous CI and performance monitoring

## Installation

### For CI Analyzer
No additional dependencies required beyond Python standard library and `requests`:

```bash
pip install requests
```

### For Performance Analyzer
Additional dependencies required for chart generation:

```bash
pip install requests matplotlib pandas
```

### For Test Balance Analyzer
No additional dependencies required beyond Python standard library and `requests`:

```bash
pip install requests
```

## Usage

### CI Analyzer

#### Basic Usage

```bash
# Replace YOUR_GITHUB_TOKEN with your actual token from https://github.com/settings/tokens
python ci_analyzer.py --token YOUR_GITHUB_TOKEN
```

#### Advanced Usage

```bash
# Analyze last 1000 runs
python ci_analyzer.py --token YOUR_GITHUB_TOKEN --limit 1000

# Custom output file
python ci_analyzer.py --token YOUR_GITHUB_TOKEN --limit 500 --output my_analysis.json
```

### Performance Analyzer

#### Basic Usage

```bash
# Analyze performance trends from recent CI runs
python ci_analyzer_perf.py --token YOUR_GITHUB_TOKEN
```

#### Advanced Usage

```bash
# Analyze last 1000 PR Test runs (auto-enables uniform sampling for ~30 days coverage)
python ci_analyzer_perf.py --token YOUR_GITHUB_TOKEN --limit 1000

# Custom output directory
python ci_analyzer_perf.py --token YOUR_GITHUB_TOKEN --limit 500 --output-dir my_performance_data

# Use sampling with 500 runs (will use sequential mode since < 500 threshold)
python ci_analyzer_perf.py --token YOUR_GITHUB_TOKEN --limit 500

# Get ALL performance data within a specific date range (recommended for historical analysis)
python ci_analyzer_perf.py --token YOUR_GITHUB_TOKEN --start-date 2024-12-01 --end-date 2024-12-31

# Get complete data for the last week
python ci_analyzer_perf.py --token YOUR_GITHUB_TOKEN --start-date $(date -d '7 days ago' +%Y-%m-%d) --end-date $(date +%Y-%m-%d)

# Upload results to GitHub repository for sharing
python ci_analyzer_perf.py --token YOUR_GITHUB_TOKEN --limit 1000 --upload-to-github
```

### Test Balance Analyzer

#### Basic Usage

```bash
# Analyze PR Test GPU job time gaps from recent CI runs
python ci_analyzer_balance.py --token YOUR_GITHUB_TOKEN
```

#### Advanced Usage

```bash
# Analyze last 1000 PR Test GPU CI runs for comprehensive test balance analysis
python ci_analyzer_balance.py --token YOUR_GITHUB_TOKEN --limit 1000

# Custom output file
python ci_analyzer_balance.py --token YOUR_GITHUB_TOKEN --limit 500 --output my_balance_analysis.json
```

### Failures Analyzer

#### Quick Start

```bash
# Set token as environment variable (recommended for security)
export GITHUB_TOKEN="your_token_here"

# Quick test with recent runs
python ci_failures_analysis.py --token $GITHUB_TOKEN --limit 50 --threshold 2

# Standard analysis (same as automated workflow)
python ci_failures_analysis.py --token $GITHUB_TOKEN --limit 300 --threshold 2

# Deep analysis
python ci_failures_analysis.py --token $GITHUB_TOKEN --limit 500 --threshold 3
```

#### Monitored Workflows

The Failures Analyzer monitors the following workflows:

- **PR Test** - Nvidia GPU tests (self-hosted runners: 1-gpu-runner, 4-gpu-h100-runner, etc.)
- **PR Test (AMD)** - AMD GPU tests (AMD-specific runners)
- **PR Test (Xeon)** - Intel Xeon CPU tests (Xeon-specific runners)

All three workflows are analyzed together, with runner statistics tracked separately by runner type.

#### Slack Notifications

The Failures Analyzer can send condensed alerts to Slack. See [SLACK_SETUP.md](SLACK_SETUP.md) for complete setup instructions.

**What gets sent:**
- Top 3 jobs with consecutive failures
- Top 3 runners with consecutive failures
- Top 3 jobs with highest total failure rate
- Top 3 runners with highest total failure rate
- Queue time summary

```bash
# Send Slack notification from analysis JSON
export SLACK_WEBHOOK_URL="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
python slack_notifier.py --json ci_failure_analysis.json
```

#### Understanding the Output

The script generates a **2-section report**:

**Section 1: Currently Broken Jobs (Active Consecutive Failures)**
- Shows consecutive failure streaks
- These need immediate attention

**Section 2: Runner Health Analysis**
- Shows which runners have high failure rates
- Includes queue time metrics (average and P90)
- Helps identify infrastructure vs code issues

#### Alert Types

**Job Alerts (Consecutive Failures):**
- Triggered when a job fails ≥ threshold times in a row
- Example: threshold=2, job fails 3 times → ALERT

**Runner Alerts:**
- **Runner Health**: Runner has >30% failure rate with ≥2 different jobs failing
- **Runner Instance**: Specific instance has >50% failure rate with ≥3 jobs

#### Output Files

- **Console**: Human-readable 3-section report (always generated)
- **JSON**: Detailed data (optional, only if `--output` is specified)
- **GitHub Summary**: Markdown (automatically generated in GitHub Actions)

**Important**: Make sure your GitHub token has `repo` and `workflow` permissions, otherwise you'll get 404 errors.

## Data Collection Strategies

The Performance Analyzer offers multiple strategies for collecting performance data to suit different analysis needs.

### 1. Uniform Sampling Strategy

**When to use**: Daily monitoring and trend analysis over extended periods.

- **Automatically enabled** when `--limit >= 500`
- **Disabled** for smaller limits (< 500) to maintain backward compatibility

#### How it works:
- Collects data uniformly across a 30-day period
- Ensures even time distribution of samples
- Provides consistent coverage for trend analysis

#### Example with 1000 Runs:
- **Time Range**: Last 30 days
- **Distribution**: 1000 samples evenly distributed across the period
- **Coverage**: ~33 samples per day on average

### 2. Date Range Collection

**When to use**: Historical analysis, specific period investigation, or complete data collection.

Use `--start-date` and `--end-date` parameters to get **ALL** CI runs within a specific time range.

#### Features:
- **Complete Data**: Gets every CI run in the specified range (no sampling)
- **No Limit**: Ignores the `--limit` parameter
- **Flexible Range**: Specify any date range you need
- **Historical Analysis**: Perfect for investigating specific time periods

#### Date Format:
- Use `YYYY-MM-DD` format (e.g., `2024-12-01`)
- Both parameters are optional:
  - Only `--start-date`: Gets all runs from that date to now
  - Only `--end-date`: Gets all runs from 30 days ago to that date
  - Both: Gets all runs in the specified range

### 3. Sequential Collection (Traditional)

**When to use**: Quick checks or when you only need recent data.

- **Default behavior** for `--limit < 500`
- Gets the most recent CI runs in chronological order
- Fast and simple for immediate analysis

### Comparison

| Strategy | Use Case | Time Coverage | Data Completeness | API Efficiency |
|----------|----------|---------------|-------------------|----------------|
| **Uniform Sampling** | Daily monitoring, trends | ~30 days | Sampled | High |
| **Date Range** | Historical analysis | Any range | Complete | Variable |
| **Sequential** | Quick checks | 3-4 days | Complete (recent) | High |

### Benefits

- **Flexible Analysis**: Choose the right strategy for your needs
- **Extended Coverage**: Up to 30 days with sampling, unlimited with date ranges
- **Complete Data**: Get every run in a specific period when needed
- **API Efficiency**: Optimized for different use patterns

## Parameters

### CI Analyzer Parameters

| Parameter | Default | Description |
|-----------|---------|-------------|
| `--token` | Required | GitHub Personal Access Token |
| `--limit` | 100 | Number of CI runs to analyze |
| `--output` | ci_analysis.json | Output JSON file for detailed data |

### Performance Analyzer Parameters

| Parameter | Default | Description |
|-----------|---------|-------------|
| `--token` | Required | GitHub Personal Access Token |
| `--limit` | 100 | Number of PR Test runs to analyze (ignored when using date range) |
| `--output-dir` | performance_tables | Output directory for CSV tables and PNG charts |
| `--start-date` | None | Start date for date range query (YYYY-MM-DD format) |
| `--end-date` | None | End date for date range query (YYYY-MM-DD format) |
| `--upload-to-github` | False | Upload results to sglang-bot/sglang-ci-data repository |

### Test Balance Analyzer Parameters

| Parameter | Default | Description |
|-----------|---------|-------------|
| `--token` | Required | GitHub Personal Access Token |
| `--limit` | 1000 | Number of CI runs to analyze |
| `--output` | test_balance_report.json | Output JSON file for detailed analysis data |

### Failures Analyzer Parameters

| Parameter | Default | Description |
|-----------|---------|-------------|
| `--token` | Required | GitHub Personal Access Token |
| `--limit` | 500 | Number of workflow runs to analyze |
| `--threshold` | 3 | Alert threshold for consecutive failures |
| `--output` | None | Output JSON file (optional, only writes if specified) |

## Getting GitHub Token

1. Go to [GitHub Settings > Personal Access Tokens](https://github.com/settings/tokens)
2. Click "Generate new token" > "Generate new token (classic)"
3. **Important**: Select the following permissions:
   - `repo` (Full control of private repositories) - **Required for accessing repository data**
   - `workflow` (Update GitHub Action workflows) - **Required for reading CI/CD data**
4. Copy the generated token and use it as `YOUR_GITHUB_TOKEN`

**Note**: Without the `repo` and `workflow` permissions, the tool will not be able to access CI run data and will return 404 errors.