File size: 2,959 Bytes
feba2ad
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
# Scripts Directory

This directory contains utility scripts for the Pico training framework.

## generate_data.py

A script to automatically generate `data.json` from training log files for the dashboard.

### What it does

This script parses log files from the `runs/` directory and extracts:
- **Training metrics**: Loss, learning rate, and inf/NaN counts at each step
- **Evaluation results**: Paloma evaluation metrics
- **Model configuration**: Architecture parameters (d_model, n_layers, etc.)

### Usage

```bash
# Generate data.json from the default runs directory
python scripts/generate_data.py

# Specify custom runs directory
python scripts/generate_data.py --runs-dir /path/to/runs

# Specify custom output file
python scripts/generate_data.py --output /path/to/output.json
```

### How it works

1. **Scans runs directory**: Looks for subdirectories containing training runs
2. **Finds log files**: Locates `.log` files in each run's `logs/` subdirectory
3. **Parses log content**: Uses regex patterns to extract structured data
4. **Generates JSON**: Creates a structured JSON file for the dashboard

### Log Format Requirements

The script expects log files with the following format:

```
2025-08-29 02:09:12 - pico-train - INFO - Step 500 -- πŸ”„ Training Metrics
2025-08-29 02:09:12 - pico-train - INFO - β”œβ”€β”€ Loss: 10.8854
2025-08-29 02:09:12 - pico-train - INFO - β”œβ”€β”€ Learning Rate: 3.13e-06
2025-08-29 02:09:12 - pico-train - INFO - └── Inf/NaN count: 0
```

And evaluation results:

```
2025-08-29 02:15:26 - pico-train - INFO - Step 1000 -- πŸ“Š Evaluation Results
2025-08-29 02:15:26 - pico-train - INFO - └── paloma: 7.125172406420199e+27
```

### Output Format

The generated `data.json` has this structure:

```json
{
  "runs": [
    {
      "run_name": "model-name",
      "log_file": "log_filename.log",
      "training_metrics": [
        {
          "step": 0,
          "loss": 10.9914,
          "learning_rate": 0.0,
          "inf_nan_count": 0
        }
      ],
      "evaluation_results": [
        {
          "step": 1000,
          "paloma": 59434.76600609756
        }
      ],
      "config": {
        "d_model": 96,
        "n_layers": 12,
        "max_seq_len": 2048,
        "vocab_size": 50304,
        "lr": 0.0003,
        "max_steps": 200000,
        "batch_size": 8
      }
    }
  ],
  "summary": {
    "total_runs": 1,
    "run_names": ["model-name"]
  }
}
```

### When to use

- **After training**: Generate updated dashboard data
- **Adding new runs**: Include new training sessions in the dashboard
- **Debugging**: Verify log parsing is working correctly
- **Dashboard setup**: Initial setup of the training metrics dashboard

### Troubleshooting

If the script doesn't find any data:
1. Check that log files exist in `runs/*/logs/`
2. Verify log format matches the expected pattern
3. Ensure log files contain training metrics entries
4. Check file permissions and encoding