File size: 2,959 Bytes

feba2ad

# Scripts Directory

This directory contains utility scripts for the Pico training framework.

## generate_data.py

A script to automatically generate `data.json` from training log files for the dashboard.

### What it does

This script parses log files from the `runs/` directory and extracts:
- **Training metrics**: Loss, learning rate, and inf/NaN counts at each step
- **Evaluation results**: Paloma evaluation metrics
- **Model configuration**: Architecture parameters (d_model, n_layers, etc.)

### Usage

```bash
# Generate data.json from the default runs directory
python scripts/generate_data.py

# Specify custom runs directory
python scripts/generate_data.py --runs-dir /path/to/runs

# Specify custom output file
python scripts/generate_data.py --output /path/to/output.json
```

### How it works

1. **Scans runs directory**: Looks for subdirectories containing training runs
2. **Finds log files**: Locates `.log` files in each run's `logs/` subdirectory
3. **Parses log content**: Uses regex patterns to extract structured data
4. **Generates JSON**: Creates a structured JSON file for the dashboard

### Log Format Requirements

The script expects log files with the following format:

```
2025-08-29 02:09:12 - pico-train - INFO - Step 500 -- 🔄 Training Metrics
2025-08-29 02:09:12 - pico-train - INFO - ├── Loss: 10.8854
2025-08-29 02:09:12 - pico-train - INFO - ├── Learning Rate: 3.13e-06
2025-08-29 02:09:12 - pico-train - INFO - └── Inf/NaN count: 0
```

And evaluation results:

```
2025-08-29 02:15:26 - pico-train - INFO - Step 1000 -- 📊 Evaluation Results
2025-08-29 02:15:26 - pico-train - INFO - └── paloma: 7.125172406420199e+27
```

### Output Format

The generated `data.json` has this structure:

```json
{
  "runs": [
    {
      "run_name": "model-name",
      "log_file": "log_filename.log",
      "training_metrics": [
        {
          "step": 0,
          "loss": 10.9914,
          "learning_rate": 0.0,
          "inf_nan_count": 0
        }
      ],
      "evaluation_results": [
        {
          "step": 1000,
          "paloma": 59434.76600609756
        }
      ],
      "config": {
        "d_model": 96,
        "n_layers": 12,
        "max_seq_len": 2048,
        "vocab_size": 50304,
        "lr": 0.0003,
        "max_steps": 200000,
        "batch_size": 8
      }
    }
  ],
  "summary": {
    "total_runs": 1,
    "run_names": ["model-name"]
  }
}
```

### When to use

- **After training**: Generate updated dashboard data
- **Adding new runs**: Include new training sessions in the dashboard
- **Debugging**: Verify log parsing is working correctly
- **Dashboard setup**: Initial setup of the training metrics dashboard

### Troubleshooting

If the script doesn't find any data:
1. Check that log files exist in `runs/*/logs/`
2. Verify log format matches the expected pattern
3. Ensure log files contain training metrics entries
4. Check file permissions and encoding