| # Scripts Directory | |
| This directory contains utility scripts for the Pico training framework. | |
| ## generate_data.py | |
| A script to automatically generate `data.json` from training log files for the dashboard. | |
| ### What it does | |
| This script parses log files from the `runs/` directory and extracts: | |
| - **Training metrics**: Loss, learning rate, and inf/NaN counts at each step | |
| - **Evaluation results**: Paloma evaluation metrics | |
| - **Model configuration**: Architecture parameters (d_model, n_layers, etc.) | |
| ### Usage | |
| ```bash | |
| # Generate data.json from the default runs directory | |
| python scripts/generate_data.py | |
| # Specify custom runs directory | |
| python scripts/generate_data.py --runs-dir /path/to/runs | |
| # Specify custom output file | |
| python scripts/generate_data.py --output /path/to/output.json | |
| ``` | |
| ### How it works | |
| 1. **Scans runs directory**: Looks for subdirectories containing training runs | |
| 2. **Finds log files**: Locates `.log` files in each run's `logs/` subdirectory | |
| 3. **Parses log content**: Uses regex patterns to extract structured data | |
| 4. **Generates JSON**: Creates a structured JSON file for the dashboard | |
| ### Log Format Requirements | |
| The script expects log files with the following format: | |
| ``` | |
| 2025-08-29 02:09:12 - pico-train - INFO - Step 500 -- π Training Metrics | |
| 2025-08-29 02:09:12 - pico-train - INFO - βββ Loss: 10.8854 | |
| 2025-08-29 02:09:12 - pico-train - INFO - βββ Learning Rate: 3.13e-06 | |
| 2025-08-29 02:09:12 - pico-train - INFO - βββ Inf/NaN count: 0 | |
| ``` | |
| And evaluation results: | |
| ``` | |
| 2025-08-29 02:15:26 - pico-train - INFO - Step 1000 -- π Evaluation Results | |
| 2025-08-29 02:15:26 - pico-train - INFO - βββ paloma: 7.125172406420199e+27 | |
| ``` | |
| ### Output Format | |
| The generated `data.json` has this structure: | |
| ```json | |
| { | |
| "runs": [ | |
| { | |
| "run_name": "model-name", | |
| "log_file": "log_filename.log", | |
| "training_metrics": [ | |
| { | |
| "step": 0, | |
| "loss": 10.9914, | |
| "learning_rate": 0.0, | |
| "inf_nan_count": 0 | |
| } | |
| ], | |
| "evaluation_results": [ | |
| { | |
| "step": 1000, | |
| "paloma": 59434.76600609756 | |
| } | |
| ], | |
| "config": { | |
| "d_model": 96, | |
| "n_layers": 12, | |
| "max_seq_len": 2048, | |
| "vocab_size": 50304, | |
| "lr": 0.0003, | |
| "max_steps": 200000, | |
| "batch_size": 8 | |
| } | |
| } | |
| ], | |
| "summary": { | |
| "total_runs": 1, | |
| "run_names": ["model-name"] | |
| } | |
| } | |
| ``` | |
| ### When to use | |
| - **After training**: Generate updated dashboard data | |
| - **Adding new runs**: Include new training sessions in the dashboard | |
| - **Debugging**: Verify log parsing is working correctly | |
| - **Dashboard setup**: Initial setup of the training metrics dashboard | |
| ### Troubleshooting | |
| If the script doesn't find any data: | |
| 1. Check that log files exist in `runs/*/logs/` | |
| 2. Verify log format matches the expected pattern | |
| 3. Ensure log files contain training metrics entries | |
| 4. Check file permissions and encoding | |