LightningRodLabs
/

Golf-Forecaster

Text Generation

reinforcement-learning

mixture-of-experts

future-as-label

Eval Results (legacy)

Model card Files Files and versions

Bturtel commited on Feb 14

Commit

9d23dd2

·

verified ·

1 Parent(s): 90c19f5

Upload README.md with huggingface_hub

Files changed (1) hide show

README.md +6 -0

README.md CHANGED Viewed

@@ -58,6 +58,12 @@ Evaluated on 855 held-out test questions (temporal split, Aug 2025+). Golf-Forec
 | gpt-oss-120b (base) | 0.218 | +12.8% | 0.083 |
 | GPT-5.1 | 0.218 | +12.8% | 0.106 |
 ### Metrics
 - **Brier Score**: Mean squared error between predicted probability and outcome (0 or 1). Lower is better. **Brier Skill Score (BSS)** expresses this as improvement over always predicting the base rate — positive means the model learned something useful beyond historical frequency.

 | gpt-oss-120b (base) | 0.218 | +12.8% | 0.083 |
 | GPT-5.1 | 0.218 | +12.8% | 0.106 |
+![Brier Skill Score](https://huggingface.co/datasets/LightningRodLabs/GolfForecasting/resolve/main/brier_skill_score.png)
+![Brier Score Comparison](https://huggingface.co/datasets/LightningRodLabs/GolfForecasting/resolve/main/brier_score_comparison.png)
+![ECE Comparison](https://huggingface.co/datasets/LightningRodLabs/GolfForecasting/resolve/main/ece_comparison.png)
 ### Metrics
 - **Brier Score**: Mean squared error between predicted probability and outcome (0 or 1). Lower is better. **Brier Skill Score (BSS)** expresses this as improvement over always predicting the base rate — positive means the model learned something useful beyond historical frequency.