LightningRodLabs
/

Trump-Forecaster

@@ -10,6 +10,9 @@ tags:
 - grpo
 - lora
 - mixture-of-experts
 datasets:
 - LightningRodLabs/WWTD-2025
 base_model: openai/gpt-oss-120b
@@ -35,31 +38,36 @@ model-index:
 # Trump-Forecaster
-**RL-tuned gpt-oss-120b for predicting Trump administration actions. Beats GPT-5 on held-out forecasting questions.**
-This model was fine-tuned with reinforcement learning (GRPO) using Brier score as the reward signal, trained on the [WWTD-2025](https://huggingface.co/datasets/LightningRodLabs/WWTD-2025) dataset of 2,108 binary forecasting questions about Trump's actions from January-December 2025.
 ## Results
-Evaluated on 682 held-out test questions (with news context):
-| Model | Brier | BSS | ECE |
-|---|---|---|---|
-| **gpt-oss-120b RL (this model)** | **0.194** | **0.16** | **0.079** |
-| GPT-5 | 0.200 | 0.14 | 0.091 |
-| gpt-oss-120b (base) | 0.213 | 0.08 | 0.111 |
-Without context (question only):
-| Model | Brier | BSS | ECE |
-|---|---|---|---|
-| **gpt-oss-120b RL** | **0.242** | **-0.04** | 0.164 |
-| GPT-5 | 0.258 | -0.11 | 0.191 |
-| gpt-oss-120b (base) | 0.260 | -0.12 | 0.189 |
-- **Brier Score**: Mean squared error between predicted probability and outcome (lower = better)
-- **BSS (Brier Skill Score)**: Improvement over base-rate guessing (positive = better than naive)
-- **ECE**: Expected Calibration Error (lower = better calibrated)
 ## Training
@@ -71,6 +79,8 @@ Without context (question only):
 - **Training steps**: 50
 - **Max tokens**: 16,384
 ## Usage
 ```python
@@ -104,16 +114,12 @@ engine = sgl.Engine(model_path="LightningRodLabs/Trump-Forecaster", trust_remote
 output = engine.generate(prompt, sampling_params={"max_new_tokens": 4096, "stop": ["</answer>"]})
 ```
-## Dataset
-Trained on [LightningRodLabs/WWTD-2025](https://huggingface.co/datasets/LightningRodLabs/WWTD-2025):
-- 2,790 binary forecasting questions about Trump administration actions
-- Auto-generated from news (Jan-Dec 2025) using the [Lightning Rod SDK](https://lightningrod.ai/sdk)
-- Ground-truth labels from web search verification
-- Temporal split: 2,108 train / 682 test (no leakage)
 ## Links
-- Dataset: [LightningRodLabs/WWTD-2025](https://huggingface.co/datasets/LightningRodLabs/WWTD-2025)
-- Training platform: [Tinker](https://tinker.computer)
-- Data generation: [Lightning Rod SDK](https://lightningrod.ai/sdk)

 - grpo
 - lora
 - mixture-of-experts
+- politics
+- trump
+- future-as-label
 datasets:
 - LightningRodLabs/WWTD-2025
 base_model: openai/gpt-oss-120b
 # Trump-Forecaster
+### RL-Tuned gpt-oss-120b for Predicting Trump Administration Actions
+We fine-tuned [gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b) with reinforcement learning to predict Trump administration actions. Trained on the [WWTD-2025](https://huggingface.co/datasets/LightningRodLabs/WWTD-2025) dataset of 2,108 binary forecasting questions generated with the [Lightning Rod SDK](https://github.com/lightning-rod-labs/lightningrod-python-sdk), the model beats GPT-5 on held-out forecasting questions.
+[Dataset](https://huggingface.co/datasets/LightningRodLabs/WWTD-2025) · [Lightning Rod SDK](https://github.com/lightning-rod-labs/lightningrod-python-sdk) · [Future-as-Label paper](https://arxiv.org/abs/2601.06336) · [Outcome-based RL paper](https://arxiv.org/abs/2505.17989)
+---
 ## Results
+Evaluated on 682 held-out test questions under two conditions: with news context, and without context (question only). The no-context condition reveals whether the model knows what it doesn't know—untrained models project false confidence, while RL training fixes overconfidence.
+| Model | Brier (With Context) | BSS | Brier (No Context) | BSS | ECE (With Context) | ECE (No Context) |
+|-------|:---:|:---:|:---:|:---:|:---:|:---:|
+| GPT-5 | 0.200 | +0.14 | 0.258 | -0.11 | 0.091 | 0.191 |
+| gpt-oss-120b | 0.213 | +0.08 | 0.260 | -0.12 | 0.111 | 0.190 |
+| **gpt-oss-120b RL (this model)** | **0.194** | **+0.16** | **0.242** | **-0.04** | **0.079** | **0.164** |
+![Brier Skill Score](https://huggingface.co/datasets/LightningRodLabs/WWTD-2025/resolve/main/brier_skill_score.png)
+![Brier Score Comparison](https://huggingface.co/datasets/LightningRodLabs/WWTD-2025/resolve/main/brier_score_comparison.png)
+![ECE Comparison](https://huggingface.co/datasets/LightningRodLabs/WWTD-2025/resolve/main/ece_comparison.png)
+### Metrics
+- **Brier Score**: Mean squared error between predicted probability and outcome (0 or 1). Lower is better. **Brier Skill Score (BSS)** expresses this as improvement over always predicting the base rate—positive means the model learned something useful beyond historical frequency.
+- **Expected Calibration Error (ECE)**: Measures whether predicted probabilities match actual frequencies. "70%" predictions should resolve "yes" 70% of the time. Lower is better.
+---
 ## Training
 - **Training steps**: 50
 - **Max tokens**: 16,384
+---
 ## Usage
 ```python
 output = engine.generate(prompt, sampling_params={"max_new_tokens": 4096, "stop": ["</answer>"]})
 ```
+---
 ## Links
+- **Dataset**: [LightningRodLabs/WWTD-2025](https://huggingface.co/datasets/LightningRodLabs/WWTD-2025)
+- **Training platform**: [Tinker](https://tinker.computer)
+- **Data generation**: [Lightning Rod SDK](https://github.com/lightning-rod-labs/lightningrod-python-sdk)
+- **Future-as-Label paper**: [arxiv:2601.06336](https://arxiv.org/abs/2601.06336)
+- **Outcome-based RL paper**: [arxiv:2505.17989](https://arxiv.org/abs/2505.17989)