Bturtel commited on
Commit
0afb1aa
verified
1 Parent(s): 131ff9f

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +33 -27
README.md CHANGED
@@ -10,6 +10,9 @@ tags:
10
  - grpo
11
  - lora
12
  - mixture-of-experts
 
 
 
13
  datasets:
14
  - LightningRodLabs/WWTD-2025
15
  base_model: openai/gpt-oss-120b
@@ -35,31 +38,36 @@ model-index:
35
 
36
  # Trump-Forecaster
37
 
38
- **RL-tuned gpt-oss-120b for predicting Trump administration actions. Beats GPT-5 on held-out forecasting questions.**
39
 
40
- This model was fine-tuned with reinforcement learning (GRPO) using Brier score as the reward signal, trained on the [WWTD-2025](https://huggingface.co/datasets/LightningRodLabs/WWTD-2025) dataset of 2,108 binary forecasting questions about Trump's actions from January-December 2025.
 
 
 
 
41
 
42
  ## Results
43
 
44
- Evaluated on 682 held-out test questions (with news context):
45
 
46
- | Model | Brier | BSS | ECE |
47
- |---|---|---|---|
48
- | **gpt-oss-120b RL (this model)** | **0.194** | **0.16** | **0.079** |
49
- | GPT-5 | 0.200 | 0.14 | 0.091 |
50
- | gpt-oss-120b (base) | 0.213 | 0.08 | 0.111 |
51
 
52
- Without context (question only):
53
 
54
- | Model | Brier | BSS | ECE |
55
- |---|---|---|---|
56
- | **gpt-oss-120b RL** | **0.242** | **-0.04** | 0.164 |
57
- | GPT-5 | 0.258 | -0.11 | 0.191 |
58
- | gpt-oss-120b (base) | 0.260 | -0.12 | 0.189 |
59
 
60
- - **Brier Score**: Mean squared error between predicted probability and outcome (lower = better)
61
- - **BSS (Brier Skill Score)**: Improvement over base-rate guessing (positive = better than naive)
62
- - **ECE**: Expected Calibration Error (lower = better calibrated)
 
 
 
 
 
63
 
64
  ## Training
65
 
@@ -71,6 +79,8 @@ Without context (question only):
71
  - **Training steps**: 50
72
  - **Max tokens**: 16,384
73
 
 
 
74
  ## Usage
75
 
76
  ```python
@@ -104,16 +114,12 @@ engine = sgl.Engine(model_path="LightningRodLabs/Trump-Forecaster", trust_remote
104
  output = engine.generate(prompt, sampling_params={"max_new_tokens": 4096, "stop": ["</answer>"]})
105
  ```
106
 
107
- ## Dataset
108
-
109
- Trained on [LightningRodLabs/WWTD-2025](https://huggingface.co/datasets/LightningRodLabs/WWTD-2025):
110
- - 2,790 binary forecasting questions about Trump administration actions
111
- - Auto-generated from news (Jan-Dec 2025) using the [Lightning Rod SDK](https://lightningrod.ai/sdk)
112
- - Ground-truth labels from web search verification
113
- - Temporal split: 2,108 train / 682 test (no leakage)
114
 
115
  ## Links
116
 
117
- - Dataset: [LightningRodLabs/WWTD-2025](https://huggingface.co/datasets/LightningRodLabs/WWTD-2025)
118
- - Training platform: [Tinker](https://tinker.computer)
119
- - Data generation: [Lightning Rod SDK](https://lightningrod.ai/sdk)
 
 
 
10
  - grpo
11
  - lora
12
  - mixture-of-experts
13
+ - politics
14
+ - trump
15
+ - future-as-label
16
  datasets:
17
  - LightningRodLabs/WWTD-2025
18
  base_model: openai/gpt-oss-120b
 
38
 
39
  # Trump-Forecaster
40
 
41
+ ### RL-Tuned gpt-oss-120b for Predicting Trump Administration Actions
42
 
43
+ We fine-tuned [gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b) with reinforcement learning to predict Trump administration actions. Trained on the [WWTD-2025](https://huggingface.co/datasets/LightningRodLabs/WWTD-2025) dataset of 2,108 binary forecasting questions generated with the [Lightning Rod SDK](https://github.com/lightning-rod-labs/lightningrod-python-sdk), the model beats GPT-5 on held-out forecasting questions.
44
+
45
+ [Dataset](https://huggingface.co/datasets/LightningRodLabs/WWTD-2025) 路 [Lightning Rod SDK](https://github.com/lightning-rod-labs/lightningrod-python-sdk) 路 [Future-as-Label paper](https://arxiv.org/abs/2601.06336) 路 [Outcome-based RL paper](https://arxiv.org/abs/2505.17989)
46
+
47
+ ---
48
 
49
  ## Results
50
 
51
+ Evaluated on 682 held-out test questions under two conditions: with news context, and without context (question only). The no-context condition reveals whether the model knows what it doesn't know鈥攗ntrained models project false confidence, while RL training fixes overconfidence.
52
 
53
+ | Model | Brier (With Context) | BSS | Brier (No Context) | BSS | ECE (With Context) | ECE (No Context) |
54
+ |-------|:---:|:---:|:---:|:---:|:---:|:---:|
55
+ | GPT-5 | 0.200 | +0.14 | 0.258 | -0.11 | 0.091 | 0.191 |
56
+ | gpt-oss-120b | 0.213 | +0.08 | 0.260 | -0.12 | 0.111 | 0.190 |
57
+ | **gpt-oss-120b RL (this model)** | **0.194** | **+0.16** | **0.242** | **-0.04** | **0.079** | **0.164** |
58
 
59
+ ![Brier Skill Score](https://huggingface.co/datasets/LightningRodLabs/WWTD-2025/resolve/main/brier_skill_score.png)
60
 
61
+ ![Brier Score Comparison](https://huggingface.co/datasets/LightningRodLabs/WWTD-2025/resolve/main/brier_score_comparison.png)
 
 
 
 
62
 
63
+ ![ECE Comparison](https://huggingface.co/datasets/LightningRodLabs/WWTD-2025/resolve/main/ece_comparison.png)
64
+
65
+ ### Metrics
66
+
67
+ - **Brier Score**: Mean squared error between predicted probability and outcome (0 or 1). Lower is better. **Brier Skill Score (BSS)** expresses this as improvement over always predicting the base rate鈥攑ositive means the model learned something useful beyond historical frequency.
68
+ - **Expected Calibration Error (ECE)**: Measures whether predicted probabilities match actual frequencies. "70%" predictions should resolve "yes" 70% of the time. Lower is better.
69
+
70
+ ---
71
 
72
  ## Training
73
 
 
79
  - **Training steps**: 50
80
  - **Max tokens**: 16,384
81
 
82
+ ---
83
+
84
  ## Usage
85
 
86
  ```python
 
114
  output = engine.generate(prompt, sampling_params={"max_new_tokens": 4096, "stop": ["</answer>"]})
115
  ```
116
 
117
+ ---
 
 
 
 
 
 
118
 
119
  ## Links
120
 
121
+ - **Dataset**: [LightningRodLabs/WWTD-2025](https://huggingface.co/datasets/LightningRodLabs/WWTD-2025)
122
+ - **Training platform**: [Tinker](https://tinker.computer)
123
+ - **Data generation**: [Lightning Rod SDK](https://github.com/lightning-rod-labs/lightningrod-python-sdk)
124
+ - **Future-as-Label paper**: [arxiv:2601.06336](https://arxiv.org/abs/2601.06336)
125
+ - **Outcome-based RL paper**: [arxiv:2505.17989](https://arxiv.org/abs/2505.17989)