Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -10,6 +10,9 @@ tags:
|
|
| 10 |
- grpo
|
| 11 |
- lora
|
| 12 |
- mixture-of-experts
|
|
|
|
|
|
|
|
|
|
| 13 |
datasets:
|
| 14 |
- LightningRodLabs/WWTD-2025
|
| 15 |
base_model: openai/gpt-oss-120b
|
|
@@ -35,31 +38,36 @@ model-index:
|
|
| 35 |
|
| 36 |
# Trump-Forecaster
|
| 37 |
|
| 38 |
-
|
| 39 |
|
| 40 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 41 |
|
| 42 |
## Results
|
| 43 |
|
| 44 |
-
Evaluated on 682 held-out test questions
|
| 45 |
|
| 46 |
-
| Model | Brier | BSS | ECE |
|
| 47 |
-
|
| 48 |
-
|
|
| 49 |
-
|
|
| 50 |
-
| gpt-oss-120b (
|
| 51 |
|
| 52 |
-
|
| 53 |
|
| 54 |
-
|
| 55 |
-
|---|---|---|---|
|
| 56 |
-
| **gpt-oss-120b RL** | **0.242** | **-0.04** | 0.164 |
|
| 57 |
-
| GPT-5 | 0.258 | -0.11 | 0.191 |
|
| 58 |
-
| gpt-oss-120b (base) | 0.260 | -0.12 | 0.189 |
|
| 59 |
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 63 |
|
| 64 |
## Training
|
| 65 |
|
|
@@ -71,6 +79,8 @@ Without context (question only):
|
|
| 71 |
- **Training steps**: 50
|
| 72 |
- **Max tokens**: 16,384
|
| 73 |
|
|
|
|
|
|
|
| 74 |
## Usage
|
| 75 |
|
| 76 |
```python
|
|
@@ -104,16 +114,12 @@ engine = sgl.Engine(model_path="LightningRodLabs/Trump-Forecaster", trust_remote
|
|
| 104 |
output = engine.generate(prompt, sampling_params={"max_new_tokens": 4096, "stop": ["</answer>"]})
|
| 105 |
```
|
| 106 |
|
| 107 |
-
|
| 108 |
-
|
| 109 |
-
Trained on [LightningRodLabs/WWTD-2025](https://huggingface.co/datasets/LightningRodLabs/WWTD-2025):
|
| 110 |
-
- 2,790 binary forecasting questions about Trump administration actions
|
| 111 |
-
- Auto-generated from news (Jan-Dec 2025) using the [Lightning Rod SDK](https://lightningrod.ai/sdk)
|
| 112 |
-
- Ground-truth labels from web search verification
|
| 113 |
-
- Temporal split: 2,108 train / 682 test (no leakage)
|
| 114 |
|
| 115 |
## Links
|
| 116 |
|
| 117 |
-
- Dataset
|
| 118 |
-
- Training platform
|
| 119 |
-
- Data generation
|
|
|
|
|
|
|
|
|
| 10 |
- grpo
|
| 11 |
- lora
|
| 12 |
- mixture-of-experts
|
| 13 |
+
- politics
|
| 14 |
+
- trump
|
| 15 |
+
- future-as-label
|
| 16 |
datasets:
|
| 17 |
- LightningRodLabs/WWTD-2025
|
| 18 |
base_model: openai/gpt-oss-120b
|
|
|
|
| 38 |
|
| 39 |
# Trump-Forecaster
|
| 40 |
|
| 41 |
+
### RL-Tuned gpt-oss-120b for Predicting Trump Administration Actions
|
| 42 |
|
| 43 |
+
We fine-tuned [gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b) with reinforcement learning to predict Trump administration actions. Trained on the [WWTD-2025](https://huggingface.co/datasets/LightningRodLabs/WWTD-2025) dataset of 2,108 binary forecasting questions generated with the [Lightning Rod SDK](https://github.com/lightning-rod-labs/lightningrod-python-sdk), the model beats GPT-5 on held-out forecasting questions.
|
| 44 |
+
|
| 45 |
+
[Dataset](https://huggingface.co/datasets/LightningRodLabs/WWTD-2025) 路 [Lightning Rod SDK](https://github.com/lightning-rod-labs/lightningrod-python-sdk) 路 [Future-as-Label paper](https://arxiv.org/abs/2601.06336) 路 [Outcome-based RL paper](https://arxiv.org/abs/2505.17989)
|
| 46 |
+
|
| 47 |
+
---
|
| 48 |
|
| 49 |
## Results
|
| 50 |
|
| 51 |
+
Evaluated on 682 held-out test questions under two conditions: with news context, and without context (question only). The no-context condition reveals whether the model knows what it doesn't know鈥攗ntrained models project false confidence, while RL training fixes overconfidence.
|
| 52 |
|
| 53 |
+
| Model | Brier (With Context) | BSS | Brier (No Context) | BSS | ECE (With Context) | ECE (No Context) |
|
| 54 |
+
|-------|:---:|:---:|:---:|:---:|:---:|:---:|
|
| 55 |
+
| GPT-5 | 0.200 | +0.14 | 0.258 | -0.11 | 0.091 | 0.191 |
|
| 56 |
+
| gpt-oss-120b | 0.213 | +0.08 | 0.260 | -0.12 | 0.111 | 0.190 |
|
| 57 |
+
| **gpt-oss-120b RL (this model)** | **0.194** | **+0.16** | **0.242** | **-0.04** | **0.079** | **0.164** |
|
| 58 |
|
| 59 |
+

|
| 60 |
|
| 61 |
+

|
|
|
|
|
|
|
|
|
|
|
|
|
| 62 |
|
| 63 |
+

|
| 64 |
+
|
| 65 |
+
### Metrics
|
| 66 |
+
|
| 67 |
+
- **Brier Score**: Mean squared error between predicted probability and outcome (0 or 1). Lower is better. **Brier Skill Score (BSS)** expresses this as improvement over always predicting the base rate鈥攑ositive means the model learned something useful beyond historical frequency.
|
| 68 |
+
- **Expected Calibration Error (ECE)**: Measures whether predicted probabilities match actual frequencies. "70%" predictions should resolve "yes" 70% of the time. Lower is better.
|
| 69 |
+
|
| 70 |
+
---
|
| 71 |
|
| 72 |
## Training
|
| 73 |
|
|
|
|
| 79 |
- **Training steps**: 50
|
| 80 |
- **Max tokens**: 16,384
|
| 81 |
|
| 82 |
+
---
|
| 83 |
+
|
| 84 |
## Usage
|
| 85 |
|
| 86 |
```python
|
|
|
|
| 114 |
output = engine.generate(prompt, sampling_params={"max_new_tokens": 4096, "stop": ["</answer>"]})
|
| 115 |
```
|
| 116 |
|
| 117 |
+
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 118 |
|
| 119 |
## Links
|
| 120 |
|
| 121 |
+
- **Dataset**: [LightningRodLabs/WWTD-2025](https://huggingface.co/datasets/LightningRodLabs/WWTD-2025)
|
| 122 |
+
- **Training platform**: [Tinker](https://tinker.computer)
|
| 123 |
+
- **Data generation**: [Lightning Rod SDK](https://github.com/lightning-rod-labs/lightningrod-python-sdk)
|
| 124 |
+
- **Future-as-Label paper**: [arxiv:2601.06336](https://arxiv.org/abs/2601.06336)
|
| 125 |
+
- **Outcome-based RL paper**: [arxiv:2505.17989](https://arxiv.org/abs/2505.17989)
|