File size: 4,311 Bytes
90c19f5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
eb2de78
90c19f5
eb2de78
 
 
 
 
90c19f5
 
 
 
 
44ff89a
90c19f5
 
 
 
 
44ff89a
90c19f5
9d23dd2
 
 
 
 
 
44ff89a
90c19f5
 
 
 
 
44ff89a
90c19f5
44ff89a
90c19f5
 
 
 
 
44ff89a
90c19f5
 
 
 
 
 
 
 
44ff89a
90c19f5
 
 
 
 
 
 
 
 
 
 
 
44ff89a
 
 
90c19f5
 
 
44ff89a
 
 
90c19f5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
---
language:
- en
license: apache-2.0
library_name: peft
tags:
- forecasting
- prediction
- reinforcement-learning
- grpo
- lora
- mixture-of-experts
- golf
- sports
- future-as-label
datasets:
- LightningRodLabs/GolfForecasting
base_model: openai/gpt-oss-120b
pipeline_tag: text-generation
model-index:
- name: Golf-Forecaster
  results:
  - task:
      type: text-generation
      name: Probabilistic Forecasting
    dataset:
      name: GolfForecasting
      type: LightningRodLabs/GolfForecasting
      split: test
    metrics:
    - type: brier_score
      value: 0.207
      name: Brier Score
    - type: ece
      value: 0.062
      name: Expected Calibration Error
---

# Golf-Forecaster

### RL-Tuned gpt-oss-120b for Predicting Professional Golf Outcomes

Starting from nothing but 9 search queries, we used the [Lightning Rod SDK](https://github.com/lightning-rod-labs/lightningrod-python-sdk) to automatically generate [3,178 forecasting questions](https://huggingface.co/datasets/LightningRodLabs/GolfForecasting) from news articles, label them using real outcomes, and train this model via RL. **No expertise required. No manual labeling. No domain-specific engineering.** The result beats GPT-5 on held-out questions.

You can do this in any domain — just change the search queries. See [how we built the dataset](https://huggingface.co/datasets/LightningRodLabs/GolfForecasting).

This repo contains a **LoRA adapter** for [gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b). A standalone `merge.py` script is included to merge it into a full model.

---

## Results

Evaluated on 855 held-out test questions (temporal split, Aug 2025+).

| Model | Brier Score | Brier Skill Score | ECE |
|-------|:---:|:---:|:---:|
| **Golf-Forecaster** | **0.207** | **+17.0%** | **0.062** |
| gpt-oss-120b (base) | 0.218 | +12.8% | 0.083 |
| GPT-5 | 0.218 | +12.8% | 0.106 |

![Brier Skill Score](https://huggingface.co/datasets/LightningRodLabs/GolfForecasting/resolve/main/brier_skill_score.png)

![Brier Score Comparison](https://huggingface.co/datasets/LightningRodLabs/GolfForecasting/resolve/main/brier_score_comparison.png)

![ECE Comparison](https://huggingface.co/datasets/LightningRodLabs/GolfForecasting/resolve/main/ece_comparison.png)

**Brier Score**: Mean squared error between predicted probability and outcome. Lower is better. **BSS** measures improvement over always predicting the base rate. **ECE**: Whether predicted probabilities match actual frequencies. Lower is better.

---

## Training

- **Base model**: [openai/gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b) (120B MoE, 5.1B active params)
- **Method**: GRPO with Brier score reward via [Tinker](https://tinker.computer)
- **LoRA rank**: 32, learning rate 4e-5, batch size 32, group size 8, 100 steps

---

## Usage

The adapter uses Tinker's module naming convention, so it requires a merge step before inference. A standalone `merge.py` script is included.

### Merge into full model

```bash
pip install torch transformers safetensors tqdm huggingface-hub
python merge.py --output ./golf-forecaster-merged
```

### Inference

```python
import sglang as sgl

engine = sgl.Engine(
    model_path="./golf-forecaster-merged",
    tokenizer_path="openai/gpt-oss-120b",
    trust_remote_code=True,
    dtype="bfloat16",
    tp_size=2,
)

news_context = "... relevant news articles ..."

prompt = f"""You are a forecasting expert. Given the question and context below, predict the probability that the answer is "Yes".

Question: Will Scottie Scheffler win the 2025 Masters?

Context:
{news_context}

Respond with your reasoning, then give your final answer as a probability between 0 and 1 inside <answer></answer> tags."""

output = engine.generate(prompt, sampling_params={"max_new_tokens": 4096, "stop": ["</answer>"]})
print(output["text"])
```

---

## Links

- **Dataset**: [LightningRodLabs/GolfForecasting](https://huggingface.co/datasets/LightningRodLabs/GolfForecasting)
- **Training platform**: [Tinker](https://tinker.computer)
- **Data generation**: [Lightning Rod SDK](https://github.com/lightning-rod-labs/lightningrod-python-sdk)
- **Future-as-Label paper**: [arxiv:2601.06336](https://arxiv.org/abs/2601.06336)
- **Outcome-based RL paper**: [arxiv:2505.17989](https://arxiv.org/abs/2505.17989)