Upload README.md with huggingface_hub

0afb1aa verified 2 months ago

5.08 kB

language:
  - en
license: apache-2.0
library_name: transformers
tags:
  - forecasting
  - prediction
  - reinforcement-learning
  - grpo
  - lora
  - mixture-of-experts
  - politics
  - trump
  - future-as-label
datasets:
  - LightningRodLabs/WWTD-2025
base_model: openai/gpt-oss-120b
pipeline_tag: text-generation
model-index:
  - name: Trump-Forecaster
    results:
      - task:
          type: text-generation
          name: Probabilistic Forecasting
        dataset:
          name: WWTD-2025
          type: LightningRodLabs/WWTD-2025
          split: test
        metrics:
          - type: brier_score
            value: 0.194
            name: Brier Score
          - type: ece
            value: 0.079
            name: Expected Calibration Error

Trump-Forecaster

RL-Tuned gpt-oss-120b for Predicting Trump Administration Actions

We fine-tuned gpt-oss-120b with reinforcement learning to predict Trump administration actions. Trained on the WWTD-2025 dataset of 2,108 binary forecasting questions generated with the Lightning Rod SDK, the model beats GPT-5 on held-out forecasting questions.

Dataset · Lightning Rod SDK · Future-as-Label paper · Outcome-based RL paper

Results

Evaluated on 682 held-out test questions under two conditions: with news context, and without context (question only). The no-context condition reveals whether the model knows what it doesn't know—untrained models project false confidence, while RL training fixes overconfidence.

Model	Brier (With Context)	BSS	Brier (No Context)	BSS	ECE (With Context)	ECE (No Context)
GPT-5	0.200	+0.14	0.258	-0.11	0.091	0.191
gpt-oss-120b	0.213	+0.08	0.260	-0.12	0.111	0.190
gpt-oss-120b RL (this model)	0.194	+0.16	0.242	-0.04	0.079	0.164

Metrics

Brier Score: Mean squared error between predicted probability and outcome (0 or 1). Lower is better. Brier Skill Score (BSS) expresses this as improvement over always predicting the base rate—positive means the model learned something useful beyond historical frequency.
Expected Calibration Error (ECE): Measures whether predicted probabilities match actual frequencies. "70%" predictions should resolve "yes" 70% of the time. Lower is better.

Training

Base model: openai/gpt-oss-120b (120B MoE, 5.1B active params, 128 experts Top-4)
Method: GRPO with Brier score reward via Tinker
LoRA rank: 32
Learning rate: 4e-5
Batch size: 32, group size 8
Training steps: 50
Max tokens: 16,384

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "LightningRodLabs/Trump-Forecaster",
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("LightningRodLabs/Trump-Forecaster", trust_remote_code=True)

prompt = """You are a forecasting expert. Given the question and context below, predict the probability that the answer is "Yes".

Question: Will Trump impose 25% tariffs on all goods from Canada by February 1, 2025?

Respond with your reasoning, then give your final answer as a probability between 0 and 1 inside <answer></answer> tags."""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=4096, do_sample=True, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

For faster inference with the MoE architecture, use SGLang:

import sglang as sgl

engine = sgl.Engine(model_path="LightningRodLabs/Trump-Forecaster", trust_remote_code=True, dtype="bfloat16")
output = engine.generate(prompt, sampling_params={"max_new_tokens": 4096, "stop": ["</answer>"]})

LightningRodLabs
/

Trump-Forecaster