LightningRodLabs
/

foresight-32B

@@ -2,22 +2,36 @@
 license: apache-2.0
 base_model:
 - Qwen/Qwen3-32B
 ---
-# Foresight-32B
-A 32-billion parameter language model fine-tuned for probabilistic forecasting of real-world events.
-## Overview
-Foresight-32B is a general-purpose forecasting model developed by [Lightning Rod Labs](https://lightningrod.ai). Built on Qwen3-32B and trained using outcome-based reinforcement learning, it achieves state-of-the-art forecasting performance among open-weight models—outperforming frontier LLMs 10-100x its size on prediction market benchmarks.
 ## Key Results
-In a forward-looking evaluation on 251 live Polymarket questions (July-August 2025):
 | Model | Brier Score ↓ | ECE ↓ | Profitable |
 |-------|---------------|-------|------------|
-| **Foresight-32B** | **0.199** | **6.0%** | ✓ |
 | OpenAI o3 | 0.205 | 7.8% | ✓ |
 | Gemini 2.5 Pro | 0.213 | 8.2% | ✗ |
 | Grok-4 | 0.218 | 9.1% | ✗ |
@@ -25,64 +39,42 @@ In a forward-looking evaluation on 251 live Polymarket questions (July-August 20
 | Qwen3-32B (base) | 0.253 | 19.2% | ✗ |
 | Polymarket (market) | 0.170 | — | — |
-Foresight-32B led all tested LLMs on every metric: Brier score, expected calibration error (ECE), and profitability.
 ## How It Works
-See: [LLMs Can Teach Themselves to Better Predict the Future](https://arxiv.org/abs/2502.05253)
-See: [Outcome-based Reinforcement Learning to Predict the Future](https://arxiv.org/abs/2505.17989)
-### Synthetic Training Data (Foresight Learning)
-We augment limited real-world prediction market data with synthetically generated forecasting questions using our data generation framework. This generates questions from streams of data (e.g., news articles) that are difficult to predict at one point in time but verifiable later. The model was trained on ~10,000 real Polymarket questions plus ~100,000 synthetic questions—with nearly 70% of training data being synthetic.
-## Training Details
-- **Base Model:** Qwen3-32B
-- **Training Method:** GRPO
-- **Training Data:** ~10k Polymarket questions + ~100k synthetic forecasting questions
-- **Evaluation:** Held-out test set of 1,265 questions with temporal separation to prevent leakage
-## Usage
-Foresight-32B is available for use at [dashboard.lightningrod.ai](https://dashboard.lightningrod.ai).
-### Input Format
-The model accepts a forecasting question along with relevant context (news articles, background information) and outputs a probability estimate with reasoning. Include instructions for how the answer should be formatted for a well structured response.
-```
-Question: Will [event] happen by [date]?
-Context:
-[Relevant news headlines and information up to prediction date]
-Output: Probability estimate (0-100%) with reasoning
-```
-## Citation
-If you use Foresight-32B in your research, please cite:
-```bibtex
-@article{turtel2025outcome,
-  title={Outcome-based Reinforcement Learning to Predict the Future},
-  author={Turtel, Benjamin and others},
-  journal={arXiv preprint arXiv:2505.17989},
-  year={2025}
-}
-@article{turtel2025llms,
-  title={LLMs Can Teach Themselves to Better Predict the Future},
-  author={Turtel, Benjamin and Franklin, Danny and Schoenegger, Philipp},
-  journal={arXiv preprint arXiv:2502.05253},
-  year={2025}
-}
-```
-## Contact
-If you are interested in generating training data for your own models or fine-tuning custom prediction agents on your domain-specific data, reach out to [support@lightningrod.ai](mailto:support@lightningrod.ai).
 ## License

 license: apache-2.0
 base_model:
 - Qwen/Qwen3-32B
+tags:
+- forecasting
+- prediction
+- reinforcement-learning
+- calibration
+- polymarket
+pipeline_tag: text-generation
 ---
+# Foresight V1 32B - Open-Source Forecasting Model
+**Lightning Rod Labs** | [lightningrod.ai](https://lightningrod.ai/)
+Foresight V1 32B is a forecasting model fine-tuned from Qwen3-32B via outcome-based RL. Despite being 10-100x smaller, it has **outperformed frontier models** on Brier score, ECE, and profitability.
+Our latest model, Foresight V3, can be tested at [dashboard.lightningrod.ai](https://dashboard.lightningrod.ai/).
+Lightning Rod Labs takes you from raw data to fine-tuned model. With automated training data generation, fine-tuning, and evaluation, all in one place. No manual labeling required.
+### 3rd Party Benchmarks 🏆
+Feb 2026: Foresight V1 32B ranked #1 on Prophet Arena Sports, a benchmark run by SIGMA Lab at UChicago, beating Grok-4, GPT-5.2, Gemini 3 Pro, and Claude Opus 4.5 on live prediction questions.
+Jan 2026: Foresight V1 32B is the [only non-frontier model in the top 5](https://forecastingresearch.substack.com/p/llms-are-closing-the-gap-on-human) on ForecastBench, an independent forecasting benchmark run by the Forecasting Research Institute, where AIs compete on real-world forecasting questions.
 ## Key Results
+Evaluated on August 25, 2025 against 251 live Polymarket questions, **Foresight-v1 outperformed every frontier model tested** on accuracy (Brier Score), calibration (ECE), and profitability.
 | Model | Brier Score ↓ | ECE ↓ | Profitable |
 |-------|---------------|-------|------------|
+| **Foresight V1 32B** | **0.199** | **6.0%** | ✓ |
 | OpenAI o3 | 0.205 | 7.8% | ✓ |
 | Gemini 2.5 Pro | 0.213 | 8.2% | ✗ |
 | Grok-4 | 0.218 | 9.1% | ✗ |
 | Qwen3-32B (base) | 0.253 | 19.2% | ✗ |
 | Polymarket (market) | 0.170 | — | — |
+Further details on our methodology and results are available [here.](https://blog.lightningrod.ai/p/foresight-32b-beats-frontier-llms-on-live-polymarket-predictions)
 ## How It Works
+Foresight V1 32B was trained using outcome-based RL. The model was shown only information available at prediction time, forced to commit to a probability, and scored against the realized outcome using the Brier score as the reward signal. Confident wrong predictions were penalized more heavily than uncertain ones, directly incentivizing calibration over overconfidence.
+Training data was generated using our Foresight Data platform, which automatically transformed unstructured sources into labeled training datasets — no human annotation required.
+The same framework has been applied across domains to create prediction agents and domain expert models, including finance, healthcare, insurance, and sports analytics.
+See: [LLMs Can Teach Themselves to Better Predict the Future](https://arxiv.org/abs/2502.05253) · [Outcome-based Reinforcement Learning to Predict the Future](https://arxiv.org/abs/2505.17989) · [Future-as-Label: Scalable Supervision from Real-World Outcomes](https://arxiv.org/abs/2601.06336)
+## Output Format
+Our recommended usage is for predictions, but it also works with the OpenAI API.
+## About Lighting Rod Labs
+Lightning Rod Labs takes you from raw data to fine-tuned model. With automated training data generation, fine-tuning, and evaluation, all in one place. No manual labeling required. Our research is peer-reviewed and published, including in Transactions on Machine Learning Research (TMLR). Our models have been benchmarked live and outperformed the world's best.
+A few highlights:
+- 🏆 #1 on ProphetArena Sport, beating GPT-5.2, Gemini 3 Pro, and Grok-4 (Feb 2026)
+- 📊 Top 5 on ForecastBench, outperforming Claude, O3, and Grok-4 (Jan 2026)
+- 🔬 Published in TMLR: 14B model matches o1 accuracy and generates >10% profit in live trading simulations [[link]](https://arxiv.org/abs/2505.17989)
+- 🏛️ Vetted and awardable for U.S. defense procurement via DARPA ERIS and CDAO Tradewinds marketplaces
+- 📰 Featured in The Atlantic, TIME, and the Forecasting Research Institute
+## Contact
+Interested in generating training data for your own models or building a custom prediction model?
+- 📧 [support@lightningrod.ai](mailto:support@lightningrod.ai)
+- 📅 [Book a demo](https://calendly.com/d/ctq4-7gd-nyq/lightning-rod-demo)
+- 🌐 [lightningrod.ai/about](https://www.lightningrod.ai/about)
 ## License