new

Get trending papers in your email inbox!

Subscribe

Daily Papers

byAK and the research community

Jul 3

MacroLens: A Multi-Task Benchmark for Contextual Financial Reasoning under Macroeconomic Scenarios

Financial decision-making is contextual: forecasting prices, valuing companies, and assessing event exposure weigh price history, accounting fundamentals, macroeconomic regime, and contemporaneous text. A benchmark over these four signals is hard to build because finance violates four assumptions of time-series evaluation: text must be gated by its publication date to prevent look-ahead, quarterly fundamentals are reported with a one- to ninety-day lag, filing text is partly redundant with the numerical statement fields it accompanies, and macroeconomic regimes leak across calendar splits. No public benchmark addresses all four signals jointly. MacroLens covers 4,416 U.S. small- and micro-cap equities over 2021-2026. Seven tasks share one point-in-time panel of prices, 46.8M XBRL accounting facts, 53 macroeconomic series, 295,860 SEC filings, and 215,882 news articles, plus a scenario layer of 1,130 macroeconomic events across 49 types automatically detected and rendered as natural language. Tasks span contextual forecasting, public and private valuation, statement generation from fundamentals and descriptions, scenario-conditioned returns, and real-estate valuation. We evaluate 19 methods across six families spanning naive heuristics through time-series foundation models, fine-tuned LLM-based time-series models, and zero-shot large language models (LLMs), plus a five-step feature-context ablation on two frontier LLMs and a gradient-boosted baseline. MacroLens is released at https://huggingface.co/datasets/DeepAuto-AI/MacroLens.

  • 5 authors
·
Jun 22

VegSim: A Geospatial World Model for Scenario-Conditioned Vegetation Simulation

Vegetation monitoring under climate stress requires answering not only how it will evolve given the expected weather, but how it would respond to alternative meteorological conditions. Forecasting models return the expected vegetation state for the observed weather and cannot answer these scenario-conditioned questions, because future weather is fixed to the recorded trajectory. We present VegSim, a geospatial world model for scenario-conditioned vegetation simulation. VegSim infers a latent vegetation state from sparse satellite-derived NDVI histories, past meteorological covariates, and static spatial context, propagates it forward under future weather forcing through recurrent latent dynamics, and decodes predictive NDVI quantiles at each lead time. Because future forcing enters as a controllable input, the same trained model supports probabilistic forecasting under observed weather and conditional simulation under user-defined meteorological forcing, without supervision on scenario responses. We evaluate VegSim on GreenEarthNet across in-distribution data and spatial, temporal, and joint spatial-temporal shift, where it achieves strong point and probabilistic accuracy against time series and Earth observation forecasting baselines while using a compact architecture. We then simulate vegetation responses across Europe under four meteorological scenarios, and in a France summer 2022 case study, obtaining spatially coherent patterns consistent with known sensitivity to temperature and precipitation. The code is available at https://github.com/arco-group/vegsim.

  • 4 authors
·
Jun 19

Dichotomy of Control: Separating What You Can Control from What You Cannot

Future- or return-conditioned supervised learning is an emerging paradigm for offline reinforcement learning (RL), where the future outcome (i.e., return) associated with an observed action sequence is used as input to a policy trained to imitate those same actions. While return-conditioning is at the heart of popular algorithms such as decision transformer (DT), these methods tend to perform poorly in highly stochastic environments, where an occasional high return can arise from randomness in the environment rather than the actions themselves. Such situations can lead to a learned policy that is inconsistent with its conditioning inputs; i.e., using the policy to act in the environment, when conditioning on a specific desired return, leads to a distribution of real returns that is wildly different than desired. In this work, we propose the dichotomy of control (DoC), a future-conditioned supervised learning framework that separates mechanisms within a policy's control (actions) from those beyond a policy's control (environment stochasticity). We achieve this separation by conditioning the policy on a latent variable representation of the future, and designing a mutual information constraint that removes any information from the latent variable associated with randomness in the environment. Theoretically, we show that DoC yields policies that are consistent with their conditioning inputs, ensuring that conditioning a learned policy on a desired high-return future outcome will correctly induce high-return behavior. Empirically, we show that DoC is able to achieve significantly better performance than DT on environments that have highly stochastic rewards and transition

  • 4 authors
·
Oct 24, 2022

An Introduction to Artificial Prediction Markets for Classification

Prediction markets are used in real life to predict outcomes of interest such as presidential elections. This paper presents a mathematical theory of artificial prediction markets for supervised learning of conditional probability estimators. The artificial prediction market is a novel method for fusing the prediction information of features or trained classifiers, where the fusion result is the contract price on the possible outcomes. The market can be trained online by updating the participants' budgets using training examples. Inspired by the real prediction markets, the equations that govern the market are derived from simple and reasonable assumptions. Efficient numerical algorithms are presented for solving these equations. The obtained artificial prediction market is shown to be a maximum likelihood estimator. It generalizes linear aggregation, existent in boosting and random forest, as well as logistic regression and some kernel methods. Furthermore, the market mechanism allows the aggregation of specialized classifiers that participate only on specific instances. Experimental comparisons show that the artificial prediction markets often outperform random forest and implicit online learning on synthetic data and real UCI datasets. Moreover, an extensive evaluation for pelvic and abdominal lymph node detection in CT data shows that the prediction market improves adaboost's detection rate from 79.6% to 81.2% at 3 false positives/volume.

  • 2 authors
·
Jul 8, 2012

Predicting Rare Events by Shrinking Towards Proportional Odds

Training classifiers is difficult with severe class imbalance, but many rare events are the culmination of a sequence with much more common intermediate outcomes. For example, in online marketing a user first sees an ad, then may click on it, and finally may make a purchase; estimating the probability of purchases is difficult because of their rarity. We show both theoretically and through data experiments that the more abundant data in earlier steps may be leveraged to improve estimation of probabilities of rare events. We present PRESTO, a relaxation of the proportional odds model for ordinal regression. Instead of estimating weights for one separating hyperplane that is shifted by separate intercepts for each of the estimated Bayes decision boundaries between adjacent pairs of categorical responses, we estimate separate weights for each of these transitions. We impose an L1 penalty on the differences between weights for the same feature in adjacent weight vectors in order to shrink towards the proportional odds model. We prove that PRESTO consistently estimates the decision boundary weights under a sparsity assumption. Synthetic and real data experiments show that our method can estimate rare probabilities in this setting better than both logistic regression on the rare category, which fails to borrow strength from more abundant categories, and the proportional odds model, which is too inflexible.

  • 2 authors
·
May 29, 2023

Generating Synergistic Formulaic Alpha Collections via Reinforcement Learning

In the field of quantitative trading, it is common practice to transform raw historical stock data into indicative signals for the market trend. Such signals are called alpha factors. Alphas in formula forms are more interpretable and thus favored by practitioners concerned with risk. In practice, a set of formulaic alphas is often used together for better modeling precision, so we need to find synergistic formulaic alpha sets that work well together. However, most traditional alpha generators mine alphas one by one separately, overlooking the fact that the alphas would be combined later. In this paper, we propose a new alpha-mining framework that prioritizes mining a synergistic set of alphas, i.e., it directly uses the performance of the downstream combination model to optimize the alpha generator. Our framework also leverages the strong exploratory capabilities of reinforcement learning~(RL) to better explore the vast search space of formulaic alphas. The contribution to the combination models' performance is assigned to be the return used in the RL process, driving the alpha generator to find better alphas that improve upon the current set. Experimental evaluations on real-world stock market data demonstrate both the effectiveness and the efficiency of our framework for stock trend forecasting. The investment simulation results show that our framework is able to achieve higher returns compared to previous approaches.

  • 7 authors
·
May 25, 2023

RETuning: Upgrading Inference-Time Scaling for Stock Movement Prediction with Large Language Models

Recently, large language models (LLMs) have demonstrated outstanding reasoning capabilities on mathematical and coding tasks. However, their application to financial tasks-especially the most fundamental task of stock movement prediction-remains underexplored. We study a three-class classification problem (up, hold, down) and, by analyzing existing reasoning responses, observe that: (1) LLMs follow analysts' opinions rather than exhibit a systematic, independent analytical logic (CoTs). (2) LLMs list summaries from different sources without weighing adversarial evidence, yet such counterevidence is crucial for reliable prediction. It shows that the model does not make good use of its reasoning ability to complete the task. To address this, we propose Reflective Evidence Tuning (RETuning), a cold-start method prior to reinforcement learning, to enhance prediction ability. While generating CoT, RETuning encourages dynamically constructing an analytical framework from diverse information sources, organizing and scoring evidence for price up or down based on that framework-rather than on contextual viewpoints-and finally reflecting to derive the prediction. This approach maximally aligns the model with its learned analytical framework, ensuring independent logical reasoning and reducing undue influence from context. We also build a large-scale dataset spanning all of 2024 for 5,123 A-share stocks, with long contexts (32K tokens) and over 200K samples. In addition to price and news, it incorporates analysts' opinions, quantitative reports, fundamental data, macroeconomic indicators, and similar stocks. Experiments show that RETuning successfully unlocks the model's reasoning ability in the financial domain. Inference-time scaling still works even after 6 months or on out-of-distribution stocks, since the models gain valuable insights about stock movement prediction.

  • 10 authors
·
Oct 24, 2025

Unravelling the Probabilistic Forest: Arbitrage in Prediction Markets

Polymarket is a prediction market platform where users can speculate on future events by trading shares tied to specific outcomes, known as conditions. Each market is associated with a set of one or more such conditions. To ensure proper market resolution, the condition set must be exhaustive -- collectively accounting for all possible outcomes -- and mutually exclusive -- only one condition may resolve as true. Thus, the collective prices of all related outcomes should be \1, representing a combined probability of 1 of any outcome. Despite this design, Polymarket exhibits cases where dependent assets are mispriced, allowing for purchasing (or selling) a certain outcome for less than (or more than) 1, guaranteeing profit. This phenomenon, known as arbitrage, could enable sophisticated participants to exploit such inconsistencies. In this paper, we conduct an empirical arbitrage analysis on Polymarket data to answer three key questions: (Q1) What conditions give rise to arbitrage (Q2) Does arbitrage actually occur on Polymarket and (Q3) Has anyone exploited these opportunities. A major challenge in analyzing arbitrage between related markets lies in the scalability of comparisons across a large number of markets and conditions, with a naive analysis requiring O(2^{n+m}) comparisons. To overcome this, we employ a heuristic-driven reduction strategy based on timeliness, topical similarity, and combinatorial relationships, further validated by expert input. Our study reveals two distinct forms of arbitrage on Polymarket: Market Rebalancing Arbitrage, which occurs within a single market or condition, and Combinatorial Arbitrage, which spans across multiple markets. We use on-chain historical order book data to analyze when these types of arbitrage opportunities have existed, and when they have been executed by users. We find a realized estimate of 40 million USD of profit extracted.

  • 4 authors
·
Aug 4, 2025

gym-invmgmt: An Open Benchmarking Framework for Inventory Management Methods

Inventory-policy comparisons are often difficult to interpret because performance depends on the evaluation contract as much as on the policy itself. Differences in topology, demand regime, information access, feasibility constraints, shortage treatment, and Key Performance Indicator (KPI) definitions can change method rankings. We present gym-invmgmt, a Gymnasium-compatible extension of the OR-Gym inventory-management lineage for auditable cross-paradigm evaluation. The benchmark evaluates optimization, heuristic, and learned controllers under a shared CoreEnv transition, reward, action-bound, and KPI contract, while varying stress conditions through a 22-scenario core grid plus four supplemental MARL-mode rows. Within these released scenarios, informed stochastic programming provides the strongest non-oracle reference, reflecting the value of scenario hedging under forecast access, but at substantially higher online computational cost. Among learned controllers, the Proximal Policy Optimization Transformer variant (PPO-Transformer) achieves the strongest learned-policy quality at fast inference, while Residual Reinforcement Learning (Residual RL) provides competitive hybrid performance. The graph neural network variant (PPO-GNN) is highly competitive on the default divergent topology but less robust on the serial topology. Imitation learning performs well in stationary regimes but degrades under demand shift, and the bounded Large Language Model (LLM) policy-parameter baseline is best interpreted as a diagnostic controller rather than an autonomous inventory optimizer. Overall, the benchmark identifies scenario-conditioned leaders while showing that performance depends jointly on information access, demand shift, topology, and policy representation.

  • 2 authors
·
May 11

A Taxonomy of Event-Linked Perpetual Futures: Variant Designs Beyond the Single-Market Binary Case

Paper 1 of this research programme develops a resolution-aware risk-design framework for the simplest event-linked perpetual: a contract whose underlying tracks a single binary prediction-market probability through resolution. The instrument class is broader. Variants span conditional probabilities P(A|B), spreads p^A - p^B, weighted baskets sum w_i p^(i), derivatives on variance or entropy of the probability process, contracts on liquidity itself, perpetual-on-expiring-event roll structures, and funding-only derivatives with no settlement. Each variant inherits some framework components from the single-market binary case and requires its own design adaptations. This paper develops a formal taxonomy of seven pure-form canonical variants beyond the probability-index perpetual of Paper 1, organised along four orthogonal design axes: underlying geometry, temporal structure, settlement structure, and venue composition. The list is not exhaustive; combinations are not treated separately. For each variant we provide a precise payoff definition; an inheritance map identifying which Paper 1 components carry over, are modified, or fail; variant-specific design constraints; microstructure properties; empirical evaluability on the PMXT v2 archive; and limitations. Notable findings: the conditional variant admits a candidate non-portability proposition (denominator instability as the conditioning event becomes improbable); the spread variant requires a three-channel decomposition of resolution risk; the volatility/entropy variant avoids random binary terminal-collapse but introduces estimator-convention and entropy-decay issues; the basket variant requires multi-period jump-aware margin whose aggregation is correlation-dependent. The paper is theoretical primarily; it specifies how demonstrative time series can be constructed and provides evaluability criteria to guide future work.

  • 1 authors
·
May 10

TRADES: Generating Realistic Market Simulations with Diffusion Models

Financial markets are complex systems characterized by high statistical noise, nonlinearity, and constant evolution. Thus, modeling them is extremely hard. We address the task of generating realistic and responsive Limit Order Book (LOB) market simulations, which are fundamental for calibrating and testing trading strategies, performing market impact experiments, and generating synthetic market data. Previous works lack realism, usefulness, and responsiveness of the generated simulations. To bridge this gap, we propose a novel TRAnsformer-based Denoising Diffusion Probabilistic Engine for LOB Simulations (TRADES). TRADES generates realistic order flows conditioned on the state of the market, leveraging a transformer-based architecture that captures the temporal and spatial characteristics of high-frequency market data. There is a notable absence of quantitative metrics for evaluating generative market simulation models in the literature. To tackle this problem, we adapt the predictive score, a metric measured as an MAE, by training a stock price predictive model on synthetic data and testing it on real data. We compare TRADES with previous works on two stocks, reporting an x3.27 and x3.47 improvement over SoTA according to the predictive score, demonstrating that we generate useful synthetic market data for financial downstream tasks. We assess TRADES's market simulation realism and responsiveness, showing that it effectively learns the conditional data distribution and successfully reacts to an experimental agent, giving sprout to possible calibrations and evaluations of trading strategies and market impact experiments. We developed DeepMarket, the first open-source Python framework for market simulation with deep learning. Our repository includes a synthetic LOB dataset composed of TRADES's generates simulations. We release the code at github.com/LeonardoBerti00/DeepMarket.

  • 3 authors
·
Jan 31, 2025

Towards Better Understanding of In-Context Learning Ability from In-Context Uncertainty Quantification

Predicting simple function classes has been widely used as a testbed for developing theory and understanding of the trained Transformer's in-context learning (ICL) ability. In this paper, we revisit the training of Transformers on linear regression tasks, and different from all the existing literature, we consider a bi-objective prediction task of predicting both the conditional expectation E[Y|X] and the conditional variance Var(Y|X). This additional uncertainty quantification objective provides a handle to (i) better design out-of-distribution experiments to distinguish ICL from in-weight learning (IWL) and (ii) make a better separation between the algorithms with and without using the prior information of the training distribution. Theoretically, we show that the trained Transformer reaches near Bayes-optimum, suggesting the usage of the information of the training distribution. Our method can be extended to other cases. Specifically, with the Transformer's context window S, we prove a generalization bound of mathcal{O}(min{S, T/(n T)}) on n tasks with sequences of length T, providing sharper analysis compared to previous results of mathcal{O}(1/n). Empirically, we illustrate that while the trained Transformer behaves as the Bayes-optimal solution as a natural consequence of supervised training in distribution, it does not necessarily perform a Bayesian inference when facing task shifts, in contrast to the equivalence between these two proposed in many existing literature. We also demonstrate the trained Transformer's ICL ability over covariates shift and prompt-length shift and interpret them as a generalization over a meta distribution.

  • 4 authors
·
May 23, 2024

Ensembling Portfolio Strategies for Long-Term Investments: A Distribution-Free Preference Framework for Decision-Making and Algorithms

This paper investigates the problem of ensembling multiple strategies for sequential portfolios to outperform individual strategies in terms of long-term wealth. Due to the uncertainty of strategies' performances in the future market, which are often based on specific models and statistical assumptions, investors often mitigate risk and enhance robustness by combining multiple strategies, akin to common approaches in collective learning prediction. However, the absence of a distribution-free and consistent preference framework complicates decisions of combination due to the ambiguous objective. To address this gap, we introduce a novel framework for decision-making in combining strategies, irrespective of market conditions, by establishing the investor's preference between decisions and then forming a clear objective. Through this framework, we propose a combinatorial strategy construction, free from statistical assumptions, for any scale of component strategies, even infinite, such that it meets the determined criterion. Finally, we test the proposed strategy along with its accelerated variant and some other multi-strategies. The numerical experiments show results in favor of the proposed strategies, albeit with small tradeoffs in their Sharpe ratios, in which their cumulative wealths eventually exceed those of the best component strategies while the accelerated strategy significantly improves performance.

  • 1 authors
·
Jun 5, 2024

Optimizing Return Distributions with Distributional Dynamic Programming

We introduce distributional dynamic programming (DP) methods for optimizing statistical functionals of the return distribution, with standard reinforcement learning as a special case. Previous distributional DP methods could optimize the same class of expected utilities as classic DP. To go beyond expected utilities, we combine distributional DP with stock augmentation, a technique previously introduced for classic DP in the context of risk-sensitive RL, where the MDP state is augmented with a statistic of the rewards obtained so far (since the first time step). We find that a number of recently studied problems can be formulated as stock-augmented return distribution optimization, and we show that we can use distributional DP to solve them. We analyze distributional value and policy iteration, with bounds and a study of what objectives these distributional DP methods can or cannot optimize. We describe a number of applications outlining how to use distributional DP to solve different stock-augmented return distribution optimization problems, for example maximizing conditional value-at-risk, and homeostatic regulation. To highlight the practical potential of stock-augmented return distribution optimization and distributional DP, we combine the core ideas of distributional value iteration with the deep RL agent DQN, and empirically evaluate it for solving instances of the applications discussed.

  • 9 authors
·
Jan 22, 2025

Advancing Investment Frontiers: Industry-grade Deep Reinforcement Learning for Portfolio Optimization

This research paper delves into the application of Deep Reinforcement Learning (DRL) in asset-class agnostic portfolio optimization, integrating industry-grade methodologies with quantitative finance. At the heart of this integration is our robust framework that not only merges advanced DRL algorithms with modern computational techniques but also emphasizes stringent statistical analysis, software engineering and regulatory compliance. To the best of our knowledge, this is the first study integrating financial Reinforcement Learning with sim-to-real methodologies from robotics and mathematical physics, thus enriching our frameworks and arguments with this unique perspective. Our research culminates with the introduction of AlphaOptimizerNet, a proprietary Reinforcement Learning agent (and corresponding library). Developed from a synthesis of state-of-the-art (SOTA) literature and our unique interdisciplinary methodology, AlphaOptimizerNet demonstrates encouraging risk-return optimization across various asset classes with realistic constraints. These preliminary results underscore the practical efficacy of our frameworks. As the finance sector increasingly gravitates towards advanced algorithmic solutions, our study bridges theoretical advancements with real-world applicability, offering a template for ensuring safety and robust standards in this technologically driven future.

  • 2 authors
·
Feb 27, 2024

UnpredictaBench: A Benchmark for Evaluating Distributional Randomness in LLMs

We introduce UnpredictaBench, an evaluation that tests the ability of large language models (LLMs) to capture true underlying distributions. As LLMs are increasingly used as substitutes for other entities (e.g., for humans in economic simulations), the tendency of many models to collapse towards a single plausible answer means a failure to capture the unpredictability of real systems. Recent work on improving output diversity is insufficient for this setting: simulation requires samples that are calibrated to a target distribution, not merely varied outputs. UnpredictaBench isolates a simplified but fundamental version of this problem: sampling outcomes from individual target distributions, including canonical statistical distributions, distributions induced by stochastic programs, and natural-language scenarios that describe random processes. We introduce 448 such problems together with KS@N, a general-purpose evaluation metric that quantifies how well a model outputs approximate black-box target distributions via the Kolmogorov-Smirnov statistical test. This is the rate at which we fail to reject model samples of size N against ground-truth samples, with larger N indicating greater difficulty. Tested across open and proprietary models, we find a large spread in distributional capabilities. For instance, when models generate samples of size 100 (KS@100, our standard metric), scores range from near 0 to over 20%. No model is able to achieve over 40% at KS@100, showing significant headroom in distributional sampling as a capability. Although adding reasoning can somewhat increase scores, we find no immediate solution for this issue. UnpredictaBench shows that even simple distributional simulation remains challenging, making it a necessary first step toward using LLMs as stand-ins for complex systems.

Revisiting Design Choices in Offline Model-Based Reinforcement Learning

Offline reinforcement learning enables agents to leverage large pre-collected datasets of environment transitions to learn control policies, circumventing the need for potentially expensive or unsafe online data collection. Significant progress has been made recently in offline model-based reinforcement learning, approaches which leverage a learned dynamics model. This typically involves constructing a probabilistic model, and using the model uncertainty to penalize rewards where there is insufficient data, solving for a pessimistic MDP that lower bounds the true MDP. Existing methods, however, exhibit a breakdown between theory and practice, whereby pessimistic return ought to be bounded by the total variation distance of the model from the true dynamics, but is instead implemented through a penalty based on estimated model uncertainty. This has spawned a variety of uncertainty heuristics, with little to no comparison between differing approaches. In this paper, we compare these heuristics, and design novel protocols to investigate their interaction with other hyperparameters, such as the number of models, or imaginary rollout horizon. Using these insights, we show that selecting these key hyperparameters using Bayesian Optimization produces superior configurations that are vastly different to those currently used in existing hand-tuned state-of-the-art methods, and result in drastically stronger performance.

  • 5 authors
·
Oct 8, 2021

PolyBench: Benchmarking LLM Forecasting and Trading Capabilities on Live Prediction Market Data

Predicting real-world events from live market signals demands systems that fuse qualitative news with quantitative order-book dynamics under strict temporal discipline -- a challenge existing benchmarks fail to capture. We present PolyBench, a multimodal benchmark derived from Polymarket that records point-in-time cross-sections of 38,666 binary prediction markets spanning 4,997 events, synchronously coupling each snapshot with a Central Limit Order Book (CLOB) state and a real-time news stream. Using PolyBench, we evaluate seven state-of-the-art Large Language Models -- spanning open- and closed-source families -- generating 36,165 predictions under identical, timestamp-locked market states collected between February 6 and 12, 2026. Our multidimensional framework assesses directional accuracy, our proposed Confidence-Weighted Return (CWR), Annualized Percentage Yield (APY), and Sharpe ratio via realistic order-book execution simulation. The results reveal a pronounced performance divergence: only two of seven models achieve positive financial returns -- MiMo-V2-Flash at 17.6% CWR and Gemini-3-Flash at 6.2% CWR -- while the remaining five incur losses despite uniformly high stated confidence. These findings highlight the gap between surface-level language fluency and genuine probabilistic reasoning under live market uncertainty, and establish PolyBench as a contamination-proof, financially-grounded evaluation standard for future LLM research. Our dataset and code available at \href{https://github.com/PolyBench/PolyBench{https://github.com/PolyBench/PolyBench}}.

  • 3 authors
·
Apr 2

Pre-training Time Series Models with Stock Data Customization

Stock selection, which aims to predict stock prices and identify the most profitable ones, is a crucial task in finance. While existing methods primarily focus on developing model structures and building graphs for improved selection, pre-training strategies remain underexplored in this domain. Current stock series pre-training follows methods from other areas without adapting to the unique characteristics of financial data, particularly overlooking stock-specific contextual information and the non-stationary nature of stock prices. Consequently, the latent statistical features inherent in stock data are underutilized. In this paper, we propose three novel pre-training tasks tailored to stock data characteristics: stock code classification, stock sector classification, and moving average prediction. We develop the Stock Specialized Pre-trained Transformer (SSPT) based on a two-layer transformer architecture. Extensive experimental results validate the effectiveness of our pre-training methods and provide detailed guidance on their application. Evaluations on five stock datasets, including four markets and two time periods, demonstrate that SSPT consistently outperforms the market and existing methods in terms of both cumulative investment return ratio and Sharpe ratio. Additionally, our experiments on simulated data investigate the underlying mechanisms of our methods, providing insights into understanding price series. Our code is publicly available at: https://github.com/astudentuser/Pre-training-Time-Series-Models-with-Stock-Data-Customization.

  • 3 authors
·
Jun 20, 2025

Contextual Bandits with Online Neural Regression

Recent works have shown a reduction from contextual bandits to online regression under a realizability assumption [Foster and Rakhlin, 2020, Foster and Krishnamurthy, 2021]. In this work, we investigate the use of neural networks for such online regression and associated Neural Contextual Bandits (NeuCBs). Using existing results for wide networks, one can readily show a {O}(T) regret for online regression with square loss, which via the reduction implies a {O}(K T^{3/4}) regret for NeuCBs. Departing from this standard approach, we first show a O(log T) regret for online regression with almost convex losses that satisfy QG (Quadratic Growth) condition, a generalization of the PL (Polyak-\L ojasiewicz) condition, and that have a unique minima. Although not directly applicable to wide networks since they do not have unique minima, we show that adding a suitable small random perturbation to the network predictions surprisingly makes the loss satisfy QG with unique minima. Based on such a perturbed prediction, we show a {O}(log T) regret for online regression with both squared loss and KL loss, and subsequently convert these respectively to mathcal{O}(KT) and mathcal{O}(KL^* + K) regret for NeuCB, where L^* is the loss of the best policy. Separately, we also show that existing regret bounds for NeuCBs are Omega(T) or assume i.i.d. contexts, unlike this work. Finally, our experimental results on various datasets demonstrate that our algorithms, especially the one based on KL loss, persistently outperform existing algorithms.

  • 5 authors
·
Dec 12, 2023

Verifiable Rewards for Calibrated Probabilistic Forecasting

Reinforcement learning with verifiable rewards can in principle train calibrated probabilistic forecasters, since a proper scoring rule such as the Brier score is computed from outcomes alone and is minimized in expectation by the true probability. In practice it degrades calibration, and existing remedies address epistemic uncertainty, where a model's confidence accompanies a verifiably correct or incorrect answer. We study aleatoric forecasting, where the forecast itself is the output and the label is one stochastic outcome, taking NFL in-game win probability as a testbed with the betting market as a reference. Rewarding the realized per-play outcome fails, because the single outcome is a noisy target and the policy gradient corrupts the chain of thought. We introduce a verifiable, label-free reward, a state-conditioned empirical win rate estimated from past outcomes, that removes the label noise, and we keep the gradient off the reasoning, by direct prediction or a gradient mask, so it cannot be corrupted. Trained with this reward alone, without human labels or supervised fine-tuning, a 7B model reaches the calibration of the betting market by direct prediction and is better calibrated than a zero-shot frontier model. That frontier model and a tabular estimator reach the same Brier score as this model, identifying the market's small remaining edge as live in-game information beyond their shared inputs. Masking the gradient, rather than dropping the chain of thought, preserves reasoning from which the forecast follows, which ordinary chain-of-thought training corrupts.

  • 3 authors
·
Jun 29

LiveTradeBench: Seeking Real-World Alpha with Large Language Models

Large language models (LLMs) achieve strong performance across benchmarks--from knowledge quizzes and math reasoning to web-agent tasks--but these tests occur in static settings, lacking real dynamics and uncertainty. Consequently, they evaluate isolated reasoning or problem-solving rather than decision-making under uncertainty. To address this, we introduce LiveTradeBench, a live trading environment for evaluating LLM agents in realistic and evolving markets. LiveTradeBench follows three design principles: (i) Live data streaming of market prices and news, eliminating dependence on offline backtesting and preventing information leakage while capturing real-time uncertainty; (ii) a portfolio-management abstraction that extends control from single-asset actions to multi-asset allocation, integrating risk management and cross-asset reasoning; and (iii) multi-market evaluation across structurally distinct environments--U.S. stocks and Polymarket prediction markets--differing in volatility, liquidity, and information flow. At each step, an agent observes prices, news, and its portfolio, then outputs percentage allocations that balance risk and return. Using LiveTradeBench, we run 50-day live evaluations of 21 LLMs across families. Results show that (1) high LMArena scores do not imply superior trading outcomes; (2) models display distinct portfolio styles reflecting risk appetite and reasoning dynamics; and (3) some LLMs effectively leverage live signals to adapt decisions. These findings expose a gap between static evaluation and real-world competence, motivating benchmarks that test sequential decision making and consistency under live uncertainty.

  • 3 authors
·
Nov 5, 2025 2

Efficient Variance-reduced Estimation from Generative EHR Models: The SCOPE and REACH Estimators

Generative models trained using self-supervision of tokenized electronic health record (EHR) timelines show promise for clinical outcome prediction. This is typically done using Monte Carlo simulation for future patient trajectories. However, existing approaches suffer from three key limitations: sparse estimate distributions that poorly differentiate patient risk levels, extreme computational costs, and high sampling variance. We propose two new estimators: the Sum of Conditional Outcome Probability Estimator (SCOPE) and Risk Estimation from Anticipated Conditional Hazards (REACH), that leverage next-token probability distributions discarded by standard Monte Carlo. We prove both estimators are unbiased and that REACH guarantees variance reduction over Monte Carlo sampling for any model and outcome. Empirically, on hospital mortality prediction in MIMIC-IV using the ETHOS-ARES framework, SCOPE and REACH match 100-sample Monte Carlo performance using only 10-11 samples (95% CI: [9,11]), representing a ~10x reduction in inference cost without degrading calibration. For ICU admission prediction, efficiency gains are more modest (~1.2x), which we attribute to the outcome's lower "spontaneity," a property we characterize theoretically and empirically. These methods substantially improve the feasibility of deploying generative EHR models in resource-constrained clinical settings.

  • 6 authors
·
Feb 2

Incorporating Surrogate Gradient Norm to Improve Offline Optimization Techniques

Offline optimization has recently emerged as an increasingly popular approach to mitigate the prohibitively expensive cost of online experimentation. The key idea is to learn a surrogate of the black-box function that underlines the target experiment using a static (offline) dataset of its previous input-output queries. Such an approach is, however, fraught with an out-of-distribution issue where the learned surrogate becomes inaccurate outside the offline data regimes. To mitigate this, existing offline optimizers have proposed numerous conditioning techniques to prevent the learned surrogate from being too erratic. Nonetheless, such conditioning strategies are often specific to particular surrogate or search models, which might not generalize to a different model choice. This motivates us to develop a model-agnostic approach instead, which incorporates a notion of model sharpness into the training loss of the surrogate as a regularizer. Our approach is supported by a new theoretical analysis demonstrating that reducing surrogate sharpness on the offline dataset provably reduces its generalized sharpness on unseen data. Our analysis extends existing theories from bounding generalized prediction loss (on unseen data) with loss sharpness to bounding the worst-case generalized surrogate sharpness with its empirical estimate on training data, providing a new perspective on sharpness regularization. Our extensive experimentation on a diverse range of optimization tasks also shows that reducing surrogate sharpness often leads to significant improvement, marking (up to) a noticeable 9.6% performance boost. Our code is publicly available at https://github.com/cuong-dm/IGNITE

  • 4 authors
·
Mar 6, 2025

Bringing Value Models Back: Generative Critics for Value Modeling in LLM Reinforcement Learning

Credit assignment is a central challenge in reinforcement learning (RL). Classical actor-critic methods address this challenge through fine-grained advantage estimation based on a learned value function. However, learned value models are often avoided in modern large language model (LLM) RL because conventional discriminative critics are difficult to train reliably. We revisit value modeling and argue that this difficulty is partly due to limited expressiveness. In particular, representation complexity theory suggests that value functions can be hard to approximate under the one-shot prediction paradigm used by existing value models, and our scaling experiments show that such critics do not improve reliably with scale. Motivated by this observation, we propose Generative Actor-Critic (GenAC), which replaces one-shot scalar value prediction with a generative critic that performs chain-of-thought reasoning before producing a value estimate. We further introduce In-Context Conditioning, which helps the critic remain calibrated to the current actor throughout training. GenAC improves value approximation, ranking reliability, and out-of-distribution generalization, and these gains translate into stronger downstream RL performance than both value-based and value-free baselines. Overall, our results suggest that stronger value modeling is a promising direction for improving credit assignment in LLM reinforcement learning.

  • 4 authors
·
Apr 11

Measuring Reasoning Utility in LLMs via Conditional Entropy Reduction

Recent advancements in large language models (LLMs) often rely on generating intermediate reasoning steps to enhance accuracy. However, little work has examined how reasoning utility contributes to the final answer's correctness. Due to the stochastic nature of autoregressive generation, generating more context does not guarantee increased confidence in the answer. If we could predict, during generation, whether a reasoning step will be useful, we could stop early or prune ineffective steps, avoiding distractions in the final decision. We present an oracle study on MATH dataset, using Qwen2.5-32B and GPT-4o to generate reasoning chains, and then employing a separate model (Qwen3-8B) to quantify the utility of these chains for final accuracy. Specifically, we measure the model's uncertainty on the answer span Y at each reasoning step using conditional entropy (expected negative log-likelihood over the vocabulary) with context expanding step by step. Our results show a clear pattern: conditional entropy that decreases over steps is strongly associated with correct answers, whereas flat or increasing entropy often results in wrong answers. We also corroborate that incorrect reasoning paths tend to be longer than correct ones, suggesting that longer reasoning does not necessarily yield better outcomes. These findings serve as a foundation to inspire future work on designing efficient reasoning pipelines that detect and avoid unproductive reasoning early.

  • 1 authors
·
Aug 27, 2025

Quantitative Risk Management in Volatile Markets with an Expectile-Based Framework for the FTSE Index

This research presents a framework for quantitative risk management in volatile markets, specifically focusing on expectile-based methodologies applied to the FTSE 100 index. Traditional risk measures such as Value-at-Risk (VaR) have demonstrated significant limitations during periods of market stress, as evidenced during the 2008 financial crisis and subsequent volatile periods. This study develops an advanced expectile-based framework that addresses the shortcomings of conventional quantile-based approaches by providing greater sensitivity to tail losses and improved stability in extreme market conditions. The research employs a dataset spanning two decades of FTSE 100 returns, incorporating periods of high volatility, market crashes, and recovery phases. Our methodology introduces novel mathematical formulations for expectile regression models, enhanced threshold determination techniques using time series analysis, and robust backtesting procedures. The empirical results demonstrate that expectile-based Value-at-Risk (EVaR) consistently outperforms traditional VaR measures across various confidence levels and market conditions. The framework exhibits superior performance during volatile periods, with reduced model risk and enhanced predictive accuracy. Furthermore, the study establishes practical implementation guidelines for financial institutions and provides evidence-based recommendations for regulatory compliance and portfolio management. The findings contribute significantly to the literature on financial risk management and offer practical tools for practitioners dealing with volatile market environments.

  • 1 authors
·
Jul 16, 2025 1

Resolution-Aware Perpetual Futures on Binary Prediction Markets: An Empirical Risk-Design Framework Using Polymarket Data

We develop and counterfactually evaluate a resolution-aware risk-design framework (PIRAP) for perpetual futures whose underlying tracks a single binary prediction-market probability through resolution. The framework specifies six components: an index estimator combining mid-price, depth-weighted mid, and time-decayed VWAP; jump-aware tiered margin sized against bounded-event terminal-collapse magnitude; leverage compression schedule contracting toward resolution; resolution-aware funding rule with boundary-aware correction; a multi-stage halt protocol; and an eligibility framework. Two formal non-portability propositions establish that standard basis-only funding paired with continuous-vol static margin fails on bounded-event underlyings. Empirical evaluation uses Polymarket's PMXT v2 archive for 2026-04-21 to 2026-04-27 (13,298-market analysis sample passing adequacy gates from 61,087 ingested; 13,115 resolved within the empirical window for E3). E1 evaluates two pre-registered stylized facts; E2 conducts counterfactual replay across three engine configurations; E3 isolates the resolution-zone protocol's contribution. Results are mixed. Five pre-registered floors: stylized-fact floors (boundary depth asymmetry, terminal-jump magnitude) PASS; welfare-side directional floors (final-hour liquidation -6%, drawdown -5.1% pooled, median PnL +14%) two FAIL one PASS; E3 mechanic floors (final-hour liquidation -80% by halt construction PASS; bad-debt frequency +2.4% FAIL). Three of five materiality floors fail: the framework as specified does not validate deployment, but the empirical record establishes a halt-versus-margin scope distinction (halt addresses execution-channel risk; terminal-jump bad-debt remains margin-side) and documents a pre-emption trade-off constraining the dynamic-margin component. The paper concludes with structural recommendations and explicit non-deployable status.

  • 1 authors
·
May 10

CAMEF: Causal-Augmented Multi-Modality Event-Driven Financial Forecasting by Integrating Time Series Patterns and Salient Macroeconomic Announcements

Accurately forecasting the impact of macroeconomic events is critical for investors and policymakers. Salient events like monetary policy decisions and employment reports often trigger market movements by shaping expectations of economic growth and risk, thereby establishing causal relationships between events and market behavior. Existing forecasting methods typically focus either on textual analysis or time-series modeling, but fail to capture the multi-modal nature of financial markets and the causal relationship between events and price movements. To address these gaps, we propose CAMEF (Causal-Augmented Multi-Modality Event-Driven Financial Forecasting), a multi-modality framework that effectively integrates textual and time-series data with a causal learning mechanism and an LLM-based counterfactual event augmentation technique for causal-enhanced financial forecasting. Our contributions include: (1) a multi-modal framework that captures causal relationships between policy texts and historical price data; (2) a new financial dataset with six types of macroeconomic releases from 2008 to April 2024, and high-frequency real trading data for five key U.S. financial assets; and (3) an LLM-based counterfactual event augmentation strategy. We compare CAMEF to state-of-the-art transformer-based time-series and multi-modal baselines, and perform ablation studies to validate the effectiveness of the causal learning mechanism and event types.

  • 5 authors
·
Aug 7, 2025

Revisiting Discriminative vs. Generative Classifiers: Theory and Implications

A large-scale deep model pre-trained on massive labeled or unlabeled data transfers well to downstream tasks. Linear evaluation freezes parameters in the pre-trained model and trains a linear classifier separately, which is efficient and attractive for transfer. However, little work has investigated the classifier in linear evaluation except for the default logistic regression. Inspired by the statistical efficiency of naive Bayes, the paper revisits the classical topic on discriminative vs. generative classifiers. Theoretically, the paper considers the surrogate loss instead of the zero-one loss in analyses and generalizes the classical results from binary cases to multiclass ones. We show that, under mild assumptions, multiclass naive Bayes requires O(log n) samples to approach its asymptotic error while the corresponding multiclass logistic regression requires O(n) samples, where n is the feature dimension. To establish it, we present a multiclass H-consistency bound framework and an explicit bound for logistic loss, which are of independent interests. Simulation results on a mixture of Gaussian validate our theoretical findings. Experiments on various pre-trained deep vision models show that naive Bayes consistently converges faster as the number of data increases. Besides, naive Bayes shows promise in few-shot cases and we observe the "two regimes" phenomenon in pre-trained supervised models. Our code is available at https://github.com/ML-GSAI/Revisiting-Dis-vs-Gen-Classifiers.

  • 6 authors
·
Feb 5, 2023

Environment-Adaptive Covariate Selection: Learning When to Use Spurious Correlations for Out-of-Distribution Prediction

Out-of-distribution (OOD) prediction is often approached by restricting models to causal or invariant covariates, avoiding non-causal spurious associations that may be unstable across environments. Despite its theoretical appeal, this strategy frequently underperforms empirical risk minimization (ERM) in practice. We investigate the source of this gap and show that such failures naturally arise when only a subset of the true causes of the outcome is observed. In these settings, non-causal spurious covariates can serve as informative proxies for unobserved causes and substantially improve prediction, except under distribution shifts that break these proxy relationships. Consequently, the optimal set of predictive covariates is neither universal nor necessarily exhibits invariant relationships with the outcome across all environments, but instead depends on the specific type of shift encountered. Crucially, we observe that different covariate shifts induce distinct, observable signatures in the covariate distribution itself. Moreover, these signatures can be extracted from unlabeled data in the target OOD environment and used to assess when proxy covariates remain reliable and when they fail. Building on this observation, we propose an environment-adaptive covariate selection (EACS) algorithm that maps environment-level covariate summaries to environment-specific covariate sets, while allowing the incorporation of prior causal knowledge as constraints. Across simulations and applied datasets, EACS consistently outperforms static causal, invariant, and ERM-based predictors under diverse distribution shifts.

  • 2 authors
·
Jan 5

EconCausal: A Context-Aware Causal Reasoning Benchmark for Large Language Models in Social Science

Socio-economic causal effects depend heavily on their specific institutional and environmental context. A single intervention can produce opposite results depending on regulatory or market factors, contexts that are often complex and only partially observed. This poses a significant challenge for large language models (LLMs) in decision-support roles: can they distinguish structural causal mechanisms from surface-level correlations when the context changes? To address this, we introduce EconCausal, a large-scale benchmark comprising 10,490 context-annotated causal triplets extracted from 2,595 high-quality empirical studies published in top-tier economics and finance journals. Through a rigorous four-stage pipeline combining multi-run consensus, context refinement, and multi-critic filtering, we ensure each claim is grounded in peer-reviewed research with explicit identification strategies. Our evaluation reveals critical limitations in current LLMs' context-dependent reasoning. While top models achieve approximately 88 percent accuracy in fixed, explicit contexts, performance drops sharply under context shifts, with a 32.6 percentage point decline, and falls to 37 percent when misinformation is introduced. Furthermore, models exhibit severe over-commitment in ambiguous cases and struggle to recognize null effects, achieving only 9.5 percent accuracy, exposing a fundamental gap between pattern matching and genuine causal reasoning. These findings underscore substantial risks for high-stakes economic decision-making, where the cost of misinterpreting causality is high. The dataset and benchmark are publicly available at https://github.com/econaikaist/econcausal-benchmark.

  • 6 authors
·
Oct 8, 2025

Bayesian Robust Financial Trading with Adversarial Synthetic Market Data

Algorithmic trading relies on machine learning models to make trading decisions. Despite strong in-sample performance, these models often degrade when confronted with evolving real-world market regimes, which can shift dramatically due to macroeconomic changes-e.g., monetary policy updates or unanticipated fluctuations in participant behavior. We identify two challenges that perpetuate this mismatch: (1) insufficient robustness in existing policy against uncertainties in high-level market fluctuations, and (2) the absence of a realistic and diverse simulation environment for training, leading to policy overfitting. To address these issues, we propose a Bayesian Robust Framework that systematically integrates a macro-conditioned generative model with robust policy learning. On the data side, to generate realistic and diverse data, we propose a macro-conditioned GAN-based generator that leverages macroeconomic indicators as primary control variables, synthesizing data with faithful temporal, cross-instrument, and macro correlations. On the policy side, to learn robust policy against market fluctuations, we cast the trading process as a two-player zero-sum Bayesian Markov game, wherein an adversarial agent simulates shifting regimes by perturbing macroeconomic indicators in the macro-conditioned generator, while the trading agent-guided by a quantile belief network-maintains and updates its belief over hidden market states. The trading agent seeks a Robust Perfect Bayesian Equilibrium via Bayesian neural fictitious self-play, stabilizing learning under adversarial market perturbations. Extensive experiments on 9 financial instruments demonstrate that our framework outperforms 9 state-of-the-art baselines. In extreme events like the COVID, our method shows improved profitability and risk management, offering a reliable solution for trading under uncertain and shifting market dynamics.

  • 10 authors
·
Jan 13

Aligning Language Models with Observational Data: Opportunities and Risks from a Causal Perspective

Large language models are being widely used across industries to generate content that contributes directly to key performance metrics, such as conversion rates. Pretrained models, however, often fall short when it comes to aligning with human preferences or optimizing for business objectives. As a result, fine-tuning with good-quality labeled data is essential to guide models to generate content that achieves better results. Controlled experiments, like A/B tests, can provide such data, but they are often expensive and come with significant engineering and logistical challenges. Meanwhile, companies have access to a vast amount of historical (observational) data that remains underutilized. In this work, we study the challenges and opportunities of fine-tuning LLMs using observational data. We show that while observational outcomes can provide valuable supervision, directly fine-tuning models on such data can lead them to learn spurious correlations. We present empirical evidence of this issue using various real-world datasets and propose DeconfoundLM, a method that explicitly removes the effect of known confounders from reward signals. Using simulation experiments, we demonstrate that DeconfoundLM improves the recovery of causal relationships and mitigates failure modes found in fine-tuning methods that ignore or naively incorporate confounding variables. Our findings highlight that while observational data presents risks, with the right causal corrections, it can be a powerful source of signal for LLM alignment. Please refer to the project page for code and related resources.

  • 1 authors
·
May 30, 2025

Navigating the Alpha Jungle: An LLM-Powered MCTS Framework for Formulaic Factor Mining

Alpha factor mining is pivotal in quantitative investment for identifying predictive signals from complex financial data. While traditional formulaic alpha mining relies on human expertise, contemporary automated methods, such as those based on genetic programming or reinforcement learning, often struggle with search inefficiency or yield alpha factors that are difficult to interpret. This paper introduces a novel framework that integrates Large Language Models (LLMs) with Monte Carlo Tree Search (MCTS) to overcome these limitations. Our framework leverages the LLM's instruction-following and reasoning capability to iteratively generate and refine symbolic alpha formulas within an MCTS-driven exploration. A key innovation is the guidance of MCTS exploration by rich, quantitative feedback from financial backtesting of each candidate factor, enabling efficient navigation of the vast search space. Furthermore, a frequent subtree avoidance mechanism is introduced to enhance search diversity and prevent formulaic homogenization, further improving performance. Experimental results on real-world stock market data demonstrate that our LLM-based framework outperforms existing methods by mining alphas with superior predictive accuracy and trading performance. The resulting formulas are also more amenable to human interpretation, establishing a more effective and efficient paradigm for formulaic alpha mining.

  • 3 authors
·
May 16, 2025

Adding Conditional Control to Diffusion Models with Reinforcement Learning

Diffusion models are powerful generative models that allow for precise control over the characteristics of the generated samples. While these diffusion models trained on large datasets have achieved success, there is often a need to introduce additional controls in downstream fine-tuning processes, treating these powerful models as pre-trained diffusion models. This work presents a novel method based on reinforcement learning (RL) to add such controls using an offline dataset comprising inputs and labels. We formulate this task as an RL problem, with the classifier learned from the offline dataset and the KL divergence against pre-trained models serving as the reward functions. Our method, CTRL (Conditioning pre-Trained diffusion models with Reinforcement Learning), produces soft-optimal policies that maximize the abovementioned reward functions. We formally demonstrate that our method enables sampling from the conditional distribution with additional controls during inference. Our RL-based approach offers several advantages over existing methods. Compared to classifier-free guidance, it improves sample efficiency and can greatly simplify dataset construction by leveraging conditional independence between the inputs and additional controls. Additionally, unlike classifier guidance, it eliminates the need to train classifiers from intermediate states to additional controls. The code is available at https://github.com/zhaoyl18/CTRL.

  • 7 authors
·
Jun 17, 2024

LLMs Contain Multitudes: How Deployment Context Reshapes Model-Level Preferences and Values

Large language models (LLMs) are increasingly characterised in recent evaluation work as having stable, model-level preference and value systems. However, accompanying robustness checks are limited to incidental prompt perturbations such as syntax variation and option reordering. This leaves open whether the measured properties survive when the surrounding task context changes, as it does in most real deployments. We test this directly across two established pairwise paradigms: ranking country preferences and eliciting utility judgements. In both, we make the deployment context -- the high-level task the model is performing while making concrete value-dependent choices -- our controlled variable, varied across framings such as writing a Reddit post or a news article. Across five LLMs and over 1.2M pairwise decisions, deployment context produces variation far larger than prompt paraphrasing and temperature controls. In country preference rankings over 15 countries, context induces widespread, statistically significant rank shifts; the aggregate Global North favouritism reported in prior work is itself context-dependent, with each model's bias shifting systematically across contexts. In utility elicitation over 50 outcomes, broad cross-category ordering is preserved, but fine-grained rankings within domains vary substantially, and cardinal exchange rates between outcomes (e.g. how many lives in one region equal one in another) shift by a factor of 2.47 at the median. Reported model-level preferences and utilities are therefore better understood as context-conditioned measurements than fixed model-level properties: safety guarantees obtained under one framing provide limited assurance in another.

  • 5 authors
·
Jun 10

Similarity-Distance-Magnitude Universal Verification

We address the neural network robustness problem by adding Similarity (i.e., correctly predicted depth-matches into training)-awareness and Distance-to-training-distribution-awareness to the existing output Magnitude (i.e., decision-boundary)-awareness of the softmax function. The resulting SDM activation function provides strong signals of the relative epistemic (reducible) predictive uncertainty. We use this novel behavior to further address the complementary HCI problem of mapping the output to human-interpretable summary statistics over relevant partitions of a held-out calibration set. Estimates of prediction-conditional uncertainty are obtained via a parsimonious learned transform over the class-conditional empirical CDFs of the output of a final-layer SDM activation function. For decision-making and as an intrinsic model check, estimates of class-conditional accuracy are obtained by further partitioning the high-probability regions of this calibrated output into class-conditional, region-specific CDFs. The uncertainty estimates from SDM calibration are remarkably robust to test-time distribution shifts and out-of-distribution inputs; incorporate awareness of the effective sample size; provide estimates of uncertainty from the learning and data splitting processes; and are well-suited for selective classification and conditional branching for additional test-time compute based on the predictive uncertainty, as for selective LLM generation, routing, and composition over multiple models and retrieval. Finally, we construct SDM networks, LLMs with uncertainty-aware verification and interpretability-by-exemplar as intrinsic properties. We provide open-source software implementing these results.

  • 1 authors
·
Feb 27, 2025

When Reasoning Models Hurt Behavioral Simulation: A Solver-Sampler Mismatch in Multi-Agent LLM Negotiation

Large language models are increasingly used as agents in social, economic, and policy simulations. A common assumption is that stronger reasoning should improve simulation fidelity. We argue that this assumption can fail when the objective is not to solve a strategic problem, but to sample plausible boundedly rational behavior. In such settings, reasoning-enhanced models can become better solvers and worse simulators: they can over-optimize for strategically dominant actions, collapse compromise-oriented terminal behavior, and sometimes exhibit a diversity-without-fidelity pattern in which local variation survives without outcome-level fidelity. We study this solver-sampler mismatch in three multi-agent negotiation environments adapted from earlier simulation work: an ambiguous fragmented-authority trading-limits scenario, an ambiguous unified-opposition trading-limits scenario, and a new-domain grid-curtailment case in emergency electricity management. We compare three reflection conditions, no reflection, bounded reflection, and native reasoning, across two primary model families and then extend the same protocol to direct OpenAI runs with GPT-4.1 and GPT-5.2. Across all three experiments, bounded reflection produces substantially more diverse and compromise-oriented trajectories than either no reflection or native reasoning. In the direct OpenAI extension, GPT-5.2 native ends in authority decisions in 45 of 45 runs across the three experiments, while GPT-5.2 bounded recovers compromise outcomes in every environment. The contribution is not a claim that reasoning is generally harmful. It is a methodological warning: model capability and simulation fidelity are different objectives, and behavioral simulation should qualify models as samplers, not only as solvers.

  • 1 authors
·
Apr 11 2

Harnessing Earnings Reports for Stock Predictions: A QLoRA-Enhanced LLM Approach

Accurate stock market predictions following earnings reports are crucial for investors. Traditional methods, particularly classical machine learning models, struggle with these predictions because they cannot effectively process and interpret extensive textual data contained in earnings reports and often overlook nuances that influence market movements. This paper introduces an advanced approach by employing Large Language Models (LLMs) instruction fine-tuned with a novel combination of instruction-based techniques and quantized low-rank adaptation (QLoRA) compression. Our methodology integrates 'base factors', such as financial metric growth and earnings transcripts, with 'external factors', including recent market indices performances and analyst grades, to create a rich, supervised dataset. This comprehensive dataset enables our models to achieve superior predictive performance in terms of accuracy, weighted F1, and Matthews correlation coefficient (MCC), especially evident in the comparison with benchmarks such as GPT-4. We specifically highlight the efficacy of the llama-3-8b-Instruct-4bit model, which showcases significant improvements over baseline models. The paper also discusses the potential of expanding the output capabilities to include a 'Hold' option and extending the prediction horizon, aiming to accommodate various investment styles and time frames. This study not only demonstrates the power of integrating cutting-edge AI with fine-tuned financial data but also paves the way for future research in enhancing AI-driven financial analysis tools.

  • 10 authors
·
Aug 13, 2024

An Analysis of Causal Effect Estimation using Outcome Invariant Data Augmentation

The technique of data augmentation (DA) is often used in machine learning for regularization purposes to better generalize under i.i.d. settings. In this work, we present a unifying framework with topics in causal inference to make a case for the use of DA beyond just the i.i.d. setting, but for generalization across interventions as well. Specifically, we argue that when the outcome generating mechanism is invariant to our choice of DA, then such augmentations can effectively be thought of as interventions on the treatment generating mechanism itself. This can potentially help to reduce bias in causal effect estimation arising from hidden confounders. In the presence of such unobserved confounding we typically make use of instrumental variables (IVs) -- sources of treatment randomization that are conditionally independent of the outcome. However, IVs may not be as readily available as DA for many applications, which is the main motivation behind this work. By appropriately regularizing IV based estimators, we introduce the concept of IV-like (IVL) regression for mitigating confounding bias and improving predictive performance across interventions even when certain IV properties are relaxed. Finally, we cast parameterized DA as an IVL regression problem and show that when used in composition can simulate a worst-case application of such DA, further improving performance on causal estimation and generalization tasks beyond what simple DA may offer. This is shown both theoretically for the population case and via simulation experiments for the finite sample case using a simple linear example. We also present real data experiments to support our case.

  • 5 authors
·
Oct 28, 2025 1