arxiv:2602.23546

Humans and LLMs Diverge on Probabilistic Inferences

Published on Feb 26

· Submitted by

Gaurav Kamath on Mar 4

McGill NLP Group

Upvote

Authors:

Abstract

ProbCOPA dataset reveals that LLMs consistently fail to replicate human probabilistic reasoning patterns in open-ended inferences, despite strong performance on logical and mathematical tasks.

AI-generated summary

Human reasoning often involves working over limited information to arrive at probabilistic conclusions. In its simplest form, this involves making an inference that is not strictly entailed by a premise, but rather only likely given the premise. While reasoning LLMs have demonstrated strong performance on logical and mathematical tasks, their behavior on such open-ended, non-deterministic inferences remains largely unexplored. We introduce ProbCOPA, a dataset of 210 handcrafted probabilistic inferences in English, each annotated for inference likelihood by 25--30 human participants. We find that human responses are graded and varied, revealing probabilistic judgments of the inferences in our dataset. Comparing these judgments with responses from eight state-of-the-art reasoning LLMs, we show that models consistently fail to produce human-like distributions. Finally, analyzing LLM reasoning chains, we find evidence of a common reasoning pattern used to evaluate such inferences. Our findings reveal persistent differences between humans and LLMs, and underscore the need to evaluate reasoning beyond deterministic settings.