125 kB

Title: Reasoning Limitations of Multimodal Large Language Models. A Case Study of Bongard Problems

URL Source: https://arxiv.org/html/2411.01173

Markdown Content: Back to arXiv

This is experimental HTML to improve accessibility. We invite you to report rendering errors. Use Alt+Y to toggle on accessible reporting links and Alt+Shift+Y to toggle off. Learn more about this project and help improve conversions.

Why HTML? Report Issue Back to Abstract Download PDF Abstract 1Introduction 2Related Work 3Solving BPs with MLLMs 4Bongard-RWR 5Experiments 6Conclusions References

HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

failed: pgfkeys failed: fvextra failed: spverbatim

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license arXiv:2411.01173v2 [cs.AI] 23 Jun 2025 Reasoning Limitations of Multimodal Large Language Models. A Case Study of Bongard Problems Mikołaj Małkiński Szymon Pawlonka Jacek Mańdziuk Abstract

Abstract visual reasoning (AVR) involves discovering shared concepts across images through analogy, akin to solving IQ test problems. Bongard Problems (BPs) remain a key challenge in AVR, requiring both visual reasoning and verbal description. We investigate whether multimodal large language models (MLLMs) can solve BPs by formulating a set of diverse MLLM-suited solution strategies and testing 4 proprietary and 4 open-access models on 3 BP datasets featuring synthetic (classic BPs) and real-world (Bongard HOI and Bongard-OpenWorld) images. Despite some successes on real-world datasets, MLLMs struggle with synthetic BPs. To explore this gap, we introduce Bongard-RWR, a dataset representing synthetic BP concepts using real-world images. Our findings suggest that weak MLLM performance on classical BPs is not due to the domain specificity, but rather comes from their general AVR limitations. Code and dataset are available at: https://github.com/pavonism/bongard-rwr

Multimodal Large Language Models, Abstract Visual Reasoning, Bongard Problems 1Introduction

Analogy-making is a critical aspect of human cognition, tightly linked with fluid intelligence, the capacity to apply learned skills in novel settings (Lake et al., 2017). Several approaches have been proposed to build systems capable of making analogies. Notably, the structure-mapping theory explores methods for discovering structural correspondences between pre-existing object representations (Winston, 1982; Gentner, 1983; Carbonell, 1983; Falkenhainer et al., 1989; Holyoak & Thagard, 1989). However, these approaches often overlook the perceptual aspect, assuming object representations are already given. Chalmers et al. (1992) highlight that forming useful representations is an intricate challenge. In particular, perception is not merely a passive reception of sensory data, but rather an active interpretation influenced by prior knowledge. This process involves the detection of patterns, recognition of analogies, and abstraction of concepts. The resultant representations may vary significantly depending on the context, which underscores the importance of modeling perception and cognition jointly (Hofstadter, 1995).

(a)Bongard Problem # ⁢ 31 (b)Bongard-RWR # ⁢ 31 (c)Bongard Problem # ⁢ 36 (d)Bongard-RWR # ⁢ 36 Figure 1:Bongard Problems. (a), (c): Manually-designed synthetic BPs (Bongard, 1970). (b), (d): Their real-world equivalents from the Bongard-RWR dataset proposed in this work. BP # ⁢ 31 : Left: One line. Right: Two lines. BP # ⁢ 36 : Left: Triangle above circle. Right: Circle above triangle.

Multiple problems that necessitate combined perception and reasoning have been identified (Hofstadter, 1999). Among these tasks are Bongard Problems (BPs), introduced by Bongard (1968; 1970). Initial BPs were designed manually, leading to the formulation of a few hundred task instances by individual contributors (Foundalis, 2006b). A typical BP consists of two sides, left and right, each comprising six image panels arranged in a grid (Figs. 1(a), 1(c)). All images on one side illustrate a shared concept absent in the images on the opposite side. The task is to identify the underlying rule that differentiates the sides and articulate it in natural language. Initial BPs (Bongard, 1968), akin to human IQ tests, featured abstract 2D geometric shapes, putting the focus on abstract reasoning. However, recent works have expanded the set of BPs to include real-world images, which broadens the scope of presented objects, attributes and relations. Specifically, the matrices in Bongard HOI (Jiang et al., 2022) depict human-object interactions, while Bongard-OpenWorld (Wu et al., 2024) employs open-world free-form concepts, increasing the diversity of featured scenes.

A central theme in BPs is recognition of concepts in a context-dependent manner, as object representations need to be formed specifically for the presented matrix, rather than described a priori (Linhares, 2000). For example, consider the matrix in Fig. 1(a) – an analysis restricted to its left side may yield multiple concepts, such as the presence of curves or an object centered in the image. Only through a comprehensive understanding of both matrix sides one can recognize that the left side depicts a single line, while the right side presents two lines. Such concept-based tasks are argued to offer a more accurate evaluation of the model’s generalization ability and its capacity for abstraction (Mitchell, 2021; Odouard & Mitchell, 2022). In contrast to other abstract reasoning problems, such as Raven’s Progressive Matrices (RPMs) (Raven, 1936; Raven & Court, 1998; Małkiński & Mańdziuk, 2025a) that have recently witnessed the development of large-scale benchmarks (Barrett et al., 2018; Zhang et al., 2019; Małkiński & Mańdziuk, 2025c), BPs allow to assess model’s ability to derive concepts from a limited set of examples (typically six images per matrix side), which positions the task within a few-shot learning setting (Fei-Fei et al., 2006; Wang et al., 2020). The above aspects make BPs a valuable testbed for assessing abstract reasoning abilities of AI models.

Motivation.

The quest to build systems capable of forming abstract concepts dates back to the 1950s (McCarthy et al., 2006). The advent of Deep Learning (DL) opened new possibilities to tackle BPs (Kharagorgiev, 2018; Nie et al., 2020). However, despite significant advancements, methods for consistently solving BPs (and other problems that involve abstract reasoning) are still lacking (Mitchell, 2021; van der Maas et al., 2021; Stabinger et al., 2021). Typically, DL approaches omit the generation of natural language answers by casting BP into a binary classification task, in which a test image had to be assigned to the matching side of the matrix. Conversely, a parallel stream of research on large language models (LLMs) demonstrated promising results in open-ended language generation (Brown et al., 2020). In particular, LLMs were applied to selected AVR tasks (Webb et al., 2023), though, lately Xu et al. (2024) pointed certain LLM limitations in solving AVR problems represented as text despite using information lossless translation through direct-grid encoding. Recent works have combined the vision and language modalities into multimodal large language models (MLLMs) (Achiam et al., 2023; Reid et al., 2024; Anthropic, 2024), inviting their application to diverse tasks (Yin et al., 2024; Wu et al., 2023). Motivated by these recent developments we examine the reasoning capabilities of MLLMs in solving BPs.

Contribution.

The contribution of this paper is fourfold.

(1) We consider BPs in the context of MLLMs and propose a diverse set of strategies to solve BP instances in two setups: open-ended language generation and binary classification.

(2) We evaluate 4 state-of-the-art proprietary MLLMs and 4 open MLLMs on both synthetic and real-world BPs, and identify their severe abstract reasoning limitations.

(3) To further examine the observed MLLM limitations in solving both synthetic and real-world BPs, we introduce a focused dataset of BPs (Bongard-RWR, Figs. 1(b), 1(d)) that represent concepts from synthetic BPs using real world images. Thanks to relying on the same abstract concepts as synthetic BPs, Bongard-RWR facilitates direct comparisons of the MLLM performance in both domains.

(4) We perform a detailed comparative analysis of 8 MLLMs on Bongard-RWR vs. synthetic BPs, shedding light on the reasons of their generally poor performance.

2Related Work AVR Tasks.

The AVR field encompasses a broad set of problems aimed at studying various aspects of visual cognition (Gardner & Richards, 2006; Małkiński & Mańdziuk, 2023). Recent DL research in this domain gravitated towards utilizing certain well-established datasets, e.g. with visual analogies (Hill et al., 2019; Webb et al., 2020) or RPMs (Zhang et al., 2019; Barrett et al., 2018), to measure the progress of DL models (Małkiński & Mańdziuk, 2024a, b, 2025b). However, such benchmarks evaluate system performance in learning a particular task, rather than assessing its general ability to acquire new AVR skills. To address this limitation, certain tasks have adopted few-shot learning setups, requiring models to learn from a few demonstrations, as exemplified by SVRT (Fleuret et al., 2011) or Bongard-LOGO (Nie et al., 2020). Nonetheless, these benchmarks follow a discriminative setting where a set of possible answers is provided. Conversely, other datasets such as ARC (Chollet, 2019) or PQA (Qi et al., 2021) pose a generative challenge, which may be considered more difficult due to its open-ended nature. In addition to synthetic tasks featuring 2D geometric shapes, certain datasets present analogous reasoning tasks using real-world images (Teney et al., 2020; Ichien et al., 2021; Bitton et al., 2023). This approach extends the range of concepts that can be expressed and, above all, allows employing models pre-trained on large image datasets. In this work, we concentrate on several BP datasets that present a few-shot learning challenge, cover both synthetic and real-world images, and consider settings involving both binary classification and answer generation in natural language.

Approaches to solve BPs.

Initial approaches to tackle BPs involved cognitive architectures (Foundalis, 2006a), program synthesis coupled with inductive logic programming (Saito & Nakano, 1996; Sonwane et al., 2021), and the application of Bayesian inference within a visual language framework (Depeweg et al., 2018, 2024). (Kharagorgiev, 2018) trained a convolutional network on a generated synthetic dataset with geometric shapes and applied a one-level decision tree to solve BPs framed as a binary classification task. Nie et al. (2020) introduced Bongard-LOGO with synthetically generated BPs and used it to evaluate CNN-based models focused on meta-learning (Snell et al., 2017; Mishra et al., 2018; Lee et al., 2019; Raghu et al., 2020; Chen et al., 2021) and relational reasoning (Barrett et al., 2018). Jiang et al. (2022) applied the Relation Network (Santoro et al., 2017) to objects detected with Faster R-CNN (Ren et al., 2015) and employed the model to solve matrices from the real world Bongard HOI dataset. Despite high diversity of approaches, none of them has fully addressed the abstract and open-ended nature of BPs. Most related to our work, Wu et al. (2024) considered hybrid approaches that caption each image panel with an image-to-text model and applied LLMs for processing these text descriptions. Differently, in this work we focus on MLLMs that are inherently capable of jointly processing images and text.

Abstract reasoning of MLLMs.

While the abstract reasoning capabilities of MLLMs have only recently begun to be explored, they have been studied more extensively in several related tasks. Initial works focused on LLMs and evaluating their abstract reasoning performance in simplified analogy tasks. Webb et al. (2023) showed that GPT-3 and GPT-4 (text-only variants) performed on the human level, or even outcompeted humans, in certain RPM-like tasks in a zero-shot manner without additional fine-tuning. However, they represented the image objects as text using a fixed small vocabulary, thus omitting the need for identifying concepts from open-ended shapes, a key challenge of BPs. Recent research concerning the evaluation of abstract reasoning skills of LLMs concentrates around the Abstraction and Reasoning Corpus (ARC) task (Chollet, 2019). It was demonstrated that LLMs can solve certain ARC problems transformed to the text domain (Moskvichev et al., 2023; Mirchandani et al., 2023; Camposampiero et al., 2023; Xu et al., 2024). Despite these important stepping stones, the text-based representation taken in these works simplifies the perception task by presenting the model with pre-existing higher level representations. Only recently, thanks to the appearance of MLLMs, vision and text started to be treated jointly in a unified manner. Perception Test (Patraucean et al., 2023) assessed multimodal video understanding across various skills and types of reasoning. Cao et al. (2024) proposed a suite of AVR tasks to compare MLLM and human performance. Jiang et al. (2024) assessed AVR skills of MLLMs on an introduced multidimensional benchmark combining AVR and perceptual questions. KiVA (Yiu et al., 2025) evaluated how models generalize visual transformations to analogical reasoning tasks inspired by the developmental psychology. Most relevant to our paper is the concurrent work of Wüst et al. (2024), which investigated the capabilities of MLLMs in solving synthetic BPs, using both open-ended text generation and multi-class concept classification setups. Our work complements this stream of research by exploring BPs and providing comparative insights into MLLM analogy-making performance in both synthetic and real-world domains.

MLLM 𝑦 ^ (a) Direct . . . MLLM . . . MLLM MLLM 𝑦 ^ (b) Descriptive(-direct) . . . MLLM MLLM . . . MLLM 𝑦 ^ 𝐿 𝑦 ^ 𝑅 MLLM 𝑦 ^ Shared context per side (c) Descriptive-iterative ( ) . . . . . . ( ) MLLM MLLM . . . MLLM 𝑦 ^ (d) Contrastive(-direct) ( ) ( ) . . . . . . ( ) MLLM MLLM MLLM . . . 𝑦 ^ Shared context (e) Contrastive-iterative Figure 2:Generation strategies. Direct (a) feeds the image of the whole matrix to the model. Descriptive (b), Contrastive (d), and their iterative variants, (c) and (e), present individual image panels to the model in a fixed order. Their direct variants, (b) and (d), additionally include the image of the whole matrix. Grey background marks a sequence of requests run in a single context window. 3Solving BPs with MLLMs

In this paper, we propose a set of novel strategies for solving BPs using MLLMs. Definition of each strategy includes the input on which the model operates and the sequence of reasoning steps performed by the model. A high-level overview of these methods is provided in Fig. 2. In the main tested setting, we follow the initial BP formulation that requires providing an answer in natural language, and propose a model-based approach to automatically evaluate such model predictions. In addition, we consider simpler formulations of the problem, casting it into a binary classification framework that enables detailed evaluation of AVR abilities of the tested MLLMs. An illustration of these evaluation settings is presented in Fig. 3. In what follows, let ℬ ⁢ 𝒫 𝑋

{ ℒ 𝑋 , ℛ 𝑋 , 𝑦 𝑋 } denote a BP instance ( 𝑋 ∈ 𝒩 is an index), composed of ℒ 𝑋

{ 𝐿 1 𝑋 , … , 𝐿 6 𝑋 } left and ℛ 𝑋

{ 𝑅 1 𝑋 , … , 𝑅 6 𝑋 } right panels, resp., and its concept 𝑦 𝑋 expressed in natural language.

3.1Answer Generation in Natural Language

We start by defining the strategies for generating answers in natural language. In each strategy, the model receives a general description of Bongard Problem with two BP examples with correct answers. Additionally, besides this generic introductory information, a given task ℬ ⁢ 𝒫 𝑋 to be solved is presented in a strategy-specific way. Appendix I.4 presents the exact prompt formulations.

Direct (Fig. 2a). The model receives an image presenting ℬ ⁢ 𝒫 𝑋 and is asked to directly formulate an answer (i.e., describe the difference between ℒ 𝑋 and ℛ 𝑋 panels in natural language).

Descriptive (Fig. 2b). Defines a more granular approach in which the model is first requested to generate a textual description of each image panel of the matrix. Each description is generated in a separate context, such that the model doesn’t have access to the prior panels nor to their descriptions. Next, the model is requested to provide an answer to the problem based only on the generated textual descriptions of all image panels. In this strategy and in all subsequent ones, we evaluate only the final answer provided by the model, without considering the intermediate outputs.

Descriptive-iterative (Fig. 2c). Evaluates the role of the reasoning context and utilizes a context window comprising the dialog history concerning all images in the given side of the problem. After generating the description of the first image, the model iteratively refines its output based on subsequent images from the same side. Based on the textual descriptions of both sides of the problem, the model is requested to provide the final answer.

Descriptive-direct (Fig. 2b with a dashed element). In both above Descriptive strategies, the model is never presented with the image of the whole matrix ℬ ⁢ 𝒫 𝑋 . Descriptive-direct strategy extends Descriptive by providing the image of ℬ ⁢ 𝒫 𝑋 along with the textual panel descriptions.

Contrastive (Fig. 2d). A critical aspect of BPs is the focus on forming concepts within the specific context of the matrix ℬ ⁢ 𝒫 𝑋 . It’s often the case that correct identification of the concept governing one side requires analysis of the other side to identify their key differences. In Descriptive strategies, the model provides image descriptions concerning a single problem side ℒ 𝑋 or ℛ 𝑋 without taking into account the images from the other side. Conversely, in the Contrastive strategy, the model is tasked with describing the difference between a pair of corresponding images from both sides of the problem ( 𝐿 1 𝑋 , 𝑅 1 𝑋 ) , … , ( 𝐿 6 𝑋 , 𝑅 6 𝑋 ) . After describing the differences between all six image pairs in separate contexts, the model generates its final answer based on these textual descriptions.

Contrastive-iterative (Fig. 2e). Extends Contrastive by performing all reasoning steps in a single context window, enabling the model to gradually improve its understanding of the rule separating both sides.

Contrastive-direct (Fig. 2d with a dashed element). Extends Contrastive by adding the image of the whole matrix to textual descriptions of differences within each panel pair.

MLLM 𝑦 𝑦 ^ 0 / 1 (a) Evaluate generated description MLLM 𝑦 ¯ 0 / 1 (b) Assess solution correctness MLLM L / R L / R (c) Assign images to sides Figure 3:Evaluation settings. We consider the three settings to solve BPs: (a) the ground-truth answer 𝑦 is paired with a description 𝑦 ^ generated by the MLLM and the model needs to verify if they describe the same concepts; (b) given a possible solution 𝑦 ¯ the model needs to assess whether it’s correct; (c) two test images corresponding to the left (L) and right (R) sides, resp., are randomly shuffled, and the model needs to assign each image from the pair to the proper side of the problem. 3.2Evaluation of Natural Language Solutions

The correct answer to a BP may be formulated in natural language in many different ways. To account for this inherent variability, we utilize a model-based approach to assess whether the generated answer 𝑦 ^ matches the ground-truth 𝑦 . In the proposed setting, an MLLM ensemble receives both 𝑦 ^ and 𝑦 and is requested to output a binary yes/no answer whether both descriptions refer to the same concept (see Fig. 3a). Specifically, we assess the correctness of 𝑦 ^ using all four considered proprietary MLLMs and count it as correct if at least two models agree with this class (see details in Appendix E). In contrast to solving the BP task, which requires abstract reasoning abilities, the task of determining whether two answers refer to the same concept boils down to assessing the semantic similarity between two texts, and MLLMs are known to excel in such setup (Lu et al., 2023).

3.3Binary Classification Formulations

To dive deeper into the AVR capabilities of the studied models, we cast the BP task into three binary classification settings, reducing the task’s difficulty.

Firstly, we provide the model with the image of the whole matrix ℬ ⁢ 𝒫 𝑋 along with a possible solution, and the model is prompted to generate a binary score assessing the correctness of the provided answer (Fig. 3b). Two settings are considered, in which the solution is formed by either the actual ground-truth answer (the expected answer is yes), or by an incorrect answer taken from a different matrix ℬ ⁢ 𝒫 𝒳 , where 𝒳

( 𝑋 + 20 ) mod 100 (the expected answer is no). We manually reviewed the resulting pairings to confirm that the selected negative concept was distinct from the original concept. We refer to these two classification setups as Ground-Truth and Incorrect Label, resp.

Secondly, we follow a setting exemplified in the Bongard-LOGO dataset (Nie et al., 2020) in which two test images have to be classified to different sides of the problem (Fig. 3c). To this end, we take two test images corresponding to the respective sides of the matrix, randomly shuffle the images, and request the model to determine the side to which each image belongs. In synthetic BPs we create the test set by removing the 6th image from each side of the matrix, while in BPs from Bongard HOI, Bongard-OpenWorld and Bongard-RWR we use the additional test images. We refer to this setup as Images to Sides.

Prompts for all three binary classification formulations are listed in Appendix I.3.

4Bongard-RWR

One of the interesting research avenues is to compare the MLLMs performance on synthetic BPs vs. real-world ones. Note, however, that a direct performance comparison on synthetic Bongard dataset vs. real-world Bongard HOI and Bongard-OpenWorld datasets is not meaningful, as these datasets depict different concepts. To enable a meaningful comparison and additionally determine whether the MLLMs performance score is domain-related, we introduce Bongard Real-World Representations, a focused dataset that expresses concepts present in synthetic BPs using real-world images, thus creating their real-life equivalents, as illustrated in Fig. 1. Appendix F contains additional examples.

(a)Synthetic BPs (b)Bongard HOI (c)Bongard-OpenWorld (d)Bongard-RWR Figure 4:Binary classification. The top row presents the results for the Ground-Truth and Incorrect Label strategies, while the bottom row refers to the Images to Sides strategy. Results of a random baseline are marked with a dashed line. For Bongard-RWR, we also report the percentage of correctly solved problems in parentheses. Dataset generation.

For a given instance ℬ ⁢ 𝒫 𝑋 , we first use GPT-4o to describe its underlying concept 𝑦 𝑋 in 𝑁

10 different ways using the prompt listed in Prompt I.1 (Appendix I). We obtain 𝑁 real-world textual descriptions 𝐷 𝑖 𝑋

{ 𝐷 𝑖 𝑋 ⁢ 𝐿 , 𝐷 𝑖 𝑋 ⁢ 𝑅 } , 𝑖

0 , … , 𝑁 − 1 , of each side 𝑆 ∈ { 𝐿 , 𝑅 } . Then, we use image search engine Pexels API (Pexels, 2024) to download 𝑀

15 images per each described side 𝐷 𝑖 𝑋 ⁢ 𝑆 . We employ GPT-4o (see Prompt I.1 in Appendix I) to select only those images that properly illustrate the concept of the respective side and are indeed distinguishable from the alternative concept. We stop the selection procedure after having a set of 𝑇

3 descriptions { 𝐷 𝑖 1 𝑋 , 𝐷 𝑖 2 𝑋 , 𝐷 𝑖 3 𝑋 } , each with 2 appropriate images: 𝐼 𝑘 𝑋 ⁢ 𝑆 ⁢ ( 1 ) , 𝐼 𝑘 𝑋 ⁢ 𝑆 ⁢ ( 2 ) , 𝑘

𝑖 1 , 𝑖 2 , 𝑖 3 per each side 𝑆 ( 6 left and 6 right ones).

To decrease the possibility of generating a problem with a trivial answer, which is more probable if the images come from a common textual description 𝐷 𝑖 𝑋 , the corresponding real-world problem instance ℛ ⁢ 𝒲 ⁢ ℛ 𝑋

{ ℒ 𝑋 , ℛ 𝑋 } is constructed as follows (see Algorithm 1 in Appendix F): 𝒮 𝑋

{ 𝐼 𝑖 ⁢ 1 𝑋 ⁢ 𝑆 ⁢ ( 1 ) , 𝐼 𝑖 ⁢ 2 𝑋 ⁢ 𝑆 ⁢ ( 1 ) , 𝐼 𝑖 ⁢ 3 𝑋 ⁢ 𝑆 ⁢ ( 1 ) , 𝐼 𝑖 ⁢ 1 𝑋 ⁢ 𝑆 ⁢ ( 2 ) , 𝐼 𝑖 ⁢ 2 𝑋 ⁢ 𝑆 ⁢ ( 2 ) , 𝐼 𝑖 ⁢ 3 𝑋 ⁢ 𝑆 ⁢ ( 2 ) } , 𝑆 ∈ { 𝐿 , 𝑅 } .

We run Algorithm 1 for the first 100 synthetic BPs. After applying the exclusion criteria, this led to the generation of 50 instances ℛ ⁢ 𝒲 ⁢ ℛ 𝑋 . However, as we noticed through a manual inspection, some of them were not well depicting the respective problem concept. Hence, we modified the dataset in the following way: 14 problems were entirely removed and, out of the remaining 36 , 24 were adjusted through a manual selection of the images that well represent the considered concept. Furthermore, we extended the dataset by adding 17 problems with manually translated concepts (i.e., with no use of GPT-4o), for which images were also selected manually, and 7 constructed by hand, i.e., by means of making photos of manually-built scenes reflecting the respective concepts.

All in all, out of 60 real-world problems obtained (Bongard-RWR), 12 were generated fully automatically, 24 were generated automatically and then adjusted manually (manual selection of the images), and 24 were constructed entirely manually. Appendix F presents the details.

Dataset variants.

Bongard-RWR was designed to capture real-world complexity, including variations in aspect ratio, resolution, and layout. To complement the base dataset, we developed three additional variants: RWR-S (square images), RWR-G (greyscale images), and RWR-SG (square greyscale images). In the square image variants, we manually cropped all images to square format and rescaled them to 512 × 512 resolution. The BP matrix was resized to 1024 ⁢ px width while maintaining aspect ratio, resulting in a height of 763 ⁢ px .

Table 1:Language generation. The number of correct answers to 100 synthetic BPs, 100 selected BPs from each of Bongard HOI and Bongard-OpenWorld, and all 60 BPs from Bongard-RWR. Three main strategies: Direct, Descriptive, and Contrastive, denoted as Di, De, and Co, resp., are considered. The best result for a given strategy is marked in bold and the second best is underlined. For Bongard-RWR, we also report the percentage of correctly solved problems in parentheses. For comparison, the average human performance on Bongard-RWR is 39.2 correct answers ( 65 % ). Synthetic HOI OpenWorld Bongard-RWR Di De Co Di De Co Di De Co Di De Co GPT-4o 17 17

10 35 42 18 40 46

19 ¯

5 ⁢ ( 8.3 % )

8 ¯ ⁢ ( 13.3 % )

2 ⁢ ( 3.3 % )

GPT-4 Turbo 6

22 45 5

21 57 12

1 ⁢ ( 1.7 % )

5 ⁢ ( 8.3 % )

0 ⁢ ( 0.0 % )

Gemini 1.5 Pro 7 21 17 23

15 ¯

3 ¯ ⁢ ( 5.0 % )

7 ⁢ ( 11.7 % )

1 ⁢ ( 1.7 % )

Claude 3.5 Sonnet 13 ¯

19 ¯

15 ¯

44 ¯

53 ¯ 21 1 ⁢ ( 1.7 % )

13 ⁢ ( 21.7 % )

2 ⁢ ( 3.3 % )

InternVL2-8B 0

0 ⁢ ( 0.0 % )

LLaVA-1.6 Mistral-7B 0

0 ⁢ ( 0.0 % )

Phi-3.5-Vision 0

0 ⁢ ( 0.0 % )

Pixtral 12B 1

28 ¯

33 ¯

1 ⁢ ( 1.7 % )

0 ⁢ ( 0.0 % ) 5Experiments

To evaluate the AVR capabilities of MLLMs, we conduct experiments in two main settings, involving 3 binary classification setups and 7 proposed generation methods. Our evaluation spans a range of MLLMs, including 4 proprietary models accessible via API: GPT-4o, GPT-4 Turbo (Achiam et al., 2023), Gemini 1.5 Pro (Reid et al., 2024), and Claude 3.5 Sonnet (Anthropic, 2024), alongside 4 open-access models run locally on an NVIDIA DGX A100 node: InternVL2-8B (Chen et al., 2024c, b), LLaVA-1.6 Mistral-7B (Liu et al., 2023, 2024; Jiang et al., 2023), Phi-3.5-Vision (Abdin et al., 2024), and Pixtral 12B (MistralAI, 2024). We consider four BP datasets covering both synthetic and real-world images. To balance diversity and computational feasibility, we cap the dataset size at 100 instances. Specifically, we use the first 100 manually constructed (synthetic) BPs from (Bongard, 1970), 100 problem samples from each of the Bongard HOI and Bongard-OpenWorld datasets selected from their test splits through stratified sampling based on problem concept labels, and all 60 instances from Bongard-RWR. Extended results are presented in Appendix C.

Binary classification. Fig. 4 presents the results of binary classification tasks. In the Ground-Truth setting, most proprietary and some open-access models outperform a random classifier baseline. In the Incorrect Label setting, since rejecting incorrect concepts is a generally easier task, most models perform better than in the Ground-Truth setup. However, the consistently shifted performance of open-access models suggests a potential bias toward agreeing or disagreeing with provided concepts. In the Images to Sides task, proprietary models demonstrate strong performance, while Pixtral stands out among open-access models. Nevertheless, binary classification tasks do not fully reveal whether the solver truly grasped the presented concept or simply relied on surface-level similarities, leaving the space for other challenging evaluation setups.

Generative capabilities in the Direct setting. As presented in Table 1, model performance using the Direct generation strategy is generally weak on synthetic BPs, with the best model, GPT-4o, solving only 17 out of 100 problems. This indicates that the models struggle to identify abstract, synthetic concepts and express them in natural language. The challenge is even more apparent on Bongard-RWR, where the best model, GPT-4o, solves only 5 out of 60 problems. Nevertheless, performance improves on Bongard HOI and Bongard-OpenWorld, with best results of 35 / 100 and 40 / 100 , resp. Notably, while GPT-4o achieves the highest scores on these two datasets, Pixtral 12B ranks second ( 28 / 100 and 33 / 100 , resp.), showing that smaller open-access models can still be competitive in this setting. While the better performance on Bongard HOI and Bongard-OpenWorld may be attributed to a higher ratio of real-world images in the training data, the weak results on Bongard-RWR suggest that the discrepancy is more related to the specific underlying concepts than the visual domain as such (see Fig. 14 in Appendix F for the details).

Independent image description. In the next experiment we analyse whether an iterative reasoning approach, in which the model first generates separate captions for each image and then combines them into a final answer, can improve performance. As shown in Table 1, compared to the Direct strategy, the Descriptive one improves the best results across all datasets: from 17 to 21 on synthetic BPs, from 35 to 45 on Bongard HOI, from 40 to 57 on Bongard-OpenWorld, and from 5 to 13 on Bongard-RWR. A clear improvement can be observed individually for each proprietary model. The gain is less pronounced (if at all) for individual open-access models. We hypothesize that this improvement is due to the Descriptive setting being more aligned with the model’s training, where it primarily learns to caption individual images. However, this strategy doesn’t leverage the additional information present in the joint BP image, and certain context-dependent visual features may be missed in captioning. We believe that with further advancements in reasoning over multi-part compositional images, models in the Direct setting should eventually outperform the Descriptive strategy.

Contrastive reasoning. Correct identification of concepts in BPs requires a joint processing of the images from both sides of the problem. The Contrastive strategy evaluates the model ability to extract underlying differentiating concepts within such image pairs. Across all datasets, the models evaluated under the Contrastive strategy perform worse than with the Descriptive strategy (cf. Table 1). This points to the fundamental difference between human and machine approaches to solving AVR tasks. Humans often rely on direct comparisons between image panels from different categories to highlight differences (Nüssli et al., 2009), whereas the tested methods perform better when making comparisons on text-based image descriptions, potentially disregarding critical visual details missed during image captioning. This discrepancy indicates the need for further modeling improvements to fully leverage the Contrastive strategy.

Figure 5:The impact of -direct and -iterative variants. Bars in each group correspond to models in the following order: GPT-4o, GPT-4 Turbo, Gemini 1.5 Pro, Claude 3.5 Sonnet, InternVL2-8B, LLaVA-1.6 Mistral-7B, Phi-3.5-Vision, and Pixtral 12B.

Iterative reasoning. Next, we tested whether preserving responses from past turns in the dialog context could improve concept identification in both Descriptive and Contrastive settings. As shown in Fig. 5, Descriptive-iterative strategy visibly worsens the results compared to its non-iterative counterpart across all datasets and models, except for negligible improvement of InternVL2-8B on synthetic BPs and several cases of a complete failure (accuracy of 0 ) of both strategies. In contrast, Contrastive-iterative brings no improvement over Contrastive in only 5 cases, 2 of synthetic BPs, and 3 of Bongard-OpenWorld. Despite these improvements, Contrastive-iterative is generally inferior to Descriptive (see Table 4, Appendix C). Overall, the study indicates that contemporary models struggle to effectively utilize additional information from the context window.

Multimodal answer generation. In the final experiment, we assessed whether incorporating an image of the entire matrix at the answer generation step would improve the performance of the Descriptive and Contrastive strategies. As shown in Fig. 5, Descriptive-direct shows performance improvements over Descriptive in 12 out of 32 (dataset, model) cases. Contrastive-direct improves upon Contrastive in all (dataset, proprietary model) configurations, and additionally improves in certain (dataset, open-access model) settings. However, despite these gains, Contrastive-direct overall performs worse than Descriptive, except for GPT-4o and InternVL2-8B on synthetic BPs, and InternVL2-8B on Bongard HOI (see Table 4, Appendix C). This suggests that contemporary models are to some extent capable of utilizing additional visual inputs to improve reasoning performance, in particular the newest GPT-4o, which displays improvement from incorporating the -direct extension in 7 out of 8 cases. Nevertheless, further work is needed to improve consistency across all models.

Comparison of prompting strategies. Across all models, the Descriptive strategy achieves the highest scores on Bongard-RWR and Bongard-OpenWorld. In Bongard HOI, it ties with Descriptive-direct, while in synthetic BPs, it ranks just behind its -direct extension. As shown in Appendix G, altogether Descriptive strategies solve the same number of synthetic BPs as Contrastive strategies ( 44 ; Fig. 16), but lead in Bongard HOI ( 82 vs. 63 ; Fig. 18), Bongard-OpenWorld ( 90 vs. 76 ; Fig. 20), and Bongard-RWR ( 20 vs. 11 ; Fig. 22). This overall advantage of Descriptive over Contrastive strategies indicates that current MLLMs perform better with prompting strategies focused on processing single images. This also highlights the need to improve multi-image reasoning capabilities of MLLMs for tasks that require reasoning across multiple images. Figs. 16 – 23 in Appendix G further show that altogether the considered approaches solved 54 , 89 , 93 , and 23 problems from synthetic BPs, Bongard HOI, Bongard-OpenWorld, and Bongard-RWR, resp. This raises the question of whether an ensemble combining all proposed strategies could further enhance model reasoning performance. We leave the exploration of this emerging direction for future work.

Proprietary vs. open-access models. Proprietary models generally outperform open-access ones, leading in 35 out of 40 (dataset, strategy) pairs (see Appendix C, Table 4). The black-box nature of proprietary models makes it challenging to attribute their advantage to specific aspects, whether it be the number of parameters, the size and composition of training data, or the pre- and post-processing methods. However, the recently released Pixtral 12B model performs competitively in multiple settings, occasionally surpassing proprietary models. This highlights the viability of developing competitive MLLMs without sacrificing accessibility. At the same time, a clear performance drop of Pixtral 12B on synthetic BPs and Bongard-RWR suggests its intrinsic weakness in reasoning about abstract concepts, whether reflected in synthetic or real-world manner, similarly to other open models.

Dataset variants. We evaluated the four open-access models on the three binary classification setups presented in Section 3.3 using the additional Bongard-RWR variants. In the Ground-Truth and Incorrect Label settings, all models performed similarly in the original Bongard-RWR and its three new variants. However, in the Images to Sides setting, model performance differed. On RWR-G, most models solved more problems compared to Bongard-RWR, e.g., Phi-3.5-Vision ( + 15 ), suggesting that removing color helped models focus on structural concepts. The difference was also present for RWR-S, e.g., for InternVL2-8B ( + 13 ), Pixtral 12B ( + 9 ), and Phi-3.5-Vision ( + 7 ), likely due to cropping, which increased the signal-to-noise ratio by eliminating empty space between panels. RWR-SG followed a similar pattern. Overall, these additional dataset variants not only make the task more approachable but also help measure robustness to matrix layout and sensitivity to image color. Extended results are provided in Appendix C.

Concept selection. Similarly to the evaluation setup proposed in (Wüst et al., 2024), we conducted a multi-class classification experiment in which the MLLM selects the correct concept from a list of 𝑘 ∈ { 2 , 4 , 8 , 16 } choices. Negative candidates were sourced from subsequent matrices, ensuring they were semantically different from the correct option. Across all four datasets, models achieved the highest accuracy on Bongard HOI and Bongard-OpenWorld (exceeding 90 % in most cases), while synthetic BPs and Bongard-RWR posed a greater challenge (e.g., for InternVL2-8B, 16 % and 25 % resp., for 𝑘

16 ). These results suggest that real-world concepts, such as “Swing tennis racket vs. Not swing tennis racket”, are easier for the models to recognize, likely because they can rely on surface-level object or action recognition to exclude incorrect options. In contrast, synthetic BPs and Bongard-RWR demand recognizing abstract concepts, such as “Convex vs. Nonconvex”, which require a deeper understanding of the images. The gap between Bongard-RWR and synthetic BPs further highlights the additional challenge posed by representing abstract concepts in real-world images. The detailed results are provided in Appendix C.

Model scaling. We investigated the effect of model scaling on abstract reasoning capabilities by evaluating a range of open-access and proprietary models of varying sizes across all four datasets using both the Direct and Descriptive strategies. Our results reveal that while larger models generally perform better, especially among open-access models, smaller variants occasionally outperform their larger counterparts. Nevertheless, even smaller proprietary models (e.g., GPT-4o mini) consistently outperformed the largest open-access alternatives, suggesting that scaling alone may be insufficient to achieve strong abstract reasoning capabilities. This highlights the need for explicit, dedicated efforts to improve abstract reasoning skils of MLLMs in future work. Appendix D provides an in-depth analysis of these findings.

Comparison with state-of-the-art. A direct comparison with the results from (Wu et al., 2024) is challenging due to the different ranges of test problems used in each study. With this caveat, we concentrate on key high-level observations from both works. Wu et al. (2024) primarily focus on a binary classification setting corresponding to the Images to Sides setup in our work. On Bongard-OpenWorld, our best performing models, Gemini 1.5 Pro and Claude 3.5 Sonnet, achieved 96 % accuracy, while their top method, SNAIL—a meta-learning approach leveraging pre-trained OpenCLIP image representations—achieved 64 % . This suggests that MLLMs, which uniformly process images and text, outperform decoupled two-stage approaches, which handle image captioning and text-based reasoning with different models. They also briefly consider a natural language generation task, where models describe concepts presented in the BP instance (Wu et al., 2024, Appendix E). They again use a two-step approach comprising fine-tuned BLIP-2 for image captioning and ChatGPT for concept generation. In contrast, we employ a single MLLM for both tasks. For evaluating free-form concept generation, they use automated text-based metrics, which provide a general measure of text similarity. We, however, employ a voting MLLM ensemble, offering a more direct assessment of solution correctness.

Human results.

We conducted a study on Bongard-RWR with 30 participants, as detailed in Appendix B. Participants were asked to describe BP concepts in an open-ended manner. Their responses were evaluated by expert annotators, who compared them to ground-truth labels and classified as correct or incorrect. Humans solved from 23 to 59 problems, with average of 39.2 , achieving 65 % accuracy. Notably, the lowest number of problems solved by a human participant ( 23 ) exceeded the number of problems solved by all models in total ( 22 , see Fig. 23), which signifies a clear performance gap.

6Conclusions

This paper investigates the reasoning capabilities of proprietary and open-access MLLMs using BPs as a case study. Despite rapid progress, MLLMs still exhibit significant reasoning limitations. Across all proposed answer generation strategies, the best-performing model solved only 22 out of 100 synthetic BPs. On the other hand, model performance improved moderately with real-world concepts, as shown by the results on Bongard HOI and Bongard-OpenWorld. To delve deeper into the performance discrepancies between synthetic and real-world domains, we introduced Bongard-RWR, a new BP dataset designed to represent concepts from synthetic BPs via real-world images. Focused experiments with this dataset suggest that the models’ weak performance on synthetic BPs is not domain-specific but rather indicative of broader limitations in their reasoning abilities. Specifically, MLLMs struggle with recognizing abstract concepts, fail to benefit from a human-like multi-image reasoning approach, demonstrate limitations in utilizing context window effectively, and require further work to consistently integrate text and vision modalities at the answer generation step. On a positive note, experiments conducted in three binary classification settings show that some models achieve encouraging results, suggesting that current limitations may be overcome with future advancements.

Acknowledgements

This research was carried out with the support of the Laboratory of Bioinformatics and Computational Genomics and the High Performance Computing Center of the Faculty of Mathematics and Information Science Warsaw University of Technology. Mikołaj Małkiński was funded by the Warsaw University of Technology within the Excellence Initiative: Research University (IDUB) programme.

Impact Statement

We believe that the limitations of MLLMs discussed in this work may eventually be overcome with the development of improved models and solution strategies. While such advancements could lead to breakthroughs in abstract reasoning, they also present potential risks of misuse. For instance, a method that excels in solving BPs could be exploited for solving online cognitive assessment tests, such as IQ tests. This may allow individuals to unfairly claim having higher cognitive abilities than they actually possess, leading to unethical advantages in job recruitment, academic admissions, or other competitive contexts related to IQ scores.

References Abdin et al. (2024) ↑ Abdin, M., Jacobs, S. A., Awan, A. A., Aneja, J., Awadallah, A., Awadalla, H., Bach, N., Bahree, A., Bakhtiari, A., Behl, H., et al.Phi-3 technical report: A highly capable language model locally on your phone.arXiv:2404.14219, 2024. Achiam et al. (2023) ↑ Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.GPT-4 technical report.arXiv:2303.08774, 2023. Anthropic (2024) ↑ Anthropic.The Claude 3 model family: Opus, Sonnet, Haiku.https://www.anthropic.com/claude-3-model-card, 2024.Accessed: 2025-01-30. Bai et al. (2023) ↑ Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., and Zhou, J.Qwen-VL: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv:2308.12966, 2023. Barrett et al. (2018) ↑ Barrett, D., Hill, F., Santoro, A., Morcos, A., and Lillicrap, T.Measuring abstract reasoning in neural networks.In International Conference on Machine Learning, pp. 511–520. PMLR, 2018. Biscione et al. (2024) ↑ Biscione, V., Yin, D., Malhotra, G., Dujmovic, M., Montero, M. L., Puebla, G., Adolfi, F., Heaton, R. F., Hummel, J. E., Evans, B. D., et al.MindSet: Vision. A toolbox for testing DNNs on key psychological experiments.arXiv:2404.05290, 2024. Bitton et al. (2023) ↑ Bitton, Y., Yosef, R., Strugo, E., Shahaf, D., Schwartz, R., and Stanovsky, G.VASR: Visual analogies of situation recognition.In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp. 241–249, 2023. Bongard (1968) ↑ Bongard, M. M.The recognition problem.Technical report, Foreign Technology Div Wright-Patterson AFB Ohio, 1968. Bongard (1970) ↑ Bongard, M. M.Pattern Recognition.Spartan Books, 1970. Brown et al. (2020) ↑ Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., and Amodei, D.Language models are few-shot learners.In Advances in Neural Information Processing Systems, volume 33, pp. 1877–1901, 2020. Camposampiero et al. (2023) ↑ Camposampiero, G., Houmard, L., Estermann, B., Mathys, J., and Wattenhofer, R.Abstract visual reasoning enabled by language.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2643–2647, 2023. Cao et al. (2024) ↑ Cao, X., Lai, B., Ye, W., Ma, Y., Heintz, J., Chen, J., Cao, J., and Rehg, J. M.What is the visual cognition gap between humans and multimodal LLMs?arXiv:2406.10424, 2024. Carbonell (1983) ↑ Carbonell, J. G.Learning by analogy: Formulating and generalizing plans from past experience.In Machine learning, pp. 137–161. Elsevier, 1983. Chalmers et al. (1992) ↑ Chalmers, D. J., French, R. M., and Hofstadter, D. R.High-level perception, representation, and analogy: A critique of artificial intelligence methodology.Journal of Experimental & Theoretical Artificial Intelligence, 4(3):185–211, 1992. Chen et al. (2021) ↑ Chen, Y., Liu, Z., Xu, H., Darrell, T., and Wang, X.Meta-baseline: Exploring simple meta-learning for few-shot learning.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9062–9071, 2021. Chen et al. (2024a) ↑ Chen, Z., Wang, W., Cao, Y., Liu, Y., Gao, Z., Cui, E., Zhu, J., Ye, S., Tian, H., Liu, Z., et al.Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv:2412.05271, 2024a. Chen et al. (2024b) ↑ Chen, Z., Wang, W., Tian, H., Ye, S., Gao, Z., Cui, E., Tong, W., Hu, K., Luo, J., Ma, Z., et al.How far are we to GPT-4V? Closing the gap to commercial multimodal models with open-source suites.Science China Information Sciences, 67(12):220101, 2024b. Chen et al. (2024c) ↑ Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al.InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 24185–24198, 2024c. Chollet (2019) ↑ Chollet, F.On the measure of intelligence.arXiv:1911.01547, 2019. Depeweg et al. (2018) ↑ Depeweg, S., Rothkopf, C. A., and Jäkel, F.Solving bongard problems with a visual language and pragmatic reasoning.arXiv:1804.04452, 2018. Depeweg et al. (2024) ↑ Depeweg, S., Rothkopf, C. A., and Jäkel, F.Solving bongard problems with a visual language and pragmatic constraints.Cognitive Science, 48(5):e13432, 2024. Falkenhainer et al. (1989) ↑ Falkenhainer, B., Forbus, K. D., and Gentner, D.The structure-mapping engine: Algorithm and examples.Artificial Intelligence, 41(1):1–63, 1989. Fei-Fei et al. (2006) ↑ Fei-Fei, L., Fergus, R., and Perona, P.One-shot learning of object categories.IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(4):594–611, 2006. Fleuret et al. (2011) ↑ Fleuret, F., Li, T., Dubout, C., Wampler, E. K., Yantis, S., and Geman, D.Comparing machines and humans on a visual categorization test.Proceedings of the National Academy of Sciences, 108(43):17621–17625, 2011. Foundalis (2006a) ↑ Foundalis, H. E.Phaeaco: A cognitive architecture inspired by Bongard’s problems.PhD dissertation, Indiana University, 2006a. Foundalis (2006b) ↑ Foundalis, H. E.Index of bongard problems.http://www.foundalis.com/res/bps/bpidx.htm, 2006b.Accessed: 2025-01-30. Galatzer-Levy et al. (2024) ↑ Galatzer-Levy, I. R., Munday, D., McGiffin, J., Liu, X., Karmon, D., Labzovsky, I., Moroshko, R., Zait, A., and McDuff, D.The cognitive capabilities of generative AI: A comparative analysis with human benchmarks.arXiv:2410.07391, 2024. Gardner & Richards (2006) ↑ Gardner, M. and Richards, D.The colossal book of short puzzles and problems.Norton, 2006. Gentner (1983) ↑ Gentner, D.Structure-mapping: A theoretical framework for analogy.Cognitive Science, 7(2):155–170, 1983. Hill et al. (2019) ↑ Hill, F., Santoro, A., Barrett, D., Morcos, A., and Lillicrap, T.Learning to make analogies by contrasting abstract relational structure.In International Conference on Learning Representations, 2019. Hoffmann et al. (2022) ↑ Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., de Las Casas, D., Hendricks, L. A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., van den Driessche, G., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., Vinyals, O., Rae, J. W., and Sifre, L.Training compute-optimal large language models.In Advances in Neural Information Processing Systems, volume 35, pp. 30016–30030, 2022. Hofstadter (1995) ↑ Hofstadter, D. R.Fluid concepts and creative analogies: Computer models of the fundamental mechanisms of thought.Basic books, 1995. Hofstadter (1999) ↑ Hofstadter, D. R.Gödel, Escher, Bach: an eternal golden braid.Basic books, 1999. Holyoak & Thagard (1989) ↑ Holyoak, K. J. and Thagard, P.Analogical mapping by constraint satisfaction.Cognitive Science, 13(3):295–355, 1989. Ichien et al. (2021) ↑ Ichien, N., Liu, Q., Fu, S., Holyoak, K. J., Yuille, A., and Lu, H.Visual analogy: Deep learning versus compositional models.arXiv:2105.07065, 2021. Jiang et al. (2023) ↑ Jiang, A., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D., de las Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., et al.Mistral 7B (2023).arXiv:2310.06825, 2023. Jiang et al. (2022) ↑ Jiang, H., Ma, X., Nie, W., Yu, Z., Zhu, Y., and Anandkumar, A.Bongard-HOI: Benchmarking few-shot visual reasoning for human-object interactions.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19056–19065, 2022. Jiang et al. (2024) ↑ Jiang, Y., Sun, K., Sourati, Z., Ahrabian, K., Ma, K., Ilievski, F., Pujara, J., et al.MARVEL: Multidimensional abstraction and reasoning through visual evaluation and learning.In Advances in Neural Information Processing Systems, volume 37, pp. 46567–46592, 2024. Kaplan et al. (2020) ↑ Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D.Scaling laws for neural language models.arXiv:2001.08361, 2020. Kharagorgiev (2018) ↑ Kharagorgiev, S.Solving bongard problems with deep learning.https://k10v.github.io/2018/02/25/Solving-Bongard-problems-with-deep-learning/, February 2018.Accessed: 2025-01-30. Lake et al. (2017) ↑ Lake, B. M., Ullman, T. D., Tenenbaum, J. B., and Gershman, S. J.Building machines that learn and think like people.Behavioral and Brain Sciences, 40:e253, 2017. Lee et al. (2019) ↑ Lee, K., Maji, S., Ravichandran, A., and Soatto, S.Meta-learning with differentiable convex optimization.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10657–10665, 2019. Linhares (2000) ↑ Linhares, A.A glimpse at the metaphysics of bongard problems.Artificial Intelligence, 121(1-2):251–270, 2000. Liu et al. (2023) ↑ Liu, H., Li, C., Wu, Q., and Lee, Y. J.Visual instruction tuning.In Advances in Neural Information Processing Systems, volume 36, pp. 34892–34916, 2023. Liu et al. (2024) ↑ Liu, H., Li, C., Li, Y., and Lee, Y. J.Improved baselines with visual instruction tuning.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 26296–26306, 2024. Lu et al. (2023) ↑ Lu, Y., Yang, X., Li, X., Wang, X. E., and Wang, W. Y.LLMScore: unveiling the power of large language models in text-to-image synthesis evaluation.In Advances in Neural Information Processing Systems, volume 36, pp. 23075–23093, 2023. Małkiński & Mańdziuk (2023) ↑ Małkiński, M. and Mańdziuk, J.A review of emerging research directions in abstract visual reasoning.Information Fusion, 91:713–736, 2023. Małkiński & Mańdziuk (2024a) ↑ Małkiński, M. and Mańdziuk, J.Multi-label contrastive learning for abstract visual reasoning.IEEE Transactions on Neural Networks and Learning Systems, 35(2):1941–1953, 2024a. Małkiński & Mańdziuk (2024b) ↑ Małkiński, M. and Mańdziuk, J.One self-configurable model to solve many abstract visual reasoning problems.In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp. 14297–14305, 2024b. Małkiński & Mańdziuk (2025a) ↑ Małkiński, M. and Mańdziuk, J.Deep learning methods for abstract visual reasoning: A survey on raven’s progressive matrices.ACM Computing Surveys, 57(7):1–36, 2025a. Małkiński & Mańdziuk (2025b) ↑ Małkiński, M. and Mańdziuk, J.Advancing generalization across a variety of abstract visual reasoning tasks.In Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, IJCAI-25, 2025b.(Accepted). Małkiński & Mańdziuk (2025c) ↑ Małkiński, M. and Mańdziuk, J.A-I-RAVEN and I-RAVEN-Mesh: Two new benchmarks for abstract visual reasoning.In Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, IJCAI-25, 2025c.(Accepted). McCarthy et al. (2006) ↑ McCarthy, J., Minsky, M. L., Rochester, N., and Shannon, C. E.A proposal for the dartmouth summer research project on artificial intelligence, august 31, 1955.AI magazine, 27(4):12–12, 2006. Mirchandani et al. (2023) ↑ Mirchandani, S., Xia, F., Florence, P., Ichter, B., Driess, D., Arenas, M. G., Rao, K., Sadigh, D., and Zeng, A.Large language models as general pattern machines.In Proceedings of the 7th Conference on Robot Learning (CoRL), 2023. Mishra et al. (2018) ↑ Mishra, N., Rohaninejad, M., Chen, X., and Abbeel, P.A simple neural attentive meta-learner.In International Conference on Learning Representations, 2018. MistralAI (2024) ↑ MistralAI.Announcing Pixtral 12B.https://mistral.ai/news/pixtral-12b/, 2024.Accessed: 2025-01-30. Mitchell (2021) ↑ Mitchell, M.Abstraction and analogy-making in artificial intelligence.Annals of the New York Academy of Sciences, 1505(1):79–101, 2021. Mitchell et al. (2023) ↑ Mitchell, M., Palmarini, A. B., and Moskvichev, A. K.Comparing humans, GPT-4, and GPT-4v on abstraction and reasoning tasks.In AAAI 2024 Workshop on “Are Large Language Models Simply Causal Parrots?”, 2023. Moskvichev et al. (2023) ↑ Moskvichev, A. K., Odouard, V. V., and Mitchell, M.The conceptARC benchmark: Evaluating understanding and generalization in the ARC domain.Transactions on Machine Learning Research, 2023.ISSN 2835-8856. Nie et al. (2020) ↑ Nie, W., Yu, Z., Mao, L., Patel, A. B., Zhu, Y., and Anandkumar, A.Bongard-LOGO: A new benchmark for human-level concept learning and reasoning.In Advances in Neural Information Processing Systems, volume 33, pp. 16468–16480, 2020. Nüssli et al. (2009) ↑ Nüssli, M.-A., Jermann, P., Sangin, M., and Dillenbourg, P.Collaboration and abstract representations: towards predictive models based on raw speech and eye-tracking data.In CSCL’09: Proceedings of the 2009 conference on Computer support for collaborative learning. International Society of the Learning Sciences, 2009. Odouard & Mitchell (2022) ↑ Odouard, V. V. and Mitchell, M.Evaluating understanding on conceptual abstraction benchmarks.arXiv:2206.14187, 2022. Patraucean et al. (2023) ↑ Patraucean, V., Smaira, L., Gupta, A., Recasens, A., Markeeva, L., Banarse, D., Koppula, S., Malinowski, M., Yang, Y., Doersch, C., et al.Perception Test: A diagnostic benchmark for multimodal video models.In Advances in Neural Information Processing Systems, volume 36, pp. 42748–42761, 2023. Pexels (2024) ↑ Pexels.Pexels API.https://www.pexels.com/api, 2024.Accessed: 2025-01-30. Qi et al. (2021) ↑ Qi, Y., Zhang, K., Sain, A., and Song, Y.-Z.PQA: Perceptual question answering.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12056–12064, 2021. Raghu et al. (2020) ↑ Raghu, A., Raghu, M., Bengio, S., and Vinyals, O.Rapid learning or feature reuse? Towards understanding the effectiveness of MAML.In International Conference on Learning Representations, 2020. Raven (1936) ↑ Raven, J. C.Mental tests used in genetic studies: The performance of related individuals on tests mainly educative and mainly reproductive.Master’s thesis, University of London, 1936. Raven & Court (1998) ↑ Raven, J. C. and Court, J. H.Raven’s progressive matrices and vocabulary scales.Oxford pyschologists Press Oxford, England, 1998. Reid et al. (2024) ↑ Reid, M., Savinov, N., Teplyashin, D., Lepikhin, D., Lillicrap, T., Alayrac, J.-b., Soricut, R., Lazaridou, A., Firat, O., Schrittwieser, J., et al.Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv:2403.05530, 2024. Ren et al. (2015) ↑ Ren, S., He, K., Girshick, R., and Sun, J.Faster R-CNN: towards real-time object detection with region proposal networks.In Advances in Neural Information Processing Systems, volume 28, pp. 91–99, 2015. Saito & Nakano (1996) ↑ Saito, K. and Nakano, R.A concept learning algorithm with adaptive search.In Machine intelligence 14: applied machine intelligence, pp. 347–363, 1996. Santoro et al. (2017) ↑ Santoro, A., Raposo, D., Barrett, D. G., Malinowski, M., Pascanu, R., Battaglia, P., and Lillicrap, T.A simple neural network module for relational reasoning.In Advances in Neural Information Processing Systems, volume 30, pp. 4967–4976, 2017. Snell et al. (2017) ↑ Snell, J., Swersky, K., and Zemel, R.Prototypical networks for few-shot learning.In Advances in Neural Information Processing Systems, volume 30, pp. 4080–4090, 2017. Sonwane et al. (2021) ↑ Sonwane, A., Chitlangia, S., Dash, T., Vig, L., Shroff, G., and Srinivasan, A.Using program synthesis and inductive logic programming to solve bongard problems.arXiv:2110.09947, 2021. Stabinger et al. (2021) ↑ Stabinger, S., Peer, D., Piater, J., and Rodríguez-Sánchez, A.Evaluating the progress of deep learning for visual relational concepts.Journal of Vision, 21(11):8–8, 2021. Teney et al. (2020) ↑ Teney, D., Wang, P., Cao, J., Liu, L., Shen, C., and van den Hengel, A.V-PROM: A benchmark for visual reasoning using visual progressive matrices.In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp. 12071–12078, 2020. van der Maas et al. (2021) ↑ van der Maas, H. L., Snoek, L., and Stevenson, C. E.How much intelligence is there in artificial intelligence? a 2020 update.Intelligence, 87:101548, 2021. Wang et al. (2024) ↑ Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., Fan, Y., Dang, K., Du, M., Ren, X., Men, R., Liu, D., Zhou, C., Zhou, J., and Lin, J.Qwen2-VL: Enhancing vision-language model’s perception of the world at any resolution.arXiv:2409.12191, 2024. Wang et al. (2020) ↑ Wang, Y., Yao, Q., Kwok, J. T., and Ni, L. M.Generalizing from a few examples: A survey on few-shot learning.ACM Computing Surveys, 53(3):1–34, 2020. Webb et al. (2020) ↑ Webb, T., Dulberg, Z., Frankland, S., Petrov, A., O’Reilly, R., and Cohen, J.Learning representations that support extrapolation.In International Conference on Machine Learning, pp. 10136–10146. PMLR, 2020. Webb et al. (2023) ↑ Webb, T., Holyoak, K. J., and Lu, H.Emergent analogical reasoning in large language models.Nature Human Behaviour, 7(9):1526–1541, 2023. Wechsler (2008) ↑ Wechsler, D.WAIS-IV Administration and Scoring Manual.PsychCorp, 2008.URL https://books.google.co.il/books?id=Bf-DswEACAAJ. Winston (1982) ↑ Winston, P. H.Learning new principles from precedents and exercises.Artificial intelligence, 19(3):321–350, 1982. Wu et al. (2023) ↑ Wu, J., Gan, W., Chen, Z., Wan, S., and Philip, S. Y.Multimodal large language models: A survey.In 2023 IEEE International Conference on Big Data, pp. 2247–2256. IEEE, 2023. Wu et al. (2024) ↑ Wu, R., Ma, X., Zhang, Z., Wang, W., Li, Q., Zhu, S.-C., and Wang, Y.Bongard-OpenWorld: Few-shot reasoning for free-form visual concepts in the real world.In International Conference on Learning Representations, 2024. Wüst et al. (2024) ↑ Wüst, A., Tobiasch, T., Helff, L., Ibs, I., Stammer, W., Dhami, D. S., Rothkopf, C. A., and Kersting, K.Bongard in wonderland: Visual puzzles that still make AI go mad?In NeurIPS 2024 Workshop on “System-2 Reasoning at Scale”, 2024.Preprint available at: arXiv:2410.19546. Xu et al. (2024) ↑ Xu, Y., Li, W., Vaezipoor, P., Sanner, S., and Khalil, E. B.LLMs and the abstraction and reasoning corpus: Successes, failures, and the importance of object-based representations.Transactions on Machine Learning Research, 2024.ISSN 2835-8856. Yin et al. (2024) ↑ Yin, S., Fu, C., Zhao, S., Li, K., Sun, X., Xu, T., and Chen, E.A survey on multimodal large language models.National Science Review, 11(12), 2024. Yiu et al. (2025) ↑ Yiu, E., Qraitem, M., Majhi, A. N., Wong, C., Bai, Y., Ginosar, S., Gopnik, A., and Saenko, K.KiVA: Kid-inspired visual analogies for testing large multimodal models.In International Conference on Learning Representations, 2025. Zerroug et al. (2022) ↑ Zerroug, A., Vaishnav, M., Colin, J., Musslick, S., and Serre, T.A benchmark for compositional visual reasoning.In Advances in Neural Information Processing Systems, volume 35, pp. 29776–29788, 2022. Zhang et al. (2019) ↑ Zhang, C., Gao, F., Jia, B., Zhu, Y., and Zhu, S.-C.RAVEN: A dataset for relational and analogical visual reasoning.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5317–5327, 2019. Appendix ALimitations and Future Work Going beyond Bongard Problems.

BPs fundamentally require solvers to articulate answers in natural language, making them a valuable testbed for assessing the reasoning capabilities of MLLMs. However, to comprehensively explore the challenges posed by the AVR domain, it is crucial to consider a broader range of problems.

For instance, VCog-Bench (Cao et al., 2024) is a benchmark designed to evaluate the reasoning capabilities of MLLMs across 3 datasets: 560 problem instances from RAVEN (Zhang et al., 2019), 309 from CVR (Zerroug et al., 2022), and 480 from MaRs-VQA. These datasets present multi-class classification tasks, offering between 4 to 8 options per problem. While the classification setting in VCog-Bench differs from our focus on natural language generation, both studies echo a shared conclusion – MLLMs struggle in complex, multi-image reasoning tasks. We argue that generative problem formulations, such as those used in our study, pose a more substantial challenge than discriminative tasks, in which the solution may be induced from correlations or by making educated guesses. Further advances in abstract reasoning may require the development of new AVR benchmarks with generative evaluation settings.

A compelling example of a generative problem formulation is the Abstraction and Reasoning Corpus (ARC) (Chollet, 2019), in which each instance involves transforming a source grid into a target grid based on an induced transformation rule. Each instance is accompanied by a few demonstrations to guide the solver. (Mitchell et al., 2023) explored multi-modal reasoning of GPT-4V on ConceptARC (Odouard & Mitchell, 2022), a variant of ARC categorizing tasks into distinct types. The study employed 3 prompting strategies: presenting all demonstrations in a single image, using separate images for each source and target grid pair, and separating each grid pair into distinct images. These settings are related to the Direct, Contrastive, and Descriptive strategies from our study, resp. The model performed best with the last approach, which aligns with the leading performance of the Descriptive strategy in our paper. Their study revealed that ARC tasks pose a significant challenge for MLLMs, aligning with our results. Similar to our findings, GPT-4V evaluation on ConceptARC demonstrated that generative problem formulations pose a significant challenge for contemporary MLLMs.

Fine-grained analysis of MLLM perception.

Related studies emphasize the importance of evaluating fine-grained aspects of model performance in visual reasoning tasks. Notably, (Biscione et al., 2024) propose the MindSet: Vision toolbox, which categorizes tasks into three main domains: low- and mid-level vision, visual illusions, and shape and object recognition. This benchmark is specifically designed to test models on 30 psychological findings inspired by human visual perception, providing a framework for understanding similarities and differences in human and machine vision. Preliminary evaluations using ResNet-152 and GPT-4 on selected tasks revealed notable differences in perception between humans and machines. Applying MLLMs and the reasoning strategies proposed in our work to the MindSet: Vision toolbox opens a promising direction for future research, which could offer deeper insights into the perceptual capabilities of MLLMs.

Incorporating proposed strategies to enhance abstract reasoning abilities.

(Galatzer-Levy et al., 2024) compared the cognitive abilities of MLLMs to humans using the Wechsler Adult Intelligence Scale (WAIS-IV) (Wechsler, 2008). Their findings reveal that while MLLMs excel in tasks related to verbal comprehension and working memory, they significantly underperform in perceptual reasoning tasks. The evaluation setting used in this study involved presenting models with an image of an abstract reasoning matrix alongside a text prompt describing the task, closely aligning with the Direct strategy employed in our work. However, as discussed in Section 5, the Direct strategy poses notable challenges for MLLMs. Our experiments show that models consistently achieve better performance with alternative approaches, such as the Descriptive strategy. This highlights the importance of selecting appropriate strategies when evaluating MLLMs on abstract reasoning tasks. We believe that the diverse suite of strategies proposed in our work can be extended to other studies in abstract reasoning to fully capitalize on MLLM reasoning capabilities.

Cross-domain analysis of MLLM perception.

A possible hypothesis for the subpar performance of MLLMs on AVR tasks involving synthetic datasets, such as VCog-Bench or ARC, is the limited representation of synthetic images in their training data. This assumption is supported by the observed performance gap between synthetic BPs and real-world image BPs, such as those in Bongard HOI and Bongard-OpenWorld, which might suggest that MLLMs perform better at abstract reasoning with real-world images. However, our experiments with Bongard-RWR challenge this notion. Despite using real-world images, Bongard-RWR demonstrates that MLLMs still struggle with abstract reasoning, indicating that the performance gap cannot be solely attributed to differences in data domains. Instead, this suggests more fundamental challenges in visual reasoning. Future work could extend this research line by leveraging datasets that include both synthetic and real-world images, such as Raven’s Progressive Matrices (Zhang et al., 2019; Teney et al., 2020) or Visual Analogy Problems (Hill et al., 2019; Bitton et al., 2023). Contrasting MLLM performance on such dataset pairs may provide valuable insights into whether their limitations are rooted in data domain or in broader domain-free reasoning challenges.

Appendix BHuman Performance on Bongard-RWR Foundation of the study.

Our tests on MLLMs using the Bongard-RWR dataset revealed their poor performance in solving synthetic concepts depicted in real-world images. However, the difficulty and reliability of this new dataset remains an open question. To address this issue, we decided to assess human capabilities in solving these problems.

Methodology.

We compiled all Bongard-RWR problems into a single document, including a brief introduction that explains what BPs are (see Prompt B), along with a few detailed examples. The examples included one problem from the original BPs ( # ⁢ 133 ), one from Bongard-OpenWorld, and an additional BP ( # ⁢ 336 ) manually translated to the real-world domain. Bongard-RWR problem instances were positioned randomly in the document and were posed in an open-ended manner, allowing participants to provide any response they deemed appropriate.

Participants in our human evaluation predominantly belonged to the academic community, including Master students and (ocassionally) faculty members and PhD students, primarily due to accessibility. This demographic was selected based on the ease of reaching and engaging with individuals who are readily available in academic settings.

All answers were collected using an online form, ensuring a streamlined and efficient process for submission. Each participant was allowed to make only a single submission, to maintain the integrity and reliability of the data. In addition, the form contained a few more questions to gather basic statistics on our new dataset and the quality of submissions:

How would you assess the readability of the images included in the problems? (Scale 1 − 10 )

How would you assess the difficulty of the tasks you received? (Scale 1 − 10 )

What is your level of education? (Primary, Secondary, Higher, I prefer not to say)

How much time did you spend solving the tasks? (Less than 30 minutes, From 30 minutes to an hour, From one to two hours, More than two hours)

Answers evaluation.

In contrast to the evaluation of MLLM solutions, human responses were evaluated entirely manually. Initially, two humans reviewed the complete set of answers independently, achieving a 94.5 % agreement on the correctness of the responses. The discrepancies were then discussed and a consensus was reached leading to a single, unified evaluation.

[tbp] Prompt 1: Text used as a brief introduction in human testing. The presented problems represent a type of logic puzzle. Each problem consists of two sides separated by a vertical line. Each side contains six images. The task is to find a characteristic that applies to all the images on the left side but does not apply to those on the right side. Some problems may be less obvious and require a broader perspective or focus on details. Simply comparing the general content of the images might not be enough. Answers may repeat.

Respondent overview.

Overall, we successfully collected 30 responses, with 26.7 % of participants having secondary education and 73.3 % having higher education. Although the sample of respondents was small, we observed no significant discrepancies between the results of individuals with higher education and those with secondary education, both in the responses to additional questions (Fig. 6) and in the problem-solving results (Fig. 8).

Findings from respondent responses.

In Fig. 8, we present the distribution of responses to the Bongard-RWR dataset. Every problem was solved by at least one respondent, confirming the solvability of the dataset. The results are consistent across respondents (see Fig. 7), with the number of solved problems ranging from 23 to 59 . Moreover, the findings demonstrate the superiority of humans over MLLMs in tackling this type of task. Notably, the lowest human score exceeded the combined score of all the models. Half of the respondents solved more than 40 problems, with mean and median equal to 39.2 and 40.5 , resp., resulting in 65 % average accuracy.

The difficulty of each problem can be estimated based on the number of respondents who successfully solved it. As shown in Fig. 8, the problems exhibit varying levels of difficulty: 22 of them were solved by at least 25 respondents, while 10 were solved by fewer than 10 respondents. In addition, three problems— 10 , 88 and 100 —were solved by all respondents. Overall, the dataset was rated as quite difficult, with an average difficulty score of 7.6 across all respondents.

Overall, the results demonstrate the robustness and applicability of Bongard-RWR as a novel dataset for investigating the performance differences between human and model-based visual reasoning.

Figure 6:The respondent’s answers to the questions attached in the form. Even though the majority of respondents had higher education, they, on average, spent more than one hour on solving the problems. Moreover, only one person rated the problems as relatively easy (giving them a difficulty score of 4 out of 10). Figure 7:The number of problems solved by humans. The histogram illustrates the consistency across multiple respondents solving the Bongard-RWR problems. Notably, half of the respondents solved more than 40 problems, with none of them solving fewer than 23 ones. In the histogram, the lower bounds of the bins are inclusive, and the upper bounds are exclusive, except for the last bin, which is [ 55 , 59 ] . Figure 8:Human performance on Bongard-RWR. As shown in the plot, the models struggled with many problems that humans found relatively easy to solve. On the other hand, the models were able to solve problem # ⁢ 87 that appeared to be relatively demanding for human solvers. In the histogram, the lower bounds of the bins are inclusive, and the upper bounds are exclusive. Appendix CExtended Results

Table 4 presents results across all models, strategies and datasets discussed in Section 5. In the following paragraphs, we extend the discussion concerning binary and multi-class classification settings.

Table 2:Binary classification on Bongard-RWR variants. The number of problems solved by open-access models in the Images to Sides task from Bongard-RWR and its three variants, with the difference relative to Bongard-RWR in parentheses. Bongard-RWR RWR-G RWR-S RWR-SG InternVL2 8B 16

( + 1 )

( + 13 )

( + 11 )

InternVL2.5 38B 13

( + 0 )

( + 12 )

( + 10 )

InternVL2.5 78B 15

( + 2 )

( + 9 )

( + 8 )

Qwen2VL 72B-Instruct 20

( + 5 )

( + 0 )

( + 1 )

LLaVA-1.6 Mistral 7B 0

( + 3 )

( + 0 )

LLaVA-NeXT 110B 24

( + 5 )

( + 8 )

Phi-3.5-Vision 1

( + 15 )

( + 7 )

( + 17 )

Pixtral 12B 26

( − 5 )

( + 9 )

( + 7 )

Binary classification (Ground-Truth). We assessed whether MLLMs can determine if a given concept matches a problem instance. On synthetic BPs, 3 proprietary (GPT-4o, Gemini 1.5 Pro, Claude 3.5 Sonnet) and 2 open-access (LLaVA-1.6 Mistral-7B, Pixtral 12B) models outperform a random classifier by a notable margin. On Bongard HOI, 3 proprietary (GPT-4o, GPT-4 Turbo, Gemini 1.5 Pro) and the same 2 open-access models also surpass random guessing. Notably, Pixtral 12B attained a perfect score on this dataset. On Bongard-OpenWorld GPT-4o, Gemini 1.5 Pro, LLaVA-1.6 Mistral-7B, and Pixtral 12B achieve reasonable results. Again, the leading model is Pixtral 12B with the outstanding 99 % outcome. Model accuracy drops significantly on Bongard-RWR, where only LLaVA-1.6 Mistral-7B and Pixtral 12B outperform a random classifier. This suggests that correctly identifying concepts expressed in Bongard-RWR likely requires more advanced reasoning abilities, even in the relatively simpler binary classification setting.

Binary classification (Incorrect Label). Rejecting a possible solution is intuitively simpler than confirming its correctness, as it boils down to finding at least one image that doesn’t match the provided concept. Accordingly, 6 models perform better in the Incorrect Label setting than in Ground-Truth, with 7 perfect scores, 2 of them on Bongard-RWR. The exceptions are LLaVA-1.6 Mistral-7B, and Pixtral 12B which are below a random guessing threshold for all four datasets, despite being above this threshold in Ground-Truth. This suggests that their strong performance in the Ground-Truth setting may be due to a potential bias toward agreeing with the provided concept.

Binary classification (Images to Sides). We also evaluate the models’ ability to correctly assign two test images to the appropriate sides of the problem. A problem is considered solved if both images are correctly assigned to the respective sides. Proprietary models perform well in this task across synthetic BPs, Bongard HOI and Bongard-OpenWorld. Conversely, among open-access models, only Pixtral 12B consistently achieves strong results. Notably, on Bongard-RWR Pixtral 12B solves 26 / 60 problems, outperforming all proprietary models, however, all models perform below the level of random guessing. The remaining open-access models show poor results in this setting. Notably, weak results of LLaVA-1.6 Mistral-7B on real-world datasets are primarily attributed to its incorrect generation of JSON output required to format the result.

Binary classification on RWR variants. In the Ground-Truth and Incorrect Label settings, models performed similarly across the Bongard-RWR dataset and its three variants (RWR-G, RWR-S, RWR-SG), indicating that color removal and cropping did not substantially affect performance in these settings. However, in the Images to Sides setup (Table 2), several models showed clear differences. For example, Phi-3.5-Vision improved from solving 1 problem on RWR to 16 on RWR-G, 8 on RWR-S, and 18 on RWR-SG, suggesting that color removal and cropping reduce unnecessary variation, and shift the focus of the model toward structural patterns. Encouraged by these improvements, we evaluated larger model variants, including InternVL2.5 38B and 78B (Chen et al., 2024a, c), Qwen2VL 72B-Instruct (Bai et al., 2023; Wang et al., 2024), and LLaVA-NeXT 110B, which showed similar trends. Overall, the RWR-G variant highlighted models’ sensitivity to color, while RWR-S and RWR-SG revealed how cropping affects performance by reducing empty space between panels. Nevertheless, in most cases the models were unable to surpass the random guess threshold, highlighting their inherent limitations in recognizing Bongard-RWR concepts.

Figure 9: Concept selection. Multi-class classification accuracy of MLLMs across datasets and 𝑘 ∈ { 2 , 4 , 8 , 16 } choices. Detailed results are presented in Table 3. Table 3: Concept selection. Multi-class classification accuracy of MLLMs across datasets and 𝑘 ∈ { 2 , 4 , 8 , 16 } choices. InternVL2 is denoted as IVL2, InternVL2.5 as IVL2.5, and Qwen2VL Instruct as Q2VL. High-level trends are illustrated in Fig. 9. Bongard HOI B-OpenWorld Bongard-RWR Synthetic BPs

IVL2 8B 0.95

0.97

0.90

0.92

0.89

0.92

0.81

0.48

0.50

0.25

0.61

0.49

0.30

0.16

IVL2.5 8B 0.97

0.96

0.92

0.93

0.96

0.94

0.87

0.82

0.52

0.45

0.27

0.73

0.56

0.41

0.30

IVL2.5 38B 1.00

0.97

0.99

0.97

1.00

0.99

0.98

0.73

0.65

0.52

0.32

0.82

0.71

0.66

0.46

IVL2.5 78B 1.00

0.99

0.98

0.99

1.00

0.99

1.00

0.88

0.65

0.68

0.38

0.86

0.83

0.75

0.70

Q2VL 72B 1.00

0.99

0.95

0.96

0.98

0.99

0.97

0.85

0.67

0.57

0.32

0.91

0.69

0.59

0.41

Multi-class classification (concept selection). Fig. 9 and Table 3 present detailed results of the multi-class classification experiments, where models were tasked with selecting the correct concept from 𝑘 ∈ { 2 , 4 , 8 , 16 } choices across all four datasets. First, the performance improves from InternVL2 8B to InternVL2.5 8B, reflecting improvements in training data and model architecture in newer model versions. Next, moving from InternVL2.5 8B to InternVL2.5 38B and InternVL2.5 78B shows further gains, highlighting the benefits of scaling up model size, which we explore further in Appendix D. Comparing InternVL2.5 78B with Qwen2VL 72B-Instruct reveals that state-of-the-art models also from other families face challenges on this task, particularly on Bongard-RWR and synthetic BPs. It’s worth noting that the average human accuracy of 65 % on Bongard-RWR observed in the free-form setup is similar to the performance of best models in the much simpler multiple-choice setup with only 𝑘

4 candidate answers. This observation underscores the limitations of current models compared to humans.

Table 4:Evaluation results. The number of correct answers to the first 100 synthetic BPs, 100 selected BPs from Bongard HOI and Bongard-OpenWorld, and all 60 BPs from Bongard-RWR. The best result for a given strategy is marked in bold, and the second best is underlined. Synthetic BPs GPT-4 GPT-4 Gemini Claude InternVL2 LLaVa-1.6 Phi Pixtral o Turbo 1.5 Pro 3.5 Sonnet 8B Mistral 7B 3.5V 12B Ground-Truth 87 ¯

22 92 Incorrect Label 79 ¯ 89 70

79 ¯

Images to Sides 72 ¯

69 75 32

Direct 17 6

13 ¯

Descriptive 17

15 21 19 ¯

Descriptive-iterative 9 7

8 ¯

Descriptive-direct 22 13

20 22 1

Contrastive 10

8 17 15 ¯

Contrastive-iterative 20 9

12 ¯

Contrastive-direct 19 ¯

9 20 18

Bongard HOI GPT-4 GPT-4 Gemini Claude InternVL2 LLaVa-1.6 Phi Pixtral o Turbo 1.5 Pro 3.5 Sonnet 8B Mistral 7B 3.5V 12B Ground-Truth 73

74 ¯

49 100 Incorrect Label 100 100 96 100 90

Images to Sides 99 92

95 99 13

Direct 35 22

28 ¯

Descriptive 42 45 40

44 ¯

Descriptive-iterative 32 23 ¯

Descriptive-direct 45 40 ¯

Contrastive 18 5

15 ¯

Contrastive-iterative 22 20

21 ¯

Contrastive-direct 25 ¯

12 27 15

Bongard-OpenWorld GPT-4 GPT-4 Gemini Claude InternVL2 LLaVa-1.6 Phi Pixtral o Turbo 1.5 Pro 3.5 Sonnet 8B Mistral 7B 3.5V 12B Ground-Truth 80 ¯

31 99 Incorrect Label 100 100 98

Images to Sides 94

86 96 96 19

Direct 40 21

33 ¯

Descriptive 46 57 32

53 ¯

Descriptive-iterative 31 24 ¯

Descriptive-direct 52 52 25

Contrastive 19 ¯

11 21 7

Contrastive-iterative 25 ¯

8 34 5

Contrastive-direct 35 25

27 ¯

Bongard-RWR GPT-4 GPT-4 Gemini Claude InternVL2 LLaVa-1.6 Phi Pixtral o Turbo 1.5 Pro 3.5 Sonnet 8B Mistral 7B 3.5V 12B Ground-Truth 22

38 ¯

21 58 Incorrect Label 59 60 52 60 53

Images to Sides 15

24 ¯

1 26 Direct 5 1

3 ¯

Descriptive 8 ¯

7 13 0

Descriptive-iterative 4 1

3 ¯

Descriptive-direct 5 7 5

6 ¯

Contrastive 2 0

1 2 0

Contrastive-iterative 5 3

4 ¯

Contrastive-direct 4 ¯

1 5 4 ¯

0 Appendix DThe Impact of Model Scaling on Abstract Reasoning Abilities Figure 10:The impact of model scaling on abstract reasoning abilities. We considered a diverse set of model sizes across proprietary and open-access MLLMs. The experiments cover all 4 datasets using the Direct and Descriptive solution strategies.

Performance of MLLMs on downstream tasks is often correlated with the number of model parameters and the size of training datasets (Kaplan et al., 2020; Hoffmann et al., 2022). To investigate the relationship between model scaling and abstract reasoning performance, we conducted experiments with a diverse set of model sizes across proprietary and open-access MLLMs. To this end, we evaluated both smaller and larger variants of the selected models. Specifically, we considered GPT-4o mini and Gemini 1.5 Flash as smaller counterparts to GPT-4o and Gemini 1.5 Pro, resp. Also, we tested multiple configurations of InternVL2 and LLaVA-NeXT model families including InternVL2-8B, InternVL2-26B, InternVL2-40B, InternVL2-Llama3-76B, LLaVA-v1.6 Vicuna-13B, LLaVA-v1.6 34B, LLaVA-NeXT 72B, and LLaVA-NeXT 110B. We conducted experiments on all 4 datasets using two solution strategies, including Direct, which is an intuitive baseline, and Descriptive, the most effective strategy identified in the main experiments.

The results are presented in Fig. 10. In general, larger proprietary models outperformed their smaller counterparts in 10 out of 16 cases. However, smaller variants sometimes performed better than larger ones. For instance, on Bongard HOI with the Direct strategy, GPT-4o mini and Gemini 1.5 Flash surpassed their larger alternatives. This suggests that smaller models can achieve competitive performance in abstract reasoning.

For open-access models, performance consistently improved with model size. For example, the results of InternVL2 on Bongard HOI increased from 12 to 25 and from 2 to 29 for Direct and Descriptive strategies, resp. Similarly, on Bongard HOI, the performance of LLaVA-NeXT improved from 5 to 27 and from 4 to 27 for the two strategies. Analogous improvements were observed on Bongard-OpenWorld, highlighting the potential benefits of model scaling.

Despite these significant improvements in open-access models, proprietary models consistently outperformed them. In particular, GPT-4o mini achieved worse results than the best open-access model in a single case only, i.e., Bongard HOI using the Descriptive strategy ( 26 vs. 29 ). Although model scaling demonstrates its potential to enhance abstract reasoning, as shown by the open-access models, the relatively strong performance of GPT-4o mini shows that a large parameter count is not necessarily critical for excelling in abstract reasoning tasks. Consequently, these results suggest that simply scaling model size may be insufficient to achieve stronger abstract reasoning capabilities and future efforts should explicitly address this aspect, e.g., by incorporating AVR datasets into model training.

Appendix EEvaluation of MLLMs Answers

Preliminary experiments revealed that proprietary MLLMs are generally much more effective in solving Bongard Problems than open, publicly-available MLLMs. Therefore, all efforts devoted to optimizing the final scores, in particular tuning the evaluation prompt were performed using these 4 commercial MLLMs.

Open-ended characteristics of BPs stemming from a textual form of an answer, and the number of considered models ( 8 ), generation strategies ( 7 ), datasets ( 4 ), and BP instances per dataset ( 60 in Bongard-RWR and 100 in the remaining cases) require the use of an automated NLP-based evaluation of the model’s answers. For this task we employed MLLMs with a specially designed prompt. The initial version of the evaluation prompt (see Prompt E.1) was intentionally relatively simple – a model received an answer to be evaluated as well as the ground-truth labels, and was requested to output a binary yes/no answer. This prompt formulation turned out to be too simplistic. While the level of agreement between all models was relatively high ( 87 % of responses were rated unanimously by all models), as illustrated in Fig. 11(a), manual inspection of the selected answers revealed that the assessment was generally too optimistic and relatively many evaluations wrongly pointed to correct answers.

(a)Results with the initial evaluation prompt. (b)Results with the final evaluation prompt. Figure 11:Models’ agreement on the evaluation of BPs. The assessed solutions were generated by GPT-4o with the Descriptive strategy. The numbers refer to the BP tasks from (Bongard, 1968). Green indicates tasks unanimously evaluated as correctly solved by all models, while white indicates unanimous incorrect evaluations. Other colors highlight tasks marked as correctly solved by individual models. E.1Evaluation Prompt Optimization

Due to the above evaluation disagreement, we made an attempt to optimize the prompt based on the GPT-4o solutions for the additional 20 BPs ( # ⁢ 101 − # ⁢ 120 ) that were not used in the main experiments. First, following the few-shot prompting technique, we expanded the prompt to include two examples showing a possible logical difference between correct and incorrect answers. Furthermore, we added a sentence which requested a strict logical compliance with the provided labels. However, this refinement appeared to be too strong, as 2 (out of 4 ) models didn’t evaluate any of the solutions as correct.

To impose some flexibility, we changed the word strictly to logically, but this resulted in an increased rate of false-positive evaluations. Finally, we combined these two prompts, obtaining the outcome closest to the manual (our human) evaluation. The final version of our evaluation prompt is listed in Prompt E.1. Additionally, we attempted to attach the image of the evaluated BP instance to each version of our prompt, but this actually confused the models rather than improving their results, so we ultimately abandoned this option and stuck with the fully text-based prompt.

Although the consistency of results regarded as the number of unanimous assessments stayed at the same level ( 87 % ) (see Fig. 11(b)), the number of answers rated as correct significantly decreased, which was in accordance with our random manual verification.

Despite lowering the results variation, there were still BPs for which the assessment varied. Therefore, we eventually decided to use hard voting to ensemble all models’ evaluations. We marked a solution as correct if at least 2 of the 4 models evaluated it as correct. This approach brought better results than the majority voting.

[tbp] Prompt 2: Initial prompt used in MLLM answer evaluation. We focused on its clarity and simplicity. You are a logic module designed to provide accurate answers. In a Bongard Problem the objective is to spot the difference between the contents of images located on the two opposite sides of the problem. You are given correct labels of these sides and must decide whether the answer provided by the user is correct and matches with those labels. Answer with ’OK’ or ’WRONG’. LEFT SIDE LABEL: RIGHT SIDE LABEL: USER ANSWER:

[tbp] Prompt 3: The final version of the evaluation prompt. It is enriched with the few-shot prompting technique and imposes a logical compliance with provided labels. You are a logic module designed to provide accurate answers. In a Bongard Problem the objective is to spot the difference between the contents of images located on the two opposite sides of the problem. You are given correct labels of these sides and must decide whether the answer provided by the user is correct and matches with those labels. Answer with ’OK’ or ’WRONG’. The user’s answer has to strictly logically match the labels, as shown in examples. FIRST EXAMPLE: LEFT SIDE LABEL: All shapes are small. RIGHT SIDE LABEL: All shapes are big. USER ANSWER: On the left side, one of the shapes is small. On the right side, all of the shapes are big. EVALUATION: WRONG END OF FIRST EXAMPLE. SECOND EXAMPLE: LEFT SIDE LABEL: All shapes are small. RIGHT SIDE LABEL: All shapes are big. USER ANSWER: On the left side, all of the shapes are small. On the right side, all of the shapes are big. EVALUATION: OK END OF SECOND EXAMPLE. LEFT SIDE LABEL: RIGHT SIDE LABEL: USER ANSWER:

E.2Manual Verification of the Evaluation Performance of MLLMs

In order to finally assess the efficacy of Prompt E.1 we manually checked the models’ evaluation performance on the 100 BPs solved by GPT-4o using the Descriptive strategy. As shown in Table 4 in the main paper, all proprietary models achieved better scores on incorrect labels classification. For this reason, we decided to manually verify only the problems evaluated as correct by at least one MLLM, assuming that those incorrect are generally evaluated properly. The comparison between the evaluation performance of the initial and final prompts and our manual evaluation is presented in Table 5. All models denotes evaluations where a solution is marked as correct only if all models evaluate it as correct. Similarly, any model refers to the cases where a solution is marked as correct if at least one model evaluates it as correct. Voting refers to the hard-voting scheme described in section E.1. It is important to observe that the chosen voting evaluation method differed from the manual evaluation in only one specific problem, which is depicted in Fig. 12.

In addition, we checked the performance of our enhanced evaluation prompt on 20 new, not used in other experiments, manually evaluated Bongard-OpenWorld problems solved by GPT-4o using the Descriptive strategy. Again, the use of Prompt E.1 visibly increased the consensus with manual evaluation (see Table 6). The difference between our manual evaluation and the voting scheme occurred only in 2 problem solutions whose correctness is disputable (see Fig. 13).

While the choice of examples shown in the prompt may in some cases impact the evaluation performance, the finally proposed evaluation prompt seems to be well suited to both domains: synthetic and real-world, and should potentially be effective in other similar datasets and solving strategies.

Table 5:Consensus with manual evaluation on synthetic BPs. The percentage of the solutions evaluated the same as in our manual evaluation in BP instances # ⁢ 1 − # ⁢ 100 (Bongard, 1970). The assessed solutions were obtained by GPT-4o using the Descriptive prompting strategy. initial prompt final prompt All models Any model voting All models Any model voting

0.93

0.90

0.94

0.90

0.96

0.99 Figure 12:Synthetic BP # ⁢ 85 . This is the only BP instance for which the voting-based evaluation of MLLM solutions differed from our manual evaluation. Left: Three parts. Right: Five parts. GPT-4o answer: Left: All images are composed of exactly three lines. Right: All images are composed of more than three lines. Voting marked it as incorrect, whereas in manual evaluation it was marked as correct. The first difference lies in the meaning of the words lines and parts, which, in this visual context, seems identical. The second difference stems from the number of the parts on the right side of the problem. The answer seems to be correct, as obviously, five is more than three. However, one could argue that the answer is incomplete, as each of the squares on the right side clearly depicts exactly five parts. Table 6:Consensus with manual evaluation on Bongard-OpenWorld. The percentage of the solutions evaluated the same as in our manual evaluation in the additional 20 Bongard-OpenWorld instances # ⁢ 101 − # ⁢ 120 (Wu et al., 2024), not used in the main experiment. The solutions were obtained by GPT-4o using the Descriptive prompting strategy. initial prompt final prompt All models Any model voting All models Any model voting

0.75

0.70

0.65

0.85

0.90 (a)Left: Underground tunnels beneath the city. Right: NOT Underground tunnels beneath the city. GPT-4o solution: Left: All images depict scenes that are primarily indoors or underground. Right: All images depict scenes that are primarily outdoors. We evaluated it as correct, while the voting marked it as incorrect. The difference arises from how the left concept is perceived. Although the images on the left depict an underground setting, they do not appear to represent an indoor scene. Nevertheless, one of the statements is still true. (b)Left: A woman wearing a white wedding dress. Right: NOT A woman wearing a white wedding dress. GPT-4o solution: Left: All images feature women in white wedding dresses or wedding-related scenes. Right: All images feature women in non-wedding attire, wearing dresses or suits of various colors other than white. We evaluated it as incorrect, while the voting marked it as correct. The difference stems from a small detail in the right concept. One of the images on the right depicts a woman in a white suit, which conflicts with the model’s answer. Figure 13:The only two Bongard-OpenWorld problems (out of the selected 20 ) for which the voting evaluation differed from our manual evaluation. The correctness of GPT-4o’s solutions to these problems is disputable. Appendix FBongard-RWR Dataset

The dataset generation algorithm is presented in Algorithm 1 using notation introduced in Section 4.

1: Input: A set of synthetic concepts 𝑌 2: Output: A set of generated instances ℛ ⁢ 𝒲 ⁢ ℛ 3: 𝑀 ← 15 , 𝑁 ← 10 , 𝑇 ← 3 4: 5: for 𝑦 𝑋 ∈ 𝑌 do 6: 𝐷 𝑋 ← GenerateTranslations( 𝑦 𝑋 , 𝑁 ) 7: 𝐼 𝑋 ← ∅ , 𝑃 ← ∅ 8: 9: for 𝐷 𝑖 𝑋 ∈ 𝐷 𝑋 do 10: for 𝑚 ← 1 to 𝑀 do 11: for 𝑆 ∈ { 𝐿 , 𝑅 } do 12: 𝐼 ← DownloadImage( 𝐷 𝑖 𝑋 ⁢ 𝑆 , 𝑚 ) 13: if 𝐼 is accepted by model then 14: 𝐼 𝑖 𝑋 ⁢ 𝑆 ← 𝐼 𝑖 𝑋 ⁢ 𝑆 ∪ { 𝐼 } 15: end if 16: end for 17: 18: if | 𝐼 𝑖 𝑋 ⁢ 𝐿 | ≥ 2 and | 𝐼 𝑖 𝑋 ⁢ 𝑅 | ≥ 2 then 19: 𝑃 ← 𝑃 ∪ { 𝑖 } 20: break 21: end if 22: end for 23: 24: if | 𝑃 | ≥ 𝑇 then 25: Break 26: end if 27: end for 28: 29: ℒ 𝑋 ← ∅ , ℛ 𝑋 ← ∅ 30: if | 𝑃 | ≥ 𝑇 then 31: for 𝑘 ← 1 to 6 do 32: 𝑝 ← 𝑃 ⁢ [ 𝑘 mod 𝑇 ] 33: 𝑗 ← 𝑘 ÷ 𝑇 34: for 𝑆 ∈ { 𝐿 , 𝑅 } do 35: 𝒮 𝑋 ← 𝒮 𝑋 ∪ { 𝐼 𝑝 𝑋 ⁢ 𝑆 ⁢ ( 𝑗 ) } 36: end for 37: end for 38: 39: ℛ ⁢ 𝒲 ⁢ ℛ 𝑋 ← { ℒ 𝑋 , ℛ 𝑋 } 40: ℛ ⁢ 𝒲 ⁢ ℛ ← ℛ ⁢ 𝒲 ⁢ ℛ ∪ { ℛ ⁢ 𝒲 ⁢ ℛ 𝑋 } 41: end if 42: end for Algorithm 1 The Bongard-RWR dataset generation.

Furthermore, Fig. 15 provides additional examples of the proposed Bongard-RWR dataset. Each subfigure presents a comparison between the synthetic Bongard problem and its respective real-world translation in Bongard-RWR. Examples 15(a) and 15(b) were translated automatically, whereas 15(c) and 15(d) were constructed fully manually, including building an appropriate scene and taking a picture. Additionally, Fig. 14 shows a particular approach taken when translating a given synthetic BP to its Bongard-RWR counterpart (problems not translated and those rejected after translation are combined into one category).

Figure 14:The structure of the Bongard-RWR dataset. Each color denotes the genesis of the translation of the respective problem. Problems not translated and those rejected after translation are combined into one category. Red color outlines denote problems which were solved by any model using any strategy. There is no visible correlation between the set of solved problem and the methods used for their generation. (a)Synthetic BP # ⁢ 8 with its automatically translated Bongard-RWR version. Left: Figures on the right side. Right: Figures on the left side. (b)Synthetic BP # ⁢ 10 with its automatically translated Bongard-RWR version. Left: Triangles. Right: Quadrangles. (c)Synthetic BP # ⁢ 41 with it manually constructed Bongard-RWR version. Left: Outline circles on one straight line. Right: Outline circles not on one straight line. (d)Synthetic BP # ⁢ 47 with its manually constructed Bongard-RWR version. Left: Triangle on top of the circle. Right: Circle on top of the triangle. Figure 15:Additional examples of Bongard-RWR instances. Appendix GCoverage of Bongard-RWR Instances

Even though the final results of individual models and strategies solving Bongard-RWR are somewhat unsatisfactory, especially in the case of open language response generation, it is worth to examine the degree of overlap of correct answers across the tested models. On the one hand, the results of ground-truth classification and model’s disagreement on the solution evaluation clearly confirm inability of any single model to solving all problems from the Bongard-RWR dataset. On the other hand, it is likely that the overlap is not complete, and it is possible to expand the solution coverage by appropriate mixture-of-experts approach. This is indeed the case in our experiments, as illustrated in Figs. 16 – 23. While single proprietary MLLMs solved between 9 and 14 instances, the number of instances solved by any MLLM equaled 23 . We leave exploration of this path for future research.

Figure 16:Overall result of each strategy on synthethic BPs. Colormaps depict all problems solved by any tested model using the respective prompting strategy. The right figure aggregates strategies into corresponding groups for better coverage exposure. Figure 17:Final summary of all experiments on the synthetic Bongard Problems dataset. The colormap aggregates all problems solved by any tested model using any generation prompting strategy. Overall, the models collectively managed to solve 53 problems. Figure 18:Overall result of each strategy on Bongard HOI. Colormaps depict all problems solved by any tested model using the respective prompting strategy. The right figure aggregates strategies into corresponding groups for better coverage exposure. Figure 19:Final summary of all experiments on the Bongard HOI dataset. The colormap aggregates all problems solved by any tested model using any generation prompting strategy. Overall, the models collectively managed to solve 88 problems. Figure 20:Overall result of each strategy on Bongard-OpenWorld. Colormaps depict all problems solved by any tested model using the respective prompting strategy. The right figure aggregates strategies into corresponding groups for better coverage exposure. Figure 21:Final summary of all experiments on the Bongard-OpenWorld dataset. The colormap aggregates all problems solved by any tested model using any generation prompting strategy. Overall, the models collectively managed to solve 93 problems. Figure 22:Overall result of each strategy on Bongard-RWR. Colormaps depict all problems solved by any tested model using the respective prompting strategy. The right figure aggregates strategies into corresponding groups for better coverage exposure. Figure 23:Final summary of all experiments on the Bongard-RWR dataset. The colormap aggregates all problems solved by any tested model using any generation prompting strategy. Overall, the models collectively managed to solve 22 problems. Appendix HComparison of Synthetic Bongard vs. Bongard-RWR results

Our research shows that all of the tested models have difficulty solving synthetic concepts when applied to real-world images. Comparing the results for both datasets (see Figs. 17 and 23) we identified some discrepancies. Four problems that remained unsolved in the synthetic BPs were successfully solved in the real-world domain of the Bongard-RWR dataset: # ⁢ 56 , # ⁢ 87 , # ⁢ 88 , and # ⁢ 98 . However, three of these problems differ slightly from their synthetic counterparts. The images in # ⁢ 56 from Bongard-RWR feature a variety of colors instead of the usual black-and-white figures. Furthermore, in real-world version of problem # ⁢ 87 more images feature disjoint elements instead of multi-part objects, which may have nudged the model toward the correct answer. Additionally, in problem # ⁢ 98 , the figures are shown against a hatched texture, which was not accounted for in the real-world translation.

Conversely, 22 problems that were solved in the synthetic BPs were not solved in the Bongard-RWR dataset. In most cases, the models focused on general associations that did not apply to all the images. For example, the concept presented in problem # ⁢ 3 is: ”LEFT: Outline figures, RIGHT: Solid figures.” Claude 3.5 Sonnet, using the Descriptive strategy, responded: ”LEFT: focuses on practical, everyday objects or scenes, RIGHT: emphasizes aesthetic, artistic, or decorative elements that serve more for visual appeal than utility”. Nevertheless, both sides feature a red coffee cup with a saucer, which matches both generated descriptions. The key difference lies in the color of the saucer’s rim, which determines whether it should be considered as an outline or a solid figure.

(a)Model’s answer: ”Left: All shapes are filled. Right: At least one shape is unfilled”. Evaluated as incorrect. (b)Model’s answer: ”Left: Images are monochromatic (containing only shades of a single color). Right: All images contain at least one hollow (unfilled) shape”. Evaluated as correct. (a)Model’s answer: ”Left: All shapes are composed of straight lines that form only acute or right angles. Right: All shapes contain at least one instance of obtuse angles or curves (e.g., pentagon, ”H” shape, radial pattern)”. Evaluated as incorrect. (b)Model’s answer: ”Left: All images contain exactly four of something: four orange slices, Four cake slices, four lobes of the four-leaf clover, four pineapples, four fingers with painted nails (thumb not visible), four visible elements (key, keychain, watch/compass, leather strap). Right: Images do not contain exactly four of any element: single four-leaf clover, multiple keys on a ring, single street lamp with multiple lights, single ladybug, multiple oranges (more than four), stack of five cookies”. Evaluated as correct. Figure 25:Synthetic BP # ⁢ 87 with its manually translated Bongard-RWR version. Correct answer: ”Left: Four parts. Right: Five parts”. Provided answers belong to Claude 3.5 Sonnet using the Descriptive strategy, as it was the only combination that solved this Bongard-RWR instance correctly. (a)Model’s answer: ”Left: Each image contains an even number of shapes. Right: Each image contains an odd number of shapes”. Evaluated as incorrect. (b)Model’s answer: ”Left: Contains groups of three items. Right: Contains groups of five items”. Evaluated as correct. Figure 26:Synthetic BP # ⁢ 88 with its manually translated Bongard-RWR version. Correct answer: ”Left: Three parts. Right: Five parts”. Provided answers belong to GPT-4o using the Direct strategy. Overall this problem was solved by three models and four strategies. (a)Model’s answer: ”Left: All images contain geometric shapes or patterns that are oriented or have components that predominantly point to the right. Right: All images contain geometric shapes or patterns that are either static (not pointing in any specific direction) or have components that do not predominantly point to the right”. Evaluated as incorrect. (b)Model’s answer: ”Left: All images on the left side feature objects that are primarily triangular in shape. This includes the slices of cake and pizza, which are cut into triangular pieces, and the sailboats, which have triangular sails. Right: All images on the right side feature objects or scenes that are primarily rectangular or square in shape. This includes the modern buildings, the sandwich, the grid-like urban area, the arrangement of kites, and the plate of toasted bread, all of which emphasize rectangular or square forms”. Evaluated as correct. Figure 27:Synthetic BP # ⁢ 98 with its manually corrected Bongard-RWR version. Correct answer: ”Left: Three parts. Right: Five parts”. Provided answers belong to GPT-4 Turbo using the Descriptive-direct strategy. Overall this problem was solved by two models using two different strategies. Appendix IMLLM Prompts I.1Prompts for Bongard-RWR Generation

At first, Prompt I.1 was employed to generate real-world representations of synthetic concepts. After downloading the images corresponding to these textual descriptions, Prompt I.1 was used to select those images that correctly represent given concept translation. In addition to the left and right concepts, we also provided prompts briefly explaining the context that the image should match. These prompts were generated during the translation stage of our algorithm (see the fourth line in Algorithm 1).

[tbp] Prompt 4: Initial concept-describing prompt used in construction of Bongard-RWR. Your goal is to translate a comparison concept from the geometric domain to the real-world domain. Your translations should be expressible as images. Example: Geometric domain: triangles vs squares { "left": { "concept": "pyramids" }, "right": { "concept": "rectangular buildings" } } Give unique translations for the following concept as a raw JSON array of objects (same as in the example above).

[tbp] Prompt 5: Prompt used for the selection of proper images for a translated concept. You translated a concept comparison from geometric domain to the real-world domain as follows: Geometric domain: vs Real world domain: { "left": { "concept": , "prompt": }, "right": { "concept": , "prompt": } } Now, you need to check if the queried image matches your translation and provides enough information to distinguish it from the other concept. Don’t focus too much on the prompt. It’s just a hint for you to understand the concept better. Provided image represents Give your answer in the following format: EVALUATION: OK EXPLANATION: or EVALUATION: REJECTED EXPLANATION:

I.2Prompt Describing the Bongard Problem

Prompt I.2 describing the BP task has been placed at the beginning of each solving strategy introduced in Section 3.1.

[tbp] Prompt 6: Prompt explaining Bongard problems to an MLLM. A Bongard Problem is composed of left and right sides separated by a line. Each side contains six images. All images belonging to one side present a common concept, which is lacking in all images from the other side. The goal is to describe the rule that fits all images on the left side, but none on the right, and, conversely, the rule that fits all images on the right side, but none on the left. The description of the rule should be simple and concise. Example 1: All shapes on left are small. All shapes on right are big. Example 2: The left side contains circles. The right side contains triangles.

I.3Prompts for Classification Strategies

Prompt I.3 was used to assess solution correctness (see Section 3.2). Prompt I.3 was used to assign images to sides (see Section 3.3).

[tbp] Prompt 7: Prompt used for images to side classification (see Fig. 3c). Two examples were provided to not bias the results of the model. You are a vision understanding module designed to provide short, clear and accurate answers. Your goal is to classify two test images to the corresponding side of the Bongard Problem, LEFT or RIGHT. Each image belongs to exactly one class. The test images belong to different classes. The images are always provided correctly. Respond only to the specific request. Respond in json using the following format. FIRST EXAMPLE: Left images: Right images: First test image: Second test image: Response: { "first": { "explanation": "The test image shows a small shape, similarly as all images on the left side. Conversely, the images on the right side feature big shapes.", "concept": "small vs big", "answer": "LEFT" }, "second": { "explanation": "The test image shows a big shape, similarly as all images on right. The images on left, on the other hand, feature small shapes.", "concept": "small vs big", "answer": "RIGHT" } } END OF FIRST EXAMPLE SECOND EXAMPLE: Left images: Right images: First test image: Second test image:

[tbp] Prompt 8: Prompt used for the solution correctness assessment (see Fig. 3b). You are a vision understanding module designed to provide short, clear and accurate answers. Your goal is to evaluate the correctness of the provided answer to the given Bongard Problem. All images are provided correctly. Do not explain the answer, just evaluate it. Respond ’OK’ if the answer is correct, otherwise respond ’WRONG’. User answer:

I.4Prompts for Natural Language Answer Generation Strategies

Prompts I.4–I.4 were used for natural language answer generation (see Section 3.1).

[tbp] Prompt 9: Prompt used for the Direct strategy. (see Fig. 2a). You are a vision understanding module designed to provide short, clear and accurate answers. Your goal is to solve the provided Bongard Problem. What is the difference between the two sides of the problem?

[tbp] Prompt 10: Prompt used to obtain the image descriptions in the Descriptive strategy (see Fig. 2b). The provided image is a part of an abstract visual reasoning problem. Describe all crucial properties of the image. Your description should be as concise as possible. Focus on the most important details. The image is provided correctly. Respond only with descriptions.

[tbp] Prompt 11: Prompt used for the Descriptive and Descriptive-direct strategies (see Fig. 2b). You are a vision understanding module designed to provide short, clear and accurate answers. Your goal is to solve the provided Bongard Problem using descriptions of its images. LEFT IMAGES: RIGHT IMAGES: What is the difference between the two sides of the problem?

[tbp] Prompt 12: Prompt used to obtain the image descriptions in the Descriptive-iterative strategy (see Fig. 2c). After the last image, we used the prompt: “That was the last image. Now provide your final answer.” You’ll receive a sequence of images that are a part of a single side of a Bongard Problem. The images will be provided one by one. Your goal is to find a common concept presented in all images. Your description should be as concise as possible. Focus on the most important details. Try to enhance the description of the concept after each image. The image is always provided correctly. Respond only to the specific request. The first image will be provided in the next message.

[tbp] Prompt 13: Prompt used for the Descriptive-iterative strategy (see Fig. 2c). You are a vision understanding module designed to provide short, clear and accurate answers. Your goal is to solve the provided Bongard Problem using descriptions of two sides of the problem. LEFT SIDE DESCRIPTION: RIGHT SIDE DESCRIPTION: What is the difference between the two sides of the problem?

[tbp] Prompt 14: Prompt used to obtain the comparison between the left and right image in the Contrastive strategy (see Fig. 2d). After the last image, we used the prompt: “That was the last image. Now provide your final answer.” You are given two images extracted from the left and right side of a Bongard Problem, respectively. Your goal is to compare the images. Your comparison should be as concise as possible.

[tbp] Prompt 15: Prompt used for the Contrastive and Contrastive-direct strategies (see Fig. 2d). You are a vision understanding module designed to provide short, clear and accurate answers. Your goal is to solve the provided Bongard Problem using comparisons between pairs of images. Each pair contains one image from the left and one from the right side of the problem. COMPARISONS: What is the difference between the two sides of the problem?

[tbp] Prompt 16: Prompt used for the Contrastive-iterative strategy (see Fig. 2e). After the last pair of images, we used the prompt: “It was the last pair of images. What is the difference between the two sides of the problem?” You are a vision understanding module designed to provide short, clear and accurate answers. Your goal is to solve the provided Bongard Problem. You’ll receive a sequence of image pairs. Each pair contains one image from the left and one from the right side of the problem. In each step compare the two images and refine the definitions of concepts that describe left and right sides of the problem. Your description should be as concise as possible. Focus on the most important details. The first pair will be provided in the next message.

Report Issue Report Issue for Selection Generated by L A T E xml Instructions for reporting errors

We are continuing to improve HTML versions of papers, and your feedback helps enhance accessibility and mobile support. To report errors in the HTML that will help us improve conversion and rendering, choose any of the methods listed below:

Click the "Report Issue" button. Open a report feedback form via keyboard, use "Ctrl + ?". Make a text selection and click the "Report Issue for Selection" button near your cursor. You can use Alt+Y to toggle on and Alt+Shift+Y to toggle off accessible reporting links at each section.

Our team has already identified the following issues. We appreciate your time reviewing and reporting rendering errors we may not have found yet. Your efforts will help us improve the HTML versions for all readers, because disability should not be a barrier to accessing research. Thank you for your continued support in championing open access for all.

Have a free development cycle? Help support accessibility at arXiv! Our collaborators at LaTeXML maintain a list of packages that need conversion, and welcome developer contributions.

Xet Storage Details

Size:: 125 kB
Xet hash:: 854f75e853b9a0270bbf5e1a0d4953d6c9d3613c0c90bfc2ed12af240365ac45

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.