Buckets:

|
download
raw
68.1 kB

Title: AdaptMI: Adaptive Skill-based In-context Math Instructions for Small Language Models

URL Source: https://arxiv.org/html/2505.00147

Published Time: Fri, 12 Sep 2025 00:07:52 GMT

Markdown Content:

3.1 Experimental Settings

Datasets. We evaluate on the MATH(7.5k training samples and 5k test samples) (Hendrycks et al., 2021) and GSM8K(7.4k training samples and 1.3k test samples) (Cobbe et al., 2021) datasets. We follow Didolkar et al. (2024a) to label skills on both the training and test sets using GPT-4o-mini (OpenAI, 2024), and run inference experiments on the whole test set. Section A.1 shows the prompt and examples of our skill annotation pipeline. We sample in-context examples from the training set. These two datasets are not overly challenging for SLMs, which ensures relatively interpretable model outputs for stable failure detection. Meanwhile, they are sufficiently representative to offer meaningful insights into our method’s efficacy.

Model settings. We tested our methods on five instruction-tuned small language models: Qwen2.5-1.5B-Instruct, Qwen2.5-3B-Instruct, Qwen2.5-7B-Instruct, Llama-3.2-1B-Instruct, and Llama-3.2-3B-Instruct (Yang et al., 2024; Meta AI, 2024). We evaluate the models on 5 5-shot ICL performance. We use generation temperature at 0.0 for all experiments. We also compare against consistency@5 voting (Wang et al., 2022) with 5-shot fixed examples, where we use 5 5 generations at temperature 1.0 1.0 and evaluate the consistent response. For classifying easy and difficult questions in the first stage, we use RLHFlow/Llama3.1-8B-PRM-Mistral-Data (Xiong et al. (2024)), an 8B process reward model fine-tuned from Llama-3.1-8B, with filtering thresholds τ 1=0.85,τ 2=0.7\tau_{1}=0.85,\tau_{2}=0.7. We use GPT-4o-mini for skill annotation as well as labeling missing skills in AdaptMI+.

Baselines. We compare our method to non-adaptive in-context example selection methods, respectively feeding in fixed examples, random examples, and skill-based examples (Didolkar et al. (2024a)) for all queries.

3.2 Performances of AdaptMI and AdaptMI+

Section 3 reports the main results of our adaptive in-context learning method. The baseline methods with non-adaptive in-context examples (fixed, random, or skill-based) results in largely similar Pass@1 accuracy, while consistency@5 can improve accuracy by a few percentages. Across all model sizes, our methods AdaptMI and AdaptMI+consistently outperform the non-adaptive Pass@1 baselines, and are on par with Consistency@5 performance on most subareas. The overall improvements are especially pronounced for smaller models, Qwen2.5-1.5B-Instruct and Llama-3.2-1B-Instruct.

While AdaptMI surpasses consistency@5 performance on most domains, it slightly lags behind on certain subjects such as Geometry and Precalculus for 1B or 3B models. These subjects are relatively difficult for the model, as suggested by their loss scores compared to other subjects (see Section D.3 in Appendix). Since AdaptMI requires models to have sufficient capabilities to leverage the given skill-based examples, it may not work better than Consistency@5 on these harder topics.

Notably, AdaptMI+ brings significant performance gain across all areas by up to 6%, reflecting its strength in accurately targeting model failures. AdaptMI also substantially improves performance by up to 3.6% for Qwen2.5-1.5B-Instruct, Llama-3.2-1B-Instruct, and Llama-3.2-3B-Instruct on MATH. This indicates that our adaptive instruction methods are effective on lower-performing models even without the aid of an LLM.

On stronger models such as Qwen2.5-3B-Instruct and Qwen2.5-7B-Instruct, however, AdaptMI shows smaller effectiveness compared to AdaptMI+. This may suggest that higher-performing models require a more intelligent and target skill identification process. Overall, these results demonstrate the effectiveness of adaptive example selection and highlight the potential of our approach to elicit the full reasoning capabilities of small language models.

Image 1: Refer to caption

Figure 2: SLM performances under iterative skill-based example selection (AdaptMI+) vs. iterative random example retrieval. Each iteration involves model inference, difficult question detection, and random/skill-based example re-selection with GPT-4o-mini. Iterative AdaptMI+ yields a continuous accuracy gain by up to 7.2%7.2%, while the baseline leads to fluctuated performances.

3.3 Iterative AdaptMI+

Our method can be extended to an iterative loop of adaptive example selection. Each iteration begins with model inference, followed by detecting difficult questions and using GPT-4o-mini to select skill-based examples. The selected examples are then fed in with difficult questions for model inference in the next iteration. This iterative AdaptMI+ is essentially pushing the SLM to tackle a gradually refined set of difficult questions by adaptive teaching. We compare iterative AdaptMI+ with a baseline of iterative random retrieval, where the loop involves inference, random example resampling, and re-inference.

Figure 2 shows that iterative AdaptMI+consistently improves the reasoning performance on MATH for all three Qwen small language models, while the baseline method struggles to keep pushing the accuracy boundary after the first few iterations. For 1.5B and 3B models, the performance grows rapidly in the first four iterations, and improves more gradually thereafter. The 7B model performance, while starting to degrade by the 10th loop, still increases substantially compared to baseline. Through iterative re-selection of targeted in-context examples, iterative AdaptMI+ demonstrates the potential of progressively guiding small language models to tackle unsolved problems.

4 Discussion

Table 2: Accuracy of Qwen2.5-1.5B-Instruct on difficult and easy questions, respectively under fixed, random, and skill-based examples. Skill-based examples boost performance on difficult questions across all categories, while significantly underperforming on easy questions. We provide the results on Number Theory, Intermediate Algebra, and Counting & Probability, as well as the results on other Qwen models in Appendix D.

4.1 Why does adaptive selection work better than non-adaptive skill-based selection?

To better understand, we compare performance under fixed, random, and skill-based in-context examples on easy and difficult questions. From Table 2, we observe a clear trend that skill-based examples harm an SLM’s performance on the set of easy questions, while effectively boosting performance on the difficult ones. To gain deeper insight into how skill-based in-context examples might harm performance on easy questions, we present two illustrative cases where the model’s performance regresses when using such prompts.

Case Study 1: Skill-based examples lead the model to overlook key problem constraints. In this example (see Section C.1), the Qwen2.5-7B-Instruct model is given an algebra question that includes multiple geometric constraints. When prompted with fixed examples, the model correctly identifies two possible answers and chooses the correct one according to the given condition ”both coordinates are negative.” On the other hand, when conditioned on examples that represent algebraic skills, the model overly emphasizes algebraic completeness but overlooks this important problem condition. It finally selects the incorrect answer by a random guess.

Case Study 2: Symbol-heavy skill-based examples cause the model to overthink. This question (see Section C.2) requires a plug-in-and-test approach instead of solving an equation. With fixed in-context examples, the model is able to find out the correct answer by directly plugging in and trying out small values. However, the skill-based examples that involve equation solving may have caused the model to overthink. After failing in the first plug-in-and-test, it ended up attempting to solve the equation system and eventually failed.

4.1.1 Fine-grained Analysis: Effect of skill-based examples across five difficulty levels

The above observations motivate a more fine-grained analysis. We partition our evaluation set into five levels of difficulty, based on the probability of success under Best-of-n n sampling (Gui et al., 2024), verified using ground-truth labels. Formally, a question belongs to Difficulty Level ℓ\ell (1≤ℓ≤4 1\leq\ell\leq 4) if it can be solved with Best-of-2 ℓ−1 2^{\ell-1} sampling, but not with any lower n n. Questions that belong to Level 5 5 can’t be solved with Best-of-8 8 sampling. We provide no in-context examples when measuring the success of Best-of-n n sampling and use temperature of 1.0 1.0. Intuitively, questions in Level 2 2 are those where the model is more susceptible to minor issues like formatting, where fixed in-context examples could help. For questions in higher levels, on the other hand, the model might benefit more from guidance with carefully selected in-context examples.

After splitting the questions into 5 5 levels, we compare the effect of skill-based in-context examples with fixed in-context examples on the model’s responses to questions in each difficulty level. Figure 3 reports the results on a Qwen-3B model and MATH dataset.

Primary observations: We clearly observe that skill-based in-context examples can perform worse than fixed in-context examples in levels 1 1 and 2 2. On the other hand, skill-based in-context examples can substantially help the model on questions in levels 3–5. Furthermore, we observe that responses of the model are substantially longer with skill-based in-context examples, when compared with model responses with fixed in-context examples.

This shows that with skill-based examples, the model can return unnecessarily longer responses and make mistakes on easier questions, when simple strategies like Best-of-2 sampling or prompting with fixed in-context examples would have sufficed. This aligns with existing works on the issues of longer chain-of-thought reasoning in language models and how it relates to ”problems of over-thinking” in humans (Liu et al., 2024b; Diaconis & Mazur, 2003). 2 2 2 We also present results using the difficulty split of questions annotated in the original MATH dataset in Section B.3. Differences in performance and generation length of model’s responses with skill-based and fixed in-context examples are less pronounced across difficulty levels. This is expected, as model’s own responses must be a better fine-grained indicator on question difficulty.

Image 2: Refer to caption

Figure 3: Accuracy and average output length of Qwen2.5-3B-Instruct on questions of Difficulty Level 1–5, designed using its Best-of-n n performance, with fixed and skill-based examples. Skill-based examples hinder performance on Levels 1 and 2, while helping on Levels 3–5. On all difficulty levels, skill-based examples result in noticeably longer outputs.

4.2 Ablation Studies

Effect of in-context example choices in Stage 2. Our main method combines difficult questions with skill-based examples and easy ones with fixed examples, based on the observation that models only need targeted instructions on more challenging cases. To better understand its effectiveness, we conduct an ablation study exploring alternative combinations of in-context examples. Our primary observations are

  • •As shown in Figure 4, our combination of ”difficult+skill-based; easy+fixed” consistently outperforms all other configurations. Notably, the accuracy gap between the best and worst-performing combination can reach 7.1%, which stresses the importance of carefully choosing in-context examples for SLMs.
  • •The sensitivity to in-context example selection varies across model sizes, with the 1.5B model being the most sensitive and the 7B model being the most stable.
Effect of threshold values on the reward model prediction.

We investigated the effect of τ 1\tau_{1} and τ 2\tau_{2} (defined in Section 2.2) on the classification performance of easy or difficult questions. Specifically, we measure whether our classification of questions as easy or difficult also corresponds to the correctness of responses assessed using ground-truth labels. In Table 3, we report four metrics (accuracy / precision / recall / F1) evaluating the prediction accuracy resulting from different filtering thresholds. Note that τ 1=0\tau_{1}=0 or τ 2=0\tau_{2}=0 means completely removing the constraints of τ 1\tau_{1} or τ 2\tau_{2}. Across all evaluated combinations of threshold values, our choice of the threshold values (τ 1=0.85,τ 2=0.7\tau_{1}=0.85,\tau_{2}=0.7) gives a good combination of prediction scores. To further visualize this effect, we conduct AdaptMI on top of all combinations of thresholds, and report the final accuracy in Table 4. Our choice of threshold values yields the highest final accuracy among all the combinations.

Table 3: Reward model performance (accuracy / precision / recall / F1) on classifying correct/incorrect responses from Qwen2.5-1.5B-Instruct on MATH, accross different thresholds. τ 1=0\tau_{1}=0 or τ 2=0\tau_{2}=0 means completely removing τ 1\tau_{1} or τ 2\tau_{2}. Our choice of threshold values (τ 1=0.85,τ 2=0.7\tau_{1}=0.85,\tau_{2}=0.7) gives a good combination of prediction scores.

Table 4: Final AdaptMI performance of Qwen2.5-1.5B-Instruct on MATH, with different thresholds. Our choice of threshold values (τ 1=0.85,τ 2=0.7\tau_{1}=0.85,\tau_{2}=0.7) leads to the highest accuracy.

Additional ablations. We compare a process reward model with an outcome reward model in Section B.1. We further show the potential of using alternate heuristic filtering methods to use in place of reward models to classify easy and difficult questions. We find that these heuristic strategies could replace reward models with appropriate hyperparameters. We keep full exploration to future work. We also explore an alternative strategy to construct adaptive in-context instruction, where we feed in natural language instructions provided by LLM in place of in-context examples, in Section B.2. We find that the models simply ignore in-context information that contain long, and unstructured natural language feedback.

Image 3: Refer to caption

Figure 4: ICL performance, measured in terms of accuracy, across different combinations of in-context examples for easy and difficult questions on the MATH dataset. Across all models, we observe that skill-based in-context examples for difficult questions and fixed in-context examples for the easy questions work the best.

5 Related Works

In-context learning example selection.

As a key feature of language models, the in-context learning ability (Brown et al. (2020)) enables models to improve performance without undergoing gradient-based training. This ability can be maximally activated with carefully chosen in-context demonstrations. Prior works have extensively studied the dynamics of in-context learning (Chen et al. (2024)) and effective techniques of in-context example selection (Zhang et al. (2022); Cheng et al. (2023); An et al. (2023); Didolkar et al. (2024a); Liu et al. (2024a)) for larger models (>>13B). These heuristics often simply rely on the semantic relation between the question and examples, and they typically require training a dedicated example selection model. Meanwhile, the in-context learning dynamics of small language models are understudied.

Classifying model failures.

Identifying and understanding language model failures helps us adaptively improve model performance, e.g., via targeted training data selection (Zeng et al. (2025)). Prior works have utilized models’ test-time failure patterns to build adaptive datasets with difficult questions (Dinan et al. (2019); Nie et al. (2020); Ribeiro & Lundberg (2022); Gao et al. (2023); Li et al. (2025)). However, these failure identification and classification approaches have rarely been applied to inform in-context example selection.

Symbolic and Skill-based Reasoning.

Performing symbolic reasoning can largely enhance language models’ math reasoning ability (Sullivan & Elsayed (2024); Alotaibi et al. (2024); Xu et al. (2024); Shaik & Doboli (2025)). As SLMs generally possess weaker capabilities to understand complex in-context information, symbolic knowledge aids SLM reasoning by providing structured, less-noisy contextual information (Liao et al. (2024)). Notably, the concept of “skill” was proven effective as a useful criterion for clustering symbolic knowledge (Didolkar et al. (2024a)), guiding contextual example selection (Didolkar et al. (2024a); An et al. (2023)) and mixture-of-experts routing (Chen et al. (2025)).

6 Conclusion

Our work explores reasons behind the failure of skill-based in-context examples to boost ICL performance of SLMs. We show that skill-based selection can make the model “overthink” on easier questions, which leads to a degradation in ICL performance. We then propose adaptive in-context selection strategies, AdaptMI and AdaptMI+, that use skill-based selection only for difficult questions.

While our primary focus is on improving ICL performance in SLMs, an important question is whether similar strategies can also guide the training of better SLMs. Current approaches often rely on distilling (Hinton et al., 2015) an SLM directly from the logits or generations of a frontier LLM, which requires careful curation of training data and training pipeline for optimal and efficient benefits (Hsieh et al., 2023; Ivison et al., 2023; Kaur et al., 2024). Recent studies suggest that additional in-context information can help models learn more effectively or efficiently. However, these strategies employ static or manually crafted curricula and in-context information (Zhu et al., 2025; Gao et al., 2025; Liao et al., 2024; Allen-Zhu & Li, 2024). An important open direction, thus, is how to adapt AdaptMI and AdaptMI+ to enable SLMs to train more effectively using frontier LLMs.

Acknowledgements

We thank the members of Princeton Language and Intelligence for their helpful discussion and feedback. Sanjeev Arora and Abhishek Panigrahi are funded by NSF, Darpa, ONR, and Schmidt Foundation. Abhishek Panigrahi is a current Apple AIML scholar.

References

  • Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  • Allen-Zhu & Li (2024) Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 3.3, knowledge capacity scaling laws. arXiv preprint arXiv:2404.05405, 2024.
  • Alotaibi et al. (2024) Fatimah Alotaibi, Adithya Kulkarni, and Dawei Zhou. Graph of logic: Enhancing llm reasoning with graphs and symbolic logic. In 2024 IEEE International Conference on Big Data (BigData), pp. 5926–5935. IEEE, 2024.
  • An et al. (2023) Shengnan An, Bo Zhou, Zeqi Lin, Qiang Fu, Bei Chen, Nanning Zheng, Weizhu Chen, and Jian-Guang Lou. Skill-based few-shot selection for in-context learning, 2023. URL https://arxiv.org/abs/2305.14210.
  • Bandura & Walters (1977) Albert Bandura and Richard H Walters. Social learning theory, volume 1. Prentice hall Englewood Cliffs, NJ, 1977.
  • Brown et al. (2020) Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners, 2020. URL https://arxiv.org/abs/2005.14165.
  • Chen et al. (2025) Justin Chih-Yao Chen, Sukwon Yun, Elias Stengel-Eskin, Tianlong Chen, and Mohit Bansal. Symbolic mixture-of-experts: Adaptive skill-based routing for heterogeneous reasoning, 2025. URL https://arxiv.org/abs/2503.05641.
  • Chen et al. (2024) Yanda Chen, Chen Zhao, Zhou Yu, Kathleen McKeown, and He He. On the relation between sensitivity and accuracy in in-context learning, 2024. URL https://arxiv.org/abs/2209.07661.
  • Cheng et al. (2023) Daixuan Cheng, Shaohan Huang, Junyu Bi, Yuefeng Zhan, Jianfeng Liu, Yujing Wang, Hao Sun, Furu Wei, Denvy Deng, and Qi Zhang. Uprise: Universal prompt retrieval for improving zero-shot evaluation, 2023. URL https://arxiv.org/abs/2303.08518.
  • Cobbe et al. (2021) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  • Diaconis & Mazur (2003) Persi Diaconis and Barry C Mazur. The problem of thinking too much. Bulletin of the American Academy of Arts and Sciences, 56(3):26–38, 2003.
  • Didolkar et al. (2024a) Aniket Didolkar, Anirudh Goyal, Nan Rosemary Ke, Siyuan Guo, Michal Valko, Timothy Lillicrap, Danilo Jimenez Rezende, Yoshua Bengio, Michael C Mozer, and Sanjeev Arora. Metacognitive capabilities of llms: An exploration in mathematical problem solving. Advances in Neural Information Processing Systems, 37:19783–19812, 2024a.
  • Didolkar et al. (2024b) Aniket Didolkar, Anirudh Goyal, Nan Rosemary Ke, Siyuan Guo, Michal Valko, Timothy Lillicrap, Danilo Rezende, Yoshua Bengio, Michael Mozer, and Sanjeev Arora. Metacognitive capabilities of llms: An exploration in mathematical problem solving, 2024b. URL https://arxiv.org/abs/2405.12205.
  • Dinan et al. (2019) Emily Dinan, Samuel Humeau, Bharath Chintagunta, and Jason Weston. Build it break it fix it for dialogue safety: Robustness from adversarial human attack. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 4537–4546, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1461. URL https://aclanthology.org/D19-1461/.
  • Gao et al. (2023) Irena Gao, Gabriel Ilharco, Scott Lundberg, and Marco Tulio Ribeiro. Adaptive testing of computer vision models, 2023. URL https://arxiv.org/abs/2212.02774.
  • Gao et al. (2025) Tianyu Gao, Alexander Wettig, Luxi He, Yihe Dong, Sadhika Malladi, and Danqi Chen. Metadata conditioning accelerates language model pre-training. arXiv preprint arXiv:2501.01956, 2025.
  • Grattafiori et al. (2024) Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
  • Gui et al. (2024) Lin Gui, Cristina Gârbacea, and Victor Veitch. Bonbon alignment for large language models and the sweetness of best-of-n sampling. arXiv preprint arXiv:2406.00832, 2024.
  • Hattie & Timperley (2007) John Hattie and Helen Timperley. The power of feedback. Review of educational research, 77(1):81–112, 2007.
  • Hendrycks et al. (2021) Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021.
  • Hinton et al. (2015) Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  • Hsieh et al. (2023) Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. arXiv preprint arXiv:2305.02301, 2023.
  • Ivison et al. (2023) Hamish Ivison, Yizhong Wang, Valentina Pyatkin, Nathan Lambert, Matthew Peters, Pradeep Dasigi, Joel Jang, David Wadden, Noah A Smith, Iz Beltagy, et al. Camels in a changing climate: Enhancing lm adaptation with tulu 2. arXiv preprint arXiv:2311.10702, 2023.
  • Kaur et al. (2024) Simran Kaur, Simon Park, Anirudh Goyal, and Sanjeev Arora. Instruct-skillmix: A powerful pipeline for llm instruction tuning. arXiv preprint arXiv:2408.14774, 2024.
  • Kirschner et al. (2006) Paul A Kirschner, John Sweller, and Richard E Clark. Why minimal guidance during instruction does not work: An analysis of the failure of constructivist, discovery, problem-based, experiential, and inquiry-based teaching. Educational psychologist, 41(2):75–86, 2006.
  • Li et al. (2025) Xiang Lisa Li, Farzaan Kaiyom, Evan Zheran Liu, Yifan Mai, Percy Liang, and Tatsunori Hashimoto. Autobencher: Towards declarative benchmark construction, 2025. URL https://arxiv.org/abs/2407.08351.
  • Liao et al. (2024) Huanxuan Liao, Shizhu He, Yupu Hao, Xiang Li, Yuanzhe Zhang, Jun Zhao, and Kang Liu. SKIntern: Internalizing symbolic knowledge for distilling better cot capabilities into small language models, 2024. URL https://arxiv.org/abs/2409.13183.
  • Liu et al. (2024a) Haoyu Liu, Jianfeng Liu, Shaohan Huang, Yuefeng Zhan, Hao Sun, Weiwei Deng, Furu Wei, and Qi Zhang. s​e 2 se^{2}: Sequential example selection for in-context learning, 2024a. URL https://arxiv.org/abs/2402.13874.
  • Liu et al. (2024b) Ryan Liu, Jiayi Geng, Addison J Wu, Ilia Sucholutsky, Tania Lombrozo, and Thomas L Griffiths. Mind your step (by step): Chain-of-thought can reduce performance on tasks where thinking makes humans worse. arXiv preprint arXiv:2410.21333, 2024b.
  • Meta AI (2024) Meta AI. Llama 3.2: Revolutionizing Edge AI and Vision with Open, Customizable Models, 2024. URL https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/.
  • Nie et al. (2020) Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela. Adversarial NLI: A new benchmark for natural language understanding. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4885–4901, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.441. URL https://aclanthology.org/2020.acl-main.441/.
  • OpenAI (2024) OpenAI. Gpt-4o mini: advancing cost-efficient intelligence. https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/, 2024.
  • Randi (2022) Judi Randi. Adaptive teaching. In Routledge encyclopedia of education, educational psychology. Routledge, 2022.
  • Ribeiro & Lundberg (2022) Marco Tulio Ribeiro and Scott Lundberg. Adaptive testing and debugging of NLP models. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 3253–3267, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.230. URL https://aclanthology.org/2022.acl-long.230/.
  • Shaik & Doboli (2025) Hashmath Shaik and Alex Doboli. Using a symbolic knowledge graph to address llm limitations in analog circuit topology generation. In 2025 IEEE 15th Annual Computing and Communication Workshop and Conference (CCWC), pp. 00528–00533. IEEE, 2025.
  • Sullivan & Elsayed (2024) Rob Sullivan and Nelly Elsayed. Can large language models act as symbolic reasoners? arXiv preprint arXiv:2410.21490, 2024.
  • Sweller (2011) John Sweller. Chapter two - cognitive load theory. volume 55 of Psychology of Learning and Motivation, pp. 37–76. Academic Press, 2011. doi: https://doi.org/10.1016/B978-0-12-387691-1.00002-8. URL https://www.sciencedirect.com/science/article/pii/B9780123876911000028.
  • Team et al. (2023) Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  • Wang et al. (2022) Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.
  • Wei et al. (2022) Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022.
  • Xiong et al. (2024) Wei Xiong, Hanning Zhang, Nan Jiang, and Tong Zhang. An implementation of generative prm. https://github.com/RLHFlow/RLHF-Reward-Modeling, 2024.
  • Xu et al. (2024) Jundong Xu, Hao Fei, Liangming Pan, Qian Liu, Mong-Li Lee, and Wynne Hsu. Faithful logical reasoning via symbolic chain-of-thought, 2024. URL https://arxiv.org/abs/2405.18357.
  • Yang et al. (2024) An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115, 2024.
  • Zeng et al. (2025) Zhiyuan Zeng, Yizhong Wang, Hannaneh Hajishirzi, and Pang Wei Koh. Evaltree: Profiling language model weaknesses via hierarchical capability trees, 2025. URL https://arxiv.org/abs/2503.08893.
  • Zhang et al. (2022) Yiming Zhang, Shi Feng, and Chenhao Tan. Active example selection for in-context learning, 2022. URL https://arxiv.org/abs/2211.04486.
  • Zhu et al. (2025) Xingyu Zhu, Abhishek Panigrahi, and Sanjeev Arora. On the power of context-enhanced learning in llms. arXiv preprint arXiv:2503.01821, 2025.

Appendix

\parttoc

Appendix A Experimental Details

A.1 Skill Annotation on MATH and GSM8K

As described in Section 3, we follow Didolkar et al. (2024a) to label skills on both the training and test sets of MATH and GSM8K using GPT-4o-mini (OpenAI, 2024). We enlist all skills that we used to annotate the questions in MATH and GSM8K dataset in Tables 6, 7 andA.1, which have been taken from Didolkar et al. (2024a). We ask the LLM to read the question and provide up to five skills required to solve this question, from the given existing skill list. We show an example prompt for annotating MATH Number Theory questions as follows.

Table 5 shows some example MATH questions and their corresponding annotated skills. From the skill annotation, we construct a Skill Bank (see Figure 1 and Section 2.1) that stores the required skills for each question.

Table 5: Example MATH questions, and the annotated skills generated by GPT-4o-mini.

Table 6: List of skills used for annotating questions in each subject in MATH dataset

Table 7: List of skills used for annotating questions in each subject of MATH dataset (continued from Table 6)

A.2 Missing skill Identification from Model Responses

As described in Section 2.3, we use GPT-4o-mini to label the skills that are missing from a model response. We ask the LLM to read the question along with the SLM response and provide the skills that the model fails to leverage in the response, from the given existing skill list. Below we show an example prompt for labeling missing skills for MATH Number Theory questions, as well as an example LLM output.

A.3 Skill-based Example Retrieval

We outline our algorithm for retrieving in-context examples tailored to a specific set of skills. Leveraging the Skill-Map definition in Section 2.1, which annotates each question with its associated skills, we construct an inverse mapping called Example-Bank:Skill-Bank​(𝒬)→𝒫\text{Example-Bank}:\text{Skill-Bank}(\mathcal{Q})\to\mathcal{P}. This map associates each skill s s with the subset of in-context examples in the pool 𝒫\mathcal{P} that are linked to s s according to Skill-Map. Given a question q q and a target skill set K K, we retrieve in-context examples by randomly selecting one example from Example-Bank​(s)\text{Example-Bank}(s) for each skill s s in K K. The algorithm is given in Section A.3.

Algorithm 1 Skill-based example retrieval

Input: List of skills K=[k 1,…,k n]K=[k_{1},...,k_{n}] (n≤5 n\leq 5)

Output: Selected 5-shot examples E=[e 1,…,e 5]E=[e_{1},...,e_{5}]

1:

E E←\leftarrow []

2:if

K K is not empty then

3:⊳\triangleright We allow an additional repeated in-context example for the first 5−n 5-n skills

4:for

i=1 i=1 to

5−n 5-n do

5:

E′E^{\prime}←\leftarrow Example-Bank(k 1 k_{1})

6:if

E′E^{\prime} is not empty then

7:

e e←\leftarrow random_choice(E′E^{\prime})

8:

E E←\leftarrow E E

  • [e]

9:end if

10:end for

11:

12:for each

k k in

K K do

13:

E′E^{\prime}←\leftarrow Example-Bank(k k)

14:if

E′E^{\prime} is not empty then

15:

e e←\leftarrow random_choice(E′E^{\prime})

16:

E E←\leftarrow E E

  • [e]

17:end if

18:end for

19:end if

20:

21:

22:

E E←\leftarrow S​e​t​(E)Set(E) ⊳\triangleright Remove repeated instances

23:if len(E E)

<< 5 then

24: Append examples from fixed in-context examples to fill remaining shots

25:⊳\triangleright This happens in the rarest of cases when we don’t have enough examples for a skill!

26:end if

27:return

E E

Appendix B Ablation Study

B.1 Ablations on the reward filtering method in Stage 1

Recall that in Stage 1 of the AdaptMI pipeline, we use an off-the-shelf process reward model (RLHFlow/Llama3.1-8B-PRM-Mistral-Data) to score small language models’ responses, in order to filter out a set of difficult questions for each model. Here, we conduct various ablation studies on the reward filtering process.

Out-of-distribution (OOD) prediction performance of reward model.

Although we primarily evaluated AdaptMI on MATH and GSM8K, our method can potentially be extended to other math datasets. While the reward model we used in Stage 1 was only trained on the MATH and GSM8K distribution, we show that it is capable of scoring responses for various OOD math datasets. Table 8 reports the reward model’s performance on classifying correct/incorrect responses from Qwen2.5-7B-Instruct on four popular math benchmarks: AMC23, AIME24, AIME25, and MATH 2. The reward model achieves comparably high performance on scoring SLM responses on these OOD, significantly more difficult benchmarks, indicating that the model is highly generalizable. This implies the potential to extend our method to new datasets without the need to train a specialized reward model for each one.

Table 8: Reward model prediction metrics across four OOD math benchmarks. Despite not being trained on these benchmarks, the reward model’s prediction capability is largely generalizable to them.

Reward Filtering vs. Simple Heuristics for classifying difficult questions.

Considering the computational overhead of calling a separate PRM, we explored alternative approaches to classifying questions that rely on computation-free simple heuristics. Specifically, we experimented with two heuristic strategies:

  • •Consistency heuristic: We measure the consistency of the model across five sampled generations per question and classify questions with lower consistency as difficult. Specifically, a question is difficult if, among 5 sampled generations, the most common response appears << 2 times.
  • •Length heuristic: We use the length of the model’s responses as a proxy and classify questions with longer responses as difficult. Specifically, a question is difficult if the average model response length on this question is ≥\geq 800 words.

Table 9 shows that both heuristics yield reasonably accurate predictions. Moreover, applying AdaptMI on top of these heuristic-classified difficult questions can improve the final accuracy by 2%. However, we leave a more thorough investigation into the robustness and generalizability of these strategies in relation to PRM-based classification for future work.

Table 9: Performance of consistency heuristic and length heuristic on classifying difficult questions. The classification accuracy of simple heuristics are on par with the reward filtering method. Applying Stage 2 of AdaptMI on top of the heuristic-classified difficult questions can yield improvement on the final accuracy by 2%.

Process Reward vs. Outcome Reward.

We also compare the prediction accuracy of our process reward model (PRM) with threshold filtering (see Section 2.2) against directly loading the reward model as an outcome reward model (ORM). Our preliminary experiments indicated 0.9 0.9 as the optimal threshold for the outcome rewards. With τ=0.9\tau=0.9, the prediction metrics of the ORM are: Precision =0.54=0.54 / Recall =0.90=0.90 / F1 =0.68=0.68, whereas the prediction metrics of the PRM with optimal thresholds are Precision =0.70=0.70 / Recall =0.92=0.92 / F1 =0.80=0.80. Therefore, our method using PRM with threshold filtering is superior to directly using ORM.

B.2 Comparing few-shot instructions with natural language instructions

Here, we explore an alternative strategy to construct adaptive in-context instruction. We want to test whether additional supervision from the LLM in AdaptMI+ could be provided in terms of feedback using natural language instructions.

Table 10: Qwen2.5-7B-Instruct accuracy under LLM-generated natural language instructions.

For difficult questions, we modify our adaptive instruction as follows. After getting the predicted missing skills on model’s response from an LLM, we prompt the LLM back with the missing skills and the corresponding skill-based in-context examples and ask the model to return a concise natural language LLM feedback that contains criticism on the model’s response, and hints on how to apply the required skills. See below for an example prompt.

We report the behavior of modified AdaptMI+ on Qwen2.5-7B-Instruct. Interestingly, we observe that even 7B models tend to not benefit from the unstructured instructions (see Table 10). Furthermore, even if skill-based in-context examples are utilized along with LLM feedback, the SLM’s performance remains nearly unchanged, which suggests the model simply ignores in-context information that contains long, and unstructured natural language feedback.

B.3 Fine-grained analysis of skill-based and fixed in-context examples on original manual split of MATH dataset

Image 4: Refer to caption

Figure 5: Accuracy and average output length of Qwen2.5-3B-Instruct on questions of Level 1–5 defined in the MATH dataset. Compared to Figure 3, the performance gap between fixed and skill-based examples is unnoticeable across all levels.

We repeat our experiment from Section 4.1.1. However, now instead of using Best-of-n n sampling to split the evaluation set into 5 5 levels, we use the manual split of questions given in the original MATH dataset. We report comparisons between skill-based and fixed in-context example selection strategies in Figure 5.

Interestingly, the differences between the ICL performance and generation length with skill-based and fixed in-context examples for the SLM are less pronounced across the 5 5 difficulty levels, compared to the results in Figure 3. This suggests that the manual difficulty split in the MATH dataset may not align well with the model’s own perception of question difficulty. To capture more fine-grained distinctions between the two strategies, using the model’s own responses through Best-of-n n sampling serves as a more reliable indicator of question difficulty.

Appendix C Case Studies

In this section, we conduct case studies to gain deeper insight into how skill-based in-context examples might harm performance on easy questions, as mentioned in Section 4. We present two questions where SLM successfully solves with fixed examples, while failing with skill-based examples.

C.1 Skill-based examples lead the model to overlook key problem constraints

In the example below, the Qwen2.5-7B-Instruct model is given an algebra question that includes multiple geometric constraints. While the question involves both Geometry and Algebra, it is only classified as an Algebra question in MATH, hence being combined with algebraic skill examples. When prompted with fixed examples, the model correctly identifies two possible answers and chooses the correct one according to the given condition ”both coordinates are negative.” On the other hand, when conditioned by examples that represent algebraic skills, the model overly emphasizes algebraic completeness but overlooks this important problem condition. It finally selects the incorrect answer by a random guess.

C.2 Symbol-heavy skill-based examples cause the model to overthink.

The question below requires a plug-in-and-test approach instead of solving an equation. With fixed in-context examples, the model is able to find out the correct answer by directly plugging in and trying out small values. However, the skill-based examples that involve equation solving may have caused the model to overthink. After failing in the first plug-in-and-test, it ended up attempting to solve the equation system and eventually failed.

Appendix D Additional Results

D.1 Classification results of easy and difficult questions

In Stage 1 of AdaptMI (see Section 2.2), we identify a set of difficult questions for each individual model using a process reward model along with a filtering heuristic. Table 11 reports the proportions of difficult questions classified for different models in each math domain. Compared to Section 3, the proportions of difficult questions closely correspond to the accuracy numbers of each model, even though we did not access the ground truth in the whole pipeline. Notably, our classification method captures not only questions that the model gets wrong, but also questions that the model passes with a flawed solution process.

Table 11: Proportions of difficult questions (%) classified by AdaptMI for each model. Although our method did not access the ground truth, the proportion of classified difficult questions still closely mirrors each model’s accuracy (see Section 3) in each domain.

D.2 AdaptMI and AdaptMI+ performances

In addition to Section 3, we put the accuracy results on Number Theory, Intermediate Algebra, and Counting & Probability in MATH in Section D.3. These results align with each other—AdaptMI and AdaptMI+ yield substantial improvement compared with all Pass@1 baseline, while being on par with the Consistency@5 results.

D.3 Effect of skill-based examples on difficult and easy questions

In Section 4, we introduce our observation that skill-based examples only boost SLM performances on difficult questions but harm performance on easier ones. We present the additional results on Qwen2.5-3B-Instruct and Qwen2.5-7B-Instruct in Section D.3 and Section D.3. Similar to Table 2, there is a clear performance drop on easy questions with skill-based examples, although the drop for Qwen2.5-3B-Instruct and Qwen2.5-7B-Instruct is less significant than Qwen2.5-1.5B-Instruct.

Table 12: Additional results of Section 3. AdaptMI and AdaptMI+ also demonstrate consistent accuracy gain compared with baseline methods. All results are Pass@1 accuracy unless otherwise indicated. Exp. stands for Examples. The selection methods for fixed, random, and skill-based examples are introduced in Section 2.1

Table 13: Accuracy of Qwen2.5-1.5B-Instruct, Qwen2.5-3B-Instruct, and Qwen2.5-7B-Instruct on difficult and easy questions, respectively under fixed, random, and skill-based examples (additional results for Table 2). Skill-based examples boost performance on difficult questions across all categories, while significantly underperforming on easy questions. The gap between easy and difficult questions is more pronounced for smaller models.

Table 14: Accuracy of Qwen2.5-1.5B-Instruct, Qwen2.5-3B-Instruct, and Qwen2.5-7B-Instruct on difficult and easy questions, respectively under fixed, random, and skill-based examples (additional results for Table 2). Skill-based examples boost performance on difficult questions across all categories, while significantly underperforming on easy questions. The gap between easy and difficult questions is more pronounced for smaller models.

Xet Storage Details

Size:
68.1 kB
·
Xet hash:
9f010445e303ad147171c14d228bac5b152c989b8fedb7cb10c67cc218180714

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.