AetherCode: Evaluating LLMs' Ability to Win In Premier Programming Competitions
Paper • 2508.16402 • Published • 14
None defined yet.
The Truthfulness Spectrum Hypothesis
AVMeme Exam: A Multimodal Multilingual Multicultural Benchmark for LLMs' Contextual and Cultural Knowledge and Thinking