Reassessing High-Performing LLMs on Polish Medical Exams: True Competence or Bias-Driven Performance? Paper • 2606.12250 • Published 25 days ago
PingPong: A Natural Benchmark for Multi-Turn Code-Switching Dialogues Paper • 2601.17277 • Published Jan 24 • 6
INTIMA: A Benchmark for Human-AI Companionship Behavior Paper • 2508.09998 • Published Aug 4, 2025 • 11
Grounding Computer Use Agents on Human Demonstrations Paper • 2511.07332 • Published Nov 10, 2025 • 107