view article Article Community Evals: Because we're done trusting black-box leaderboards over the community +5 4 days ago • 39
Accelerating Scientific Research with Gemini: Case Studies and Common Techniques Paper • 2602.03837 • Published 4 days ago • 4
Failing to Explore: Language Models on Interactive Tasks Paper • 2601.22345 • Published 9 days ago • 2