Running 36 TRUEBench 🔥 36 Explore and compare language model performance across categories and languages