Running 36 TRUEBench ๐ฅ 36 Explore and compare language model performance across categories and languages