CritiqueLLM: Scaling LLM-as-Critic for Effective and Explainable Evaluation of Large Language Model Generation Paper • 2311.18702 • Published Nov 30, 2023
AlignBench: Benchmarking Chinese Alignment of Large Language Models Paper • 2311.18743 • Published Nov 30, 2023 • 1
EVA: An Open-Domain Chinese Dialogue System with Large-Scale Generative Pre-Training Paper • 2108.01547 • Published Aug 3, 2021
CharacterBench: Benchmarking Character Customization of Large Language Models Paper • 2412.11912 • Published Dec 16, 2024
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models Paper • 2508.06471 • Published Aug 8, 2025 • 206
CharacterGLM: Customizing Chinese Conversational AI Characters with Large Language Models Paper • 2311.16832 • Published Nov 28, 2023 • 1
IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation Paper • 2603.04738 • Published 8 days ago • 1
Benchmarking Complex Instruction-Following with Multiple Constraints Composition Paper • 2407.03978 • Published Jul 4, 2024 • 1
IF-CRITIC: Towards a Fine-Grained LLM Critic for Instruction-Following Evaluation Paper • 2511.01014 • Published Nov 2, 2025 • 1
IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation Paper • 2603.04738 • Published 8 days ago • 1
HPSS: Heuristic Prompting Strategy Search for LLM Evaluators Paper • 2502.13031 • Published Feb 18, 2025
The Side Effects of Being Smart: Safety Risks in MLLMs' Multi-Image Reasoning Paper • 2601.14127 • Published Jan 20 • 5
PromptCoT 2.0: Scaling Prompt Synthesis for Large Language Model Reasoning Paper • 2509.19894 • Published Sep 24, 2025 • 34