Spaces:
Running
Running
| title: README | |
| emoji: ๐ | |
| colorFrom: blue | |
| colorTo: pink | |
| sdk: static | |
| pinned: false | |
| ## Consensus-driven LLM Evaluation | |
| The rapid advancement of Large Language Models (LLMs) necessitates robust | |
| and challenging benchmarks. | |
| To address the challenge of ranking LLMs on highly subjective tasks such as emotional intelligence, creative writing, or persuasiveness, | |
| the **Language Model Council (LMC)** operates through a democratic process to: 1) formulate a test set through | |
| equal participation, 2) administer the test among council members, and 3) evaluate | |
| responses as a collective jury. | |
| Our initial research deploys a council of 20 newest LLMs on an open-ended emotional intelligence task: responding to interpersonal dilemmas. Our results show that the LMC produces rankings that are more separable, robust, | |
| and less biased than those from any individual LLM judge, and is more consistent with a human-established leaderboard compared to other benchmarks like Chatbot Arena or MMLU. | |
| Roadmap: | |
| - Use the Council to benchmark evaluative characteristics of LLM-as-a-Judge/Jury like bias, affinity, and agreement. | |
| - Expand to more domains, use cases, and sophisticated agentic interactions. | |
| - Produce a generalized user interface for Council-as-a-Service. | |