| Rank | Algorithm ↕ | #Features ↕ | Mean F1 ↕ | Mean AUC ↕ | Time (s) ↕ |
|---|
AlpacaEval an LLM-based automatic evaluation that is fast, cheap, and reliable. It is based on the AlpacaFarm evaluation set, which tests the ability of models to follow general user instructions. These responses are then compared to reference responses (Davinci003 for AlpacaEval, GPT-4 Preview for AlpacaEval 2.0) by the provided GPT-4 based auto-annotators, which results in the win rates presented above. AlpacaEval displays a high agreement rate with ground truth human annotations, and leaderboard rankings on AlpacaEval are very correlated with leaderboard rankings based on human annotators. Please see our documentation for more details on our analysis.
We welcome new model contributions to the leaderboard from the community! To do so, please follow the steps in the contributions section. Specifically, you'll need to run the model on the evaluation set, auto-annotate the outputs, and submit a PR with the model config and leaderboard results. We've also set up a Discord for community support and discussion.
We also welcome contributions for new evaluators or new eval sets! For making new evaluators, we release our ground-truth human annotations and comparison metrics. We also release a rough guide to follow for making new eval sets. We specifically encourage contributions for harder instructions distributions and for safety testing of LLMs.
这里是简介