# Evaluate on the Hub

    
    
    

You can evaluate AI models on the Hub in multiple ways and this page will guide you through the different options:

- **Community Leaderboards** bring together the best models for a given task or domain and make them accessible to everyone by ranking them.
- **Model Cards** provide a comprehensive overview of a model's capabilities from the author's perspective.
- **Libraries and Packages** give you the tools to evaluate your models on the Hub.

## Eval Results on the Hub

The Hub provides a decentralized system for tracking model evaluation results. Benchmark datasets can host leaderboards, and model repos store evaluation scores that automatically appear on both the model page and the benchmark's leaderboard.

![Eval Results on the Hub](https://huggingface.co/huggingface/documentation-images/resolve/main/evaluation-results/benchmark-preview.png)

You can add evaluation results to any model by submitting a YAML file to the `.eval_results/` folder in the model repo. These results display with badges indicating whether they are verified, community-provided, or linked to a benchmark leaderboard.

For full details on adding evaluation results to models and registering benchmark datasets, see the [Evaluation Results documentation](https://huggingface.co/docs/hub/eval-results).

## Community Managed Leaderboards

Community leaderboards show how a model performs on a given task or domain. For example, there are leaderboards for question answering, reasoning, classification, vision, and audio. If you're tackling a new task, you can use a leaderboard to see how a model performs on it.

Here are some examples of community leaderboards:

| Leaderboard | Model Type | Description |
| --- | --- | --- |
| [MTEB](https://huggingface.co/spaces/mteb/leaderboard)| Embedding | The Massive Text Embedding Benchmark leaderboard compares 100+ text and image embedding models across 1000+ languages. Refer to the publication of each selectable benchmark for details on metrics, languages, tasks, and task types. Anyone is welcome to add a model, add benchmarks, help improve zero-shot annotations, or propose other changes to the leaderboard. |
| [GAIA](https://huggingface.co/spaces/gaia-benchmark/leaderboard)| Agentic | GAIA is a benchmark which aims at evaluating next-generation LLMs (LLMs with augmented capabilities due to added tooling, efficient prompting, access to search, etc). (See [the paper](https://arxiv.org/abs/2311.12983) for more details.) |
| [OpenVLM Leaderboard](https://huggingface.co/spaces/opencompass/open_vlm_leaderboard)| Vision Language Models | The OpenVLM Leaderboard evaluates 272+ Vision-Language Models (including GPT-4v, Gemini, QwenVLPlus, LLaVA) across 31 different multi-modal benchmarks using the VLMEvalKit framework. It focuses on open-source VLMs and publicly available API models. |
| [Open ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard)| Audio | The Open ASR Leaderboard ranks and evaluates speech recognition models on the Hugging Face Hub. Models are ranked based on their Average WER, from lowest to highest. |
| [LLM-Perf Leaderboard](https://huggingface.co/spaces/llm-perf/leaderboard)| LLM Performance | The 🤗 LLM-Perf Leaderboard 🏋️ is a leaderboard at the intersection of quality and performance. Its aim is to benchmark the performance (latency, throughput, memory & energy) of Large Language Models (LLMs) with different hardware, backends and optimizations using Optimum-Benchmark. |

There are many more leaderboards on the Hub. Check out all the leaderboards via this [search](https://huggingface.co/spaces?category=model-benchmarking) or use this [dedicated Space](https://huggingface.co/spaces/OpenEvals/find-a-leaderboard) to find a leaderboard for your task.

## Model Cards

Model cards provide an overview of a model's capabilities evaluated by the community or the model's author. They are a great way to understand a model's capabilities and limitations.

![Qwen model card](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/evaluate-docs/qwen-model-card.png)

Unlike leaderboards, model card evaluation scores are often created by the author, rather than by the community.

> [!TIP]
> For information on reporting results, see details on [the Model Card Evaluation Results metadata](https://huggingface.co/docs/hub/en/model-cards#evaluation-results).

## Libraries and packages

There are a number of open-source libraries and packages that you can use to evaluate your models on the Hub. These are useful if you want to evaluate a custom model or performance on a custom evaluation task.

### LightEval

LightEval is a library for evaluating LLMs. It is designed to be comprehensive and customizable. Visit the LightEval [repository](https://github.com/huggingface/lighteval) for more information.

> [!TIP]
> For more recent evaluation approaches that are popular on the Hugging Face Hub that are currently more actively maintained, check out [LightEval](https://github.com/huggingface/lighteval).

### 🤗 Evaluate

A library for easily evaluating machine learning models and datasets.

With a single line of code, you get access to dozens of evaluation methods for different domains (NLP, Computer Vision, Reinforcement Learning, and more!). Be it on your local machine or in a distributed training setup, you can evaluate your models in a consistent and reproducible way!

Visit the 🤗 Evaluate [organization](https://huggingface.co/evaluate-metric) for a full list of available metrics. Each metric has a dedicated Space with an interactive demo for how to use the metric, and a documentation card detailing the metrics limitations and usage.

  
    Tutorials
      Learn the basics and become familiar with loading, computing, and saving with 🤗 Evaluate. Start here if you are using 🤗 Evaluate for the first time!
    
    How-to guides
      Practical guides to help you achieve a specific goal. Take a look at these guides to learn how to use 🤗 Evaluate to solve real-world problems.
    
    Conceptual guides
      High-level explanations for building a better understanding of important topics such as considerations going into evaluating a model or dataset and the difference between metrics, measurements, and comparisons.
   
    Reference
      Technical descriptions of how 🤗 Evaluate classes and methods work.
    
  

