LLM Performance Leaderboard
View the LLM leaderboard rankings
For consumer applications and chat experiences, speed and responsiveness are critical to user engagement. Users expect near-instant responses, and delays can directly lead to reduced engagement. When building more complex applications involving tool use or agentic systems, speed and cost become even more important, and can become the limiting factor on overall system capability. The time taken by sequential requests to LLMs can quickly stack up for each user request adding to the cost.
This is why Artificial Analysis (@ArtificialAnlys) developed a leaderboard evaluating price, speed and quality across >100 serverless LLM API endpoints, now coming to Hugging Face.
Find the leaderboard here!
The LLM Performance Leaderboard aims to provide comprehensive metrics to help AI engineers make decisions on which LLMs (both open & proprietary) and API providers to use in AI-enabled applications.
When making decisions regarding which AI technologies to use, engineers need to consider quality, price and speed (latency & throughput). The LLM Performance Leaderboard brings all three together to enable decision making in one place across both proprietary & open models.
Source: LLM Performance Leaderboard
The metrics reported are:
For further definitions, see our full methodology page.
The leaderboard allows exploration of performance under several different workloads (6 combinations in total):
We test every API endpoint on the leaderboard 8 times per day, and leaderboard figures represent the median measurement of the last 14 days. We also have percentile breakdowns within the collapsed tabs.
Quality metrics are currently collected on a per-model basis and show results reports by model creators, but watch this space as we begin to share results from our independent quality evaluations across each endpoint.
For further definitions, see our full methodology page.
Our chart of Quality vs. Throughput (tokens/s) shows the range of options with different quality and performance characteristics.
Source: artificialanalysis.ai/models
In some cases, design patterns involving multiple requests with faster and cheaper models can result in not only lower cost but better overall system quality compared to using a single larger model.
For example, consider a chatbot that needs to browse the web to find relevant information from recent news articles. One approach would be to use a large, high-quality model like GPT-4 Turbo to run a search then read and process the top handful of articles. Another would be to use a smaller, faster model like Llama 3 8B to read and extract highlights from dozens web pages in parallel, and then use GPT-4 Turbo to assess and summarize the most relevant results. The second approach will be more cost effective, even after accounting for reading 10x more content, and may result in higher quality results.
Please follow us on Twitter and LinkedIn for updates. We're available via message on either, as well as on our website and via email.
View the LLM leaderboard rankings