Diraitory

LLM Benchmarks - Directory w/ AI Reviews

Choosing the right LLM for a task requires rigorous comparison across dimensions like reasoning, coding, multilingual ability, and cost. LMSYS Chatbot Arena uses crowdsourced human preference ratings to rank models on open-ended tasks. HELM provides standardized benchmark suites for academic and industry comparison, while the Hugging Face Open LLM Leaderboard tracks open-source model performance. Artificial Analysis adds infrastructure metrics like throughput and latency to the evaluation picture.

HELM

HELM Free Free Plan Open Source 2 reviews HELM by Stanford evaluates language models across dozens of scenarios measuring accuracy, robustness, and fairness.

2 4.9 LMSYS Chatbot Arena Free Free Plan Open Source 2 reviews LMSYS Chatbot Arena is a crowdsourced LLM evaluation platform developed by LMSYS Org that ranks language models through blind, randomized head-to-head battles judged by human users. Visitors submit prompts and vote on anonymous model outputs, with results aggregated into an Elo-style leaderboard tha

Hugging Face Open LLM Leaderboard

Hugging Face Open LLM Leaderboard Free Free Plan API Open Source 2 reviews The Open LLM Leaderboard by Hugging Face is a comprehensive benchmark tracking platform that evaluates open-source language models across standardized academic benchmarks. The leaderboard automatically runs models through evaluation suites including MMLU, ARC, HellaSwag, TruthfulQA, Winogrande, and

Artificial Analysis

Artificial Analysis Free Free Plan 2 reviews Artificial Analysis independently evaluates AI models across quality, speed, price, and throughput, comparing API providers for developers.

Evalverse

Evalverse Free Free Plan Open Source 2 reviews Evalverse is an open-source unified LLM evaluation framework integrating multiple benchmark suites in one interface.