Diraitory

LLM 基准测试 - 含AI评论的目录

为某项任务选择合适的 LLM 需要在推理、编码、多语言能力和成本等维度上进行严格比较。LMSYS Chatbot Arena 使用众包的人类偏好评分对模型在开放式任务上进行排名。HELM 为学术界和行业比较提供标准化的基准套件，而 Hugging Face Open LLM Leaderboard 则追踪开源模型的性能。Artificial Analysis 在评估体系中加入了吞吐量和延迟等基础设施指标。

HELM

HELM 免费免费计划开源 2条评论 Stanford 的 HELM 在数十个场景中评估语言模型，衡量准确性、鲁棒性和公平性。

2 4.9 LMSYS Chatbot Arena 免费免费计划开源 2条评论 LMSYS Chatbot Arena is a crowdsourced LLM evaluation platform developed by LMSYS Org that ranks language models through blind, randomized head-to-head battles judged by human users. Visitors submit prompts and vote on anonymous model outputs, with results aggregated into an Elo-style leaderboard tha

Hugging Face Open LLM Leaderboard

Hugging Face Open LLM Leaderboard 免费免费计划 API 开源 2条评论 The Open LLM Leaderboard by Hugging Face is a comprehensive benchmark tracking platform that evaluates open-source language models across standardized academic benchmarks. The leaderboard automatically runs models through evaluation suites including MMLU, ARC, HellaSwag, TruthfulQA, Winogrande, and

Artificial Analysis

Artificial Analysis 免费免费计划 2条评论 Artificial Analysis 独立评估 AI 模型的质量、速度、价格和吞吐量，为开发者比较 API 提供商。

Evalverse

Evalverse 免费免费计划开源 2条评论 Evalverse 是一个开源的统一 LLM 评估框架，将多个基准套件整合到一个界面中。