关于

Whisper is an open-source automatic speech recognition (ASR) system developed by OpenAI, trained on 680,000 hours of multilingual and multitask supervised data collected from the web. Released in September 2022, Whisper approaches human-level robustness and accuracy on English speech recognition and demonstrates strong performance across many languages, accents, and audio conditions without the need for fine-tuning. The model is designed as a general-purpose speech recognition system capable of multilingual speech recognition, speech translation to English, and language identification. Whisper employs a transformer-based encoder-decoder architecture and was trained using a large-scale weak supervision approach, learning from a diverse dataset of audio paired with transcripts sourced from the internet. This training methodology enables Whisper to handle a wide variety of audio conditions including background noise, technical jargon, accented speech, and low-quality recordings that challenge traditional ASR systems. The model is available in multiple sizes to accommodate different computational requirements: Tiny (39M parameters), Base (74M), Small (244M), Medium (769M), and Large (1.55B parameters), with larger models providing higher accuracy at the cost of increased inference time and memory usage. Whisper supports transcription and translation across approximately 100 languages, with particularly strong performance in English, Spanish, French, German, Italian, Portuguese, Japanese, Korean, Chinese, and many other widely spoken languages. The model includes automatic punctuation, capitalization, and timestamp generation at both the segment and word level. As open-source software released under the MIT license, Whisper can be run locally on consumer hardware with a compatible GPU, integrated into applications through Python, or accessed through OpenAI's commercial API for managed inference. The open-source release has spawned a large ecosystem of tools, integrations, and optimized variants including faster-whisper, whisper.cpp, and WhisperX that offer improved inference speed and additional features.

AI 音频模型

Whisper 是一个开源基础音频模型,使用基于 Transformer 的编码器-解码器架构处理语音,该架构在 68 万小时的多语言数据上进行了训练。提供从 3900 万到 15.5 亿参数的五种规模,是 AI 生态系统中语音识别、翻译和语言识别任务的核心音频理解模型。

AI 研究工具

Whisper 是语音和音频研究的基础工具,为 ASR 系统的基准测试提供高质量的开源基线。研究人员使用 Whisper 及其变体来研究多语言语音识别、音频处理技术,以及将其作为结合语音理解与其他模态的多模态 AI 系统的组件。

AI 转录

Whisper 是目前最强大和应用最广泛的 AI 转录系统之一,提供涵盖约 100 种语言的自动语音识别,在英语中具有接近人类的准确度。它能处理多种音频条件,包括背景噪音、口音语音和技术术语,生成具有自动标点、大写和时间戳的转录文本。

开源 LLM

虽然技术上是一个语音模型而非语言模型,但 Whisper 是 OpenAI 在 MIT 许可证下最重要的开源 AI 发布之一。其开放可用性促进了整个派生工具和优化实现生态系统的发展,使其成为音频处理任务开源 AI 社区的基石。

工具详情 免费

价格 Free open-source (API access via OpenAI: $0.006/minute)
平台 Self-hosted,API
总部 San Francisco, California
成立于 2022
免费计划
API可用
开源
4.6
1 reviews
Format Flexibility
4.7
Audio Fidelity
4.7
Prompt Adherence
4.5
Voice Naturalness
4.3
Processing Speed
3.6
Claude Opus 4.6
AI Review
4.6/5

OpenAI's Whisper is a groundbreaking open-source automatic speech recognition (ASR) model that has set a new standard for transcription accuracy. Trained on 680,000 hours of multilingual data, it handles transcription, translation, and language identification across 99 languages with impressive robustness. The model comes in multiple sizes (tiny to large-v3), allowing users to balance accuracy against computational requirements. Local deployment is straightforward with a clean Python API, and the community has built numerous wrappers and optimizations like faster-whisper. For those preferring cloud access, OpenAI's API offers transcription at a very competitive $0.006/minute. Strengths include exceptional noise resilience, strong multilingual support, and zero-cost self-hosting. Limitations include higher compute demands for larger models, occasional hallucinations on silent audio segments, and no real-time streaming in the base implementation. As a research tool, the published methodology and open weights make it invaluable for audio ML experimentation. While not technically an LLM, its transformer architecture and open-source nature make it a cornerstone of the open AI ecosystem.

Audio Fidelity
4.7
Format Flexibility
4.7
Prompt Adherence
4.5
Voice Naturalness
4.3
Processing Speed
3.6
Feb 15, 2026