Voxtral Transcribes at the Speed of Sound: Introducing Voxtral Transcribe 2
Today, we're excited to unveil Voxtral Transcribe 2, a groundbreaking leap in speech-to-text technology. This release introduces two cutting-edge models: Voxtral Mini Transcribe V2 for batch transcription and Voxtral Realtime for real-time applications. Voxtral Realtime is open-source under the Apache 2.0 license, offering unparalleled flexibility and customization.
We've also launched an interactive audio playground in Mistral Studio (https://console.mistral.ai/build/audio/speech-to-text) where you can instantly test Voxtral Transcribe 2's capabilities, including diarization and timestamps.
Here's a closer look at the key features:
Voxtral Mini Transcribe V2:
- State-of-the-art transcription with speaker diarization, context biasing, and word-level timestamps in 13 languages.
- Achieves the lowest word error rate at the lowest price point, offering industry-leading accuracy at a fraction of the cost.
Voxtral Realtime:
- Purpose-built for live transcription with latency as low as sub-200ms, ideal for voice agents and real-time applications.
- Multilingual, supporting 13 languages, including English, Chinese, Hindi, Spanish, Arabic, French, Portuguese, Russian, German, Japanese, Korean, Italian, and Dutch.
- Runs efficiently on edge devices, ensuring privacy and security.
Voxtral Realtime in Detail:
- Uses a novel streaming architecture to transcribe audio as it arrives, minimizing delay.
- At 2.4 seconds delay, it matches Voxtral Mini Transcribe V2's performance.
- At 480ms delay, it achieves near-offline accuracy, enabling voice agents with minimal word error rate (WER).
Voxtral Mini Transcribe V2's Performance:
- Achieves an average diarization error rate of 4% across five English benchmarks and the TalkBank multilingual benchmark.
- Reaches an average word error rate of 4% on the FLEURS transcription benchmark, offering the best price-performance ratio in the market.
- Outperforms competitors like GPT-4o mini Transcribe, Gemini 2.5 Flash, Assembly Universal, and Deepgram Nova in accuracy and processing speed.
Model Features:
- Speaker Diarization: Generates transcriptions with speaker labels and precise start/end times, ideal for meeting transcription and multi-party call processing.
- Context Biasing: Provides context to guide the model toward correct spellings of names, technical terms, and domain-specific vocabulary.
- Word-Level Timestamps: Enables precise subtitle generation, audio search, and content alignment.
- Expanded Language Support: Supports 13 languages, with non-English performance significantly outperforming competitors.
- Noise Robustness: Maintains accuracy in challenging acoustic environments.
- Longer Audio Support: Processes recordings up to 3 hours in a single request.
Audio Playground:
- Test Voxtral Transcribe 2 directly in Mistral Studio (https://console.mistral.ai/build/audio/speech-to-text).
- Upload up to 10 audio files, toggle diarization, choose timestamp granularity, and add context bias terms.
Transforming Voice Applications:
- Meeting Intelligence: Transcribe multilingual recordings with clear speaker attribution, enabling efficient meeting content annotation.
- Voice Agents and Virtual Assistants: Build conversational AI with sub-200ms latency, creating natural and responsive voice interfaces.
- Contact Center Automation: Transcribe calls in real-time, enabling sentiment analysis, response suggestions, and CRM field population.
- Media and Broadcast: Generate live multilingual subtitles with minimal latency, handling technical terminology.
- Compliance and Documentation: Monitor and transcribe interactions for regulatory compliance, ensuring clear speaker attribution and precise audit trails.
Deployment and Pricing:
- Voxtral Mini Transcribe V2 is available via API at $0.003 per minute.
- Voxtral Realtime is available via API at $0.006 per minute and as open weights on Hugging Face.
Get started with Voxtral today and unlock the power of advanced speech-to-text technology!