Assembly AI claims its new Universal-1 model has 30% fewer hallucinations than Whisper

Join us in Atlanta on April 10th and explore the landscape of security workforce. We will explore the vision, benefits, and use cases of AI for security teams. Request an invite here.

AI-as-a-service provider Assembly AI has a new speech recognition model called Universal-1. Trained on more than 12.5 million hours of multilingual audio data, the company says it does well with speech-to-text accuracy across English, Spanish, French and German. It boasts that Universal-1 can reduce hallucinations by 30% on speech data and by 90% on ambient noise compared to OpenAI’s Whisper Large-v3 model.

In a blog post, the company describes Universal-1 as “another milestone in our mission to provide accurate, faithful and robust speech-to-text capabilities for multiple languages, helping our customers and developers worldwide build various Speech AI applications.” Along with a better understanding of four major languages, the model can code-switch, transcribing multiple languages within a single audio file.

A chart from Assembly AI showing how its Universal-1 speech recognition model compares against industry peers in generated correct words. Image credit: Assembly AI
A chart from Assembly AI showing how its Universal-1 speech recognition model compares against industry peers in generated correct words. Image credit: Assembly AI

Universal-1 also supports improved timestamp estimation, which is important when working with audio and video editing and conversation analytics. Assembly AI claims the new model is 13 percent better than its predecessor, Conformer-2. As a result, there’s better speaker diarization, improved concatenated minimum-permutation word error rate (cpWER) of 14%, and speaker count estimation accuracy by 71%.

Finally, parallel inference has been made more efficient, reducing the turnaround processing time for long audio files. Universal-1 is said to accomplish this task five times faster than Whisper Large-v3. Assembly AI compared Universal-1’s processing speed with Whisper Large-3 on Nvidia Tesla T4 machines with 16GB of VRAM. With a batch size of 64, the former took 21 seconds to transcribe 1 hour of audio. However, using a much smaller batch size of 24, the latter took 107 seconds to accomplish the same task.

VB Event

The AI Impact Tour – Atlanta

Continuing our tour, we’re headed to Atlanta for the AI Impact Tour stop on April 10th. This exclusive, invite-only event, in partnership with Microsoft, will feature discussions on how generative AI is transforming the security workforce. Space is limited, so request an invite today.

Request an invite

The benefits of having improved speech-to-text AI models are that notetakers can generate more accurate and hallucination-free notes, identify action items and sort out metadata such as proper nouns, who’s speaking and timing information. Additionally, it’ll help creator tool applications incorporating AI-powered video editing workflows, telehealth platforms automated clinical note entry and claims submission processes where accuracy is important, and more.

The Universal-1 model is available through Assembly AI’s API.


Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top