Meta’s AI FAIR team has unveiled a groundbreaking achievement in the field of Automatic Speech Recognition (ASR) — the “Omnilingual ASR” system, a suite of models claimed to provide speech recognition capabilities for over 1,600 languages, setting a new benchmark for both scale and quality within the industry.
Meta emphasizes that this initiative seeks to address the long-standing issue of ASR technology and resources being heavily concentrated in a small number of high-resource languages. By introducing a universal transcription framework, the company aims to extend high-quality speech-to-text technology to underrepresented linguistic communities, thereby narrowing the global digital divide.
Alongside this announcement, Meta has also open-sourced several key assets under the Apache 2.0 license, including:
• Omnilingual ASR Model Family: Available in multiple configurations, ranging from a lightweight 300-million-parameter model optimized for low-power devices to a 7-billion-parameter version delivering state-of-the-art accuracy.
• Omnilingual wav2vec 2.0 Base Model: A large-scale multilingual Speech Representation Model, expanded to 7 billion parameters, designed to serve as a foundation not only for ASR but also for other speech-related tasks.
• Omnilingual ASR Corpus: A massive dataset released under CC-BY licensing, featuring transcribed speech from 350 under-served languages.
To overcome the traditional technical limitations of scaling ASR systems, Omnilingual ASR introduces two key architectural innovations. First, Meta has expanded its wav2vec 2.0 speech encoder to 7 billion parameters, enabling the generation of rich, multilingual semantic representations from vast amounts of unlabeled audio.
Next, the team developed two decoder variants — one using Connectionist Temporal Classification (CTC), and another leveraging a Transformer-based decoder referred to as “LLM-ASR.”
According to Meta’s published research, the 7-billion-parameter LLM-ASR system achieved state-of-the-art (SOTA) performance across more than 1,600 languages, with 78% of them exhibiting a Character Error Rate (CER) below 10%.
One of Omnilingual ASR’s most remarkable breakthroughs lies in its transformation of how new languages can be added, introducing the “Bring Your Own Language” (BYOL) paradigm. Inspired by large language model principles, this system incorporates powerful in-context learning capabilities, allowing it to adapt rapidly to new languages with minimal data.
In practice, this means that users working with currently unsupported languages need only provide a handful of paired audio-text samples, enabling the AI to produce usable transcription quality without extensive fine-tuning, expert intervention, or large-scale computational resources. This marks a pivotal step toward community-driven language expansion.
To include languages with little to no digital footprint, the team combined public datasets with local partnerships — collaborating with organizations such as the Mozilla Foundation, Lanfrica, and NaijaVoices — directly engaging native speakers to record and contribute speech samples, with fair compensation for their efforts.
The resulting dataset, released as part of the Omnilingual ASR Corpus, is now one of the largest collections ever assembled for ultra-low-resource natural speech recognition.
Currently, all related models, datasets, transcription demos, and language exploration tools are publicly available through GitHub, Hugging Face, and the Meta AI website.