Universal Speech Model

published on 2023/03/30

Universal Speech Model (USM) is a family of state-of-the-art speech models with 2B parameters trained on 12 million hours of speech and 28 billion sentences of text, spanning 300+ languages. USM, which is for use in YouTube (e.g., for closed captions), can perform automatic speech recognition (ASR) on widely-spoken languages like English and Mandarin, but also languages like Punjabi, Assamese, Santhali, Balinese, Shona, Malagasy, Luganda, Luo, Bambara, Soga, Maninka, Xhosa, Akan, Lingala, Chichewa, Nkore, Nzema to name a few. Some of these languages are spoken by fewer than twenty million people, making it very hard to find the necessary training data.

We demonstrate that utilizing a large unlabeled multilingual dataset to pre-train the encoder of our model and fine-tuning on a smaller set of labeled data enables us to recognize these under-represented languages. Moreover, our model training process is effective for adapting to new languages and data.

Google Research