Microsoft releases largest publicly available speech data for three Indian languages to aid researchers

2 min. read

Published on September 6, 2018

published on September 6, 2018

Readers help support Windows Report. We may get a commission if you buy through our links.

Microsoft India has announced the availability of Microsoft Indian language Speech Corpus to help researchers and academia build Indian language speech recognition for all applications where speech is used.

Available for Telugu, Tamil, and Gujarati, this is the largest publicly available Indian language speech dataset and includes audio and corresponding transcripts.

The Speech Corpus content is provided as part of Microsoft Research Open Data initiative, a collection of free datasets from Microsoft Research to advance state-of-the-art research in areas such as natural language processing, computer vision, and domain-specific sciences.

Microsoft Indian Language Speech Corpus was launched to address the scarcity of adequate digital data for text, speech, and linguistic resources – which are imperative in building large machine learning models for many vernacular languages across the world. The development of accurate digital tools in Indian languages has been slow owing to subtle differences in enunciation, accent, diction, and slang across various regions in India.

Microsoft’s Indian Language Speech Corpus was tested at Interspeech 2018, the world’s largest and most comprehensive conference on the science and technology of spoken language processing, and was used to create high-quality speech recognition models, thus validating the efficacy of the Corpus.

It is imperative that India’s increasing digital literacy is supported by a multi-lingual digital world and initiatives like these for researchers and academia will help accelerate innovation in voice-based computing for India.

Radu Tyrsina

Radu Tyrsina has been a Windows fan ever since he got his first PC, a Pentium III (a monster at that time). For most of the kids of his age, the Internet was an amazing way to play and communicate with others, but he was deeply impressed by the flow of information and how easily you can find anything on the web. Prior to founding Windows Report, this particular curiosity about digital content enabled him to grow a number of sites that helped hundreds of millions reach faster the answer they're looking for.