Tamil, Telugu, Gujarati Data Description


The Tamil, Telugu and Gujarati datasets are taken from Microsoft Research Open Data available at Microsoft Research Open Data (msropendata.com). These datasets are the same as the ones used in Interspeech 2018 Special Session: Low Resource Speech Recognition Challenge for Indian Languages (Interspeech 2018 Special Session: Low Resource Speech Recognition Challenge for Indian Languages - Microsoft Research). All the three datasets consist of 40 and 5 hours of speech in the train and test sets, respectively. The audio files are sampled at 16 kHz, 16 bit encoding.