Odia Data Description


The text data was collected from four districts, (representative dialects indicated in parenthesis) - Sambalpur (North Western Odia), Mayurbhanj (North Eastern Odia), Puri (Central and Standard Odia) and Koraput (Southern Odia). The focal themes of data collection were agriculture, healthcare and finance. Data collection was carried out on the field from farmers and agriculture officers for Agriculture domain; nurses, doctors and associate professionals (front desk staff, naturopathy practitioners) for healthcare domain and bank employees for Finance domain. A cumulative of 885 sentences were obtained for speech data collection, and were split across train and test set with 94.54 hours and 5.49 hours audio respectively. The dataset has 65 unique sentences in Test set non overlapping with 820 unique sentences in Train set. The audio files are sampled at 8kHz, 16-bit encoding. The vocabulary size is 1644.