Description

Data Characteristics: -------------------- This data was created by selecting 20 files each from the 10 largest classes in the Reuters-21578 collection ([Web Link]). The files were read out by 3 Indian speakers and an Automatic Speech Recognition (ASR) system was used to generate the transcripts. More about the ASR system can be found in [1]. Such a dataset will be really helpful to study the effect of speech recognition noise on text mining algorithms. The first work which refered to this dataset was on noisy text classification[2]. Data Format: ---------- There are 10 directories labeled by the topic name. Each contains 20 files of transcriptions. References: ---------- [1] L. R. Bahl, S. Balakrishnan-Aiyer, J. Bellegarda, M. Franz, P. Gopalakrishnan, D. Nahamoo, M. Novak, M. Padmanabhan, M. Picheny, and S. Roukos, Performance of the IBM large vocabulary continuous speech recognition system on the ARPA wall street journal task. In Proc. of ICASSP 95, pages 4144, Detroit, MI, 1995. [2] S. Agarwal, S. Godbole, D. Punjani and S. Roy, How Much Noise is too Much: A Study in Automatic Text Classification', In Proc. of ICDM 2007

Related datasets