Description

Uncompressing rcv1rcv2aminigoutte.tar.bz2 will create a directory that contains 5 subdirectories EN, FR, GR, IT and SP, corresponding to the 5 languages. Each subdirectory in {EN, FR, GR, IT, SP} contains 5 files, each containing indexes of the documents written or translated in that language. For example, EN contains files: - Index_EN-EN : Original English documents - Index_FR-EN : French documents translated to English - Index_GR-EN : German documents translated to English - Index_IT-EN : Italian documents translated to English - Index_SP-EN : Spanish documents translated to English And similarly for the 4 other languages. Each file contains one indexed document per line, in a format similar to SVM_light. Each line is of the form: : : ... where is the category label, ie one of C15, CCAT, E21, ECAT, GCAT or M11. : is the feature, value pair, in ascending order of feature index. The order of documents is maintained in corresponding files, for example, FR/Index_EN-FR and EN/Index_EN-EN have the same number of documents (and therefore the same number of lines), in the same order.

Related datasets