Description

The Extreme Classification Repository: Multi-label Datasets & Code Kush Bhatia Himanshu Jain Prateek Jain Manik Varma The objective in extreme multi-label learning is to learn a classifier that can automatically tag a datapoint with the most relevant subset of labels from an extremely large label set. This page provides benchmark datasets and code that can be used for evaluating the performance of extreme multi-label algorithms. These multi-label datasets have been processed from their original source to create train/test splits ensuring that each label occurs at least once in both the training and the test set. This yields more realistic train/test splits as compared to uniform sampling which can drop many of the infrequently occurring, and hard to classify, labels from the test set. For example, on the WikiLSHTC-325K dataset, uniform sampling might loose ninety thousand of the hardest to classify labels from the test set. Results computed on the train/test splits provided on this page are therefore not comparable to results computed on the original sources or through uniform sampling. Please cite both the original source as well as the appropriate FastXML or SLEEC paper to avoid any ambiguity about which train/test split was used. Please also note that the Ads-1M and Ads-9M datasets are proprietary and not available for download.

Related datasets