Description

For each text collection, D is the number of documents, W is the number of words in the vocabulary, and N is the total number of words in the collection (below, NNZ is the number of nonzero counts in the bag-of-words). After tokenization and removal of stopwords, the vocabulary of unique words was truncated by only keeping words that occurred more than ten times. Individual document names (i.e. a identifier for each docID) are not provided for copyright reasons. These data sets have no class labels, and for copyright reasons no filenames or other document-level metadata. These data sets are ideal for clustering and topic modeling experiments. For each text collection we provide docword.*.txt (the bag of words file in sparse format) and vocab.*.txt (the vocab file). Enron Emails: orig source: www.cs.cmu.edu/~enron D=39861 W=28102 N=6,400,000 (approx) NIPS full papers: orig source: books.nips.cc D=1500 W=12419 N=1,900,000 (approx) KOS blog entries: orig source: dailykos.com D=3430 W=6906 N=467714 NYTimes news articles: orig source: ldc.upenn.edu D=300000 W=102660 N=100,000,000 (approx) PubMed abstracts: orig source: www.pubmed.gov D=8200000 W=141043 N=730,000,000 (approx)

Related datasets