Description

The "spam" concept is diverse: advertisements for products/web sites, make money fast schemes, chain letters, pornography... Our collection of spam e-mails came from our postmaster and individuals who had filed spam. Our collection of non-spam e-mails came from filed work and personal e-mails, and hence the word 'george' and the area code '650' are indicators of non-spam. These are useful when constructing a personalized spam filter. One would either have to blind such non-spam indicators or get a very wide collection of non-spam to generate a general purpose spam filter. For background on spam: Cranor, Lorrie F., LaMacchia, Brian A. Spam! Communications of the ACM, 41(8):74-83, 1998. (a) Hewlett-Packard Internal-only Technical Report. External forthcoming. (b) Determine whether a given email is spam or not. (c) ~7% misclassification error. False positives (marking good mail as spam) are very undesirable.If we insist on zero false positives in the training/testing set, 20-25% of the spam passed through the filter.

Related Papers

  • Christos Dimitrakakis and Samy Bengioy. Online Policy Adaptation for Ensemble Classifiers. IDIAP. [link]
  • C. Titus Brown and Harry W. Bullen and Sean P. Kelly and Robert K. Xiao and Steven G. Satterfield and John G. Hagedorn and Judith E. Devaney. Visualization and Data Mining in an 3D Immersive Environment: Summer Project 2003. [link]
  • Yongmei Wang and Ian H. Witten. Modeling for Optimal Probability Prediction. ICML. 2002. [link]
  • Don R. Hush and Clint Scovel and Ingo Steinwart. Los Alamos National Laboratory Stability of Unstable Learning Algorithms. Modeling, Algorithms and Informatics Group, CCS-3. 2003. [link]

Related datasets