MPI-INF Logo
MPI-INF/SWS Research Reports 1991-2021

1. Author,Editor - 1. by Individual

MPI-I-2004-5-001

Goal-oriented methods and meta methods for document classification and their parameter tuning

Siersdorfer, Stefan and Sizov, Sergej and Weikum, Gerhard

2004, 32 pages.

.
Status: available - back from printing

Automatic text classification methods come with various calibration parameters such as thresholds for probabilities in Bayesian classifiers or for hyperplane distances in SVM classifiers. In a given application context these parameters should be set so as to meet the relative importance of various result quality metrics such as precision versus recall. In this work we consider classifiers that can accept a document for a topic, reject it, or abstain. We aim to meet the application's goals in terms of accuracy (i.e., avoid false acceptances or rejections) and loss (i.e., limit the fraction of documents for which no decision is made). To this end we investigate restrictive forms of SVM classifiers and we develop meta methods that split the training data into subsets for independently trained classifiers and then combine the results of these classifiers. These techniques tend to improve accuracy at the expense of document loss. We develop estimators that help to predict the accuracy and loss for a given setting of the methods' tuning parameters, and a methodology for efficiently deriving a setting that meets the application's goals. Our experiments confirm the practical viability of the approach.

  • MPI-I-2004-5-001.ps
  • Attachement: MPI-I-2004-5-001.ps (29924 KBytes)

URL to this document: https://domino.mpi-inf.mpg.de/internet/reports.nsf/NumberView/2004-5-001

Hide details for BibTeXBibTeX
@TECHREPORT{SiersdorferSizovWeikum2004,
  AUTHOR = {Siersdorfer, Stefan and Sizov, Sergej and Weikum, Gerhard},
  TITLE = {Goal-oriented methods and meta methods for document classification and their parameter tuning},
  TYPE = {Research Report},
  INSTITUTION = {Max-Planck-Institut f{\"u}r Informatik},
  ADDRESS = {Stuhlsatzenhausweg 85, 66123 Saarbr{\"u}cken, Germany},
  NUMBER = {MPI-I-2004-5-001},
  YEAR = {2004},
  ISSN = {0946-011X},
}