Max-Planck-Institut für Informatik
max planck institut
mpii logo Minerva of the Max Planck Society


Automated retraining methods for document classification and their parameter tuning

Siersdorfer, Stefan and Weikum, Gerhard

MPI-I-2005-5-002. September 2005, 23 pages. | Status: available - back from printing | Next --> Entry | Previous <-- Entry

Abstract in LaTeX format:
This paper addresses the problem of semi-supervised classification on
document collections using retraining (also called self-training). A
possible application is focused Web
crawling which may start with very few, manually selected, training
but can be enhanced by automatically adding initially unlabeled,
positively classified Web pages for retraining.
Such an approach is by itself not robust and faces tuning problems
regarding parameters
like the number of selected documents, the number of retraining
iterations, and the ratio of positive
and negative classified samples used for retraining.
The paper develops methods for automatically tuning these parameters,
based on
predicting the leave-one-out error for a re-trained classifier and
avoiding that the classifier is diluted by selecting too many or weak
documents for retraining.
Our experiments
with three different datasets
confirm the practical viability of the approach.
References to related material:

To download this research report, please select the type of document that fits best your needs.Attachement Size(s):
MPI-I-2005-5-002.ps262 KBytes
Please note: If you don't have a viewer for PostScript on your platform, try to install GhostScript and GhostView
URL to this document:
Hide details for BibTeXBibTeX
  AUTHOR = {Siersdorfer, Stefan and Weikum, Gerhard},
  TITLE = {Automated retraining methods for document classification and their parameter tuning},
  TYPE = {Research Report},
  INSTITUTION = {Max-Planck-Institut f{\"u}r Informatik},
  ADDRESS = {Stuhlsatzenhausweg 85, 66123 Saarbr{\"u}cken, Germany},
  NUMBER = {MPI-I-2005-5-002},
  MONTH = {September},
  YEAR = {2005},
  ISSN = {0946-011X},