Automatic classification of data items, based on training
samples, can be boosted by considering the neighborhood of data items
in a graph structure (e.g., neighboring documents in a hyperlink environment
or co-authors and their publications for bibliographic data entries).
This talk presents a new method for graph-based classification,
with particular emphasis on hyperlinked text documents but broader
applicability.
Our approach is based on iterative relaxation labeling and
can be combined with either Bayesian or SVM classifiers on the feature
spaces of the given data items. The graph neighborhood is taken into
consideration to exploit locality patterns while at the same time avoiding
overfitting. In contrast to prior work along these lines, our approach
employs a number of novel techniques: judicious pruning of edges from
the neighborhood graph based on node dissimilarities and node degrees,
weighting edges by content similarity measures, and weighting the
influence of edges based on a distance metric between the classification
labels of interest (e.g., different scientific fields for bibliographic data).
Our techniques considerably improve the robustness and accuracy of the
classification outcome, as shown in systematic experimental comparisons
with previously published methods on three different real-world datasets.