Consequently, the contributions of SpotSigs are twofold: 1) by combining stopword antecedents with short chains of adjacent content terms, we create robust document signatures with a natural ability to filter out noisy components of web pages that would otherwise distract pure n-gram-based approaches such as shingling; 2) we provide an exact and efficient, self-tuning matching algorithm that exploits a novel combination of collection partitioning and inverted index pruning for high-dimensional similarity search. Our experiments confirm an increase in combined precision and recall of more than 24 percent over state-of-the-art approaches such as shingling or I-Match and up to a factor of 3 faster execution times than Locality Sensitive Hashing (LSH), over a demonstrative “Gold Set” of manually assessed near-duplicate news articles as well as the TREC WT10g Web collection.
See also our Stanford InfoBlog entry:
http://infoblog.stanford.edu/2008/08/spotsigs-are-stopwords-finally-good-for.html