Database systems are islands of structure in a sea of unstructured
data sources. Several real-world applications now need to create
bridges for smooth integration of semi-structured sources with
existing structured databases for seamless querying and mining. This
integration requires extracting structured column values from the
unstructured source and mapping them to known database entities.
Existing methods of data integration do not effectively exploit the
wealth of information available in multi-relational entities.
We present statistical models for co-reference resolution and
information extraction in a database setting. We then go over the
performance challenges of training and applying these models
efficiently over very large databases. This requires us to break open
a black box statistical model and extract predicates over indexable
attributes of the database. We show how to extract such predicates for
several classification models, including naive Bayes classifiers and
support vector machines. We extend these indexing methods for
supporting similarity predicates needed during data integration.
Homepage: http://www.it.iitb.ac.in/~sunita/
Biography:
Sunita Sarawagi researches in the fields of databases, data mining,
and machine learning. She is associate professor at IIT Bombay. Prior
to that she was a research staff member at IBM Almaden Research Center.
She got her PhD in databases from the University of California at
Berkeley and a bachelors degree from IIT Kharagpur. She was visiting
associate professor at CMU Jan-May 2004. She has several publications
in international conferences on databases and data mining and several
patents. She has served as program committee member for ACM SIGMOD,
VLDB, ACM SIGKDD, IEEE ICDE and ICML conferences and is editor-in-chief
of the ACM SIGKDD newsletter.