Data quality is a serious concern in complex industrial-scale
databases, which often have thousands of tables and tens of thousands
of columns. Commonly encountered problems include duplicates and
default values in columns treated as keys, data inconsistencies, and
poor quality join paths. Compounding the data quality problems are
incomplete and out-of-date metadata about the database and the
processes used to populate the database. These problems make the task
of analyzing data particularly challenging. The Bellman data quality
browser has been built to effectively address such problems. Bellman
profiles the database and computes concise statistical summaries of
the contents of the database to identify approximate keys, frequent
values of a field (often default values), joinable fields, and to
understand database dynamics (changes in a database over time). In
this talk, I'll describe the technology underlying Bellman and how
it is used to help make sense of complex databases.