MPI-INF Logo
Campus Event Calendar

Event Entry

What and Who

Enhanced Probabilistic Models for Remote Homology Detection

Thomas Plötz
University of Bielefeld
Talk
AG 1, AG 2, AG 3, AG 4, AG 5  
MPI Audience

Date, Time and Location

Wednesday, 13 October 2004
09:30
-- Not specified --
46.1 - MPII
024
Saarbrücken

Abstract

In the last decade, probabilistic models of protein families more and
more became the methodology of choice for remote homology
detection. Instead of direct sequence-to-sequence comparison,
stochastic models for protein families of interest are trained and
unknown sequences are aligned to these models yielding scores for
classification. Especially Profile Hidden Markov Models (HMMs) are the
most promising models of protein families. Unfortunately, the problem
of remote homology classification is far away from beeing solved even
when using the most powerful techniques.

Presently, Profile HMMs are created in a more or less
straightforward manner. Based on raw sequence data, discrete models
are created for rather large protein units resulting in huge numbers
of states (match, insert, delete). Due the complex and almost
ergodic model architecture vast numbers of parameters need to be
trained for robust HMMs which is often problematic because of the
lack of suitable data.

In the GRASSP-project, a research cooperation between Boehringer
Ingelheim Pharma KG, Genomics group and the Applied Computer Science
group of Bielefeld University, principles and techniques
successfully deployed to general pattern classification problems (like
automatic recognition of spoken language or handwritten script) are
adopted for the task of remote homology detection. The goal is to
develop HMMs for protein families outperforming present Profile HMMs
whereas requiring significantly less training samples.

In this talk an overview about the GRASSP-project will be given. Here,
the focus will be put on the development of feature based continuous
HMMs for sequence families, i.e. models containing "continuous"
emissions instead of discrete amino acid distributions. For this
reason a feature representation of sequence data based on biochemical
properties of adjacent residues was developed and the emissions
of Profile HMMs were substituted by continuous parametric
representations. The evaluation based on a representative SCOP dataset
shows the superior performance of the new approach.

Besides this, current research activities will be discussed.
Based on the new feature representation of protein sequences
alternative, significantly less complex model architectures become
possible. The amount of free parameters and thus the amount of
training samples required could be reduced dramatically. First
preliminary and encouraging results are presented showing comparable
recognition results while containing only a fraction of parameters to
be trained.
--------------

Contact

Ingolf Sommer
--email hidden
passcode not visible
logged in users only

Ruth Schneppen-Christmann, 10/05/2004 11:18 -- Created document.