Martinetz, Thomas;Gewehr, Jan E.;Kim, Jan T. - A Statistical Learning Approach to Predicting Protein-DNA-Binding Sites

ECCB 2002 Poster sorted by: Author | Number

Next | Previous poster (in order of the view you have selected)

Title: A Statistical Learning Approach to Predicting Protein-DNA-Binding Sites	P100
Martinetz, Thomas; Gewehr, Jan E.; Kim, Jan T. kim@inb.uni-luebeck.de Institut für Neuro- und Bioinformatik, Universität zu Lübeck

Predicting the bindings sites of DNA binding proteins is a highly relevant task in bioinformatics. For example, the bindings sites of transcription factors are key elements of regulatory networks and determine the location of genes on a genome. Usually, for a given DNA binding protein, only a few binding sequences are known experimentally. The task then is to deduce the global binding characteristics of the protein based on these few examples. A widespread approach is the so-called Position-Weight-or Profile-Matrix (PM) [1]. A PM for binding sites of length L is a 4 x L matrix. It is used for scoring words of length L within genomic sequences. Positions of high scoring words, i.e. where the score exceeds a threshold value S_min, define the potential binding sites detected by the PM.

The PM can be interpreted as a linear classifier (binding word class/non-binding word class) within the space of sequence words, with the profile of the experimentally verified binding sites determining its parameters. To demonstrate this, a PM p can be represented as a 4*L-dimensional vector by conceptually concatenating the L columns of the PM. Likewise, a sequence word w, which is to be assessed using the PM, can also be represented as a 4*L-dimensional vector by concatenating L 4-tuples which correspond to the L bases in w. In each 4-tuple, three components are 0 and one component, corresponding to the base in w, is 1. For example, the word "GAT" would be represented by w = (0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1). It is easy to show that the vector representations of all words of length L are homogeneously distributed on a 3*L-dimensional hypersphere around their center of gravity <w>.

The linear classification defined by p and S_min is a separation of sequence words according to whether they satisfy

S(w) = p * w >= S_min (1)

This inequality slices a segment off the 3*L-dimensional hypersphere; the words on the surface of this cap are those classified as binding words.

In this contribution a novel approach called Binding-Matrix (BM) is introduced. As the PM the BM realizes a linear classification, but in contrast to the PM approach the classifier (matrix) is now determined by maximum likelihood estimation. As a point of departure, it is noted that the transcription factor itself also performs a classification on the set of words with length L: If the binding energy of the factor to a word w exceeds a certain threshold E_min, the word is a binding word. It is well established that in very good approximation, each base pair in a word contributes to the binding energy independently of the other base pairs [2]. Thus, within the framework of vector representations, the classification performed by a transcription factor is defined by

E(w) = e * w >= E_min (2)

where e is a matrix / vector whose components are the position specific binding energy values.

The small set of experimentally determined binding words can be considered as a random sample drawn from the words that satisfy (2). The BM is the maximum likelihood estimation of e. It can be shown that in the 4L-dimensional space in which the 3*L-dimensional hypersphere is embedded, the maximum likelihood estimation is the plane which cuts off the smallest possible sphere segment which still contains all data points, i.e. all experimentally determined binding words. This is sketched in Fig. 1.

Generally, the BM is not identical to, or linearly dependent on the PM. They both converge towards e when the entire set of binding words is provided as training data, but may vary substantially if the training data consists of a small subset of all binding words. The BM can efficiently be computed by convex optimization from training data (i.e. from a set of experimentally determined binding words).

For testing the BM on empirical data, all transcription factors for which at least 5 binding words are available were selected from the TRANSFAC database. For each set of binding words, a leave-one-out test was performed. For two transcription factors for which the highest number (about 70) of experimentally determined binding words were available, tests were performed in which 2/3 of the words were employed as training data and the remaining words served as test data. In all tests, the thresholds were set such that a 100% sensitivity level was just achieved, i.e. all experimentally known binding words were correctly classified. As an estimate for specificity, the total number of words which had to be classified as binding words in order to achieve 100% sensitivity was determined. Fig. 2 shows the aggregated results of the leave-one-out tests. All tests revealed that the specificity of the BM method exceeds the specificity of the PM method by about one order of magnitude.

[1] K. Frech, K. Quandt and T. Werner (1997) "Finding Protein-Binding Sites in DNA Sequences: The Next Generation." Trends Biochem. Sci. 22: 103-104.
[2] G.D. Stormo, S. Strobl, M. Yoshioka and J.S. Lee (1993) "Specificity of the Mnt Protein. Independent Effects of Mutations at Different Positions in the Operator." J. Mol. Biol 229: 821-826.