ECCB 2002 Poster sorted by: Author | Number

Next | Previous poster (in order of the view you have selected)

Title: The Effects of the Transcription Factor on Binding Site Information are Constrained by Genetic Autonomy
P79
Kim, Jan T.; Martinetz, Thomas; Polani, Daniel

kim@inb.uni-luebeck.de
Institut für Neuro- und Bioinformatik

All living organisms are endowed with a genome that has an extensive information storage capacity. Having gained access to this information through whole genome sequencing, understanding the mechanisms by which genetic information is turned into phenotypic processes and traits is now a major challenge for bioinformatics. Transcription factors form regulatory networks, and these ensure that from the very high dimensional space of all possible expression patterns, biologically meaningful patterns are selected with an extremely high specificity. It is thus evident that transcription factors are central components of the mechanisms which interpret genetic information. Consequently, genomes and transcription factors are tightly interlinked by coevolutionary constraints: If they fail to "understand" each other, massive pleiotropic effects and lethal consequences are almost certainly generated.

Information content of binding site positions (R_freq) and of binding word sequences (R_seq) [1] are key quantities for analyzing this coevolutionary process. In recent studies, we developed a formal framework for investigating the coevolution of genomes and transcription factors using maximum entropy analysis [2, 3]. The basis of this framework is a combined state space, given by the Cartesian product of the genome space and the transcription factor space. On this basis, the maximally likely value of R_seq given R_freq as an independent variable can be calculated.

In this contribution, we extend our analysis to account for a fundamental biological fact: Living systems which are genetically autonomous, i.e. which do not depend on genetic information provided by host organisms, must encode all transcription factors they need within their own genome. This places a constraint on the ratio between the cardinality
of genome space the cardinality of transcription factor space.

Let N denote the length of the genome, G denote the space of all possible genomes and T the space of all possible transcription factors. Evidently, |G| = 4^N is valid. Further, let M denote the number of genes in the genome. Then, N/M is an upper bound for average gene length (equality would occur if no intergenic regions exist). Considering the fact that each transcription factor is encoded by single gene, it is clear that |T| <= 4^{N/M} holds as an estimate. One can thus deduce that

log(|T|) / log(|G|) <= 1/M (1)

is valid. As an alternative approach, one can use the typical number of amino acids in a DNA binding domain to estimate |T|.

It can be shown that a uniform probability distribution in G induces a tendency towards the equality of the maximally likely value of R_seq and R_freq but in the generic case, the maximally likely R_seq may assume any value, depending on the probability distribution in T. The inequality (1), however, enables us to derive a limited range for the
maximally likely R_seq. These ranges are depicted in Fig. 1. In the graph, diamonds indicate R_seq = R_freq and the bars indicate the range within which the maximally likely R_seq values cannot exceed for any probability distribution in T.

The R_seq range clearly increases in size with R_freq. Only for small R_freq (up to about 4 bits), the equality R_seq = R_freq appears to be a good approximation. However, sequence specificity of DNA binding is too low for transcription factors. For transcription factors and other sequence specific DNA binding proteins, the expected binding site
frequency is about 1 out 500 positions or R_freq = 9 [1]. Fig. 1 shows that at such R_freq values, R_seq may noticeably deviate from R_freq but is is still quite strongly constrained (For figure see web representation of the poster abstract).

Our study demonstrates that a correlation between R_freq and R_seq exists in biological systems that are genetically autonomous, i.e. which encode their own machinery for controlling gene expression. From this perspective, it appears particularly interesting to empirically investigate R_seq and R_freq in non-automous systems such as viruses,
mitochondria and plastids. In a longer perspective, our finding may provide a point of departure for characterizing key properties of the probability distribution in the space of DNA binding protein function and thus support progress in functional genomics.
[1] T.D Schneider, G.D. Stormo and L. Gold (1986) "Information Content of Binding Sites on Nucleotide Sequences." J. Mol. Biol. 188: 415-431.
[2] Kim, J.T., Martinetz, T. and Polani, D. (2001) "On the Effects of Transcription Factor Properties on the Information Contents of Binding Sites." In E. Wingender and R. Hofestaedt (editors): Proceedings of the German Conference on Bioinformatics 2001. German Research Center for Biotechnology (GBF), Braunschweig.
[3] Kim, J.T., Martinetz, T. and Polani, D. (2002) "Bioinformatic Principles Underlying the Information Content of Transcription Factor Binding Sites." Submitted to J. theor. Biol.