ECCB 2002 Poster sorted by: Author | Number

Next | Previous poster (in order of the view you have selected)

Title: Predicting protein-protein interactions for the C. elegans interactome project
P164
Thierry-Mieg, Nicolas

Nicolas.Thierry-Mieg@imag.fr
Laboratoire Logiciels-Systemes-Reseaux/IMAG, Grenoble, France

Protein-protein interactions are critical in a wide range of biological processes, from the formation of macromolecular complexes to the transduction of signals in biological pathways. As such, they are the focus of several high-throughput functional genomics projects worldwide, relying on the two-hybrid system [Fie89], and more recently phage display [13] or mass spectrometry of complexes [9]. We work in close collaboration with Marc Vidal's group at Dana Farber Cancer Institute ([14], [11], [7]), whose ultimate goal is to identify all protein-protein interactions in Caenorhabditis elegans, using a high-throughput version of the two-hybrid system [15].
As of today, most protein interactions from this lab have been identified by screening selected genes used as "baits" against a cDNA library. This method is reliable and quickly productive, but it requires the often redundant sequencing of all positives, therefore becoming costly and time-consuming at the genome scale. To efficiently identify the complete interactome, it might be advantageous to set up a two-hybrid array approach, where pairs of known proteins are tested for interaction. Such a method requires the initial cloning of all nematode coding regions, a goal largely achieved in the Vidal lab by way of the so-called C. elegans ORFeome project [11]. In this context, our goal is to predict potential protein-protein interactions in the hope of prioritizing experiments, thereby promoting the quick identification of interacting pairs of proteins.
We proceed as follows: In a first step, a prediction-oriented database of protein-protein interactions, called InterDB [12], is built. In InterDB, proteins from all species known to interact physically are characterized by "descriptors", selected for their potential relevance to protein-protein interactions. Current descriptors include Interpro domains [2], as well as SwissProt [5 keywords and subcellular localization information. Annotations from the Gene Ontology consortium [3], as well as interactions from the BIND database [4] will be included soon. The version of InterDB used here contains 2464 interactions involving 2032 proteins.
In a second step, the data collected in InterDB is used in a KDD (Knowledge Discovery in Databases) approach to extract predictive rules. These rules are of the form : if protein A is described by descriptors D1, D2, ..., Dm and protein B is described by descriptors D1, D2, ..., Dn ; then proteins A and B potentially interact. Our approach relies on the datamining method known as frequent itemset mining. Given a boolean matrix where columns are attributes and lines are observations, the idea is to identify all sets of attributes (i.e. itemsets) that are frequently true together, i.e. in the same observations. In collaboration with Jean-Francois Boulicaut from the LISI laboratory of INSA-Lyon (France), we use the min-ex algorithm [6], a variant of the close algorithm [10] particularly suited to highly correlated and sparse datasets. This algorithm allows the use of very low frequency threshold, as opposed to less sophisticated algorithms such as apriori [1]. Therefore, the algorithm extracts a large number of potential predictive rules, yielding predictive models (i.e. sets of predictive rules) that are both more sensitive and less specific. Experimentation has shown the necessity of this strategy, as the sensitivity of predictive models obtained by using higher frequency thresholds is too low to be of value. Our work then consists in applying post-processing filters, to separate as much as possible the meaningful rules from the noise. To this end, we have developed 4 post-processing filters parameterized by 3 parameters.
The first filter eliminates rules containing a "bad" descriptor, for example SwissProt keywords such as Hypothetical_Protein or 3D-Structure. The second post-processing filter simply discards itemsets that only concern one protein. In fact, these sets might reflect meaningful correlations between descriptors of a single protein, for example synonyms, but they cannot produce predictive rules for protein-protein interactions. The third and most important post-processor extracts statistically significant rules, by comparing the observed and randomly expected frequencies for each frequent itemset. Finally, the fourth filter removes rules that are generalizations of other more significant rules.
To find the most promising parameter values and to evaluate the whole prediction system, we used a set of recently identified interactions. These interactions, which concern proteins involved in the C. elegans proteasome [7], were not present in the learning set. The results are encouraging : well-chosen predictive models have sensitivities of the order of 20%, while their specificity ratios of 3% represent a one-hundred-fold improvement over the random predictive model.
[1] Agrawal R et al (1996). Advances in Knowledge Discovery and Data Mining, 307-28, AAAI Press.
[2] Apweiler R et al (2001). Nucleic Acids Research 29(1), 37-40.
[3] Ashburner M et al (2000). Nature Genetics 25(1), 25-9.
[4] Bader GD et al (2001). Nucleic Acids Research 29(1), 242-5.
[5] Bairoch A and Apweiler R (1999). Nucleic Acids Research 27(1), 49-54.
[6] Boulicaut JF, Bykowski A (2000). Lecture Notes in Artificial Intelligence 1805, PaKDD'00.
[7] Davy A et al (2001). EMBO Reports 2(9): 821-8.
[8] Fields S, Song O (1989). Nature 340, 245-6.
[9] Ho Y et al (2002). Nature 415(6868), 180-3.
[10] Pasquier N et al (1999). Information Systems 24(1), 25-46.
[11] Reboul J et al (2001). Nature Genetics 27(3), 332-6.
[12] Thierry-Mieg N, Trilling L (2001). Lecture Notes in Computer Science 2066, 144-54.
[13] Tong A et al (2002). Science 295(5553), 321-4.
[14] Walhout A et al (2000). Science 287, 116-22.
[15] Walhout A, Vidal M (2001). Methods 24(3), 297-306.