Reeder, Jens;Giegerich, Robert - Improved Prediction of RNA Secondary Structures Including Pseudoknots

ECCB 2002 Poster sorted by: Author | Number

Next | Previous poster (in order of the view you have selected)

Title: Improved Prediction of RNA Secondary Structures Including Pseudoknots	P130
Reeder, Jens; Giegerich, Robert robert@techfak.uni-bielefeld.de, jreeder@techfak.uni-bielefeld.de Bielefeld University

MOTIVATION
Pseudoknots have been proved to be functionally relevant in RNA secondary structures, for example in ribosomal RNA, in the catalytic core of Group I Introns or in RNaseP. Yet, they are excluded from consideration by standard RNA folding programs because of computational complexity: While folding a sequence of length n into unknotted structures requires O(n³) time and O(n²) space, finding the best structure including pseudoknots in the widely accepted MFE-model has been proved to be NP-complete [1,2]. However, for a significant subclass of pseudoknots, polynomial time algorithms have been designed: Rivas and Eddy achieve O(n⁶) time and O(n⁴) space [3,4], improved by Lyngso [1] to O(n⁵) time and O(n⁴) space. An algorithm by Akutsu [2] achieves O(n⁴) time and O(n⁴) space, but does not work for the standard energy model. Note that the space complexity of O(n⁴), common to all algorithms, may be the larger obstacle for practical use. The Rivas and Eddy algorithm is available and, in spite of the high computational cost, it is actually used in practice.

RESULTS
We present an improved algorithm for finding the best RNA structure including a certain subclass of pseudoknots in the MFE model that achieves O(n⁴) time and O(n²) space. The improvement of two orders of magnitude over Rivas/Eddy in both time and space results from an idea of canonization, which may be successfully applied in other dynamic programming problems as well. Moreover, the resulting algorithm only adds a special case to the traditional folding algorithm for unknotted structures, and hence is much simpler to understand and to implement.

MODELLING PSEUDOKNOTS
We consider RNA secondary structures in full generality, but only discuss the aspects related to pseudoknots here. A pseudoknot of the class considered here is shown in Figure 1. Its sequence can be decomposed into compartments a,u,b,v,a',w,b' with boundaries at sequence positions i,e,k,g,f,l,h,j (Fig. 2). Compartment a forms a helix with a', and b with b'. Both parts of a helix must have the same length, so we conclude
f == l-(e-i)
h == j-(g-k).
We are left with 6 out of eight boundaries that vary independently. To consider all possible pseudoknots during energy minimization, this requires a six-fold nested loop and, therefore, O(n⁶) time.

CANONIZATION
We call a pseudoknot canonical if the helices a,a' and b,b' facing each other have maximal extent, or in other words, compartment v is as short as possible under the rules of base pairing. Clearly, for each family of pseudoknots delineated by i,k,l,j there is a canonical one. Considering canonical pseudoknots is a heuristic restriction, but a well-supported one: The energy model strongly favors maximal helix extension. But there are exceptions, such that in some (rare) cases the canonical pseudoknot is only near-optimal. Another case where the true optimal structure may be missed is the case when there is competition for helix formation from within compartment v. So far, neither of these cases has occurred in practice.

COMPLEXITY REDUCTION
For canonical pseudoknots, we observe that the length of a and a' is known once i and k are chose. The maximal helix length maxhel(i,l) can be precomputed and stored in an O(n²) table. For forming a pseudoknot between i and j, we now choose k and l, and determine
e == i+maxhel(i,l)
g == k+maxhel(k,j).
Thus, we are left with only four independently moving boundaries - i, k, l, j -, and obtain an algorithm with runtime O(n⁴). The space requirements of O(n²) are in the same asymptotic efficiency class as in folding of unknotted structures.

IMPLEMENTATION
The algorithm has been developed and implemented with the method of Algebraic Dynamic Programming (ADP)[5]. The standard recurrences for RNA folding can be taken from [6], the ADP code for the pseudoknot case is shown in Figure 3. A technical difficulty lies with the case where the length of v becomes negative if both enclosing helices are assumed to be maximal. This can be overcome without affecting overall asymptotic complexity. The program has not yet been fully evaluated, but a stable preliminary version can be obtained from the authors upon request.

FIRST EVALUATIONS
We tested our algorithm on some of the sequences mentioned in [3] and compared the resulting structures. Since our algorithm computes a subset of Rivas' and Eddy's algorithm, we expected not to find any spurious pseudoknots in the prediction of 8 randomly chosen tRNAs. Indeed all tRNAs fold into a cloverleaf structure similar to those in the database. As an example of short pseudoknots we folded seven HIV-1-RT-ligands and obtained for all of them the identical structures as Rivas/Eddy and the folding time was reduced from 9 to less than one second. Even for the 3' end of the tymv genome (86 nucleotides) we got an almost identical result, but taking only 26 seconds instead of the 52 minutes Rivas/Eddy needed. Thus we are able to fold sequences that are longer than their limit of 130-140 nucleotides [3] with comparable results. E.g. a subsequence (250 nuc.) of the RNaseP RNA was folded in 53 minutes and the complete sequence (400 nuc.) in 18 hours. Furthermore the storage use did not exceed the theoretically expected bounds.

(For the figures see web representation of the poster abstract)

[1] R.B. Lyngso, C.N.S. Pedersen. RNA Pseudoknot Prediction in Energy Based Models. J. Comp. Biol., vol. 7(3/4), 409-428, 2000.
[2] T. Akutsu. Dynamic programming algorithms for RNA secondary structure prediction with pseudoknots. Discr. Appl. Math. 104, 45-62, 2000.
[3] E. Rivas and S.R. Eddy. A dynamic programming algorithm for RNA structure prediction including pseudoknots. J. Mol. Biol. 285:2053-2068, 1999.
[4] E. Rivas and S.R. Eddy. The language of RNA: A formal grammar that includes pseudoknots. Bioinformatics 16:334-340, 2000.
[5] R. Giegerich and C. Meyer. Algebraic dynamic programming. In 9th International Conference on Algebraic Methodology And Software Technology (AMAST), 2002.
To appear.
[6] D. Evers, R. Giegerich: ISMB 2002 Tutorial Notes, http://bibiserv.techfak.uni-bielefeld.DE/ dynprog/tutorial/