ECCB 2002 Poster sorted by: Author | Number

Next | Previous poster (in order of the view you have selected)

Title: Paracel TranscriptAssembler: Identification of Alternative Splice Forms and polyA Sites from EST and mRNA Data
P16
Boysen, Cecilie; Qian, Jun; Gill, Tristan; Messenger, Richard J.; Mo, Yi; Sievers, Michael; Zhu, Lingyan; Borkowski, Joseph A.

boysen@paracel.com
Paracel, Inc., Pasadena, CA, USA.

The detection of splice variants and alternatively used polyA sites is important in biology and pharmacology. Their presence can mean the difference between functional and non-functional proteins. In some cases, two splice variants can have antagonistic effects. Even if they do not affect the function of the protein, they may alter the level of transcript or protein available. Quantitative gene expression profiling studies, whether they apply ESTs, SAGE, or hybridization-based techniques such as oligo- or gene-arrays, have largely ignored alternative forms of the transcripts. And yet, the information about splice variants or alternatively used polyA sites can be derived from these EST sequences.

Most available EST clustering and assembly software uses a genomic assembly engine and often divides the EST data into smaller, manageable sets based on tissue type. These practices can thwart the identification of transcript variations in a gene. We overcame these deficiencies by developing the Paracel TranscriptAssembler (PTA). PTA efficiently handles very large numbers of sequences, and has been specifically designed to address biological and technical issues such as splice variants, chimeras, and poor quality data.

PTA automatically performs all steps necessary to transform input sequences to output of contigs and singlets, making it fairly simple, yet flexible, to use. The PTA pipeline includes transformation of many different sequence formats into a common XML format and detection of contaminants, repeats, polyA tails, and other special sequences which have to be treated in prescribed ways for downstream processing to succeed. The sequences are then clustered into bins. Each bin generally represents one gene, but includes the various splice forms for this gene. The sequences in an individual bin are then assembled, with emphasis on distinguishing splice variants or alternative polyA tails from low quality sequence ends and identifying the alternative transcripts represented. This can be done whether base quality values are available or not, since PTA is designed to work both with and without quality values. PTA also performs a thorough mutual alignment of these variant transcripts. The alignment relationships can be visually displayed in one of PTA?s many powerful viewers. The underlying assemblies can also be displayed to check for potential SNPs in coding or non-coding regions. If a genomic sequence is available, the resulting transcripts can be aligned and viewed in relationship to the genomic sequence. This alignment takes splice sites into consideration, thus providing accurate exon-intron boundaries for annotation and other downstream processes. Whether the genomic sequence is available or not, PTA provides information helpful for the design of splice variant specific oligo's, which are very important in gene expression experiments. This information makes PTA ideal for all species, even those for which the genomic sequence is not available.

PTA has been used to cluster and assemble millions of ESTs and mRNAs in many species. Alternative transcripts have been found in numerous plant, insect, and mammalian species to a much higher degree than previously speculated. With sufficient EST coverage, the majority of human genes have been shown to produce several splice variants or to employ alternative polyA sites. Most of these splice variants and polyA sites have been confirmed via alignments to the genome. In some instances, however, the genomic sequence to which the transcripts align do not contain all the bases in the canonical splice sites. The absence of some of these bases raises the question of how frequently mutations and polymorphisms occur in splice sites and how they may relate to biology and disease.

In summary, Paracel TranscriptAssembler clusters and assembles ESTs and mRNAs automatically, with emphasis on resolving alternative splice variants and polyA site usage. The results can be visually inspected as splice variants aligned with each other or with the genomic sequence, if available, for validation or annotation purposes. One output report includes information about the gene segments used alternatively in the different transcripts of any given gene. This information can be used in the development of splice variant-specific gene expression profiling assays. The resulting alternative transcripts can further help elucidate results stemming from proteomic studies. Studies using PTA indicate that the majority of human genes can produce several different transcripts as a result of alternative splicing or alternative polyA site usage.