Kelkar, Bhooshan P. - Finding and Characterizing Novel Cancer Related Genes in Genomic Sequences Using IBM Life Sciences Framework

ECCB 2002 Poster sorted by: Author | Number

Next | Previous poster (in order of the view you have selected)

Title: Finding and Characterizing Novel Cancer Related Genes in Genomic Sequences Using IBM Life Sciences Framework	P77
Kelkar, Bhooshan P. bkelkar@us.ibm.com IBM Life Sciences Strategic Initiatives

This proposal is about the architecture and one simple Life Sciences application of the use of federated-database technology and Web Services.

The need: Federation of data and interoperability of resources

In the Life Sciences field, there are many ways the data is being generated and stored. The data sources are not only distributed but also disparate. The data is stored in relational tables, flat files, instrumentation-specific files etc. The data sources are geographically distributed and some are accessed over the internet. This data needs to be not only accessed optimally but also in a consistent and secure manner.
The other issue is that of abundance of tools, algorithms, methods and technologies that are often limited by platform dependence and lack of interoperability. This interoperability and knowledge sharing is critical because Life Sciences has and is evolving as inherently interdisciplinary activity. There are many silos, like chemistry, microbiology etc, existent in the pharmaceutical and biotech industries, small and big alike. The CORBA or DCOM based mutually agreed common platform based architecture is very useful for application integration within an enterprise, but does little to leverage the ubiquity of the Internet or the multi-vendor environment of today. This causes a severe problem for client-to-server communications, especially when the client machines are scattered across the Internet, which is fairly common in the Life Sciences area.
It is crucial from resource planning and optimal knowledge sharing point of view that these processes be as much automated in terms of workflows and be enabled for interoperability.

The Solution: IBM DiscoveryLink and IBM Life Sciences FrameWork

IBM DiscoveryLink is a product that enables federation of data. The basic concept is to make data available from seemingly disparate, distributed sources to the user as though it is coming from one consistent source. IBM DiscoveryLink allows for the optimization of data integration by enabling queries across disparate and distributed data sources.

The Life Sciences Framework offering uses common web protocols, to integrate without prior coordination between the applications. It allows for a loose confederation of resources held together by relatively simple open protocols, such as XML and Web Services. XML is open and it can be used to exchange data between applications and with other users in a platform independent way. IBM DiscoveryLink for data federation is also an integrable part of the IBM Life Sciences FrameWork Solution.
The breadth of functions/capabilities/solutions delivered through the Life Sciences Framework are: Data storage, query, mining, integration, analysis and visualization; Applications development and integration; Project, workflow and knowledge management; Networking, system and Web integration. The generic orthogonal benefits include: security, scalability, reliability, flexibility and manageability. The IBM Life Sciences Framework supports component-based development. The framework promotes highly productive application development by supporting code reuse and by enabling application and data integration in the Life Sciences domain. The Life Sciences Framework focuses on portability, cross-platform interoperability, platform independence, security workflow management, knowledge, and data management. New technologies constantly appear for new application niches. The framework solution helps ensure that mission critical systems are rooted in standards that will adapt to new hardware capabilities and software platforms.
This will enable Life Sciences companies to integrate applications from many vendors on their internal Intranets or over the Internet to add value to the overall solution and provide new more robust solutions. The Researchers will benefit by the integration of applications across the domain to enable the development of new and more efficient workflows for disease diagnosis, drug discovery and submissions for FDA approvals.

Illustration:
An example of IBM Life Sciences Framework is demonstrated by using various publicly available bioinformatics tools in the particular scenario of finding and characterizing novel cancer related genes in genomic sequences. The following labor-intensive, error-prone, manual steps can be easily put in a workflow and automated using Life Sciences Framework.

1. Medline searches using PubMed to obtain relevant documents and manual filtering of the documents to extract sequence ids.
2. Retrieve FASTA formatted sequences manually from genomic databases for each sequence id.
3. Manually copy and paste the sequence to run BLAST analysis against human genome.
4. Manually filter results by stringency criterion
5. Manually format and feed data into visualization tool
6. Manually retrieve the sequences corresponding to the blast hits and run multiple gene prediction tools for each interesting sequence.
7. Manually submit each putative coding region and run BLAST analysis against non-redundant database.
8. Manually format and feed data into visualization tool
9. Manually copy and paste the sequence and run BLAST analysis against SwissProt to identify hits to known protein sequences.
10. Manually filter, format and feed data into visualization tool
11. Manually retrieve relevant hit sequences from SwissProt, copy and paste the sequences to run multiple sequence alignment using ClustalW
12. Manually copy and paste the alignment and run Phylogenetic analysis & HMMER programs individually
13. Visualize homologues of a protein family of interest and Phylogenetic tree