ECCB 2002 Poster sorted by: Author | Number

Next | Previous poster (in order of the view you have selected)

Title: SNP Database Integration Project
P183
Wong, Marie; Choo, Keng Wah; Liu, Jianjun

giswty@nus.edu.sg
Genome Institute of Singapore

Biologists working on SNPs (Single Nucleotide Polymorphisms) at the Genome Institute of Singapore were manually retrieving SNPs in and around genes of interest from several public databases. This took considerable time and effort especially for a long list of genes. In order to facilitate a more efficient large-scale retrieval of SNP information, the SNP Database Integration Project was initiated.

The databases of interest were: The SNP Consortium, NCBI's dbSNP, Japan SNP Database and Celera's Human RefSNP.

Although dbSNP is supposed to be the main SNP repository, it is not updated as quickly as new SNP data is added to their individual databases. Thus, in order to always have the most up-to-date information, the biologist would have to search each database separately for the latest data. To ease this process, an integration of the databases was vital.

To accomplish the first version of the database integration, a collaboration was struck with the Nanyang Polytechnic to aid in both man and compute power.

In consultation with the biologists, a set of requirements was drawn up. The first was for the SNPs to be mapped back to the Human Genome Assembly (NCBI's Build 29 was chosen). The second was for each SNP to contain the following information if any:
· SNP ID (rs#, IMS-JST, TSC, hCV)
· Map position Polymorphic Site
· minimum 500 bp flanking sequence
· Repeats marked in lower case
· Identification method
· Frequency information
· Population information
Lastly, a web GUI was needed to query the integrated database and view the results.

The approach was to first download all the databases and use a combination of parsers and database queries depending on the database format, to extract all those SNPs without a dbSNP rs number (rs#). Since dbSNP and Build 29 are both properties of the NCBI, we assumed that the map positions of these would be accurate and updated. Those without an rs# would have to be re-blasted against Build 29 in order to be mapped. These usually corresponded to new SNPs.

The corresponding ids from all the databases would be stored together in a new table such that when a user retrieved the information, the relevant IDs would be used to retrieve the additional SNP information from their original databases. This saved having redundant information stored in the new table which in turn saved space.

Consequently, a web query (eg a gene's symbol) result would produce a picture with multiple tracks; the top being the relevant portion of the human genome map with subsequent tracks for each SNP database. Features would include zooming into a single SNP to show the above list of annotations as well as zooming out to show SNPs in the vicinity flanking the gene of interest.