Hubans, Christine;Kerkaert, Jean Pierre;Van Hoecke, Marie Pierre - XX_frag: design of cDNA sequences for microarrays-application to CGH

ECCB 2002 Poster sorted by: Author | Number

Next | Previous poster (in order of the view you have selected)

Title: XX_frag: design of cDNA sequences for microarrays-application to CGH	P66
Hubans, Christine; Kerkaert, Jean Pierre; Van Hoecke, Marie Pierre hubans@lifl.fr, jpk@lille.inserm.fr, vanhoeck@lifl.fr Genopole de Lille, FRANCE

1. Introduction
Biochips or micro arrays will allow a faster and a more efficient experimentation for genome studies. The probes, cDNA sequences or genomic sequences, are spotting on a slide. After the targets, complete cells DNA associated with fluorochrom Cy3 for the first set (control) and Cy5 for the second set (Test) are hybridizing on this slide.
There is an hybridization of complementary strand. The answer of this experimentation is the fluorescence intensity, which determines the presence or absence of one DNA sequence. During CGH(Comparative Genomic Hybridization) experimentation its answer shows an abnormality on genomes test. For instance, it is such as a deletion or an amplification present in tumoral cells.
The probes selection asks several features. The first is specificity of cDNA sequences(to avoid cross hybridization) the second is the specificity of primers (to ensure the production of the right cDNA sequences). The third, specific to CGH experimentation, is a regular distribution of the probes sequences over the genome or over part of it.
We exposed, in this article, our method of selection probes and the soft XX_frag.

2. Algorithms
The algorithm follows three major steps:
- step 1: extraction of a distributed region of a genome.
- step 2: raw computation of fragments over the virtual genome.
- step 3: specificity assessment of the calculated probes using BLAST and Primer3 and iteration to step2 until sufficient specificities are obtained for the sequence and the pair of primers.

2.1 extraction of a distributed region of a genome.
The user chooses the genomic region for his study, for instance one chromosome, cytogenetic bands, genes, exons/total DNA, transcript?). We then rebuild a virtual genome composed by all the selected segments. All the selected segments compose a virtual genome which we have called "Pseudogenome". We memorize the positions of the junctions-called GAPs - between every segment.

2.2 Row computation of fragments over extracted sequence or Pseudogenome.
We have created this technique of computing to obtain DNA fragments with user's parameters. These fragments will be regularly distributed positioning over the Pseudogenome, considering some user's parameters: length of the virtual genome or Pseudogenome (L), number of fragments (N), length of every probes (T) and distance (D) between two consecutive probes. Using these elements the computing starts, every probes is define with two position, the started position and the end position on the Pseudogenome. So as to start this calculating the initial point (I) is necessary, this point is defined by the user or is chosen randomly by algorithms. The mathematical expression calculates start and end position of probes. For i fragments (exp 1)
Start[i] = I+(i-1) * (T+D) and End[i] = start[i] + T (exp1)
however, there is a check of the location of fragments according to GAPs positions. If a fragment is divided on two biological different entities it has no interest for experimentation because one probe for two entities doesn't allow to test anyone. Indeed, if the fragment i defined by (start[i]-end[i]) is divided by a GAP, it has no biologic interest. For this problem XX_frag keep the longest part of the fragment, start[i]-GAP or GAP-end[i]. Next, the part is stretching in a way as one probes with T length and separated to probes (i+1) to D distance. This extension is possible with the margins. To obtain one fragment i with T length the margin on distance value is decreased. If this margins is not self important the length is decreased on T margin. Thus, the fragment i has a minimum length of [T-margin T] and the distance with (i+1) is [D-margin D]. (figure 1)
Two frgamnt can not have a common part because Margin T is inferior at T/2 and margin D is inferior at D/2. In order to conserve these correct fragments these which not match the margin's criteria are eliminated.
(figure 1)
Thus, there are six parameters. Two are necessary within T, D and N the third is calculated with the total length of Pseudogenome (L) (exp 2)
L = n * (T+D) (exp 2)
The other parameters are optional. The default values are 10% for the margins, the initial point is randomized.
Once this work is finished, we have a list of coordinates of fragments which are correctly positioned on Pseudogenome. Now, we should retrieve the nucleic sequences over real genome. For this step, the coordinates are computing with the fragments positions on Pseudogenome and references of all segments which composed this. There is one particular case of sequences retrieval: when a fragment covers several exons. This problem is solved by computation of real start and of end-exon.

2.3 checking of fragment's specificity to avoid cross hybridization.
For the moment the probes are not specific. The test is an alignement against a bank of nucleic sequences, all sequences belong is the same genome. The soft used for this step is BLASTn. The result data are treated by algorithms. For the treatment two parameters are necessary: the maximal size of mtch and the percentage of homology. A fragment is specific when the two values are inferior at parameters. A specific fragment is stored. Else, if one of two values is superior at the correspondent parameter the fragment is not specific. It is sent again to computing module in order to be modified. This iteration process continues until all the fragments are specific. The computing module modifies the fragment with the margins. If a fragment can not be specific because the margins are not sufficient, it is eliminated.
The last part, is primer design with PRIMER3 soft. There are an iteration process on the specificity of primers. Here the parameters are Melting temperature, GC percentage,?

3.Implementation
The first part, the extraction of a interested region, is made on a database. It is Santa Cruz at htp://genome.ucsc.edu/ which is explored on Lille's Genopole site. This database is implemented in MySQL language and it is associated with sequences files. Extraction of a genomic region is generated by PHP script and the computing is written in C language. All the results files are stored and the users can take them. In this files there are probe sequences in fasta format.

4. Conclusion.
XX_frag is a soft for biological users. It is so far for human CGH probes selection. We will put other genomes, public genomes and private genomes. And we wish allow users to work on a family of genes, such as apoptosis genes. There are many improvments for this tools.

(for the figures see web representation of the poster abstract)

[1] V.Cheung, M.Morley, F.Aguilar, A.Massimi, R.Kucherlapati, and G.Childs, "Making and reading microarrays", Nature Genetics, Vol 21, pp 15-19, January 1999
[2] T.R.Hughes, M.Mao,A.R.Jones "Expression profiling using microarrays fabricated by an ink-jet oligonucleotide synthesizer", nature biotechnology, vol19, april 2001
[3] S.Rozen and HJ.Skaletsky, "Primer3"