Freschi, Valerio;Bogliolo, Alessandro - Accurate Base Calling from Multiple Electropherograms

ECCB 2002 Poster sorted by: Author | Number

Next | Previous poster (in order of the view you have selected)

Title: Accurate Base Calling from Multiple Electropherograms	P39
Freschi, Valerio; Bogliolo, Alessandro alessandro.bogliolo@uniurb.it STI - University of Urbino - 61029 Urbino - Italy

DNA sequencing is an error-prone process composed of two main steps: generation of an electropherogram (or trace) representative of a DNA sample, and interpretation of the electropherogram in terms of base sequence. The first step entails chemical processing of the DNA sample, electrophoresis and data acquisition [1]; the second step, known as base calling, entails digital signal processing and decoding usually performed by software running on a PC [2].
In order to improve the accuracy and reliability of DNA sequencing, multiple experiments may be independently performed on the same DNA sample. In most cases, forward and reverse experiments are performed by sequencing a DNA segment from the two ends. Bases that appear at the beginning of the forward electropherogram, appear (complemented) at the end of the reverse one. Since most of the noise sources are position-dependent (e.g., there is a sizable degradation of the signal-to-noise ratio during each experiment) starting from opposite sides provides valuable information for error compensation.
When forward and reverse electropherograms are available, the traditional approach to determin the unknown sequence consists of independently performing base calling on the two traces in order to obtain forward and reverse sequences, aligning the two sequences and performing a minimum number of editing steps to obtain a consensus sequence. In this flow, the results of the two experiments are combined only once they have been independently decoded, without taking advantage of the availability of two electropherograms to reduce decoding uncertainties. On the other hand, once base-calling errors have been made on each sequence, wrong bases take part in alignment as if they were correct. In case of mismatch between forward and reverse sequences, manual editings have to be performed by an experienced operator in order to take the correct decision, possibly looking back at the corresponding traces.

We propose a different approach to base calling from multiple electropherograms: We first obtain an average electropherogram by combining all the experiments available for the given DNA, then we perform base calling on the averaged electropherogram directly obtaining the consensus base sequence. The rationale behind our approach is two-fold. First, electropherograms are much more informative than the corresponding base sequences, so that their comparison provides more opportunities for noise filtering and error correction. Second, each electropherogram is the result of a complex measurement experiment affected by random errors. Since the average of multiple independent measurements has a lower standard error, the averaged electropherogram has improved quality with respect to the original ones.
Averaging independent electropherograms is not a trivial task, since they usually have different starting points, different number of samples, different base spacing and different distortions. In order to compute the point-wise average of the traces, we need first to re-align the traces so that samples belonging to the same peak (i.e., representing the same base) are in the same position. The overall procedure is outlined below, assuming that forwad and reverse traces are available and that the reverse trace has already been reversed and complemented.
1. Perform independent base calling on the original traces annotating the position (i.e., the point in the trace) of each base.
2. Align the base sequences, possibly inserting a "-" character to represent a missing base (i.e., a gap).
3. Associate a virtual position with each missing base assuming that it is equally spaced from the preceeding and the following bases.
4. Number the bases according to the alignement, taking into account the presence of gaps. Aligned bases on the two traces are associated with the same number.
5. Re-position all bases on a common x axis. The new position (xk) of base k is computed as the average of the positions of the k-th bases on the two orginal traces.
6. Shrink the original traces in order to adapt them to the common x axis. The peaks associated with the k-th base on the two shrinked traces should be in position xk.
7. Re-sample the traces using a common sampling step.
8. Compute a sample-wise average of the two traces to obtain the averaged electropherogram.
9. Smooth the combined electropherogram to remove small artificial peaks caused by the above steps.

We tested our approach on a representative set of known DNA samples. Forward and reverse electropherograms were obtained using an ABI PRISM 310 Genetic Analyzer [3]. Phred [2] was used for base calling, while procedures for trace alignment and averaging were implemented in C and run on a PC under Linux.
For each sample we generated two sets of results: a consensus sequence obtained by applying the traditional approach (forward and reverse sequences were called by Phred and aligned, and their consensus was computed according to Table 1) and a merged sequence obtained by our approach (forward and reverse traces were aligned and averaged and base calling was performed by Phred on the averaged electropherogram). The quality of both sequences was evaluated by performing pair-wise alignment against the actual sequence.

Experimental results are reported in Table 2 in terms of quality (Q) of the original and merged electropherograms (computed from the error probabilities provided by Phred [2]), of unrecognized bases (N) and of calling errors (E) in the consensus and merged sequences. In general, the merged electropherograms improve the accuracy of the original ones and enable much more accurate base calling.

[1] Sanger, F. et al., "DNA sequencing with chain terminating inhibitors", in Proc. Natl. Acad. Sci. 74, 1977, pp. 5463-5467.
[2] Ewing,B. et al., "Base-calling of automated sequencer traces using Phred", Genome Research 8. 1998, pp. 175-194.
[3] ABI, ABI PRISM 310 Genetic Analyzer user's manual. PE Applied Biosystems, Foster City, CA.