Algorithms for large-scale problems in computational biology
Date of Award
Doctor of Philosophy (PhD)
Electrical Engineering and Computer Science
Computational biology, Solvent-accessible, Parallel computing
Computer Sciences | Molecular Biology | Numerical Analysis and Scientific Computing
Advancements in biological research have enabled researchers to obtain large amounts of data, especially on DNA and protein sequences. Algorithmic inventions to process biological data are necessary to match their increasing size. In this dissertation, we present algorithms for solving large-scale problems in sequence comparison, accessible surface area computation and string search.
Comparison of long biological sequences is of interest due to the availability of long DNA sequences such as chromosomes, which may be up to several hundred million base pairs long. Sequence comparison algorithms take time proportional to the product of the input sequences. We considered constant and affine gap penalty functions, full-sequence and subsequence matching, syntenic alignments, and also developed space-saving algorithms. Our algorithms solve all these problems in optimal O [Special characters omitted.] time, and O [Special characters omitted.] space.
Solvent accessible surface area (ASA) is used in determining the energy of protein molecules, and protein folding prediction using molecular dynamics simulation computes ASA repeatedly. Each atom in the protein is modeled as a sphere. Using domain specific knowledge, we show that the number of sphere intersections is O ( n ), where n is the number of atoms in the protein molecule. For computing sphere intersections, we present a hash-based algorithm that runs in O ( n ) expected sequential time and a sort-based algorithm that runs in worst-case O ( n log n ) sequential time. Optimal parallelization of both the algorithms are also presented. Our algorithms are a factor of n faster than currently known algorithms. We also present a Monte Carlo algorithm for computing the solvent accessible surface area, and provide error bounds as a function of the sample size.
PATRICIA tree, suffix tree and suffix array are useful data structures for string search because the search time using them is only a function of the length of the pattern, and not the length of the text. We present a fast algorithm to construct PATRICIA trees. Our algorithm can construct a PATRICIA tree in O ( n ) expected running time and O [Special characters omitted.] worst-case running time, where n is the number of the strings indexed by PATRICIA tree and m is the sum of the length of all the strings. Finally, we present two fast string search algorithms using PATRICIA trees. One algorithm uses linear space and searches pattern p with length | p | in O [Special characters omitted.] time when | p | = O [Special characters omitted.] , where w is the number of bits a processor can process in unit time (typically w = 32), and O [Special characters omitted.] time otherwise. The other algorithm uses O ( n log 2 | p |) space and searches a PATRICIA tree in O (log 2 | p |) time when | p | = O [Special characters omitted.] .
Surface provides description only. Full text is available to ProQuest subscribers. Ask your Librarian for assistance.
Futamura, Natsuhiko, "Algorithms for large-scale problems in computational biology" (2002). Electrical Engineering and Computer Science - Dissertations. 87.