Computational Identification of cis-Regulatory Elements and Prediction of Gene Expression Level

Date of Award


Degree Type


Degree Name

Doctor of Philosophy (PhD)


Electrical Engineering and Computer Science


Kishan G. Mehrotra

Second Advisor

Philip N. Borer

Third Advisor

Wenliang Du


Cis-elements, Gene expression level prediction, Gene regulation, Hammer, Motifs identification, Structured motifs

Subject Categories

Computer Engineering | Computer Sciences


The dissertation focuses on developing computational methods to discover cis -elements in promoter region of co-regulated genes and predict gene expression level using identified cis -elements.

Discovering cis -elements in promoter region of co-regulated genes is important in molecular biology research and recently received extensive attention. In my Ph.D. research I developed an algorithm that is faster and more accurate than well-known tools currently in use to identify cis -elements. The HAMMER algorithm searches for subsequences of desired length whose frequency of occurrence is relatively high, while accounting for slightly perturbed variants using hash table and modulo arithmetic. Candidate cis -elements are evaluated using profile matrices and higher-order Markov background model. Simulation results show that the HAMMER algorithm discovers more cis -elements present in the test sequences when compared with two widely used motif-discovery tools (MDScan and AlignACE). The HAMMER algorithm also produces very promising results on real data set which contain many known cis -elements.

Based on the cis -elements found by HAMMER algrithm, I further developed an algorithm to identify structured motifs which consists of two simpler patterns ( half-sites ) separated from each other by a gap, with no restriction on the number of nucleotides that may occur within the gap. First, HAMMER algorithm is used to search for individual cis -elements which will be used as half-sites to create structured motifs. These structured motifs are then evaluated based on the relative frequency of the half-sites as well as the distribution of gap length. Unlike other recent structured motif detection algorithm, the new algorithm does not require the gap length to be prespecified. The algorithm has successfully extracted structured motifs on synthetic data and real testing data.

Gene expression level is influenced significantly by the presence or absence of cis -elements. I developed several classification systems in which the occurrences of both activator and repressor motifs constitute important inputs in predicting whether a gene will be up-regulated, down-regulated, or neither. I have experimented with several approaches for classification and best preformance was obtained using Support Vector Machine models with linear kernels and a hierarchical structure. On Saccharomces cerevisiae data, the SVM models yielded 71% accuracy for 3-category classification (up-regulated, down-regulated, neutral) and 85% accuracy for 2-category classification (up-regulated, down-regulated).


Surface provides description only. Full text is available to ProQuest subscribers. Ask your Librarian for assistance.