Date of Award

June 2017

Degree Type


Degree Name

Doctor of Philosophy (PhD)


Electrical Engineering and Computer Science


Heng Yin


Binary Analysis, Machine Learning, Memory Forensics, Software Security, Vulnerability Search

Subject Categories



The past decade has been witnessing an explosion of various applications and devices.

This big-data era challenges the existing security technologies: new analysis techniques

should be scalable to handle “big data” scale codebase; They should be become smart

and proactive by using the data to understand what the vulnerable points are and where

they locate; effective protection will be provided for dissemination and analysis of the data

involving sensitive information on an unprecedented scale.

In this dissertation, I argue that the code search techniques can boost existing security

analysis techniques (vulnerability identification and memory analysis) in terms of scalability and accuracy. In order to demonstrate its benefits, I address two issues of code search by using the code analysis: scalability and accountability. I further demonstrate the benefit of code search by applying it for the scalable vulnerability identification [57] and the

cross-version memory analysis problems [55, 56].

Firstly, I address the scalability problem of code search by learning “higher-level” semantic

features from code [57]. Instead of conducting fine-grained testing on a single device

or program, it becomes much more crucial to achieve the quick vulnerability scanning

in devices or programs at a “big data” scale. However, discovering vulnerabilities in “big

code” is like finding a needle in the haystack, even when dealing with known vulnerabilities. This new challenge demands a scalable code search approach. To this end, I leverage successful techniques from the image search in computer vision community and propose a novel code encoding method for scalable vulnerability search in binary code. The evaluation results show that this approach can achieve comparable or even better accuracy and efficiency than the baseline techniques.

Secondly, I tackle the accountability issues left in the vulnerability searching problem

by designing vulnerability-oriented raw features [58]. The similar code does not always

represent the similar vulnerability, so it requires that the feature engineering for the code

search should focus on semantic level features rather than syntactic ones. I propose to

extract conditional formulas as higher-level semantic features from the raw binary code to

conduct the code search. A conditional formula explicitly captures two cardinal factors

of a vulnerability: 1) erroneous data dependencies and 2) missing or invalid condition

checks. As a result, the binary code search on conditional formulas produces significantly

higher accuracy and provides meaningful evidence for human analysts to further examine

the search results. The evaluation results show that this approach can further improve

the search accuracy of existing bug search techniques with very reasonable performance


Finally, I demonstrate the potential of the code search technique in the memory analysis

field, and apply it to address their across-version issue in the memory forensic problem

[55, 56]. The memory analysis techniques for COTS software usually rely on the

so-called “data structure profiles” for their binaries. Construction of such profiles requires

the expert knowledge about the internal working of a specified software version. However,

it is still a cumbersome manual effort most of time. I propose to leverage the code search

technique to enable a notion named “cross-version memory analysis”, which can update a

profile for new versions of a software by transferring the knowledge from the model that

has already been trained on its old version. The evaluation results show that the code search based approach advances the existing memory analysis methods by reducing the

manual efforts while maintaining the reasonable accuracy. With the help of collaborators, I

further developed two plugins to the Volatility memory forensic framework [2], and show

that each of the two plugins can construct a localized profile to perform specified memory

forensic tasks on the same memory dump, without the need of manual effort in creating the corresponding profile.


Open Access

Included in

Engineering Commons