Efficient Detection of Mutated Clones Using Augmented Scope Trees

Date of Award

May 2019

Degree Type


Degree Name

Doctor of Philosophy (PhD)


Electrical Engineering and Computer Science


James W. Fawcett


Code Analysis, Code Clones, Content Based Hashing, Latent Errors, Mutated Clones, Refactoring

Subject Categories



Clone detection is the process of detecting similar segments of code in one or more source files. The existence of clones in a software project makes maintenance burdensome. Detecting clones is, therefore, an important endeavor in software quality improvement. Clones evolve after they are introduced in a software project. As part of the clone detection, detecting modified parts of a clone may create better opportunities for refactoring and detecting latent errors.

Software developers and researchers have been working on methods of clone detection for more than a decade. Various clone detection techniques have been used focusing on similarity of segments of code. Token-based techniques are used to detect subsequences from streams of tokens. Some of the techniques use suffix trees to efficiently detect similar subsequences. CCFinder and CP-Miner are two prominent token-based clone detection techniques. Unfortunately, they may detect clones that include incomplete parts of two or more adjacent function scopes and may not be suitable for refactoring.

Tree-based techniques, on the other hand, use tree representations of source code to detect clones. Various approaches are proposed to detect syntactically bound clones using tree-based techniques. As direct tree comparison is an inefficient technique, some of the techniques have slow performance. Deckard combined a tree-based approach with a locality sensitive hashing (LSH) technique to detect clones efficiently. It represents frequency of parse tree elements using vectors. After merging these vectors, LSH is used to group vectors using a distance measure. These result in clones that are relatively better for refactoring than token-based clones. However, tree-based approaches have either performance issues or the mitigation used to improve performance may result in clones that are not very useful for some software engineering tasks such as refactoring, e.g., partial scopes from adjacent scope blocks.

We proposed a clone detection technique based on efficiently searching scope trees derived from source code and using some novel hashing techniques to find clones on complete scope boundaries with better performance than existing techniques. Furthermore, our technique reports mutations within clones, that can be useful for detecting sources of error and for refactoring. We use augmented scope trees to represent source code, and find clones. Like other techniques, we detect similar segments of code. However, as part of our clone detection, modified parts of the scopes are identified. These create better opportunities for refactoring and detection of latent errors. Our approach detects clones efficiently using a two-level hashing mechanism. Scope tree nodes are grouped based on their cyclomatic complexity. The nodes are further sub-grouped using secondary hashing based on additional metrics computed for each scope node, such as number of tokens, number of scope children, and number of unique identifiers. Nodes in each subgroup are compared using edit distance with and without identifier mapping (which allows us to ignore simple name changes), to detect clones.

As a result, we detected clones in open source projects such as OpenSSL having more than 400,000 lines of code in 15 seconds on average. The clones detected are also suitable for refactoring and latent error detection because we report clones on scope boundaries, and report mutations within clones. In addition, we were able to process the entire OpenJDK10, having different programing languages including C, C++, and Java and 6.7 million lines of code, in 10 minutes on average. This demonstrates that our technique can be applied successfully to industrial scale applications.


SURFACE provides description only. Full text may be available to ProQuest subscribers. Please ask your Librarian for assistance.

This document is currently not available here.