Date of Award

December 2019

Degree Type

Dissertation

Degree Name

Doctor of Philosophy (PhD)

Department

Electrical Engineering and Computer Science

Advisor(s)

Sucheta Soundarajan

Keywords

data collection, network analysis, network crawling, network sampling, social network mining

Subject Categories

Engineering

Abstract

In recent years, researchers have increasingly used OSN data to study human behavior. Before such a study can begin, one must first obtain appropriate data. A platform, e.g, Facebook or Twitter, may provide an API for accessing data, but such APIs are often ratelimited, restricting the amount of data that an individual collects in a given amount of time. In order for the data collector to efficiently collect data, she needs to make intelligent use of her limited budget. Therefore, when collecting data, efficiency is extremely important. We consider the problem of network sampling through crawling, in which the data collectors have no knowledge of the network of interest except the identity of a starting node. The data collector can expand the observed sample by querying an observed node. While the network science literature has proposed numerous network crawling methods, it is not always easy for the data collector to select an appropriate method: methods that are successful on one network may fail on other networks.

Here, we show that the performance of network crawling methods is highly dependent on the network structural properties. We identify three important network properties: community separation, average community size, and node degree. In addition, we provide guidelines to data collectors on how to select an appropriate crawling method for a particular network. Secondly, we propose a novel crawling algorithm, called DE-Crawler, and demonstrate that it performs the best across different network domains. Lastly, we consider the scenario in which there is are errors in the data collection process. These errors then lead to errors in a subsequent analysis task. Therefore, it is important for a data analyst to know if a collected sample is trustworthy. We introduce a robustness measure called sampling robustness, which measures how robust a network is under random edge deletion with respect to sampling. We demonstrate that sampling robustness highly depends on the network properties and users can estimate sampling robustness from the obtained sample.

Access

Open Access

Included in

Engineering Commons

Share

COinS