Date of Award

Spring 5-22-2021

Degree Type

Thesis

Degree Name

Master of Science (MS)

Department

Electrical Engineering and Computer Science

Advisor(s)

Reza Zafarani

Keywords

Machine Learning, Network Embedding, Network Mining, Network Sampling, Similarity, Stability

Subject Categories

Engineering

Abstract

Machine Learning on graphs has become an active research area due to the prevailing graph-structured data in the real world. Many real-world applications can be modeled with graphs. Modern application domains include web-scale social networks [26], recommender systems, knowledge graphs, and biological or protein networks. However, there are various challenges. First, the graphs generated from such applications are often large. Moreover, in some scenarios, the complete graph is not available, e.g., for privacy reasons. Thus, it becomes impractical to perform network analysis or compute various graph measures. Hence, graph sampling becomes an important task.Sampling is often the first step to handle any type of massive data. The same applies to graphs as well, which leads to many graph sampling techniques. Sampling Techniques include Node-based (e.g., Random Node Sampling), Edge-based (e.g., Random Edge Sampling) and Traversal-based (e.g., Random Walk Sampling). Graphs are often analyzed by first embedding (i.e., representing) them in some matrix/vector form with some number of dimensions. Various graph embedding methods have been developed to convert raw graph data into high dimensional vectors while preserving intrinsic graph properties [3]. The embedding methods focus on the node-level, edge-level [28], a hybrid, or at the graph level. This thesis focuses on graph-level embeddings which allows calculating similarity between two graphs. With the knowledge of embedding and sampling methods, the natural questions to ask are: 1) What is a good sampling size to ensure embeddings are similar enough to that of the original graph? 2) Do results depend on the sampling method? 3) Do they depend on the embedding method? 4) As we have embeddings, can we find some similarity between the original graph and sample? 5) How do we decide if the sample is good or not? How do we decide if the embedding is good or not? Essentially, if we have an embedding method and a sampling strategy, can we find the smallest sampling size that will give an ε-similar embedding to that of the original graph? We will try to answer the above questions in the thesis and give a new perspective on graph sampling. The experiments are conducted on graphs with thousands of edges and nodes. The datasets include graphs from social networks, autonomous systems, peer-to-peer networks, and collabo- ration networks. Two sampling methods are targeted namely - Random Node Sampling, and Random Edge Sampling. Euclidean distance is used as a similarity metric. Experiments are car- ried out on Graph2vec, and Spectral Features(SF) graph embedding methods. Univariate analysis is performed to decide a minimum sample which gives, e.g., 40% minimum sample for 80% similarity. We also design a Regression model which predicts similarity for a given sampling size and graph properties. Finally, we analyze the stability of the embedding methods, where we find that that e.g., Graph2Vec is a stable embedding method.

Access

Open Access

Available for download on Saturday, June 25, 2022

Included in

Engineering Commons

Share

COinS