Enhancing survivability and reliability in the cloud

Date of Award

June 2017

Degree Type


Degree Name

Doctor of Philosophy (PhD)


Electrical Engineering and Computer Science


Jian Tang

Subject Categories



Cloud computing has evolved as an important distributed computing model, enabling infrastructure, information, and software to be used as shared resources over the network in an on-demand manner. A complex cloud computing system is a large-scale network which may include multiple Data Centers (DCs) that are distributed all over the world. Given a large number of connected servers across the whole planet, server and network failures are inevitable. Meanwhile, the power-demanding nature of data centers urges us to allocate existing resources efficiently, instead of simply adding more servers, to enhance the reliability. It is meaningful to study how to build survivable and reliable cloud computing systems while maximizing power efficiency.

In this dissertation, we first studied survivable Virtual Machine (VM) management by proposing a general optimization framework, designing a polynomial-time optimal algorithm and an efficient heuristic algorithm for virtual link mapping and VM placement subproblems respectively, as well as designing an effective algorithm to solve the two subproblems jointly. We reduced the reserved bandwidth by at least 96% and yielded comparable results in terms of the number of active servers against the first fit decreasing and single shortest path-based baseline algorithm in the simulation.

We also studied the reliable VM management problem and proposed the first Deep Reinforcement Learning(DRL)-based continuous-time and event-driven resource allocation framework that combines a deep neural network with autoencoder and novel weight sharing structure, and an online deep Q-learning framework to handle high-dimensional state space. Simulation results showed that our DRL-based solution reduced the server power consumption by least 47% with minor job latency increases in generating reliable VM placement, compared with the round-robin baseline algorithm.

Moreover, we studied reliable VM management in distributed DCs by formulating Virtual Server Provisioning and Selection (VSPS) as a mixed integer linear programming problem, and proposing a novel optimization framework, under which we developed a polynomial-time ln(N)-approximation algorithm, along with a heuristic algorithm which jointly solves sub-problems of the VSPS. Simulation results showed that our algorithms provide close-to-optimal performance and achieve 25% or more cost reduction compared to a baseline algorithm.

Finally, we studied the reliability enhancement for Distributed Stream Data Processing Systems (DSDPSs) and designed a predictive DSDPS control framework which consists a two-tiered Deep Recurrent Neural Network (DRNN) model with consideration on co-location interference and implementing dynamic grouping for misbehaving workers bypassing. In addition, we implemented and tested our framework over a well-known DSDPS Storm and showed that our DRNN model outperformed AutoRegressive Integrated Moving Average(ARIMA) and Support Vector Regression (SVR) in terms of prediction accuracy, and that our framework introduced minor performance degradation when misbehaving workers exist.

This document is currently not available here.