Date of Award
Doctor of Philosophy (PhD)
Electrical Engineering and Computer Science
General-purpose Distributed Stream Data Processing Systems (DSDPSs) have attracted extensive attention from industry and academia in recent years. They are capable of processing unbounded big streams of continuous data in a distributed and real (or near-real) time manner. A fundamental problem in a DSDPS is the scheduling problem, i.e., assigning threads (carrying workload) to workers/machines with the objective of minimizing average end-to-end tuple processing time (or simply tuple processing time). A widely-used solution is to distribute workload over machines in the cluster in a round-robin manner, which is obviously not efficient due to the lack of consideration for communication delay among processes/machines. A scheduling solution makes a significant impact on the average tuple processing time. However, their relationship is very subtle and complicated. It does not even seem possible to have a mathematical programming formulation for the scheduling problem if its objective is to directly minimize the average tuple processing time.
In this dissertation, we first propose a model-based approach that accurately models the correlation between a scheduling solution and its objective value (i.e. average tuple processing time) for a given scheduling solution according to the topology of the application graph and runtime statistics. A predictive scheduling algorithm is then presented, which as- signs tasks (threads) to machines under the guidance of the proposed model. This approach achieves an average of 24.9% improvement over Storm’s default scheduler. However, the model-based approach still has its limitations: the model may not be able to fully capture the features of a DSDPS; prediction may not be accurate enough; and a large amount of high-dimensional data may lead to high overhead.
To address the limitations, we develop a model-free approach that can learn to control a DSDPS from its experience rather than adopting accurate and mathematically solvable system models, just as a human learns a skill (such as cooking, driving, swimming, etc.). Recent breakthrough of Deep Reinforcement Learning (DRL) provides a promising approach for enabling effective model-free control. The proposed DRL-based model-free approach minimizes the average end-to-end tuple processing time by jointly learning the system environment via collecting very limited runtime statistics and making decisions under the guidance of powerful Deep Neural Networks (DNNs). This approach achieves great performance improvement over the current practice and the state-of-the-art model-based approach.
Moreover, there is still room for improvement for the above model-free approach: For the above model-free approach and most existing methods, a user specifies the number of threads for an application in advance without knowing much about runtime needs, which, however, remains unchanged during runtime. This could severely affect the performance of a DSDPS. Therefore, we further develop another model-free approach using DRL, EXTRA, which enables the dynamic use of a variable number of threads at runtime. It has been shown by extensive experimental results, by adding this new feature, EXTRA can achieve further performance improvement and greater flexibility on scheduling.
Li, Teng, "Efficient Online Scheduling in Distributed Stream Data Processing
Systems" (2018). Dissertations - ALL. 861.