Relational databases, Parallel join, Data skew, Sort-based algorithm, Hash-based algorithm
Join is the most important and expensive operation in relational databases. The parallel join operation is very sensitive to the presence of the data skew. In this paper, we present two new parallel join algorithms for coarse grained machines which work optimally in presence of arbitrary amount of data skew. The first algorithm is sort-based and the second is hash-based. Both of these algorithms employ a preprocessing phase (prior to the redistribution phase) to equally partition the work among the processors. The proposed algorithms have been designed for memory resident-data. However, they can be extended to disk resident-data. These algorithms are shown to be theoretically as well as practically scalable. Experimental results are provided on the IBM SP-2.
Alsabti, Khaled and Ranka, Sanjay, "Skew-Insensitive Parallel Algorithms for Relational Join" (1998). College of Engineering and Computer Science - Former Departments, Centers, Institutes and Projects. 50.