Multiprocessing, Join Processing, Data Partitioning
Multiprocessor implementation of the relational database operators has recently received great attention in literature [1-4, 8, 11]. As the complexity of implementing the relational operators rests on the inter-node communication patterns involved in an operation, greater research attention has been focused on Join algorithms. The Join traffic patterns subsume those of the remaining relational operators. To effectively exploit parallelism in bucket based join implementations, the domain of the joining attributes must be partitioned into equal subranges. That is, the processing of each subrange requires roughly the same amount of time. A skewed distribution of workload significantly hinders performance. As relations exhibit a non-uniform attribute value distribution, possibly resulting from a previous operation, a priori determination of subrange boundary conditions results in a non-balanced workload across the processors. Performance degradation in parallel systems employing such static boundary subrange partitioning is demonstrated in Lakshmi and Yu . That study exemplified that even a low degree of attribute skew results in a significant performance penalty. This paper proposes a statistical algorithm for dynamic determination of domain partitioning in bucket based join implementations. This statistics-based approach guarantees a near-uniform processor workload. A parameterization of the sample size versus the number of tuples is developed, and a proof of the validity of the approach is discussed. A simple illustrative example is presented.
Frieder, Ophir, "Dynamic Range Partitioning in Multiprocessor Database Implementations" (1990). Electrical Engineering and Computer Science Technical Reports. 50.