Processing theta-joins using MapReduce 论文

2011引用 238

Cloud Computing and Resource ManagementAdvanced Database Systems and QueriesAdvanced Data Storage Technologies

云计算 Advanced Data Storage Technologies Advanced Database Systems and Queries Cloud Computing and Resource Management

作者

摘要

Joins are essential for many data analysis tasks, but are not supported directly by the MapReduce paradigm. While there has been progress on equi-joins, implementation of join algorithms in MapReduce in general is not sufficiently understood. We study the problem of how to map arbitrary join conditions to Map and Reduce functions, i.e., a parallel infrastructure that controls data flow based on key-equality only. Our proposed join model simplifies creation of and reasoning about joins in MapReduce. Using this model, we derive a surprisingly simple randomized algorithm, called 1-Bucket-Theta, for implementing arbitrary joins (theta-joins) in a single MapReduce job. This algorithm only requires minimal statistics (input cardinality) and we provide evidence that for a variety of join problems, it is either close to optimal or the best possible option. For some of the problems where 1-Bucket-Theta is not the best choice, we show how to achieve better performance by exploiting additional input statistics. All algorithms can be made 'memory-aware', and they do not require any modifications to the MapReduce environment. Experiments show the effectiveness of our approach.

作者查看全部 (2)

Mirek Riedewald

Alper Okcan

Processing theta-joins using MapReduce 论文

摘要

作者查看全部 (2)

相关技术

相关事件

相关文章