A fuzzy or similarity join is one of the most useful data processing and analysis operations for Big Data in a general context. It combines pairs of tuples for which the distance is lower than or equal to a given threshold. The fuzzy join is used in many practical applications, but it is extremely costly in time and space, and may even not be executed on large scale datasets. Although there have been some studies to improve its performance by applying filters, a solution of an effective fuzzy filter for the join has never been conducted. In this thesis, we thus focus on optimizing fuzzy big joins using filters.
To achieve these objectives, we first apply Bloom filters to fuzzy big join operations to eliminate most non-joining elements in input datasets before sending data to actual join processing. Thus, it reduce redundant intermediate data, unnecessary comparisons and avoid the data duplication. Another important proposal is Fuzzy Filters, which are probabilistic data structures designed to represent set(s) and its similarity elements. They are used to fast detect close elements of the set in given thresholds with small false positive rate and zero false negative rate. Moreover, these filters are not only independent with thresholds and distance measure functions but also easily updated. Therefore, it can be applied to a wide range of popular problems such as deduplication, errorcorrection, data cleaning, data integration, recursive joins and stream joins. Our improvement will significantly reduce the number of redundant computations, and the related overheads such as intermediate data, and communication for the deduplication operations.
We then make analysis comparisons of the fuzzy join algorithms persuasive based on a M − C − R cost model. As a result, with using the proposed filters, the fuzzy join operations can minimize disk I/O and communication costs. Moreover, the filters based fuzzy join operations are demonstrated to be more efficient than existing solutions through experimental evaluations in Spark. Experimental comparisons of different algorithms for fuzzy joins are examined with respect to intermediate data amount, the total output amount, the total execution time, and especially task timelines.
Keywords
Fuzzy join, Similarity join, MapReduce, Fuzzy Filter
Dr (HDR) Sofian Maabout, LABRI, Bordeaux Université (Rapporteur)
Pr Bernd Amann, LIP6, LIP6, Sorbonne Université (Examinateur)
Pr Olivier Pivert, IRISA, Université de Rennes 1/ENSSAT (Examinateur)
Dr Christophe Bobineau, LIG, UGA (Examinateur)
Dr Thuong-Cang Phan, CTU (co-encadrant de thèse)
Pr Anne Laurent, LIRMM, Université de Montpellier (co-directrice de thèse)
Pr Laurent d'Orazio, IRISA, Université de Rennes 1 (co-directeur de thèse)