License: Creative Commons Attribution 3.0 Unported license (CC BY 3.0)
When quoting this document, please refer to the following
DOI: 10.4230/LIPIcs.ICDT.2018.20
URN: urn:nbn:de:0030-drops-86057
URL: http://dagstuhl.sunsite.rwth-aachen.de/volltexte/2018/8605/
Go to the corresponding LIPIcs Volume Portal


Tao, Yufei

Massively Parallel Entity Matching with Linear Classification in Low Dimensional Space

pdf-format:
LIPIcs-ICDT-2018-20.pdf (0.5 MB)


Abstract

In entity matching classification, we are given two sets R and S of objects where whether r and s form a match is known for each pair (r, s) in R x S. If R and S are subsets of domains D(R) and D(S) respectively, the goal is to discover a classifier function f: D(R) x D(S) -> {0, 1} from a certain class satisfying the property that, for every (r, s) in R x S, f(r, s) = 1 if and only if r and s are a match.

Past research is accustomed to running a learning algorithm directly on all the labeled (i.e., match or not) pairs in R times S. This, however, suffers from the drawback that even reading through the input incurs a quadratic cost. We pursue a direction towards removing the quadratic barrier. Denote by T the set of matching pairs in R times S. We propose to accept R, S, and T as the input, and aim to solve the problem with cost proportional to |R|+|S|+|T|, thereby achieving a large performance gain in the (typical) scenario where |T|<<|R||S|.

This paper provides evidence on the feasibility of the new direction, by showing how to accomplish the aforementioned purpose for entity matching with linear classification, where a classifier is a linear multi-dimensional plane separating the matching and non-matching pairs. We actually do so in the MPC model, echoing the trend of deploying massively parallel computing systems for large-scale learning. As a side product, we obtain new MPC algorithms for three geometric problems: linear programming, batched range counting, and dominance join.

BibTeX - Entry

@InProceedings{tao:LIPIcs:2018:8605,
  author =	{Yufei Tao},
  title =	{{Massively Parallel Entity Matching with Linear Classification in Low Dimensional Space}},
  booktitle =	{21st International Conference on Database Theory (ICDT 2018)},
  pages =	{20:1--20:19},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-063-7},
  ISSN =	{1868-8969},
  year =	{2018},
  volume =	{98},
  editor =	{Benny Kimelfeld and Yael Amsterdamer},
  publisher =	{Schloss Dagstuhl--Leibniz-Zentrum fuer Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{http://drops.dagstuhl.de/opus/volltexte/2018/8605},
  URN =		{urn:nbn:de:0030-drops-86057},
  doi =		{10.4230/LIPIcs.ICDT.2018.20},
  annote =	{Keywords: Entity Matching, Linear Programming, Range Counting, Dominance Join, Massively Parallel Computation}
}

Keywords: Entity Matching, Linear Programming, Range Counting, Dominance Join, Massively Parallel Computation
Collection: 21st International Conference on Database Theory (ICDT 2018)
Issue Date: 2018
Date of publication: 05.03.2018


DROPS-Home | Fulltext Search | Imprint | Privacy Published by LZI