用户名: 密码: 验证码:
Efficient entity resolution based on subgraph cohesion
详细信息    查看全文
  • 作者:Hongzhi Wang ; Jianzhong Li ; Hong Gao
  • 关键词:Entity resolution ; Data quality ; Graph cohesion
  • 刊名:Knowledge and Information Systems
  • 出版年:2016
  • 出版时间:February 2016
  • 年:2016
  • 卷:46
  • 期:2
  • 页码:285-314
  • 全文大小:1,266 KB
  • 参考文献:1.Duplicate detection, record linkage, and identity uncertainty: datasets. http://​www.​cs.​utexas.​edu/​users/​ml/​riddle/​data.​html . Accessed 6 Oct 2013
    2.Matching (graph theory) (2012). http://​en.​wikipedia.​org/​matching_​(graph_​theory) . Accessed 15 Oct 2012
    3.DBLP (2014). http://​www.​informatik.​uni-trier.​de/​~ley/​db/​ . Accessed 15 Jan 2014
    4.Arasu A, Ré C, Suciu D (2009) Large-scale deduplication with constraints using dedupalog. In: ICDE, pp 952–963
    5.Aslam JA, Pelekhov E, Rus D (2004) The star clustering algorithm for static and dynamic information organization. J Graph Algorithms Appl 8:95–129MathSciNet CrossRef MATH
    6.Bayardo RJ, Ma Y, Srikant R (2007) Scaling up all pairs similarity search. In: WWW, pp 131–140
    7.Benjelloun O, Garcia-Molina H, Menestrina D, Su Q, Whang SE, Widom J (2009) Swoosh: a generic approach to entity resolution. VLDB J 18(1):255–276CrossRef
    8.Bhattacharya I, Getoor L (2004) Iterative record linkage for cleaning and integration. In: DMKD, pp 11–18
    9.Chaudhuri S, Chen B-C, Ganti V, Kaushik R (2007) Example-driven design of efficient record matching queries. In: VLDB, pp 327–338
    10.Chaudhuri S, Ganti V, Kaushik R (2006) A primitive operator for similarity joins in data cleaning. In: ICDE, p 5
    11.Chaudhuri S, Sarma AD, Ganti V, Kaushik R (2007) Leveraging aggregate constraints for deduplication. In: SIGMOD conference, pp 437–448
    12.Chen Z, Kalashnikov DV, Mehrotra S (2009) Exploiting context analysis for combining multiple entity resolution systems. In: SIGMOD conference, pp 207–218
    13.Dong X, Halevy AY, Madhavan J (2005) Reference reconciliation in complex information spaces. In: SIGMOD conference, pp 85–96
    14.Duda R, Hart P, Stork D (2001) Pattern classification. Wiley, Hoboken, NJMATH
    15.Elmagarmid AK, Ipeirotis PG, Verykios VS (2007) Duplicate record detection: a survey. IEEE Trans Knowl Data Eng 19(1):1–16CrossRef
    16.Hassanzadeh O, Chiang F, Miller RJ, Lee HC (2009) Framework for evaluating clustering algorithms in duplicate detection. PVLDB 2(1):1282–1293
    17.Hassanzadeh O, Miller RJ (2009) Creating probabilistic databases from duplicated data. VLDB J 18(5):1141–1166CrossRef
    18.Jiang Y, Li G, Feng J, Li W (2014) String similarity joins: an experimental evaluation. PVLDB 7(8):625–636
    19.Kalashnikov DV, Mehrotra S (2006) Domain-independent data cleaning via analysis of entity-relationship graph. ACM Trans Database Syst 31(2):716–767CrossRef
    20.Kim HS, Lee D (2010) Harra: fast iterative hashed record linkage for large-scale data collections. In: EDBT, pp 525–536
    21.Koudas N, Saha A, Srivastava D, Venkatasubramanian S (2009) Metric functional dependencies. In: ICDE, pp 1275–1278
    22.Manning CD, Schütze H (1999) Foundations of statistical natural language processing. The MIT Press, Cambridge, MAMATH
    23.Menestrina D, Whang S, Garcia-Molina H (2010) Evaluating entity resolution results. PVLDB 3(1):208–219
    24.Micali S, Vazirani VV (1980) An \(\text{ o }(\sqrt{|V|}|e|\) ) algorithm for finding maximum matching in general graphs. In: FOCS, pp 17–27
    25.Michelson M, Knoblock CA (2006) Learning blocking schemes for record linkage. In: AAAI
    26.Michelson M, Knoblock CA (2009) Mining the heterogeneous transformations between data sources to aid record linkage. In: IC-AI, pp 422–428
    27.Sarawagi S, Bhamidipaty A (2002) Interactive deduplication using active learning. In: KDD, pp 269–278
    28.Schaeffer SE (2007) Graph clustering. Comput Sci Rev 1(1):27–64MathSciNet CrossRef MATH
    29.Shen W, DeRose P, Vu L, Doan A, Ramakrishnan R (2007) Source-aware entity matching: a compositional approach. In: ICDE, pp 196–205
    30.Shen W, Li X, Doan A (2005) Constraint-based entity matching. In: AAAI, pp 862–867
    31.Whang SE, Menestrina D, Koutrika G, Theobald M, Garcia-Molina H (2009) Entity resolution with iterative blocking. In: SIGMOD conference, pp 219–232
    32.Xiao C, Wang W, Lin X, Yu JX (2008) Efficient similarity joins for near duplicate detection. In: WWW, pp 131–140
    33.Yang X, Wang B, Li C (2008) Cost-based variable-length-gram selection for string collections to support approximate queries efficiently. In: SIGMOD conference, pp 353–364
    34.Yannakakis M (1978) Node- and edge-deletion np-complete problems. In: STOC, pp 253–264
    35.Yin X, Han J, Yu PS (2007) Object distinction: distinguishing objects with identical names. In: ICDE, pp 1242–1246
  • 作者单位:Hongzhi Wang (1)
    Jianzhong Li (1)
    Hong Gao (1)

    1. Department of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
  • 刊物类别:Computer Science
  • 刊物主题:Information Systems and Communication Service
    Business Information Systems
  • 出版者:Springer London
  • ISSN:0219-3116
文摘
Entity resolution has wide applications and receives considerable attentions in literature. For entity resolution, similarity functions are often used to judge whether two data objects refer to the same real-world entity. However, the similar relations determined by many commonly used similarity functions lack transitivity. This fact results in the conflict that \(A\) and \(B\) refer to the same entity and \(B\) and \(C\) refer to the same entity, but \(A\) and \(C\) do not refer to the same entity. To address this problem and make the group-wise entity resolution results consistent with pairwise entity resolution, this paper models the entity resolution problem as the partition of the vertices in a weighted graph into cohesive subgraphs, which is proven to be co-NP-complete. To solve this problem, an approximate algorithm with approximation ratio bound is proposed. For performing entity resolution on a large data set efficiently, a heuristic algorithm is developed to address this problem. In order to implement the heuristic algorithm efficiently, a similarity measure compatible with many measures in common usage is presented. With such similarity measure, indices and efficient implementations for the heuristic algorithm are proposed. Extensive experiments have been performed to verify the efficiency and effectiveness of the methods in this paper.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700