用户名: 密码: 验证码:
Active transfer learning of matching query results across multiple sources
详细信息    查看全文
  • 作者:Jie Xin ; Zhiming Cui ; Pengpeng Zhao ; Tianxu He
  • 关键词:entity resolution ; active learning ; transfer learning ; convex optimization
  • 刊名:Frontiers of Computer Science in China
  • 出版年:2015
  • 出版时间:August 2015
  • 年:2015
  • 卷:9
  • 期:4
  • 页码:595-607
  • 全文大小:626 KB
  • 参考文献:1.Getoor L, Machanavajjhala A. Entity resolution: theory, practice & open challenges. Proceedings of the VLDB Endowment, 2012, 5(12): 2018鈥?019View Article
    2.Negahban N, Rubinstein P, Gemmell G. Scaling multiple-source entity resolution using statistically efficient transfer learning. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management. 2012, 2224鈥?228
    3.Arasu A, G枚tz M, Kaushik R. On active learning of record matching packages. In: Proceedings of the 2010 International Conference on Management of Data. 2010, 783鈥?94
    4.Bellare K, Iyengar S, Parameswaran A, Rastogi V. Active sampling for entity matching. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2012, 1131鈥?139View Article
    5.Chuang S L, Chang K C C. Integrating web query results: holistic schema matching. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management. 2008, 33鈥?2
    6.K枚pcke H, Rahm E. Frameworks for entity matching: a comparison. Data & Knowledge Engineering, 2010, 69(2): 197鈥?10View Article
    7.Winkler W E. The state of record linkage and current research problems. In: Proceedings of Statistical Research Division, US Census Bureau. 1999
    8.Chaudhuri S, Chen B C, Ganti V, Kaushik R. Example-driven design of efficient record matching queries. In: Proceedings of the 33rd International Conference on Very Large Data Bases. 2007, 327鈥?38
    9.Bilenko M, Mooney R J. Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2003, 39鈥?8
    10.Su W, Wang J, Lochovsky F H. Record matching over query results from multiple web databases. IEEE Transactions on Knowledge and Data Engineering, 2010, 22(4): 578鈥?89View Article
    11.K枚pcke H, Rahm E. Training selection for tuning entity matching. In: Proceedings of QDB/MUD. 2008, 3鈥?2
    12.Altwaijry H, Kalashnikov D V, Mehrotra S. Query-driven approach to entity resolution. Proceedings of the VLDB Endowment, 2013, 6(14): 1846鈥?857View Article
    13.Singla P, Domingos P. Entity resolution with Markov logic. In: Proceedings of International Conference on Data Mining. 2006, 572鈥?82
    14.Liu W, Xiao J G. A duplicate web entity identification approach based on iterative training. Frontiers of Computer Science and Technology, 2010, (007): 599鈥?07
    15.Wang J, Kraska T, Franklin M J, Feng J. Crowder: crowdsourcing entity resolution. Proceedings of the VLDB Endowment, 2012, 5(11): 1483鈥?494View Article
    16.Pan S J, Yang Q. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 2010, 22(10): 1345鈥?359View Article
    17.Yang L, Hanneke S, Carbonell J. A theory of transfer learning with applications to active learning. Machine Learning, 2013, 90(2): 161鈥?89MathSciNet View Article
    18.Shi X, Fan W, Ren J. Actively transfer domain knowledge. In: Proceedings of ECML/PKDD. 2008, 342鈥?57
    19.Zhao L, Pan S J, Xiang E W, Zhong E, Lu Z, Yang Q. Active transfer learning for cross-system recommendation. In: Proceedings of the 27th AAAI Conference on Artificial Intelogence. 2013, 1205鈥?211
    20.Fang M, Yin J, Zhu X. Knowledge transfer for multi-labeler active learning. Lecture Notes in Computer Science, 2013, 8188: 273鈥?88View Article
    21.Jun G, Ghosh J. An efficient active learning algorithm with knowledge transfer for hyperspectral data analysis. In: Proceedings of Geoscience and Remote Sensing Symposium. 2008, 1: I-52鈥?5
    22.Li L, Jin X, Pan S J, Sun J T. Multi-domain active learning for text classification. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2012, 1086鈥?094View Article
    23.Christen P. Automatic record linkage using seeded nearest neighbor and support vector machine classification. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2008, 151鈥?59View Article
    24.Benjelloun O, Garcia-Molina H, Menestrina D, Su Q, Whang S E, Widom J. Swoosh: a generic approach to entity resolution. The International Journal on Very Large Data Bases, 2009, 18(1): 255鈥?76View Article
    25.Boyd S P, Vandenberghe L. Convex Optimization. Cambridge University Press, 2004View Article
    26.Jalali A, Ravikumar P D, Sanghavi S, Ruan C. A dirty model for multitask learning. In: Proceedings of Advances in Neural Information Processing Systems. 2010, 964鈥?72
    27.Bickel P J, Ritov Y A, Tsybakov A B. Simultaneous analysis of lasso and dantzig selector. The Annals of Statistics, 2009, 37(4): 1705鈥?732MathSciNet View Article
    28.Tong S. Active Learning: Theory and Applications. Stanford University, 2001
    29.Davis J, Goadrich M. The relationship between precision-recall and ROC curves. In: Proceedings of the 23rd International Conference on Machine Learning. 2006, 233鈥?40
  • 作者单位:Jie Xin (1) (2)
    Zhiming Cui (1)
    Pengpeng Zhao (1)
    Tianxu He (1)

    1. The Institute of Intelligent Information Processing and Application, Soochow University, Suzhou, 215006, China
    2. Provincial Key Laboratory for Computer Information Processing Technology, Soochow University, Suzhou, 215006, China
  • 刊物类别:Computer Science
  • 刊物主题:Computer Science, general
    Chinese Library of Science
  • 出版者:Higher Education Press, co-published with Springer-Verlag GmbH
  • ISSN:1673-7466
文摘
Entity resolution (ER) is the problem of identifying and grouping different manifestations of the same real world object. Algorithmic approaches have been developed where most tasks offer superior performance under supervised learning. However, the prohibitive cost of labeling training data is still a huge obstacle for detecting duplicate query records from online sources. Furthermore, the unique combinations of noisy data with missing elements make ER tasks more challenging. To address this, transfer learning has been adopted to adaptively share learned common structures of similarity scoring problems between multiple sources. Although such techniques reduce the labeling cost so that it is linear with respect to the number of sources, its random sampling strategy is not successful enough to handle the ordinary sample imbalance problem. In this paper, we present a novel multi-source active transfer learning framework to jointly select fewer data instances from all sources to train classifiers with constant precision/recall. The intuition behind our approach is to actively label the most informative samples while adaptively transferring collective knowledge between sources. In this way, the classifiers that are learned can be both label-economical and flexible even for imbalanced or quality diverse sources. We compare our method with the state-of-the-art approaches on real-word datasets. Our experimental results demonstrate that our active transfer learning algorithm can achieve impressive performance with far fewer labeled samples for record matching with numerous and varied sources.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700