Web论坛数据抽取

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

Web论坛数据抽取

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Data Extraction from Web Forums
作者：张敬伟
论文级别：博士
学科专业名称：计算机应用技术
中文关键词：论坛数据抽取 ; 用户生成内容 ; 抽取规则 ; 归纳逻辑程序设计
英文关键词：Forum Data Extraction ; User-Generated Contents ; Wrapper ; Induc-
英文关键词：tive Logic Programming
学位年度：2012
导师：周傲英
学科代码：081203
学位授予单位：华东师范大学
论文提交日期：2012-11-01
答辩委员会主席：李战怀

摘要

Web2.0为用户提供了丰富的应用,大量用户的深度参与使Web正演变成一个生态系统。在向用户展示信息的同时,Web2.0也吸引着用户贡献大量内容,这些用户生成的内容蕴含巨大的价值。
     作为一种典型的Web2.0应用,论坛为用户提供了一个信息获取与交流的平台。用户在论坛上发布信息和评论,例如介绍产品使用心得、交流生活感悟、讨论学校教育、发布社会新闻等,这些内容真实地反映了用户的需求、观点以及社会现象等。如何将论坛数据从Web页面中抽取出来,以支持商品推荐、专家发现、舆情监控等应用具有很强的研究与现实意义。
     论坛数据较为复杂,它不仅包含用户生成内容,还包括推荐、广告等噪音数据；此外,各论坛站点风格也存在较大差异,这为论坛数据抽取带来了挑战。传统的Web数据抽取技术通常面向相对规整的结构化数据,并不适用于论坛数据抽取,因而需要研究面向论坛数据的高效的抽取技术。本文的主要贡献包括以下几个方面：
     ·提出了一种整合归纳逻辑程序设计和XPath模式学习的论坛数据抽取方法,该方法具有较高的准确率和召回率。该方法充分考虑了论坛页面的结构特征,引入新谓词,以整合逻辑程序表达式和XPath模式,采用分而治之的方法来学习XPath模式,以描述目标数据的结构特征。最后,将学习的XPath模式规则转换成XSLT文件,从而把抽取的论坛数据按照预定义的模型存储起来,以实现论坛数据的自动抽取。
     ·提出了一种非监督的论坛数据抽取方法,该方法充分考虑了Web页面的结构特征和页而间联系,显著提升了抽取的自动化程度。基于同一论坛站点页面的结构具有相似性的特点,采用多页面联合比较的方法,将Web页面划分成稳定区域和非稳定区域,并通过页面级过滤和模板级过滤移除Web页面的大多数噪音数据。然后利用稳定区域中路径和非稳定区域中路径的相互关系,引入路径伴随距离和相似度计算路径之间的依赖关系,从而判定一个路径是否属于抽取目标的路径,实现论坛帖子内容的自动抽取。
     ·提出了一种非监督的论坛数据抽取规则生成方法,该方法充分考虑了Web页面的结构和页面内容特征,提升了对不同论坛的适应能力,保证了帖子抽取的完整性。本方法是一个两阶段的抽取规则生成方法,同时开采了Web页面结构、用户发布帖子和论坛常规性的冗余信息三者的特征。在用户信息处理阶段,通过Web页而常规性的冗余信息获取用户区域,并发现用户区域中的最大子结构,从而获得用户信息：在帖子内容处理阶段,将用户区域转换成关系表中的记录,根据属性间的函数依赖关系来区分帖子内容和噪音数据。最后,将两个阶段获取内容对应的路径归纳成以正则树结构表示的抽取规则。
     综上所述,本文从不同的需求出发提出了三种论坛数据抽取方法。第一种方法采用有监督的抽取规则学习模式,能够获得较好的准确率和召回率,比较适用于小规模的论坛数据集合；第二种方法是非监督的抽取方法,直接从Web页面抽取数据,不显式地输出抽取规则,适用于较大规模的论坛数据集合；第三种方法也是非监督的方法,它首先学习抽取规则,然后基于规则抽取数据,兼顾了规则生成的自动化和抽取性能,能适应更大规模的数据集合。基于真实论坛数据的实验表明,上述方法能有效地从各种论坛中抽取数据。
Web2.0provides a wealth of services for people, the huge number of attendees make it be evolving into an ecosystem. As well as presenting rich information to people, Web also harvests massive contents contributed by users, which holds a huge value.
     As a typical Web2.0service, Web forums provide a platform for users to publish and exchange information. For example, people may like to release information or make comments, such as sharing product experience, exchanging life experience, discussing ed-ucation, posting gossip and so on. Such user-generated contents reflect people's real needs and viewpoints, social phenomena and others. Hence, how to extract data from Web forums becomes very realistic and meaningful since it is critical for commodity recommen-dation, expert discovery, public opinion monitoring and other analysis tasks.
     Forum data consists of not only a lot of useful user-generated contents, but also some noise data, such as recommendations, advertisements and so on. In addition, there exist a large number of Web sites with different styles, which makes forum data extraction even more challenging. Traditional Web data extraction methods usually work on structured data, therefore, it is necessary to revisit the existing work to devise new efficient extraction methods for Web forums. This paper makes the following contributions,
     · Proposing a forum data extraction method with high precision and recall by inte-grating inductive logic programming and XPath pattern learning. The method fully considers the structural features of forum pages, introduces new predicates, unifies logic program expressions and XPath patterns, and uses a divide-and-conquer way to learn XPath patterns. XPath patterns are used for expressing the structural features of target data. Finally, XPath patterns are automatically transformed into a XSLT file, which is responsible for transferring the extracted data into a predefined storage model, to complete forum data extraction.
     · Proposing an unsupervised method of forum data extraction based on both the struc-tural features of Web pages and the relationships between Web pages, which makes the extraction processing automatically. Considering the structural similarity among Web pages from a same Web site, this paper adopts some comparison operations on multiple Web pages to divide Web pages into stable parts and unstable parts, and introduces two filtering operations, page-level filtering and template-level filtering, to remove most noise data from Web pages. Finally, the definition for path accompanying distance and path similarity are introduced to compute the dependency relationship between paths in stable parts and unstable parts. The dependency relationship be-tween paths can help to find those paths locating target data and realize automatic extraction of forum posts.
     · Proposing an unsupervised wrapper generation method for forum data extraction, which fully considers the features of Web page structure and contents to improve adaptability of the unsupervised method on different forums and ensure the integrity of extracted forum posts. This method contains two stages, which exploit the features of Web page structure, user contents and some redundant information generated by forums themselves. First, it tries to locate user areas by using the redundant in-formation, base on which user information can be obtained by finding a maximum substructure in user areas. Second, it tries to distinguish user-generated contents and noise by loading all data into a table, and then an attribute dependency computation is exerted on this table to identify which items should be reserved. All the paths locating those contents discovered in above two stages are gathered and induced into a regular tree for future use.
     In summary, this paper proposes three methods to extract forum data. The first method is a supervised extraction rule learning method, it behaves well on precision and recall, and is qualified for small-scale data set. The:second method is a unsupervised one, which extracts data simultaneously from multiple forum pages, has no explicit rules and can handle a larger data set. The third method is also an unsupervised method, it learns extraction rules and uses these rules to extract data, which gives a comprehensive consideration on extraction automation and performance. This method can handle a larger data set than the first two methods. Extensive experiments on real forum data sets show that the above methods have a good extraction performance.

引文

[1]Alon Halevy, Poter Norvig, Fernando Pereira. The Unreasonable Effectiveness of Data. IEEE Intelligent Systems. Mar.2009,24(2):8-12. URL http://dx.doi.org/10.1109/MIS.2009. 36
    [2]Michael J. Cafarella, Alon Halevy, Jayant Madhavan. Structured data on the web. Commun ACM. Feb.2011,54(2):72-79. URL http://doi.acm.org/10.1145/1897816.1897839
    [3]Michael J. Cafarella, Alon Halevy, Daisy Zhe Wang, Eugene Wu, Yang Zhang. WebTables: exploring the power of tables on the web. Proc VLDB Endow. Aug.2008, 1(1):538-549. URL http://dx.doi.org/10.1145/1453856.1453916
    [4]Hazem Elmeleegy, Jayant Madhavan, Alon Halevy. Harvesting relational tables from lists on the web. Proc VLDB Endow. Aug.2009,2(1):1078-1089. URL http://dl.acm.org/ citation.cfm?id=1687627.1687749
    [5]中国互联网发展状况统计报告Tech. rep中国互联网信息中心(CNNIC),72012
    [6]Natalie. Glance, Matthew Hurst, Kamal Nigam, Matthew Siegler, Robert Stockton, Takashi Tomokiyo. Deriving marketing intelligence from online discussion. Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining. KDD'05, New York, NY, USA:ACM,2005,419-428. URL http://doi.acm. org/10.1145/ 1081870.1081919
    [7]Wen-tan Yih, Po-hao Chang, Wooyoung Kim. Mining Online Deal Forums for Hot Deals. Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence. WI'04, Washington, DC, USA:IEEE Computer Society,2004,384-390. URL http://dx. doi.org/10.1109/WI.2004.98
    [8]Robert Baumgartner, Georg Cottlob, Marcus Herzog. Scalable web data extraction for online market intelligence. Proc VLDB Endow. Aug.2009,2(2):1512-1523. URL http: //dl.acm.org/citation.cfm?id=1687553.1687580
    [9]Jun Zhang, Mark S. Ackerman, Lada Adainic. Expertise networks in online communities: structure and algorithms. Proceedings of the 16th international conference on World Wide Web. WWW'07, New York, NY, USA:ACM,2007,221-230. URL http://doi.acm.org/ 10.1145/1242572.1242603
    [10]Gao Cong, Long Wang, Chin-Yew Lin, Young-In Song, Yueheng Sun. Finding question-answer pairs from online forums. Proceedings of the 31st annual international ACM SIGIR. conference on Research and development in information retrieval SIGIR'08, New York, NY, USA:ACM,2008,467-474. URL http://doi.acm.org/10.1145/1390334.1390415
    [11]Lei Shi, Bai Sun, Liang Kong, Yan Zhang. Web Forum Sentiment Analysis Based on Topics. Proceedings of the 2009 Ninth IEEE International Conference on Computer and Information Technology-Volume 02. CIT'09, Washington, DC, USA:IEEE Computer Society,2009, 148-153. URL http://dx.doi.org/10.1109/CIT.2009.53
    [12]Yuan Niu, Hao Chen, Francis Hsu, Yi-Min Wang, Ming Ma. A Quantitative Study of Forum Spamming Using Context-based Analysis. NDSS. The Internet Society,2007
    [13]Christopher C. Yang, Tobun Dorbin Ng. Analyzing content development and visualizing social interactions in Web forum. ISI. IEEE,2008,25-30
    [14]Zaiqing Nie, Ji-Rong Wen, Wei-Ying Ma. Object-level Vertical Search. CIDR. www.cidrdb.org,2007,235-246
    [15]Senjuti Basu Roy, Sihem Amer-Yahia, Ashish Chawla, Gautam Das, Cong Yu. Constructing and exploring composite items. Proceedings of the 2010 ACM SIGMOD International Con-ference on Management of data. SIGMOD'10, New York, NY, USA:ACM,2010,843-854. URL http://doi.acm.org/10.1145/1807167.1807258
    [16]Rakesh Agrawal, Anastasia Ailainaki, Philip A. Bernstein, Eric A. Brewer, Michael J. Carey, Surajit Chaudhuri, Anhai Doan, Daniela Florescu, Michael J. Franklin, Hector Garcia-Molina, Johannes Gehrke, Le Gruenwald, Laura M. Haas, Alon Y. Halevy, Joseph M. Hellerstein, Yannis E. Ioannidis, Hank F. Korth, Donald Kossmann, Samuel Madden, Roger Magoulas, Beng Chin Ooi, Tim O'Reilly, Raghu Ramakrishnau, Sunita Sarawagi, Michael Stonebraker, Alexander S. Szalay, Gerhard Weikum. The Claremont report on database research. Commun ACM. Jun.2009,52(6):56-65. URL http://doi.acm.org/10.1145/ 1516046.1516062
    [17]Alon Y. Halevy. Towards an ecosystem of structured data on the web. Proceedings of the 15th International Conference on Extending Database Technology. EDBT'12, New York, NY, USA:ACM,2012,1-2. URL http://doi.acm.org/10.1145/2247596.2247597
    [18]Tim Weninger, William H. Hsu, Jiawei Han. CETR:content extraction via tag ratios. Proceedings of the 19th international conference on World wide web. WWW'10, New York, NY, USA:ACM,2010,971-980. URL http://doi.acm.org/10.1145/1772690.1772789
    [19]Fei Sun, Dandan Song, Lejian Liao. DOM based content extraction via text density. Pro-ceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval. SIGIR'11, New York, NY, USA:ACM,2011,245-254. URL http://doi.acm.org/10.1145/2009916.2009952
    [20]Michelangelo Diligenti, Frans Coetzee, Steve Lawrence, C. Lee Giles, Marco Gori. Focused Crawling Using Context Graphs. Proceedings of the 26th International Conference on Very Large Data. Bases. VLDB'00, San Francisco, CA, USA:Morgan Kaufmann Publishers Inc.. 2000,527-534. URL http://dl.acm.org/citation.cfm?id=645926.671854
    [21]Sriram Raghavan, Hector Garcia-Molina. Crawling the Hidden Web. Proceedings of the 27th International Conference on Very Large Data Bases. VLDB'01, San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.,2001,129-138. URL http://dl.acm.org/citation. cfm?id=645927.672025
    [22]Yan Guo, Kui Li, Kai Zhang, Gang Zhang. Board Forum Crawling:A Web Crawling Method for Web Forum. Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence. WI'06, Washington, DC, USA:IEEE Computer Society,2006,745-748. URL http://dx.doi.org/10.1109/WI.2006.52
    [23]Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai, Lei Zhang, Wei-Ying Ma. Exploring traversal strategy for web forum crawling. Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval. SIGIR'08, New York, NY, USA:ACM,2008,459-466. URL http://doi.acm.org/10.1145/1390334. 1390413
    [24]Rui Cai, Jiang-Ming Yang, Wei Lai, Yida Wang, Lei Zhang. iRobot:an intelligent crawler for web forums. Proceedings of the 17th international conference on World Wide Web. WWW'08, New York, NY, USA:ACM,2008,447-456. URL http://doi.acm.org/10. 1145/1367497.1367558
    [25]Tianjun Fu, Ahmed Abbasi, Hsinchun Chen. A focused crawler for Dark Web forums. J Am Soc Inf Sci Technol. Jun.2010,61(6):1213-1231. URL http://dx.doi.org/10.1002/ asi.v61:6
    [26]Jingtian Jiang, Nenghai Yu, Chin-Yew Lin. FoCUS:learning to crawl web forums. Pro-ceedings of the 21st international conference companion on World Wide Web. WWW'12 Companion, New York, NY, USA:ACM,2012,33-42. URL http://doi.acm.org/10.1145/ 2187980.2187985
    [27]Ashwin Macha.nava.jjhala, Arun Shankar Iyer, Philip Bohannon, Srujana. Merugu. Collec-tive extraction from heterogeneous web lists. Proceedings of the fourth ACM international conference on Web search and data mining. WSDM'11, New York, NY, USA:ACM,2011, 445-454. URL http://doi.acm.org/10.1145/1935826.1935894
    [28]Arnaud Sahuguet, Fabien Azavant. Building Light-Weight Wrappers for Legacy Web Data-Sources Using W4F. Proceedings of the 25th International Conference on Very Large Data Bases. VLDB'99, San Francisco, CA, USA:Morgan Kaufmann Publishers Inc.,1999,738-741. URL http://dl.acm.org/citation.cfm?id=645925.671350
    [29]Arnaud Sahuguet, Fabicn Azavant. Building intelligent web applications using lightweight wrappers. Data Knowl Eng. Mar.2001,36(3):283-316. URL http://dx.doi.org/10. 1016/S0169-023X(00)00051-3
    [30]Robert Baumgartner, Sergio Flesca, Georg Gottlob. Visual Web Information Extraction with Lixto. Proceedings of the 27th International Conference on Very Large Data Bases. VLDB'01, San Francisco, CA, USA:Morgan Kaufmann Publishers Inc.,2001,119-128. URL http://dl.acm.org/citation.cfm?id=645927.672194
    [31]XWRAP:An XML-Enabled Wrapper Construction System for Web Information Sources. Proceedings of the 16th International Conference on Data Engineering. ICDE'00, Washing-ton, DC, USA:IEEE Computer Society,2000,611-. URL http://dl.acm.org/citation. cfm?id=846219.847340
    [32]Robert Baumgartner, Sergio Flesca, Georg Gottlob. The Elog Web Extraction Language. Proceedings of the Artificial Intelligence on Logic for Programming. LPAR.'01, London, UK, UK:Springer-Verlag,2001,548-560. URL http://dl.acm.org/citation.cfm?id=645710. 664471
    [33]Jussi Myllymaki. Effective Web data extraction with standard XML technologies. Proceed-ings of the 10th international conference on World Wide Web. WWW'01, New York, NY, USA:ACM,2001,689-696. URL http://doi.acm.org/10.1145/371920.372183
    [34]Wook-Shin Han, Wooseong Kwak, Hwanjo Yu. On supporting effective web extraction. Feifei Li, Mirella M. Moro, Shahrain Ghandeharizadeh, Jayant. R. Haritsa, Gerhard Weikum, Michael J. Carey, Fabio Casati, Edward Y. Chang, Ioana Manolescu, Sharad Mehrotra, Umeshwar Dayal, Vassilis J. Tsotras, (Editors) ICDE. IEEE,2010,773-775
    [35]Thomas Kistler, Hannes Marais. WebL-a programming language for the Web. Com-put Netw ISDN Syst. Apr.1998,30(1-7):259-270. URL http://dx.doi.org/10.1016/ S0169-7552(98)00018-X
    [36]Theodore Hong, Keith Clark. Using Grammatical Inference to Automate Information Ex-traction from the Web. Luc De Raedt, Arno Siebes, (Editors) Principles of Data Mining and Knowledge Discovery, Springer Berlin/Heidelberg,2001, vol.2168 of Lecture Notes in Computer Science.216-227.10.1007/3-540-44794-6.18, URL http://dx.doi.org/10. 1007/3-540-44794-6\_18
    [37]Georg Gottlob, Christoph Koch, Robert Baumgartner, Marcus Herzog, Sergio Flesca. The Lixto data, extraction project:back and forth between theory and practice. Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. PODS'04, New York, NY, USA:ACM,2004,1-12. URL http://doi.acm.org/ 10.1145/1055558.1055560
    [38]Julien Carme, Michal Ceresna, Oliver Fr?lich, Georg Gottlob, Tamir Hassan, Marcus Her-zog, Wolfgang Holzinger, Bernhard Krupl. The Lixto Project:Exploring New Frontiers of Web Data Extraction. David Bell, Jun Hong, (Editors) Flexible and Efficient Informa-tion Handling, Springer Berlin/Heidelberg,2006, vol.4042 of Lecture Notes in Computer Science.1-15.10.1007/11788911-1, URL http://dx.doi.org/10.1007/11788911\_1
    [39]Robert Baumgartner, Sergio Flesca, Georg Gottlob. Declarative Information Extraction, Web Crawling, and Recursive Wrapping with Lixto. Thomas Eiter, Wolfgang Faber, Miros Truszczynski, (Editors) Logic Programming and Nonmotonic Reasoning, Springer Berlin/ Heidelberg,2001, vol.2173 of Lecture Notes in Computer Science.21-41.10.1007/3-540-45402-0.2, URL http://dx.doi.org/10.1007/3-540-45402-0\_2
    [40]Nicholas Kushmerick, Daniel S. Weld, Robert B. Doorenbos. Wrapper Induction for Infor-mation Extraction. IJCAI (1). Morgan Kaufmann,1997,729-737
    [41]Ion Muslea, Steve Minton, Craig Knoblock. A hierarchical approach to wrapper induction. Proceedings of the third annual conference on Autonomous Agents. AGENTS'99, New York, NY, USA:ACM,1999,190-197. URL http://doi.acm.org/10.1145/301136.301191
    [42]Gerald Huck, Peter Fankhauser, Karl Aberer, Erich J. Neuhold. Jedi:Extracting and Syn-thesizing Information from the Web. Proceedings of the 3rd IFCIS International Conference on Cooperative Information Systems, New York City, New York, USA, August 20-22,1998, Sponsored by IFCIS, The Intn 1 Foundation on Cooperative Information Systems. IEEE Computer Society,1998,32-43
    [43]Kristina Lerman, Steven Minton. Learning the Common Structure of Data. Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence. AAAI Press,2000,609-614. URL http: //dl.acm.org/citation.cfm?id=647288.721589
    [44]Raymond Kosala, Hendrik Blocked. Instance-Based Wrapper Induction. In Proceedings of the Tenth Belgian-Dutch Conference on Machine Learning (Benelea.ru 2000.61-68
    [45]Chia-Hui Chang, Shih-Chien Kuo. OLERA:Semisupervised Web-Data Extraction with Visual Support. IEEE Intelligent Systems. Nov.2004,19(6):56-64. URL http://dx.doi. org/10.1109/MIS.2004.71
    [46]Ion Muslea, Steven Minton, Craig A. Knoblock. Hierarchical Wrapper Induction for Semistructured Information Sources. Autonomous Agents and Multi-Agent Systems. Mar. 2001,4(1-2):93-114. URL http://dx.doi. org/10.1023/A:1010022931168
    [47]Alberto H. F. Laender, Berthier Ribeiro-Neto, Altigran S. da Silva. DEByE-Date extraction by example. Data. Knowl Eng. Feb.2002,40(2):121-154. URL http://dx.doi.org/10. 1016/S0169-023X(01)00047-7
    [48]Alberto H. F. Laender, Bert.hier A. Ribeiro-Neto, Altigran S. da Silva, Juliana S. Teixeira. A brief survey of web data extraction tools. SIGMOD Rec. Jun.2002,31(2):84-93. URL http://doi.acm.org/10.1145/565117.565137
    [49]Nicholas Kushmerick. Wrapper induction:efficiency and expressiveness. Artif Intell. Apr. 2000,118(1-2):15-68. URL http://dx.doi.org/10.1016/S0004-3702(99) 00100-9
    [50]Sergio Flesca, Giuseppe Manco, Elio Masciari, Eugenio Rende, Andrea Tagarelli. Web wrapper induction:a brief survey. AI Commum. Apr.2004,17(2):57-61. URL http: //dl.acm.org/citation.cfm?id=1218702.1218707
    [51]Valter Crescenzi, Giansalvatore Mecca, Paolo Merialdo. RoadRunner:Towards Automatic-Data Extraction from Large Web Sites. Proceedings of the 27th International Conference on Very Large Data Bases. VLDB'01, San Francisco, CA, USA:Morgan Kaufmann Publishers Inc.,2001,109-118. URL http://dl.acm.org/citation.cfm?id=645927.672370
    [52]Valter Crescenzi, Giansalvat.ore Mecca, Paolo Merialdo. Automatic Web Information Extrac-tion in the RoadRunner System. Hiroshi Arisawa, Yahiko Kambayashi, Vijay Kumar, Hein-rich Mayr, Ingrid Hunt, (Editors) Conceptual Modeling for New Information Systems Tech-nologies, Springer Berlin/Heidelberg,2002, vol.2465 of Lecture Notes in Computer Science. 264-277.10.1007/3-540-46140-X_21, URL http://dx.doi.org/10.1007/3-540-46140-X\_21
    [53]Arviud Arasu, Hector Garcia-Molina. Extracting structured data from Web pages. Proceed-ings of the 2003 ACM SIGMOD international conference on Management of da.ta. SIGMOD '03, New York, NY, USA:ACM,2003,337-348. URL http://doi.acm.org/10.1145/ 872757.872799
    [54]Chia-Hui Chang, Shao-Chen Lui. IEPAD:information extraction based on pattern discovery. Proceedings of the 10th international conference on World Wide Web. WWW'01, New York, NY, USA:ACM,2001,681-688. URL http://doi.acm.org/10.1145/371920.372182
    [55]Dan Gusfield. Algorithms on strings, trees, and.sequences:computer science and computa-tional biology. New York, NY, USA:Cambridge University Press,1997
    [56]Bing Liu, Robert Grossman, Yanhong Zhai. Mining data records in Web pages. Pro-ceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. KDD'03, New York, NY, USA:ACM,2003,601-606. URL http: //doi.acm.org/10.1145/956750.956826
    [57]Yanhong Zhai, Bing Liu. Web data extraction based on partial tree alignment. Proceedings of the 14th international conference on World Wide Web. WWW'05, New York, NY, USA: ACM,2005,76-85. URL http://doi.acm.org/10.1145/1060745.1060761
    [58]Bing Liu, Yauhong Zhai. NET:a system for extracting web data from flat and nested data, records. Proceedings of the 6th international conference on Web Information Systems Engineering. WISE'05, Berlin, Heidelberg:Springer-Verlag,2005,487-495. URL http: //dx.doi.org/10.1007/11581062_39
    [59]Justin Park, Denilson Barbosa. Adaptive record extraction from web pages. Proceedings of the 16th international conference on World Wide Web. WWW'07, New York, NY, USA: ACM,2007,1335-1336. URL http://doi.acm.org/10.1145/1242572.1242838
    [60]D. C. Reis, P. B. Golgher, A. S. Silva, A. F. Laender. Automatic web news extraction using tree edit distance. Proceedings of the 13th international conference on World Wide Web. WWW'04, New York, NY, USA:ACM,2004,502-511. URL http://doi.acm.org/10. 1145/988672.988740
    [61]Yeonjung Kim, Jeahyun Park, Taehwan Kim, Joongmin Choi. Web Information Extrac-tion by HTML Tree Edit Distance Matching. Convergence Information Technology,2007. International Conference on.2007,2455-2460
    [62]Nilesh Dalvi, Philip Bohannon, Fei Sha. Robust, web extraction:an approach based on a probabilistic tree-edit model. Proceedings of the 2009 ACM SIGMOD International Con-ference on Management of data. SIGMOD'09, New York, NY, USA:ACM,2009,335-348. URL http://doi.acm.org/10.1145/1559845.1559882
    [63]Boris Chidlovskii. Automatic repairing of web wrappers. Proceedings of the 3rd international workshop on Web information and data management. WIDM'01, New York, NY, USA: ACM,2001,24-30. URL http://doi.acm.org/10.1145/502932.502938
    [64]Jun Zhu, Zaiqing Nie, Ji-Rong Wen, Bo Zhang, Wei-Ying Ma.2D Conditional Random Fields for Web informa.tion extraction. Proceedings of the 22nd international conference on Machine learning. ICML'05, New York, NY, USA:ACM,2005,1044-1051. URL http: //doi.acm.org/10.1145/1102351.1102483
    [65]Jun Zhu, Zaiqing Nie, Ji rong Wen. Simultaneous record detection and attribute labeling in web data extraction. In Proc. of the ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining (KDD'06.2006,494-503
    [66]Susan Mengel, Yaoquin Jing. Extracting Structured Data from Web Pages with Maximum Entropy Segmental Markov Model. Proceedings of the 10th International Conference on Web Information Systems Engineering. WISE'09, Berlin, Heidelberg:Springer-Verlag,2009, 219-226. URL http://dx.doi.org/10.1007/978-3-642-04409-0_25
    [67]Jun Zhu, Zaiqing Nie, Bo Zhang, Ji-Rong Wen. Dynamic. Hierarchical Markov Random Fields for Integrated Web Data Extraction. J Mach Learn Res. Jun.2008,9:1583-1614. URL http://dl.acm.org/citation.cfm?id=1390681.1442784
    [68]Jun Zhu, Zaiqing Nie, Bo Zhang, Ji-Rong Wen. Dynamic hierarchical Markov random fields and their application to web data, extraction. Proceedings of the 24th international conference on Machine learning. ICML'07, New York, NY, USA:ACM,2007,1175-1182. URL http://doi.acm.org/10.1145/1273496.1273644
    [69]Boris Chidlovskii. Information Extraction from Tree Documents by Learning Subtree De-limiters. In:Proc. IIWeb'03.2003,3-8
    [70]Shuyi Zheng, Ruihua Song, Ji-Rong Wen, Di Wu. Joint optimization of wrapper generation and template detection. Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining. KDD'07, New York, NY, USA:ACM,2007,894-902. URL http://doi.acm.org/10.1145/1281192.1281287
    [71]Hongkun Zhao, Weiyi Meng, Zonghuan Wu, Vijay Raghavan, Clement Yu. Fully automatic wrapper generation for search engines. Proceedings of the 14th international conference on World Wide Web. WWW'05, New York, NY, USA:ACM,2005,66-75. URL http: //doi.acm.org/10.1145/1060745.1060760
    [72]Rahul Gupta, Sunita Sarawa.gi. Answering table augmentation queries from unstructured lists on the web. Proc VLDB Endow. Aug.2009,2(1):289-300. URL http://dl.acm.org/ citation.cfm?id=1687627.1687661
    [73]http://lucene.apache.org/
    [74]Wolfgang Gatterbauer, Paul Bohunsky, Marcus Herzog, Bernhard Kriipl, Bernhard Pollak. Towards domain-independent information extraction from web tables. Proceedings of the 16th international conference on World Wide Web. WWW'07, New York, NY, USA:ACM, 2007,71-80. URL http://doi.acm.org/10.1145/1242572.1242583
    [75]M.S. Amin, H. Jamil. Fast Wrap:An efficient wrapper for tabular data extraction from the web. Information Reuse Integration,2009. IRI'09. IEEE International Conference on.2009, 354-359
    [76]Bernhard Kriipl, Marcus Herzog, Wolfgang Gatterbauer. Using visual cues for extraction of tabular data from arbitrary HTML documents. Special interest tracks and posters of the 14th international conference on World Wide Web. WWW'05, New York, NY, USA:ACM, 2005,1000-1001. URL http://doi.acm.org/10.1145/1062745.1062838
    [77]Manuel Alvarez, Alberto Pan, Juan Raposo, Fernando Bellas, Fidel Cacheda. Extracting lists of data records from semi-structured web pages. Data Knowl Eng. Feb.2008,64(2):491-509. URL http://dx.doi.org/10.1016/j.datak.2007.10.002
    [78]Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti. Redundancy-driven web data extraction and integration. Procceedings of the 13th Inter-national Workshop on the Web and Databases. WebDB'10, New York, NY, USA:ACM, 2010,7:1-7:6. URL http://doi.acm.org/10.1145/1859127.1859137
    [79]Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti. Wrapper Generation for Overlapping Web Sources. Olivier Boissier, Boualem Benatallah, Mike P. Papazoglou, Zbig-niew W. Ras, Mohand-Said Hacid, (Editors) Web Intelligence. IEEE Computer Society, 2011,32-35
    [80]Rahul Gupta, Sunita Sarawagi. Joint training for open-domain extraction on the web: exploiting overlap when supervision is limited. Proceedings of the fourth ACM international conference on Web search and data mining. WSDM'11, New York, NY, USA:ACM,2011, 217-226. URL http://doi.acm.org/10.1145/1935826.1935868
    [81]Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti. Automat-ically building probabilistic databases from the web. Proceedings of the 20th international conference companion on World wide web. WWW'11, New York, NY, USA:ACM,2011, 185-188. URL http://doi.acm.org/10.1145/1963192.1963285
    [82]Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti. Exploiting information redundancy to wring out structured data from the web. Proceedings of the 19th international conference on World wide web. WWW'10, New York, NY, USA:ACM,2010, 1063-1064. URL http://doi.acm.org/10.1145/1772690.1772805
    [83]Nilesh Dalvi, Ravi Kumar, Mohamed Soliman. Automatic wrappers for large scale web extraction. Proc VLDB Endow. Jan.2011,4(4):219-230. URL http://dl.acm.org/ citation.cfm?id=1938545.1938547
    [84]Fabian M. Suchanek, Gjergji Kasneci, Gerhard Weikum. Yago:a core of semantic knowledge unifying WordNet and Wikipedia. Proceedings of the 16th international conference on World Wide Web. WWW'07, New York, NY, USA:ACM,2007,697-706. URL http://doi.acm. org/10.1145/1242572.1242667
    [85]Nora Derouiche, Bogdan Cautis, Talel Abdessalem. Automatic Extraction of Structured Web Data with Domain Knowledge. Proceedings of the 2012 IEEE 28th International Conference on Data Engineering. ICDE'12, Washington, DC, USA:IEEE Computer Society,2012,726-737. URL http://dx.doi.org/10.1109/ICDE.2012.90
    [86]Talel Abdessalem, Bogdan Cautis, Nora Derouiche. ObjectRunner:lightweight, targeted extraction and querying of structured web data. Proc VLDB Endow. Sep.2010,3(1-2):1585-1588. URL http://dl.acm.org/citation.cfm?id=1920841.1921045
    [87]Deepayan Chakrabarti, Rupesh Mehta. The paths more taken:matching DOM trees to search logs for accurate webpage clustering. Proceedings of the 19th international conference on World wide web. WWW'10, New York, NY, USA:ACM,2010,211-220. URL http: //doi.acm.org/10.1145/1772690.1772713
    [88]Pankaj Gulhane, Rajeev Rastogi, Srinivasan H. Sengamedu, Ashwin Tengli. Exploiting content redundancy for web information extraction. Proceedings of the 19th international conference on World wide web. WWW'10, New York, NY, USA:ACM,2010,1105-1106. URL http://doi.acm.org/10.1145/1772690.1772826
    [89]Qiang Hao, Rui Cai, Yanwei Pang, Lei Zhang. From one tree to a forest:a unified solution for structured web data extraction. Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval. SIGIR'11, New York, NY, USA:ACM,2011,775-784. URL http://doi.acm.org/10.1145/2009916.2010020
    [90]Andrew Carlson, Charles Schafer. Bootstrapping Information Extraction from Semi-structured Web Pages. Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases-Part I. ECML PKDD'08, Berlin, Heidelberg: Springer-Verlag,2008,195-210. URL http://dx.doi.org/10.1007/978-3-540-87479-9_31
    [91]Gengxin Miao, Junichi Tatemura, Wang-Pin Hsiung, Arsany Sawires, Louise E. Moser. Extracting data records from the web using tag path clustering. Proceedings of the 18th international conference on World wide web. WWW'09, New York, NY, USA:ACM,2009, 981-990. URL http://doi.acm.org/10.1145/1526709.1526841
    [92]F. Ashraf, T. Ozyer, R. Alhajj. Employing Clustering Techniques for Automatic Information Extraction From HTML Documents. Systems, Man, and Cybernetics, Part C:Applications and Reviews, IEEE Transactions on. sept.2008,38(5):660-673
    [93]Oren Etzioni, Michele Banko, Stephen Soderland, Daniel S. Weld. Open information extrac-tion from the web. Commun ACM. Dec.2008,51(12):68-74. URL http://doi.acm.org/ 10.1145/1409360.1409378
    [94]Arpita Ghosh, Preston McAfee. Incentivizing high-quality user-generated content. Proceed-ings of the 20th international conference on World wide web. WWW'11, New York, NY, USA:ACM,2011,137-146. URL http://doi.acm.org/10.1145/1963405.1963428
    [95]Pankaj Gulhane. Amit Madaan, Rupesh Mehta, Jeyashankher Ramamirtham, Rajeev Ras-togi, Sandeep Satpal, Srinivasan H. Sengamedu, Ashwin Tengli, Charu Tiwari. Web-scale information extraction with vertex. Proceedings of the 2011 IEEE 27th International Con-ference on Data Engineering. ICDE'11, Washington, DC, USA:IEEE Computer Society, 2011,1209-1220. URL http://dx.doi.org/10.1109/ICDE.2011.5767842
    [96]Shuyi Zheng, Ruihua Song, Ji-Rong Wen, C. Lee Giles. Efficient, record-level wrapper induc-tion. Proceedings of the 18th ACM conference on Information and knowledge management. CIKM'09, New York, NY, USA:ACM,2009,47-56. URL http://doi.acm.org/10.1145/ 1645953.1645962
    [97]Mohammed Kayed, Chia-Hui Chang. FiVaTech:Page-Level Web Data Extraction from Template Pages. IEEE Trans Knowl Data Eng.2010,22(2):249-263
    [98]Shui-Lung Chuang, K.C.-C. Chang, ChengXiang Zhai. Collaborative Wrapping:A Turbo Framework for Web Data Extraction. Data Engineering,2007. ICDE 2007. IEEE 23rd International Conference on.2007,1261-1262
    [99]Jochen Kranzdorf, Andrew Jon Sellers, Giovanni Grasso, Christian Schallhart, Tim Furche. Visual oXPath:robust wrapping by example. Alain Mille, Fabien L. Gandon, Jacques Missolis, Michael Rabinovich, Stoffen Staab, (Editors) WWW (Companion Volume). ACM, 2012,369-372
    [100]Andrew Jon Sellers, Tim Furche, Georg Gottlob, Giovanni Grasso, Christian Schallhart. Taking the OXPath down the deep web. Anastasia Ailamaki, Sihem Amer-Yahia, Jignesh M. Patel, Tore Risch, Pierre Senellart, Julia Stoyanovich, (Editors) EDBT. ACM,2011, 542-545
    [101]Tim Furche, Georg Gottlob, Giovanni Grasso, Christian Schallhart, Andrew Jon Sellers. Exploring the web with OXPath. Roberto De Virgilio, Devis Bianchini, Valeria De Antonel-lis, Kjell Orsborn, Silvia Stefanova, (Editors) EDBT/ICDT Workshop on Linked Web Data Management. ACM,2011,28-29
    [102]Andrew Jon Sellers, Tim Furche, Georg Gottlob, Giovanni Grasso, Christian Schallhart. OXPath:little language, little memory, great value. Srinivasan et al. [125],261-264
    [103]Andrew Jon Sellers. The OXPath to success in the deep web. Srinivasan et al. [125],409-414
    [104]Tim Furche, Georg Gottlob, Giovanni Grasso, Christian Schallhart, Andrew Jon Sellers. OX-Path:A Language for Scalable. Memory-efficient Data Extraction from Web Applications. PVLDB.2011,4(11):1016-1027
    [105]Jiang-Ming Yang, Rui Cai, Yida Wang, Jun Zhu, Lei Zhang. Wei-Ying Ma. Incorporating site-level knowledge to extract structured data from web forums. Proceedings of the 18th international conference on World wide web. WWW'09, New York, NY, USA:ACM,2009, 181-190. URL http://doi.acm.org/10.1145/1526709.1526735
    [106]Xinying Song, Jing Liu, Yunbo Cao, Chin-Yew Lin, Hsiao-Wuen Hon. Automatic extrac-tion of web data records containing user-generated content. Proceedings of the 19th ACM international conference on Information and knowledge management.. CIKM'10, New York, NY, USA:ACM,2010,39-48. URL http://doi.acm.org/10.1145/1871437.1871447
    [107]Serge Abiteboul, Meghyn Bienvenu, Alban Galland, Emilien Ant.oine. A rule-based language for web data management. Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. PODS'11, New York, NY, USA:ACM,2011, 293-304. URL http://doi.acm.org/10.1145/1989284.1989320
    [108]Joseph M. Hellerstein. The declarative imperative:experiences and conjectures in dis-tributed logic. SIGMOD Rec. Sep.2010,39(1):5-19. URL http://doi.acm.org/10.1145/ 1860702.1860704
    [109]Stephen Muggleton, Luc De Raedt. Inductive Logic Programming:Theory and Methods. JOURNAL OF LOGIC PROGRAMMING.1994,19(20):629-679
    [110]Luc De Raedt, Kristian Kersting. Probabilistic inductive logic programming. Berlin, Heidel-berg:Springer-Verlag,2008.1-27. URL http://dl.acm.org/citation.cfm?id=1793956. 1793958
    [111]Luc De Raedt. Logical and Relational Learning. Gcrson Zaverucha, Augusto da Costa, (Editors) Advances in Artificial Intelligence-SBIA 2008, Springer Berlin/Heidelberg, 2008, vol.5249 of Lecture Notes in Computer Science.1-1.10.1007/978-3-540-88190-2_1, URL http://dx.doi.org/10.1007/978-3-540-88190-2\_1
    [112]J. R. Quinlan. Learning Logical Definitions from Relations. Mach Learn. Sep.1990,5(3):239-266. URL http://dx.doi.org/10.1023/A:1022699322624
    [113]Stephen Muggleton. Inverse entailment and progol. New Generation Computing.1995, 13:245-286.10.1007/BF03037227, URL http://dx.doi.org/10.1007/BF03037227
    [114]J. Ross Quinlan, R. Mike Cameron-Jones. Induction of Logic Programs:FOIL and Related Systems. New Generation Comput.1995,13(3&4):287-312
    [115]Niels Landwehr, Kristian Kersting. Luc De Raedt. nFOIL:Integrating Naive Bayes and FOIL. Manuela M. Veloso, Subbarao Kambhampati, (Editors) AAAI. AAAI Press/The MIT Press,2005,795-800
    [116]Niels Landwehr, Kristian Kersting, Luc De Raedt. Integrating Naive Bayes and FOIL. J Mach Lea.rn Res. May 2007,8:481-507. URL http://dl.acm.org/citation.cfm?id= 1248659.1248677
    [117]Stephen Muggleton. Bayesian inductive logic programming. Proceedings of the seventh annual conference on Computational learning theory. COLT'94, New York, NY, USA:ACM, 1994,3-11. URL http://doi.acm.org/10.1145/180139.178095
    [118]Kristian Kersting, Luc De Raedt. Towards Combining Inductive Logic Programming with Bayesian Networks. Proceedings of the 11th International Conference on Inductive Logic Programming. ILP'01, London, UK, UK:Springer-Verlag,2001,118-131. URL http: //dl.acm.org/citation.cfm?id=648001.742956
    [119]Houssam Nassif, Hassan Al-Ali, Sawsan Khuri, Walid Keirouz, David Page. An inductive logic programming approach to validate Hexose binding biochemical knowledge. Proceed-ings of the 19th international conference on Inductive logic programming. ILP'09, Berlin, Heidelberg:Springer-Verlag,2010,149-165. URL http://dl.acm.org/citation.cfm?id= 1893538.1893552
    [120]Tuan Tran, Kenji Satou, Tu Ho. Using Inductive Logic Programming for Predicting Protein-Protein Interactions from Multiple Genomic Data. Alipio Jorge, Luis Torgo, Pavel Brazdil, Rui Camacho, Jo?o Gama, (Editors) Knowledge Discovery in Databases:PKDD 2005, Springer Berlin/Heidelberg,2005, vol.3721 of Lecture Notes in Computer Science.321-330. 10.1007/11564126-33, URL http://dx.doi.org/10.1007/11564126\_33
    [121]Costin Badica, Amelia Badica, Elvira Popescu. Tuples extraction from HTML using logic wrappers and inductive logic programming. Proceedings of the Third international confer-ence on Advances in Web Intelligence. AWIC'05, Berlin, Heidelberg:Springer-Verlag,2005, 44-50. URL http://dx.doi.org/10.1007/11495772_8
    [122]Ganesh Ramakrishnan, Sachindra Joshi, Sreeram Balakrishnan, Ashwin Srinivasan. Using ILP to construct features for information extraction from semi-structured text. Proceed-ings of the 17th international conference on Inductive logic programming. ILP'07, Berlin, Heidelberg:Springer-Verlag,2008,211-224. URL http://dl.acm.org/citation.cfm?id= 1793494.1793519
    [123]Can Zhang, Jingwei Zhang. InForCE:Forum data crawling with information extraction. Universal Communication Symposium (IUCS),2010 4th International.2010,367-373
    [124]Chia-Hui Chang, Mohammed Ka.yed, Moheb Ramzy Girgis, Khaled F. Shaalan. A Survey of Web Information Extraction Systems. IEEE Trans on Knowl and Data Eng. Oct.2006, 18(10):1411-1428. URL http://dx.doi.org/10.1109/TKDE.2006.152
    [125]Sadagopan Srinivasan, Krithi Ramamritham, Arun Kumar, M. P. Ravindra, Elisa Bertino, Ravi Kumar, (Editors). Proceedings of the 20th International Conference on World Wide Web, WWW 2011, Hyderabad, India, March 28-April 1,2011 (Companion Volume). ACM, 2011

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700