基于语义标注的元数据自动构建及其相关技术研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
为了解决网络信息“爆炸”时代出现的诸多问题,元数据作为一种重要的应对方法和措施,已广泛应用于信息检索、信息集成及信息共享等服务中。毫无疑问,元数据自身质量的好坏决定了元数据应用服务的最终成败。为了提高元数据的服务质量,学术界和产业界主要从以下几个方面进行了大量的研究和探索。一是元数据质量相关标准的制定,建立统一的元数据标准可以有效地保证元数据的一致性和完整性,并实现规范性的交互操作,这一点已经在研究工作者中达成广泛的共识;二是元数据构建及管理方法的改进与完善,元数据构建及管理方法的改进和完善是提高元数据质量的另一种途径,目前,在元数据的模式发现、模式转换、控制策略、管理机制等诸多方面都已经开展了大量的研究工作;三是元数据质量评估的研究,学术界对此问题的讨论集中在评估指标体系、评估方法及评估用例等几方面。从目前的文献和资料中我们发现,现有的研究工作更多的从元数据创建者手动方式的角度出发,考虑了创建工具的有效性和便利性,然而,从元数据的创建者和使用者两方面考虑,这必将会引起诸如以下问题:从创建者来看,面对大量形式多样的数据集,元数据创建者需要花费一定的精力去了解数据集内容,直到对数据集的内容具有深入透彻的理解,这必将是一项繁琐沉重的工作,此外,不同创建者理解上的差异,也会导致元数据理解上的歧义;从使用者来看,用户也需要对预先定义好的元数据具备正确的认识,否则在创建者和使用者之间就会产生认知上的“鸿沟”,用户自然就无法有效的查询获取需求信息。
     因此,为了解决以上问题,构建高质量的元数据服务,本文首先提出了一种基于语义标注构建元数据的方法,利用数据集中已有的语义标注信息自动构建生成元数据。该方法在考虑元数据构建效率的同时,充分借鉴了知识共享的理念,探索了利用语义标注信息传递出来的多视角信息消除主观认知上“鸿沟”的可行性,并对不同结构视图下的元数据识别策略进行了针对性研究。在此基础上,本文进一步研究了元数据模式语义异构的问题,提出了一种支持元数据模式语义集成的模式匹配方法。为了验证方法的适用性、评估元数据的质量,本文又提出了一种可以有效提高查准率,抑制查全率低引起目标缺失的元数据查询方法。考虑到档案信息资源自身特有的使用价值及其在基础信息资源中重要的地位[1],本文在实验设计的出发点以及测试数据集的选择上,都将目标定位在了这个领域之中。具体来讲,本文各项研究成果主要包含以下几个方面:
     (1)在分析基于模板和基于机器学习两类主要元数据抽取方法的基础上,提出了一个自动构建元数据的方法(SAMC)。该方法能克服上述两类方法的缺点与不足,不但能充分地利用现有语义标注信息对元数据进行有效的识别和定位,而且还有机地将统计学理论、信息的结构化特征、视觉布局特征等融合在一起,为SAMC的性能提供了有力的保证,因而,该方法构建出的元数据具有更高的精确度与更强的信息表达能力,能够很好地满足对构建高质量元数据的要求。
     (2)提出了不同布局模式下识别元数据的算法。为了提高本方法中生成元数据的可行性,本文考虑了语义标注信息结构视图差异的情况,重点研究了在总分、递进、综合分布等序列模式下,语义标注信息所表现出来的差异特征,针对性的设计了相应的元数据识别算法。算法中有效地利用了树型数据结构的层次、线性数据结构的次序以及信息分布的频繁程度等特征,从而使元数据识别的效果以及性能等方面都有了很好的表现。
     (3)提出了能有效支持元数据属性级语义集成的模式匹配方法(PISMatching)。与相关研究相比,本研究面临的是一个以丰富元数据模式语义信息为目的、以多数据源元数据模式合并为任务的新问题。本文尝试了将本体、叙词表和概念相似度计算结合使用,实现了整合各自优点的目的,在实现难易、复杂度、语义强度等方面都拥有更好的性能。本体的引入为匹配方法准确性的提高提供了强有力的领域上下文支持,基于关联信息联想和概率统计的概念相似度方法也为模式匹配提供了一个新的度量标准,该度量标准能够发现积极相关的属性以得到潜在的属性组,也能将同义关系的属性组保留下来。在PISMatching具体设计的表现力上,本文更注重匹配程度的高低排序而不是差距值的计算,这样对实际应用更具意义;更注重对匹配可利用信息的捕获,而减少对特定匹配模式的依赖,这样使研究成果具有更大的灵活性、扩展性和更广泛的利用价值。
     (4)提出了利用域上下文信息度量相关性的元数据查询方法(MFCQuery)。与传统元数据查询方式相比,为了能在查准率、查全率上有进一步地提高,MFCQuery主要从两个方面进行了扩展:一是利用向量空间模型(Vector Space Model)在用户查询信息和元数据域上下文信息之间建立相关性计算矩阵,利用域上下文信息与用户查询信息相关性的高低来判断用户的真实查询意图,用以提高检索结果的查全率;另一个方面考虑到部分查询者可能由于缺少足够的背景知识,而无法提供必要的元数据域查询,我们将为其匹配最相关的目标域限制,以提高检索结果的查准率。该方法在保证传统查询方式下高精度特点的同时,能够使检索结果的查全率得到进一步地提升。
     (5)细化了元数据的评估标准。从整个论文研究的出发点讲,论文全部研究工作的主要目的是为了有效地提高元数据的质量,使其能在具体应用领域发挥更大的作用。为此,本文选择了档案信息资源作为实验中的目标应用领域,而对于元数据最终质量的评估,作者考虑到并不能单纯从信息技术经典的评估指标查全率和查准率来体现,所以本文尝试了细化各项评估指标,对特征不同的评估对象,采用了分化的评估比较的办法,这样可以在更细致的层面上反映出不同方法在元数据质量上的影响。
     总之,本论文通过规则、统计、概率等方法分别从上述各个方面对元数据相关技术进行了深入研究。解决了元数据构建过程中的关键问题,提高了生成元数据的查准度和查全率;增强了对不同格式以及不断变化的元数据模式进行集成的适用能力;提高了用户主动查询的性能,在进一步提高查全率的同时,也提高了查准率,在这些工作中取得了一系列相关的研究成果。
To solve a good deal of problems in the age of network information "explosion", metadata as an important method and measure has been widely used in information retrieval, information integration, information sharing and so on. There is no question that good or bad quality of metadata itself determines the ultimate success or failure of metadata application services. In order to improve the quality of metadata, academia and industry made a lot of research and exploration mainly from the following aspects:First, set standards related to metadata, establish a unified metadata standard to effectively ensure its consistency and integrity, also to achieve normative interaction, this point has been widely recognized by the research workers; Second, construct metadata, improve and perfect the management methods, it's another way to improve the quality of metadata, at present, metadata schema discovery, schema transformation, control strategy, administration mechanism and many other aspects have been widely carried out; Third, study for metadata quality assessments, academic discussion of this issue focused on several aspects such as evaluation indicators, evaluation methods, evaluation use cases and so on. From the current literatures, we found that the existing research works are more often started from the angle of manually creating metadata, considered about the effectiveness and convenience of creating tools. However, thinking about the creator and the user of metadata, which will give rise to problems such as the following:For the creator, facing with a large number of diverse forms of data sets, metadata creator need to take some effort to understand the contents of the data sets until the contents of data sets are deeply understood. It will surely be a cumbersome and heavy work, in addition, different creators have different understandings, which can lead to ambiguity in the understanding of metadata; from the view of users, they need to have a correct understanding for the predefined metadata, otherwise, there would be "gap" between creators and users on the knowledge, the user naturally can not effectively query information on demand.
     Therefore, in order to solve the above problems, and to build high-quality metadata services, this paper presents a method based on semantic annotation to build metadata, using the existing semantic annotation in data sets to automatically build the metadata. This method is given to build metadata efficiently, and it fully borrows idea of knowledge sharing, exploring the feasibility of elimination of subjective perception "gap" using multi-angle of semantic annotation, and strategies on metadata identification in different structure views. On the basis, this paper further studies heterogeneous problems of metadata schema, and proposes a schema matching method for semantic integration of metadata schema. In order to validate its applicability, this paper proposes a metadata query method for effectively improving the precision and inhibiting result loss caused by low recall. This paper locates in the the field of archive information resources in experimental designs and test data sets, considering its own unique value and its important position in basal information resources [1]. Specifically, our studies mainly cover the following aspects:
     (1)Come up with a method of automatically constructing metadata called SAMC, based on the analysis of two main metadata extraction methods:template-based and machine learning-based. This method can overcome shortcomings and disadvantages of above methods, not only can effectively identify metadata from existing semantic annotation, but also organically combine statistical theory with the structural features of information and visual layout characteristics, providing a guarantee for performance of SAMC. So, our method has higher precision and greater ability to express information, and can well meet requirements of building high-quality metadata.
     (2)Come up with related algorithms for identifying metadata from different layout patterns. To improve feasibility of our method, this paper considers the differences in structure views, and focuses on the differences in characteristics demonstrated by summary-detail, iterative, integrated sequence patterns, and designs corresponding algorithm of identifying metadata. The algorithms use hierarchy of tree structure, order of linear structure and information characteristics such as frequency distribution, so that these result in good effects in metadata identification.
     (3)Put forward a schema matching method for attribute-level integration of metadata schema called PISMatching. Compared with related works, this research is facing new issues for the purpose of enriching semantic of metadata schema, and for the task of merging of metadata schema from multiple data sources. This paper tries to combine ontology with thesaurus and concept similarity for integrating their respective advantages, and has better performance in difficulty of implement, complexity, semantics richness and so on. Ontology provides a strong context domain support for improving matching accuracy, and concept similarity based on related information and probability provides a new metric for schema matching, which can dig out those properties with positive correlation to get potential properties groups, and also reserve properties groups with synonymous. On concrete designs, this paper pays more attention to matching sort rather than the gap between calculated values, which is more meaningful to the practical application; And pay more attention to capture available information, and reduce dependence on a specific schema, this will make research more flexibility, scalability and wider use-value.
     (4)Come up with a metadata query method of measuring field context called MFCQuery. Compared to traditional method, in order to have further improved in precision and recall, MFCQuery Mainly extends two aspects from following:first, establish similarity matrix between user query and metadata field context by vector space model, and determine real query intent by similarity between field context and user query to improve recall; Another aspect, considering that some users can not provide necessary metadata fields query, may be due to a lack of sufficient background knowledge, we will match the most relevant target field for restricting query to improve precision. The method not only can ensure high-precision, but also can further enhance recall.
     (5)Detail evaluation of metadata. From the starting point, all the works in the paper main aim to effectively improve quality of metadata in order that it can play a greater role in specific applications. So, this paper selects archive information domain as target applications for our experiments. For evaluation of metadata quality, we think that it can not be simply reflected from classic evaluation indicators of information technology such as recall and precision, therefore, this paper attempts to detail evaluation indicators, and uses a more refined approach to make a evaluation for objects with different characteristics, this will reflect the impact on different methods on metadata quality at a more detailed level.
     In a word, this paper makes a deep study in related technologies of metadata from above aspects by rules, statistics, probability and other methods. Address key issues during construction of metadata, and improve precision and recall of generating metadata; Enhance applicable capacity for integrating different metadata schemas; Improve performance of users'active queries, and not only further improve recall, but also improve the precision. In these efforts, We made a series of research achievements.
引文
[1]颜海.档案信息资源开发利用[M].武汉,武汉大学出版社,2004.
    [2]Moen, W.E., Stewart, E.L.and McClure. Assessing metadata quality:findings and methodological considerations from an evaluation of the U.S.Government Information Locater Service (GILS). In IEEE international Forum on Research and Technology Advances in Digital Libraries, ADL'98 Proceedings, Santa Barbara, California,1998.
    [3]Bruce, T.R., Hillmann, D.I. The Continuum of Metadata Quality:Defining, Expressing, Exploiting. In Metadata in Practice, American Library Association,2004: 238-256.
    [4]Reggie-Metadata Editor. http://metadata.net/dstc/
    [5]DcDot. http://www.ukoln.ac.uk/cgi-bin/dcdot.pl
    [6]Miroslav.B, Petr.K, Martin S. DML-CZ Metadata Editor Content Creation System for Digital Libraries.
    [7]D.Maynard, K. Bontcheva, H. Saggion, H. Cunningham, O. Hamza. Using a Text Engineering Framework to Build an Extendable and Portable IE-based Summarisation System. Proceedings of the ACL Workshop on Text Summarisation, Philadelphia, July 2002.
    [8]C. Mooers. "Application of random codes to the gathering of statistical information". Bulletin 31, Zator Co, Cambridge, Mass,1949
    [9]祁延莉,赵丹群.”信息检索概论”,北京大学出版社,pp.3-4,2006
    [10]Chidlovskii B Wrapping web information providers by transducer induction. In: Raedt L, Flach P, eds. Proc of the 12th Int'l of European Conf. on Machine Learning (ECML2001). LNCS 2167, Heidelberg:Springer-Verlag,2001.61-72.
    [11]Mao, S., Kim, J.W., Thoma, G.R.:A Dynamic Feature Generation System for Automated Metadata Extraction in Preservation of Digital Materials. In:Dial 2004. Proceedings of the First international Workshop on Document Image Analysis For Libraries, vol.225, IEEE Computer Society, Los Alamitos (2004)
    [12]Yin P, Zhang M, Deng ZH, Yang DQ. Metadata extraction from bibliographies using bigram HMM. In:Chen Z, Chen H, Miao Q, Fu Y, Fox E, Lim E, eds. Proc. of the Int'l Conf. of Asian Digital Libraries (ICADL 2004). LNCS 3334, Heidelberg: Springer-Verlag,2004.310-319.
    [13]Borkar VR, Deshmukh K, Sarawagi S. Automatic segmentation of text into structured records. In:Aref WG, ed. Proc. of the ACM-SIGMOD Int'l Conf. Management of Data (SIGMOD 2001). New York:ACM Press,2001.175-186.
    [14]Han H, Giles CL, Mnavoglu E, Zha HY, Zhang ZY, Fox EA. Automatic document metadata extraction using support vector machine. In:Proc. of the ACM/IEEE Joint Conf. on Digital Libraries (JCDL 2003). New York:ACM Press, 2003.37-48.
    [15]韩李敏,吴新宁.新时期档案馆利用回顾与展望——浙江省档案馆15年(1980--1994)档案利用分析报告[J].档案学研究(1996年增刊).26-29.
    [16]张晓林主编.元数据研究与应用.北京:北京图书馆出版社,2002
    [17]李郎达.Metadata初探.情报科学.2001,6,19(6):605.
    [18]王松林.元数据及有关思考,情报学报,2002.21(4):465-469
    [19]中文元数据研究现状与发展.http://www.ibnet.sh.cn/dcchina/jfz.htm
    [20]Lorcan Dempsey, Rachel Heery. Specification for Resource Description Methods. Part I. a Review of Metadata:A Survey of Current Resource Description Formats. http://www.ukoln.ac.uk/metadata/desire/overview/rev_ti.htm
    [21]吴建中.DC元数据[M].上海:上海科学技术文献出版社,2000:40
    [22]Kent, J.-P.& Schuerhoff, M.(1996). Some Thoughts about a Metadata Management System, paper presented to InterCASIC, November 1996; Voorburg, The Netherlands:Statistics Netherlands.
    [23]Emily A Hicks, Jody Perkins, and Margaret Beecher Maurer, "Application Profile Development for Consortial Digital Libraries," Library Resources and Technical Services 51, no.2 (April 2007).
    [24]Gruber T R. A Translation Approach to Portable Ontology Specifications. Knowledge Acquisition,1993,5:199-220.
    [25]Borst W N. Construction of Engineering Ontologies for Knowledge Sharing and Reuse. PhD thesis, University ofTwente,Enschede,1997.
    [26]Studer R,Benjamins V R,Fensel D. Knowledge Engineering,Principles and Methods. Data and Knowledge Engineer2ing,1998,25(122):161-197.
    [27]Guarino N. Semantic Matching:Formal Ontological Distinctions for Information Organization, Extraction, and Integration. In:Pazienza MT,eds. Information Extraction:A Multidisciplinary Approach to an Emerging Information Technology,Springer Verlag,1997,139-170
    [28]CARDIE C. Empirical methods in information extraction [J]. AI Magazine,1997, 18(4):65-78.
    [29]Pavel Shvaiko, Jerome Euzenat.A Survey of Schema-based Matching Approaches. Journal on Data Semantics IV, LNCS3730,2005, pp.146-171.
    [30]TimBerners-Lee, James A Hendle. The semantic web[J] Scientific American, 2001,284(5):34-42
    [31]Merholz P. Metadata for Masses,2004[EB/OL]. [2008-04-24]. http://adaptivepath.com/publications/essays/archives/000361.php.
    [32]Brin,S. Extracting Patterns and Relations from the World Wide Web. In WebDB Workshop at 6th International Conference on Extending Database Technology.1998.
    [33]H. Cunningham, D. Maynard, K. Bontcheva, V. Tablan. GATE:A Framework and Graphical Development Environment for Robust NLP Tools and Applications. Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL'02). Philadelphia, July 2002.
    [34]Zhang M, Yang DQ, Deng ZH, Feng Y, Wang WQ, Zhao PX, Wu S, Wang SA, Tang SW. PKUSpace:A collaborative platform for scientific researching. In:Liu WY, Shi YC, Li Q, eds. Proc of the Int'l Conf. of Web-based Learning (ICWL 2004). LNCS 3143, Heidelberg:Springer-Verlag,2004.120-127.
    [35]Klink S, Dengel A, Kieninger T. Rule-Based document structure understanding with a fuzzy combination of layout and textual features. Int'l Journal on Document Analysis and Recognition,2001,4(1):18-26.
    [36]Kim J, Le DX, Thoma GR. Automated labeling algorithms for biomedical document images. In:Proc. of the 7th World Multiconference on Systemics, Cybernetics and Informatics. Orlando:ⅢS,2003.352-357.
    [37]Hitchcock S, Carr L, Jiao Z, Bergmark D, Hall W, Lagoze C, Harnad S. Developing services for open eprint archives:Globalisation, integration and the impact of links. In:Proc. of the 5th ACM Conf. on Digital Libraries (ACMDL 2000). New York:ACM Press,2000.143-151.
    [38]Bikel DM, Miller S, Schwartz R, Weischedel R. Nymble:A high performance learning name finder. In:Proc. of the 5th Conf. on Applied Natural Language Processing (ANLC'97). San Francisco:Morgan Kaufmann Publishers,1997. 194-201.
    [39]McCallum A, Freitag D, Pereira F. Maximum entropy Markov models for information extraction and segmentation. In:Langley P, ed. Proc. of the Int'l Conf. on Machine Learning (ICML 2000). San Francisco:Morgan Kaufmann Publishers,2000. 591-598.
    [40]Seymore K, McCallum A, Rosenreid R. Learning hidden Markov model structure for information extraction. In:Califf ME, Freitag D, Kushmerick N, Muslea I, eds. Proc. of the AAAI'99 Workshop on Machine Learning for Information Extraction. Cambridge:MIT Press,1999.37-42.
    [41]Stitson MO, Weston JAE, Gammerman A, Vovk V, Vapnik V. Theory of support vector machines. Technical Report, CSD-TR-96-17, London:University of London, 1996.
    [42]Lafferty J, McCallum A, Pereira F. Conditional random fields:Probabilistic models for segmenting and labeling sequence data. In:Brodley C, Danyluk A, eds. Proc. of the Int'l Conf. on Machine Learning (ICML 2001). San Francisco:Morgan Kaufmann Publishers,2001.282-289.
    [43]Peng F, McCallum A. Accurate information extraction from research papers using conditional random fields. In:Dumais S, Marcu D, Roukos S, eds. Proc. of the Human Language Technology Conf. and North American Chapter of the Association for Computational Linguistics (HLT-NAACL 2004). New York:ACM Press,2004. 329-336.
    [44]F.Ciravegna, D.Petrelli, User Involvement in Adaptive Information Extraction: Position Paper, In Proceedings of the IJCAI-2001 Workshop on Adaptive Text Extraction and Mining held in conjunction with the 17th International Conference on Artificial Intelligence(IJCAI-01), August,2001
    [45]Leek T.R., Information extraction using hidden Markov models. Master's thesis, UC San Diego,1997.
    [46]M. E. Califf and R. J. Mooney, "Bottom-up relational learning of pattern matching rules for information extraction," Journal of Mach. Learn. Res., vol.4, pp. 177-210,2003.
    [47]Uschold, M. Where are the semantics in the semantic web?.AI Magazine,2005, 24(3):25-36
    [48]Sheth, A., Ramakrishnan, C., and Thomas, C. Semantics for the semantic web: The implicit, the formal and the powerful. Journal on Semantic Web&Information Systems,2005,1(1):1-18
    [49]Jack. A Cognitive Analysis of Tagging[EB/OL]. [2008-05-16]. http://blog.jackvinson.com/archives/2005/10/01/a_cognitive_analysis_of_tagging.htm l
    [50]Steels L. Collaborative Tagging as Distributed Cognition[J]. Pragmatics and Cognition,2006,14(2):287-292.
    [51]Veres C. Concept Modeling by the Masses:Folksonomy Structure and Interoperability [J]. Lecture Notes in Computer Science,2006:325-338
    [52]Zauder K, Lazic JL, Zorica MB. Collaborativ Tagging Supported Knowledge Discovery[C]. In:Proceedings of the ITI 2007 29th International Conference on Information Technology Interfaces. New York:IEEE Press,2007:437-442
    [53]Macgregor G, McCulloch E. Collaborative Tagging as a Knowledgeorganisation and Resource Discovery Tool[J/OL]. Library Review,2006(55):291-30. [2008-04-24]. http://eprints.rclis.org/archive/00005703/.
    [54]隆婕.基于标签的互联网自由分类法研究[D].北京:北京大学,2007.
    [55]Kip P M, Campbell G.. Patterns and Inconsistencies in Collaborative Tagging Systems:An Examination of Tagging Practices [EB/OL]. [2008-04-24]. http://eprints.rclis.org/archive/00008315/.
    [56]Maria Vargas-Veral, Enrico Mottal, et al., MnM:Ontology Driven Semi-Automatic and Automatic Support for Semantic Mark-up, Semantic Authoring, annotation & Knowledge Markup Workshop. EKAW 2002, September 30,2002
    [57]Siegfried Handschu, Steffen Staab, Authoring and Annotation of Web Pages in CREAM. In Proc.of WWW2002, Honolulu, Hawaii, USA, May 7-11,2002.
    [58]F.Ciravegna, D.Petrelli. User Involvement in Adaptive Information Extraction: Position Paper. In Proceedings of the IJCAI-2001 Workshop on Adaptive Text Extraction and Mining held in conjunction with the 17th International Conference on Artificial Intelligence (IJCAI-01), August,2001
    [59]Zauder K, Lazic JL, Zorica MB. Collaborative Tagging Supported Knowledge Discovery[C]. In:Proceedings of the ITI 2007 29th International Conference on Information Technology Interfaces. New York:IEEE Press,2007:437-442.
    [60]Santos-Neto E, Ripeanu M, Iamnitchi A. Tracking Usage in Collaborative Tagging Communities [EB/OL]. [2008-04-24]. http://www.csee.usf.edu/-anda/papers/CAMA07_ready_v2.pdf.
    [61]风言疯语之IT罗盘.TAG的历史和TAG盛行的原因分析[EB/OL]. [2008-07-05]. http://www.kuangfeng.cn/blog/? P=92
    [62]Elke michlmayr, Steve Cayzer. Learning User Profile from Tagging Data and Leveraging them for Personal (ized) Information Access. In Proc.of WWW 2007, May8-12,2007, Banff, Canada.
    [63]Shen Jie, Zhu Yan,Zhang Hui,etc. A Content-based Algorithm for Blog Ranking, International Conference on Internet Computing in Science and Engineering,ppl9-22, November 2008.
    [64]G.Zipf.Human Behaviour and the Principle of Least Effort.Addison-Wesley, Cambridge, Massachusets,1949.
    [65]Robin Dhamankar, YoonkyongLee, AHnai Doan, et al. iMAP:Discovering ComplexSemantic Matches between Database Sehemas[A]. In:Proeeedings of the ACM SIGMOD Conference on Management of Data[C]. Paris, France. 2004:383-394.
    [66]Bin He, Kevin Chen-Chuan Chang. Automatic Complex Schema Matching across Web Query Interfaces:A Correlation Mining Anproach[J]. ACM Transactions on Database Systems.2006, 1(1):1-45.
    [67]Xiaofeng Meng, Dongdong Hu, Haiyan Wang et al. Sehema-Guided Wrapper Maintenance for Web-Data Extraction [A]. In Proceedings of the 5th ACM International Workshop on Web Infomration and Data Management[C]. Neworleans, Louisinaa, USA.2003:1-8.
    [68]Amit Sheth and James Larson. Federated database systems for managing distributed heterogeneous, and autonomous databases.Computer Surveys,22(3): 183-236, September 1990.
    [69]Stephen Hayne and Sudha Ram.Multi-user view integration system(MUVIS):An expert system for view integration.In Proceedings in the 6th International Conference on Data Engineering,pages 402-409.IEEE,February 1990.
    [70]Salton G. Automatic text processing:the transformation, analysis, and retrieval of information by computer. Addison-wesley publishing company, Inc.1989.
    [71]Miller GA.Wordnet:A lexical database for English, communications of the ACM, pp39-41,1995.
    [72]S.Navathe and Peter Buneman.Integrating user views in database design. Computers,19(1):50-62, January 1986.
    [73]Wen-Syan Li and Chris Clifton.Using field specifications to determine attribute equivalence in heterogeneous databases. In Third International Workshop on Research Issues on Data Engineering:Interoperability in Multidatabase Systems,pages 174-177, Vienna, Austria, April 18-201993.IEEE.
    [74]Doan A, Domingos P, Halevy AY. Reconciling Schemas of Disparate Data Sources:A machine-Learning Approach. In SIGMOD,2001
    [75]S. Melnik, H. Garcia-Molina, and E. Rahm. Similarity Flooding:A Versatile Graph Matching Algorithm and Its Application to Schema Matching. In Proceedings of the 18th International Conference on Data Engineering, pages 117-128, San Jose, CA, USA, Mar.2002.
    [76]Jaewoo Kang Jeffrey F. Naughton On schema matching with opaque column names and data values. In ACM SIGMOD 2003, California Pages:205-216
    [77]Miller, R.J. et al. The Clio Project:Managing Heterogeneity. SIGMOD Record 30(1),78-83,2001
    [78]Wen-Syan Li and Chris Clifton. Semantic integration in heterogeneous databases using neural networks.Proceedings of the 20th VLDB Conference Santiago, Chile, 1994.
    [79]P.Scheuermann, Wen-Syan Li, Chris Clifton. Mutidatabase Query Processing with Uncertainty in Global Keys and Attribute Values. Journal of the American Society for Information science.49(3):283-301,1998.
    [80]M. Lenzerini. Data integration:A theoretical perspective. In Proc. PODS'02, pages 233-246. ACM,2002.
    [81]Sonia Bergamaschi, Silvana Castano.A Semantic Approach to Informatio Integration:the MOMIS Project[DB/OL]. http://www.sbgroup.unimo.it/prototip /paper/iceisOl.pdf.
    [82]Li Xu, David W Embley. Combining the Best of Global-as-View and Local-as-View for Data Integration [EB/OL]. http:/www.deg.byu.edu/papers/PODS. integration.pdf,2004-10.
    [83]Li Xu, David W Embley. Discovering direct and indirect matches for schema element[C]. Proceedings of the 8th International Conference on Database Systems for Advanced Applications (DASFAA 2003).2003.
    [84]Cf.T.R.Gruber. A Translation Approach to Portable ontologies. Knowledge Acquisition,5(2),1993:199-220
    [85]叶栗乾,何存道,梁宁建编著.普通心理学(修订二版).上海:华东师范大学出版社,2004.
    [86]DekangLin, An information-theoretic definition of similarity. In Proceedings of the 15th International Conference on Machine Learning. San Franeisco, CA, 1998:296-304.
    [87]Ana GM, FiliPPo M, Fulya E, Heather R, Alessandro V. Algorithmic Computation and Approximation of Semantic Similarity. World Wide Web.2006, (9): 1413-1417.
    [88]Do HH, Rahm E. COMA-A system for flexible combination of schema matching approaches. In:Proc. of the 28th Int'1 Conf. on Very Large Data Bases(VLDB 2002). Hong Kong,2002.610-621.
    [89]Madhavan J, Bernstein P, Rahm E. Generic schema matching with Cupid. In: Proe. of the 27th Int'1 Conf. on Very Large Data Bases(VLDB 2001). Rome:Morgan Kaufmatm Publishers, Inc.,2001.49-58.
    [90]Miller R, Haas L, Hernandez MA. Schema mapping as query discovery. In: Abbadi AE, Brodie ML, Chakravarthy S, Dayal U, Kamel N, Schlageter G, Whang KY, eds. Proc. of the 26th Int'1 Conf. on Very Large Data Bases(VLDB 2000). Cairo:Morgan Kaufman n Publishers, Inc.,2000.77-88.
    [91]Yuan, W. (1997). End-user searching behavior in information retrieval:A longitudinal study. Journal of the American Society for Information Science,48(3), 218-234.
    [92]Wolfram, D.,& Dimitroff, A. (1998). Hypertext vs. Boolean-based searching in a bibliographic database environment:A direct comparison of searcher performance. Information Processing & Management,34(6),669-679.
    [93]Zhang, X.,& Chignell, M. (2001). Assessment of the effects of user characteristics on mental models of information retrieval systems. Journal of the American Society for Information Science,52(6),445-459.
    [94]GERARD SALTON A. WONG and C.S.YANG. A Vector space model for information Retrieval. Communications of the ACM 1975.18(11):613-620
    [95]W. Bruce Croft, Howard R. Turtle, and David D. Lewis. The use of phrases and structured queries in information retrieval. Proc. of ACM SIGIR, pages 32-45, October 1991.
    [96]Lisa F. Rau and Paul S. Jacobs. Creating segmented databases from free text for text retrieval. Proc. of ACM SIGIR, pages 337-346, October 1991.
    [97]Salton GAutomatic text processing:the transformation, analysis, and retrieval of information by computer. Addison-wesley publishing company, Inc.1989.
    [98]G. G Lee, J. Seo, S. Lee, H. Jung, B.-H. Cho, C. Lee, B.-K. Kwak, J. Cha, D. Kim, J. An, H. Kim, and K. Kim, SiteQ:Engineering high performance QA system using lexicosemantic pattern matching and shallow NLP, Proceedings of TREC-10, 2001, pp.442-451.
    [99]A. Kiryakov et al, Semantic Annotation, Indexing and Retrieval. In Proc.of ISWC'2003, pp.484-499, Florida, Oct.2003
    [100]Landauer, T. K., Foltz, P. W.,& Laham, D. (1998). An introduction to latent semantic analysis. Discourse Processes,25,259-284.
    [101]Qin Bing, Liu Ting, Zhang Yu, Li Sheng. Research on Multi-Document Summarization Based on Latent Semantic Indexing. Journal of Harbin Institute of Technology,2005,12(1):91-94
    [102]Rong Zhao and W.I Grosky. Narrowing the semantic gap-improved text-based web document retrieval using visual features. IEEE Transactions on Multimedia, 4:189-200,2002.
    [103]丁国栋,白硕,王斌.一种基于局部共现的查询扩展方法[J].中文信息学报,2006,20(3):84-91.
    [104]张华平.语言浅层分析与句子级新信息检测研究[D].北京:中国科学院研究生院,2005.
    [105]左家莉,王明文,王希基于Markov网络的信息检索扩展模型清华大学学报(自然科学版)加05,vol.45,No.51:1847-1852
    [106]Mchugh J, Abiteboul S, Goldman R, et al.Lore:A Database Management System for Semistruetured Data. SIGMOD Record(ACM Special Interest GrouP on Management of Data),1997,26(3):54-66
    [107]Quass D, Widom J, Goldman R, et al.LORE:A Light weight Objeet REPository for semistructured data. Proeeedings of ACM SIGMOD International Conference on Management of Data,1996.549
    [108]Mehugh J, Widom J. Indexing semistruetured data. [TeehLnical RePort]. Stanford University,1998
    [109]Guo, L., Shao, F., Botev, C., Shanmugasundaram, J.:Xrank:Ranked keyword search over xml documents. SIGMOD 16-27(2003)
    [110]Shurug AI-Khalifa, Cong yu, H.V.Jagadish. Querying Structured Text an XML Database. In sigmod 2003.
    [111]N.Fuhr and K.Grobjohann. XIRQL:A Query Language for Information Retrieval in XML Documents. In Proeeedings of the 24th Annual ACM SIG Conference on Research and Development in Information Retrieval,2001.
    [112]Sihem Amer Yahia, Chavdar Botev, Jayavel. TeXQuery:A Full Text Search Extension to XQuery. In WWW 2004.
    [113]Daniela Florescu, Donald Kossmann, Ioana Manolescu. Intergration Key word Search into XML Query Proeessing. In WWW 2000.
    [114]Turtle, H, R., Croft, W, B.:Inference networks for document retrieval. In Proceedings of the 13th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Brussels, Belgium, September (1990) 1-24.
    [115]Guy, M. Powell, A. and Day, A. Improving the Quality of Metadata in Eprint Archives. Ariadne 38, January 2004. http://www.ariadne.ac.uk/issue38/guy/
    [116]Witten, I.H. and Bainbridge, D. How to Build a Digital Library. Morgan Kaufmann, San Francisco, CA.2003.
    [117]Humphreys J. B. K. PhraseRate:An HTML Keyphrase Extractor. Technical report, University of California, Riverside. June 2002. http://infomine.ucr.edu/
    [118]杨雪梅,董逸生,王永利等.异构数据源集成中的模式映射技术[J].计算机科学,2006,33(7),P87-91.