基于本体的文本分类模型研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

基于本体的文本分类模型研究

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Research on a Concept Vector Model of Documents Based on Ontology
作者：邓爽
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：文本分类 ; 本体 ; 概念层次 ; 特征选择 ; 概念空间模型 ; SVM ; KNN
英文关键词：Text classification ; Ontology ; Concept Hierarchy ; Feature selection ; Concept vector model (CVM) ; SVM ; KNN
学位年度：2007
导师：彭宏
学科代码：081203
学位授予单位：西华大学
论文提交日期：2007-05-01

摘要

在过去的十几年中，将文本自动地归于事先定义好的类别的技术获得了长足发展，这主要是因为以数字形式存储的文档的数目急剧增长，引起了将它们进行有效组织以便于利用的需求。这一过程主要是用机器学习的方法，在事先构造的训练语料上学习各个类别的特征，自动构建出一个分类器。
     传统的文本分类方法都是采用向量空间模型的文本表示方法，用关键词作为特征来构建的。然而，向量空间模型的文本表示方法是基于贝叶斯假设之上的，即认为词与词之间没有语义联系。但是在现实文本中的用词往往是有关联的，比如：同义词、上下位关系等。并且用关键词来表示文本的特征虽然简单直观，但有其固有的局限性，主要有包含的类别信息太少，维数过高从而造成数据稀疏等两个问题。用特征串作为类别特征可以在一定程度上解决第一个问题，但又会进一步加剧数据稀疏问题。对第二个问题的解决方法一般是进行降维，去掉一些对分类结果没有影响或影响很小的特征，用剩余的特征来表示文本。根据结果特征的特点，降维方法可以分为(1)特征提取：结果是原始特征的子集；(2)特征抽取：结果不是原始特征的子集。基于概念的文本分类方法，采用概念作为特征，将特征从词空间映射到概念空间，这样多个同义词就对应一个概念，而一个多义词在不同的语境下会被映射到不同的概念，提高了特征的凝聚度，克服了基于关键词的分类方法的缺陷，提高了分类准确率。
     本文的研究工作主要包括以下几个方面：
     1．建立了基于本体的文本分类模型。
     2．提出基于本体获取概念特征的方法。
     3．使用概念空间代替词空间，提出相应的权重与相似度的计算方法，建立概念向量空间模型。
     4．讨论了K最邻近方法和支持向量机分类器，并将概念向量空间模型的思想运用于这两种分类器。
     5．给出新方法的仿真实验。实验结果表明，基于概念的文本分类与基于关键词的文本分类相比，在查准率、查全率和F1测试值上都占有较大优势。
The automated classification of texts into pre-specified categories has gained a rapid progress in the last ten years, due to the increased availability of documents in digital form and the ensuing need to organize them. Machine learning technologies are used in this process to automatically build a classifier by learning, from a set of previously classified documents, the characteristics of categories.
     The vector space model (VSM) is a conventional text classification model that represents documents as vectors in a multidimensional space. When key words are extracted from a document collection, each document is represented as a vector of weighted key words frequencies. In the traditional VSM, the system's relevance judgment is based on the basic assumption that documents are related to each other only if there are shared key words in the documents. However, the difficulty lies in the fact that most key words have multiple meanings on the one hand, and on the other hand, some concepts can be described by more than one key word. In addition, the traditional text categorization use key words occurring in documents to determine the class of the documents, but it have two main flaws: the one is less category information, and the other is high dimensionality which causes data sparse. Phrase can be used to relieve the first problem but: it will aggravate the second one. For the second one, the usual way is using dimensionality reduction (DR) methods which can remove none-effect or less-effect features and the left features are used to represent the text. According to the nature of the result terms, DR has two types: (1) Term Selection: the result terms is a subset of the original terms; (2) Term Extraction: the result terms is not a subset of the original terms. The TC method based on concept is not using key words but concepts to make up characteristic items and considering hyponymy-hyponymy relation between synonymy sets. The approach can keep the text information mostly and solve the two problems at the same time.
     The main works of this paper were introduced as follows:
     1. We established the text categorization model based on ontology.
     2. We proposed a method based on ontology that obtained concepts.
     3. The keywords are matched against the attribute terms of the concepts in the given ontology, requiring exact matches. Based on the amount of matching terms for each concept a weight for each concept can be defined. We considered the possible application of the proposed theory on calculating similarity degree of documents, which is the fixed domain. These constructed the concept vector model.
     4. We introduced KNN and SVM, and they were implemented for the purpose of the proposed document classification.
     We empirically tested the proposed model on documents in order to demonstrate the general applicability of the method. The experimental results show that we can incorporate domain ontology to assist in document classification. For some data sets the concept vector model (CVM) is more effective than the vector space model (VSM) based term. Moreover, the performance comparisons of SVM and KNN based on CVM show that SVM achieves better performance than KNN, and SVM training is thus performed over the reduced training set.

引文

[1] K. Aas, L. Eikvil. Text Categorization: a survey. Technical Report 941, Norwegian Computing Center, 1999;
    [2] F. Sebastiani. Machine Learning in Automated Text Categorization. ACM Computing Surveys, 34(1): 1-47, 2002;
    [3] Chang C C, Hector G M, Papcke A. Boolean Query Mapping Across Heterogeneous Information Sources. IEEE Transactions on Knowledge and Data Engineering, 1996, 8(4): 515～521;
    [4] G. Salton, A. Wong, C.. S. Yang. A Vector Space Model for Automatic Indexing[J]. Communication of the ACM. 1975, (18) 618-620;
    [5] Norbert Fuhr, Chris Buckley. A Probabilistic Learning Approach for Document Indexing. Information Systems, 1991;
    [6] 姚天顺，朱靖波，张琍等．自然语言理解——一种让机械懂得人类语言的研究(第二版)．北京：清华大学出版社．2002年10月
    [7] Miller G. A, Beckwith, R., Fellbaum C., Gross D., et al.. Introduction to WordNet: An On-Line Lexical Database. In Five Papers on WordNet, CSL Report, Cognitive Science Laboratory, Princeton University, 1993
    [8] 于江生，俞士汶．中文概念词典的结构．中文信息学报，2002，Vol．16，No．4．12～20
    [9] 黄曾阳．HNC理论概要．中文信息学报，1997，Vol．11，No．4，11～20
    [10] 何婷婷．基于HNC理论的一种词汇岐义消解规则．华中师范大学学报(自然科学版)．2002，Vol．36，No．1，30-34
    [11] 董振东，董强．知网．计算语言学文集，北京：清华大学出版社，1999．10
    [12] Li Sujian, Zhang Jian, Huang Xiong, et al. Semantic Computation in a Chinese Question-Answering System. Journal of Computer Science and Technology, 2002, Vol. 17 No. 6, 933～939
    [13] Deerwester S., Dumais S. T., Fumas G. W., Landaure T. K., and Harshman, R.. Indexing by Latent Semantics Analysis, Journal of the American Society for Information Science, 1990, Vol. 41, No.6, 391-407
    [14] Dumais S., Furnas G., Landauer T., Scott D., et al. Using Latent Semantic Analysis to Improve Access to Textual Information. Proceedings of Computer Human Interaction, 1988. P281-285
    [15] 邓志鸿，唐世渭，张铭等．Ontology研究综述．北京大学学报(自然科学版)，2002，Vol．38，No．5，730-737
    [16] R Neches, R Fikes, T Finin, et al. Enabling Technology for Knowledge Sharing. AI Magazine, 1991, Vol. 12, No.3, 36-56
    [17] Fonseca, F. Egenhofer M., Agouris, P., Camara G. 2002. Using Ontologies for Intergrated Geographic Information Systems. Transactions in GIS, -(6):3 in print
    [18] Guarino N. Semantic Matching: Formal Ontological Distinctions for Information Organization, Extraction, and Integration. In Information Extraction: A Multidisciplinary Approach to an Emerging Information Technology, Pazienza M T Eds. Springer Verlag, 1997: 139-170
    [19] Arpirez J, Perez A G, Lozano A, et. (Onto)2agent: An Ontology -based WWW Broker to Select Ontologies. In Proceeding of the Workshop on Application of Ontologies and Problem-Solving Methods, UK, 1998: 16-24
    [20] Shum S B, Motta E, Domingue J. ScholOnto: an Ontology-based Kigital Library Server for Research Documents and Discourse. Intl. J. Digital Libaries, 2002, 3(3): 237-248.
    [21] D. Bonino, F. Corno, L. Farinetti, A. Bosca, Ontology Driven Semantic Search. WSEAS Conference ICAI 2004, Venice, Italy, 2004
    [22] R. Guha, R. McCool, and E. Miller. Semantic search. Proceedings of the 12th International Conference on World Wide Web, ACM Press, 2003,700-709
    [23] 张宏斌．信息获取中两类不确定性的理论模型研究．华中科技大学博士．论文，2004．2-15
    [24] A. Hotho, A. Maedche, S. Staab. Ontology-based Text Document Clustering. Kunstliche Intelligenz, 2002, No. 4, 48-54
    [25] A. Hotho, S. Staab, and G. Stumme. Wordnet Improves Text Document Clustering. In Proc. Of the SIGIR 2003 Semantic Web Workshop, 2003.
    [26] S. Tiun, R. Abdullah, T. E. Kong, Automatic Topic Identification Using Ontology Hierarchy, Proc. of 2nd Int. Conf. Comput. Linguistics and Intelligent Text Processing (ClCLing 2001), 2001, 444-453
    [27] Inderjit S. Dhillon, Subramanyam Mallela, and Rahul Kumar. Enhanced word clustering for hierarchical text classification. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 191-200. ACM Press, 2002.
    [28] I. S. Dhillon and D. S. Modha. Concept decompositions for large sparse text data using clustering[J]. Machine Learning, Jan 2001, 42(1): 143-175.
    [29] Wendy W. Chapman, Lee M. Christensen, Michael M. Wagner, et al. Classifying free-text triage chief complaints into syndromic categories with natural language processing. Artificial Intelligence in Medicine, Volume 33, Issue 1, January 2005, Pages 31-40
    [30] Thorsten Joachims. Text categorization with support vector machines: learning with many relevant featimes. In Claire Nedellec and Celine Rouveirol, editors, Proceedings of ECML-98,10th European Conference on Machine Learning, number 1398, pages 137-142, Chemnitz, DE, 1998. Springer Verlag, Heidelberg, DE.
    [31] Thorsten Joachims. Transductive inference for text classification using support vector machines. In Ivan Bratko and Saso Dzeroski, editors, Proceedings of ICML-99, 16th International Conference on Machine Learning, pages 200-209, Bled, SL, 1999. Morgan Kaufrnann Publishers, San Franeisco, US.
    [32] Kamal Nigam, Andrew K. McCallum, Sebastian Thrun, et al.. Text classification from labeled and unlabeled documents using EM[J]. Machine Learning, 2000, 39(2/3): 103-1134.
    [33] 程工．读乔姆斯基《语言与思维研究中的进展》[J]，外语教学与研究，2001；
    [34] Francisco Aboitiz, Ricardo R. Garcia, Conrado Bosman and Enzo Bmnetti. Cortical memory mechanisms and language origins. Brain and Language, Volume 98, Issue 1, July 2006, Pages 40-56
    [35] R Neches, R Fikes, T Finin, et al. Enabling Technology for Knowledge Sharing. AI Magazine, 1991, Vol. 12, No. 3, 36-56
    [36] T. Gruber. Ontolingua: a Translation Approach to Portable Ontology Specifications. Knowledge Acquisition. 1993.5(2): 199-200.
    [37] W. N. Borst. Construction of Engineering Ontologies. PhD Thesis, University of Twenty, Enschede. 1997.
    [38] R. Studer, V. R. Benjamins, D. Fensel. Knowledge Engineering: Principles and Methods. Data and Knowledge Engineering. 1998. 25(1-2): 161-197.
    [39] Ontologies and Knowledge Bases: Towards a Terminological Clarification. "Amsterdam, NL lOS Press" Guarino N, 1995, 25-32.
    [40] Gruber, Thomas R. Toward Principles for the Design of Ontologies Used for Knowledge Sharing In Guarino, N. and Poli, R. (eds.), Formal Ontology in Conceptual Analysis and Knowledge epresentation (to appear).
    [41] Conceptual Knowledge Markup Language Version 0.2 http://www.ontologos.org/CKML/CKML%200.2.html
    [42] Dieter Fensel, Frank van Hannelen, Ian Horrocks, Deborah L. McGuinness, Peter F. Patel-Schneidero OIL: An Ontology Infrastructure for the Semantic Web. IEEE INTELLIGENT SYSTEMS, 2001 (2)
    [43] Salton G, New MeGill M J. Introduction to modem Information Retrieval [M]. New York: McGraw-Hill Book Company,1983, P400
    [44] Zipf, G. K. Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology[M]. Cambridge, Mass: Addison-Wesley Press, INC,1949.
    [45] Wang Shan, Zhang Kun-Long. Searching database with keywords. Journal of Computer Science & Technology, 2005, 20(1): 55-62.
    [46] 张海龙，王莲芝．自动文本分类特征选择方法研究，计算机工程与设计，2006，20期
    [47] Salton, G., Bucidey, C.: Term-weighting Approaches in Automatic Text Retrieval. Information Processing and Management, 1988,24(5): 513-523
    [48] 辛明海．个性化信息服务中的本体论自动分类和多Agent技术．硕士学位论文，2002
    [49] 庞剑锋，卜东波，白硕．基于向量空间模型的文本自动分类系统的研究与实现，计算机应用研究，2002，(9)：23-26．
    [50] 朱礼军．万维网环境下基于领域知识的信息资源管理模式研究．博士学位论文，2004，(44页)
    [51] Wenlei Mao, Wesley Mao, Wesley W. Chu: The Phrase-based Vector Space Model for Automatic Retrieval 3 of Free-text Medical Documents. Data and Knowledge Engineering. Article in press, (2006)
    [52] A. Basu, C. Watters and M. Shepherd. "Support Vector Machines for Text Categorization". Faculty of Computer Science Dalhousie University, Halifax, Nova Scotia, Canada B3H1WS.
    [53] T. Joachims. "Text categorization with support vector machines: Learning with many relevant features". In ECML'98, pages 137-142, 1998
    [54] 张博锋，苏金树．一种新的多类SVM方法及其在文本分类中的应用．广西师范大学学报，2006年4期
    [55] Joachims T. Text categorization with support vector machines: Learning with many relevant features. In: Nedellec C, Rouveirol C, eds. Proc. of the 10th European Conf. on Machine Learning (ECML-98). Chemnitz: Springer-Verlag, 1998. 137-142
    [56] Muller KR, Mika S, Ratsh G, Tsuda K, Scholkopf B. An introduction to kernel-based learning algorithms. IEEE Trans. on Neural Networks, 2001,12(2):181-202.
    [57] Zaragoza HH, Ralf. The perceptron meets Reuters. In: Proc. of the NIPS 2001 Machine Learning for Text and Images Workshop. 2001.
    [58] Cristianini N, Shawe-Taylor J, Lodhi H. Latent semantic kernels. In: Brodley C, Danyluk A, eds. Proc. of the 18th Int'l Conf. on Machine Learning (ICML-01). Williams College: Morgan Kaufmann Publishers, 2001. 66-73.
    [59] Cancedda N, Gaussier E, Goutte C, Renders JM. Word sequence kernels. Journal of Machine Learning Research, 2003,3(6): 1059-1082.
    [60] Lodhi H, Saunders C, Shawe-Taylor J, Cristianini N, Watkins C. Text classification using string kernels. Journal of Machine Learning Research, 2002,2(2):419-444.
    [61] Leslie C, Kuang R. Fast kernels for inexact string matching. In: Scholkopf B, Warmuth MK, eds. Proc. of the 16th Annual Conf. on Learning Theory and 7th Kernel Workshop (COLT/Kernel 2003). Washington: Springer-Verlag, 2003. 114-128.
    [62] Joachims T, Cristianini N, Shawe-Taylor J. Composite kernels for hypertext categorisation. In: Brodley C, Danyluk A, eds. Proc. of the 18th Int'l Conf. on Machine Learning (ICML-01). Williams College: Morgan Kaufmann Publishers, 2001. 250-257.
    [63] 白小明，邱桃荣。基于SVM和KNN算法的科技文献自动分类研究微计算机信息。2006年12期
    [64] 孙健，王伟，钟义信。基于K-最近距离的自动文本分类的研究。北京邮电大学学报，第24卷，第1期，2001．3
    [65] 孙丽华，张积东，李静梅。一种改进的KNN方法及其在文本分类中的应用。应用科技，第29卷，第2期，2002．2
    [66] K. Kalpakis, D. Gada, V. Puttagunta. Distance measures for effective clustering of ARIMA time-series[J]. In Proc. IEEE Int. Conf. Data Mining, 2001, 273-280.
    [67] J. Qian, M. Dolled-Filhart, H. Yu. Beyond synexpression relationships: Local clustering of time-shifted and inverted gene expression profiles identifies new, biologically relevant interactions[J], 2001, 314, 1053-1066.
    [68] T. M Cover and P. E Hart. Nearest neighbor pattern classification. IEEE TRANS. Inform. Theory, 1976, IT-13(1): 21-27.
    [69] 王小华，张国煊，陆蓓．文本分类系统的评价因素探讨．杭州电子工业学院学报．2002，22(3)：11-14
    [70] Y Yang and X. Liu. A re-examination of text caxegorizazion methods. In.: 11rI. A. Hearst, F Gey, and R. Tong eds. Proceedings of SIGIR-99, 22nd' ACM International Conference,} R. esearch and Development in Informatian Retrieval. New York: ACM Press, 1993. 42-49.
    [71] David, D. L., Yang, Y., Fan L: RCVI: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research 5 (2004) 361-397.
    [72] Protege网站: http://protege.stanford.edu/
    [73] Protege网站:http://protege.cim3.net/cgi-bin/wiki
    [74] http://protege.stanford.edu/plugins/owl/owl-library/
    [75] http://swoogle.umbc.edu/
    [76] Jena相关网站: http://www.hpl.hp.com/semweb/jena.htm
    [77] HP Lab, Jena-A Semantic Web Frame work for Java, Technology Report, http://jena.sourceforge.net
    [78] SiRPAC相关网站:http://www.w3.org/RDF/Implementations/SiRPAC/

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700