用户名: 密码: 验证码:
基于概念特征的中文文本分类研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
文本分类(Text Categorization,TC)指的是把一个自然语言文本,根据其主题归入到预先定义好的类别中的过程。文本分类是自然语言处理的一个基础性工作,也是近年来人们研究的热点话题。
     基于概念的分类方法将关键词映射到概念空间,用概念作为特征进行分类。这样多个同义词就对应一个概念,而一个多义词在不同的语境下会被映射到不同的概念,提高了特征的凝聚度,克服了基于关键词的分类方法的缺陷,提高了分类准确率。
     本文借助“知网”知识系统,将特征由关键词空间映射概念空间,实现了基于概念的文本分类系统。在实现该系统的过程中,从三个方面提出了新的想法。第一,对普通关键词进行概念映射的同时, 提出了专有名词之间也存在“一词多义”,和“多词一义”的现象,借助一部专有名词词典,解决了专有名词的概念映射的问题。第二,在概念特征的权重计算方面改进了传统的TFIDF算法,并在实验中采用该方法取得了比较好的效果。第三,对比了几种常见的文本分类算法,并针对朴素贝叶斯算法在小样本集分类效果不高的原因进行了分析,对其进行了调整取得了更好的效果。
Text categorization is such a procedure that it can classify the text automatically by computer, and the categories have been defined before classify. It's a hot topic in our study area and it's also a basic work in the area of natural language disposal.The method of text categorization that based on concept is such a method that it uses the concept instead the term as its feature. So that terms that have the same meaning will be linked to the same concept and one term that have different meanings in different areas will be divided into several different concepts. As the feature the concept is better than term. By using such method , we overcomes the weakness of using the term as the feature directly.In this paper we used the "HotNet" knowledge system to map the feature from the keyword space to the concept space. And In our system of text categorization we put forward three new ideas. The first, we found the names of people or institute have the relation of synonym and some of the names also appeared in varied areas. These name need to mapping to concept space too. The second, we improved the method of weight computation and we prove it's better than the TFIDF method. The third, we modified the naiVe Bayes algorithm and our result is better than before.
引文
[1] Narayanan Shivakumar, et. Finding near-replicas of documents on the web. WebDB 1998. 204-212.
    [2] N. Shivakumar and H. Garcia-Molina. SCAM: A copy detection mechanism for digital documents. In proceedings of 2" d International Conference in Thery and Practice of Digital Libraries, Austin, Texas, June 1995.
    [3] 程军.基于统计的文本分类研究:[博士学位论文].中国科学院研究生院.2003:3-10.
    [4] 刘开瑛.中文文本自动分词和标注.商务印书馆 2000:3-15.
    [5] 董振东 知网 http://www.keenage.com.
    [6] 陈小荷 现代汉语自动分析——Visual C++实现.北京语言文化大学出版社 2000:90-114。
    [7] 白栓虎.汉语词切分及词性标注一体化方法,《计算语言学进展及应用》,清华大学出版社,1995:56-61.
    [8] 郑家恒,李鑫,谭红叶 基于语料库的中文姓名识别方法研究,中文信息学报,2001,14(1).
    [9] 史忠植,知识发现 清华大学出版社,2002:342-360.
    [10] 庞剑锋,卜东波,白硕.基于向量空间模型的文本自动分类系统的研究与实现,计算机应用研究,2002,9:23-26.
    [11] 苏伟峰,李绍滋,李堂秋,一个基于概念的中文文本分类模型,计算机工程与应用,2002,6:193-195.
    [12] Yang, Y. , Pedersen J. P. , A Comparative Study on Feature Selection in Text Categorization. Proceedings of the Fourteenth International Conference on Machine Learning, 1997, 412-420
    [13] 鲁羽明,李凡.基于权值调整的文本分类改进方法 清华大学学报(自然科学版)2003,43(4).
    [14] 周莉娜,郑家恒,刘开瑛,汉语词类标注规则的获取技术,计算语言学进展与应用.清华大学出版社 1993.
    [15] 白栓虎.汉语词切分及词性标注一体化方法,计算语言学进展与应用.清华大学出版社 1995.
    [16] Cooper W S. Getting beyond Boole. Information Processing & Management. 1988 Vol. 24 (3).
    [17] Salton G and Lesk M E. Computer evaluation of Index and text processing, Association for computing Machinery. 1968 Vol. 15 (1).
    [18] IMaron, M. E. et al. On relevance, probabilistic indexing and information retrieval. Journal of the ACM. Vol. 7 (3). 1960.
    [19] 吴立德等.大规模中文文本处理.复旦大学出版社,1997:20-74.
    [20] Dunja Mladenic, Marko Grobelnik. Feature selection for unbalanced class distribution and naive Baves. http://citeseer.ni.nec.eom/mladenic99feature.html
    [21] Machine Leaming,Tom M. Mitchell机械工业出版社,2003:204-258.
    [22] Lewis, D. D. Naive Bayes at forty: The independence assumption in information retrieval. Machine Learning: ECML=98, Tenth European Conference on Machine T. Parninu. 1998: 4-15.
    [23] 吴志峰,田学东.人名、机构名在基于概念的文本分类中的应用研究河北大学学报(自然科学版)2004,24(6):657-660.
    [24] 陆玉昌,鲁明羽.向量空间法中单词权重函数的分析和构造.计算机研究与发展.2002,39(10):1205-1210
    [25] Joachims, T. Text categorization with support vector machines: Learning with many relevant features. Machine Learning: ECML-98, Tenth European Conference on Machine Learning. 1998: 137-142.
    [26] Yang, Y. , & Chute, C. G. An example-based mapping method for text classification and retrieval. ACM Transcations on Information Systems, 1994, 12 (3) : 233-251.
    [27] 秦进,陈笑蓉等.文本分类中的特征抽取,计算机应用,2003,23(2):45-46.
    [28] 鲁松,李晓黎,白硕.文档中词语权重计算方法的改进,中文信息学报,2000,14(6):8-10.
    [29] 鲍文,胡清华,于达仁,基于K近邻方法的科技文献分类,情报学报,2003,22(4):452-453.
    [30] T. M. Cover and P. E. Hart, Nearest Neighbor Pattern Classification, In Trans, IEEE Inform. Theory, IT-13, pp21-27
    [31] T. M. Cover, Rates of Convergence for Nearest Neighbor Procedures, In Proc. Hawaii int. Conf. on System Science, pp413-415
    [32] J. J. Rocchio. Relevance Feedback in information Retrieval G. 11, Ed .SMART Retrieval System, Prentice Hall, pp. 313-323, 1971
    [33] Y. Yang and X. Liu. A re-examination of text categorization methods. In: M. A. Hearst, F. Gey, and R. Tong eds. Proceedings of SIGIR-99, 22nd ACM International Conference on Research and Development in Information Retrieval. New York: ACM Press, 1999. 42-49.
    [34] Fabrizio Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 2002, 34 (1) : 1-47.
    [35] 文本分类评测大纲.http://www.863data.org.cn/fenlei.php
    [36] 黄萱菁,吴立德等.独立语种的文本分类方法.中文信息学报.2000,14(6).1-2.
    [37] http://www.cs.cmu.edu/~tom/book.html.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700