用户名: 密码: 验证码:
基于条件随机场模型的中文人名识别的研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
中文人名识别是中文命名实体识别(NER)的一个重点工作,广泛应用于信息检索、信息抽取、机器翻译等领域。中文人名在命名实体中占有很大的比重,并且由于中文人名结构的复杂性和形式的多样性,中文人名识别一直是中文信息处理领域的一个难点。
     本文在前人工作的基础上,采用条件随机场(Conditional Random Fields, CRFs)模型,并利用篇章信息,来完成中文人名识别的任务。本文的主要工作和特点介绍如下:
     (1)详细介绍了条件随机场模型,并讨论了本模型相比其他机器学习模型的特点。CRFs模型是当前比较优秀的条件概率模型,它既克服了生成模型的独立性假设,同时避免了有向图模型的标记偏执问题,并具有这两种模型的优点。
     (2)由于中文人名可能在同一篇语料中多次出现,但是同一人名在不同的位置具有不同的上下文环境,因此对于上下文信息比较充足的人名很容易通过模型进行召回,但是对于上下文信息不足的人名可能被漏识别。本文基于篇章信息,将通过CRFs模型识别出来的人名提取出来作为人名词典,进行第二次人名识别,进一步提高中文人名识别的效率。
     本文的研究成果同样适用于中文地名和机构名等其他命名实体的识别,实验证明本文提出的方法是有效的。
Chinese Personal Name Recognition (CPNR) plays an important role in Named Entity Recognition (NER) task; it is usually used in information retrieval, information extraction and machine translation and so on. Chinese personal names account for a large proportion in named entities, and it is always a difficulty of Chinese Natural Language Processing (CNLP) due to complexity of construction and diversity of form.
     This paper based on the previous works of others, completes this task with CRFs model. In order to improve the performance of our system, we introduce the proliferation based on discourse. The main works of this paper are as follows:
     (1) Give a detail description of CRFs model, and compare this model with other machine-learning models. CRFs model is a very excellent conditional probability model. It not only overcomes the independence assumption of generation models, but also settles the label-bias problem of directed graph models. It inheritances advantages from both type of models in addition.
     (2) CPNs maybe appear many times in the same corpus, but have different context information. The CPNs which have strong context information are sure more easy to recalled than the others. Based on discourse, this paper constructed a dictionary with personal names extracted from the results of CRFs model. In order to improve the performance of our system, we implement a second recognition of personal names.
     The research of this paper can also be provided to recognize Chinese location names and organization names. Experimental results prove that our method is effective.
引文
[1]陈禹.基于语篇的中文命名实体识别研究[D].厦门大学,2008年.
    [2]Tan H Y, Zheng J H, Liu K Y. Research on method of automatic recogniton of Chinese place name based on transformation[J]. Journal of Software,2001,12(11):1608-1613.
    [3]马玉霞.基于互信息的中文姓名自动识别研究与实现[D].大连理工大学,2004.
    [4]Guodong Zhou, Jian Su. Named entity recognition using an HMM-based chunk tagger[C]. Proceeding of the 40th annual meeting of the association for computational linguistics (ACL), Philadelphia,2002,473-480.
    [5]Borthwick A. A maximum entropy approach to named entity recognition [D]. New York: New York University,1999.
    [6]和雪娟、陈玉华、高丽金、夏幼明.基于统计和规则混合策略的中国人名识别研究[J].云南民族大学学报,2009,18(1):70~72.
    [7]刘秉伟、黄萱菁、郭以坤,等.基于统计方法的中文姓名识别[J].中文信息学报,2000,14(3):16-24.
    [8]黄德根、杨元生、王省,等.基于统计方法的中文姓名识别[J].中文信息学报,2001,15(2):31-37.
    [9]郭家清、蔡东风、王智超、刘浩公.一种基于条件随机场的人名识别方法[J]. Journal of Communication and Computer,2007,4(2):22-25.
    [10]沈达阳,孙茂松,黄昌宁.中文地名自动辨识[J].计算机语言学进展与应用.北京:清华大学出版社,1995:68-74.
    [11]张小衡,王玲玲.中文机构名称的识别与分析[J].中文信息学报,1997,11(4):21-32.
    [12]周俊生,戴新宇,尹存燕,陈家骏.基于层叠条件随机场模型的中文机构名自动识别[J].电子学报,2006,34(5):804-809.
    [13]季恒,罗振声.基于统计和规则的中文姓名自动辨识[J].语言文字应用,2001(1):14-18.
    [14]Lafferty J D、McCallum A、Pereira F C N. Conditional random fields:Probabilistic models for segmenting and labeling sequence data[C]//Proceedings of the 18th International Conference on Machine Learning. San Francisco,2001:282-289.
    [15]Darroch J, Lauritzen S, Speed T. Markov fields and log-linear interaction models for contingency table[J]. Annals of Statistics,1980,8(3):522-539.
    [16]Stephen Della Pietra, Vincent Della Pietra and John Lafferty. Inducing features of random fields[C]. IEEE Transactions on Pattern Analysis and Machine Intelligence,1997, (19):380-393.
    [17]Dong C. Liu, Jorge Nocedal. On the limited memory BFGS method for large scale optimization[J].Mathematical Programming,1989(45):503-528.
    [18]Church K W, Hanks P. Word association norms, mutual information and lexicography[J]. Computational Linguistics,1990(3):22-29.
    [19]Berger, A. L., Della Pietra, S. A. and Della Pietra, V. J.. A maximum entropy approach to natural language processing[J]. Computational Linguistics,1996, (22):39-71.
    [20]McCallum A. Efficiently Inducing Features of Conditional Random Fields [A]. In:Proc. 19th Conference in uncertainty in Artificial Intelligence (UAI-2003) [C],2003.
    [21]丁卓冶.中文命名实体识别的研究(A Study on Named Entity Recognition)[D].大连理工大学,2008年.
    [22]王振华,孔祥龙,陆汝占,刘绍明.结合决策树方法的中文姓名识别[J].中文信息学报,2004,18(6):10-15.
    [23]贾宁,张全.基于最大熵模型和规则的中文姓名识别[J].计算机工程与应用,2007,43(35):1-4.
    [24]孙茂松,卢红娜,邹嘉彦.基于隐Markov模型的汉语词类自动标注的实验研究[J].清华大学学报(自然科学版),2000,40(9):57-60.
    [25]Dan Shen, Jie Zhang, Jian Su, Guodong Zhou, Chew-Lim Tan. Multi-Criteria-based Active Learning for named entity recognition[C]. Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics,2004.
    [26]Oliver Bender,Franz Josef Och and Hermann Ney. Maximum entropy models for named entity recognition[C]. Proceedings of the seventh conference on Natural language learning,2003:148-151.
    [27]Erik F. Tjong Kim Sang and Fien De Meulder. Language-Independent Named Entity Recognition[C]. Proceeding of the seventh conference on natural language learning,2003:142-147.
    [28]Chen Zheng, Wenyin Liu,Zhang Feng. A new statistical approach to personal name extraction[C]. Proceedings of the Nineteenth International Conference on Machine Learing,2002:67-74.
    [29]Casey Whitelaw, Alex Kehlenbeck, Nemanja Petrovic. Web-scale named entity recognition[C]. Proceeding of the 17th ACM conference on Information and knowledge management,2008:123-132.
    [30]Rie Kubota Ando, Tong Zhang. A High-performance semi-supervised learning method for text chunking[C]. Proceeding of the 43rd Annual Meeting on Association for Computational Linguistics,2005:1-9.
    [31]刘开瑛.中文文本自动分词和标注[M].北京:商务印书馆,2000.
    [32]郑家恒,谭红叶.基于变换的中文姓名识别探讨[C].见:1998中文信息处理国际会议论文集,北京:清华大学出版社,1998:163-168.
    [33]冯元勇,孙乐,李文波,张大鲲.基于单字提示特征的中文命名实体识别快速算法[J].中文信息学报,2008,22(1):104-110.
    [34]李中国,刘颖.边界模板和局部统计相结合的中国人名识别[J].中文信息学报,2006,20(5):44-50.
    [35]Jenny Rose Finkel and Christopher D. Manning. Joint parsing and named entity recognition[C]. Proceedings of Human Language Technologies:The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics,2009:326-334.
    [36]Miller S、Crystal M、Fox H, et al. Algorithms that learn to extract information BBN:Description of the SIFT system as used for MUC-7[C]//Proceedings of the Seventh Message Understanding Conference,1998.
    [37]罗彦彦、黄德根.基于CRFs边缘概率的中文分词[J].中文信息学报,2009,23(5):3-8.
    [38]曹学虹,张宇橙.信息论与编码[M].北京:清华大学出版社,2009.
    [39]张玥杰,徐智婷,薛向阳.融合多特征的最大熵汉语命名实体识别模型[J].计算机研究与发展,2008,45(6):1004-1010.
    [40]Lili Shuang, Rongpeng Zhou,Degen Huang. Two-phase biomedical named entity recognition using CRFs[J]. Computational Biology and Chemistry,2009, (33):334-338.
    [41]Quang Minh Vu, Atsuhiro Takasu, Jun Adachi. Improving the performance of personal name disambiguation using web directories [J]. Information Processing and Management,2008, (44):1546-1561.
    [42]胡春静,韩兆强.基于隐马尔可夫模型(HMM)的词性标注的应用研究[J].计算机工程与应用,2002(6):62-64.
    [43]王明会.自然语言处理的研究与进展[J].电子科技导报,1995,5:10-13.
    [44]赵铁军.机器翻译原理[M].哈尔滨:哈尔滨工业大学出版社,2001.
    [45]郑家恒、刘开瑛.自动分词系统中姓氏人名的处理策略探讨[C].见:陈力为编.计算语言研究与应用.北京:北京语言学院出版社,1993.
    [46]Xiaodan Zhu, Mu Li, Jianf eng Gao, Changning Huang. Single character Chinese named entity recognition. Proceedings of the second SIGHAN workshop on Chinese language processing[C],2003,125-132.
    [47]宗成庆.统计自然语言处理[M].北京:清华大学出版社,2008.
    [48]张素香、张素贤、王小捷.一种人名识别方法的研究[J].计算机工程与应用,2008,44(21):157-161.
    [49]向晓雯,史晓东,曾华琳.一个统计与规则相结合的中文命名实体识别系统[J].计算机应用,2005,25(10):2404-2407.
    [50]Lishuang Li, Zhuoye Ding, Degen Huang, Huiwei Zhou. A hybrid model based on CRFs for Chinese Named Entity Recognition[C]. In International Conference on Advanced Language Processing and Web Information Technology (April 2008), Dalian,2008: 127-132.
    [51]Tri Tran Q., Thao Pham T. X., Hung Ngo Q., Dien DINH and Nigel COLLIER. Named entity recognition in Vietnamese documents [J]. Progress in Informatics Journal,2007:5-13.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700