用户名: 密码: 验证码:
基于条件随机场的文献记录分析算法
详细信息    查看官网全文
摘要
随着互联网的快速发展,网页已成为获取信息的主要来源。为了使出版机构能够及时从大量网页中发现所需文献,需要设计能够从HTML页面中自动提取文献信息的算法。为此,本文设计了基于条件随机场的文献记录分析算法:首先,设计了文档对象树的分割算法,通过分割标记将页面数据分成独立的部分,这些数据块由标签和文本序列构成;随后,将该序列作为条件随机场模型的特征向量,建立文献信息标记模型;最后,设计启发式算法,从标记模型中提取文献信息数据,并通过实验验证了其有效性。
With rapid development of Internet,web pages have become the main source of information.In order to make publishing agencies timely find the necessary references from a large number of pages.It is necessary to design a reference information extraction algorithm to get useful references information from HTML pages.In this paper,we design a reference analysis algorithm based on conditional random fields:Firstly,a document object tree segmentation algorithm is designed.Through the classifier web page data will be divided into separate parts,these data blocks are composed of tags and text sequences.Subsequently,we take these sequences as characteristic vectors of conditional random field model to establish reference information labeling model.Finally,a heuristic algorithm is presented to extract reference information data from the labeling model,and validity of this algorithm was verified by experiments.
引文
[1]湛江.文献检索统计中易被漏检和错误归类的高校学报[J].中国科技期刊研究,2015,26(9):1005-1008.Zhan Jiang,The Journals of universities easily missed or wrongly classified in statistical analysis[J].Chinese Journal of Scientific and Technical Periodicals,2015,26(9):1005-1008.
    [2]李文立,王乐超,宋春雷.基于HTML树和模板的文献信息提取方法研究[J].计算机应用研究,2010,27(12):4615-4617.LI Wenli,WANG Lechao,SONG Chunlei.Method of paper information extraction based on HTML tree and template[J].APPLICATION RESEARCH OF COMPUTERS,2010,27(12):4615-4617.
    [3]龚真平.基于HTMLParser的Web文献信息提取[J].软件导刊,2011,10(2):14-15.Gong Zhenpin.Infonnation Extraction of Web Document Based on Htmlparser[J].SOFT WARE GUIDE,2011,10(2):14-15.
    [4]孙颖,崔洁爽,陈扬.关键词共现分析技术在图书馆文献检索中的应用——以心理学为我国“五位一体”战略布局服务为例[J].图书馆工作与研究,2015,2015(11):45-56.Sun Ying,Cui Jieshuang,Chen Yang.Keywords co-occurrence analysis technology in the library literature retrieval application—to psychology for China"one of five"strategic layout of the service as an example[J].Library Work and Study,2015,2015(11):45-56.
    [5]刘东信.养成能提高编辑技能的一种好习惯:文献检索与查阅[J].编辑学报.2007,19(3):231-232.Liu Dongxin.A good habit for editors to improve the editing skills-searching and reading of published papers[J].ACTA EDITOLOGICA,2007,19(3):231-232.
    [6]林岚.认知弹性理论在文献检索教学中的应用[J].图书馆,2010,2010(2):119-120.Lin Lan.Application of Cognitive Flexibility Theory on document retrieval Teaching[J].LIBRARY,2010,2010(2):119-120.
    [7]吉喆,董智勇,王袆.基于DataSet和ADO.NET的文献检索列表转换软件设计[J].情报杂志,2009,28(1):116-121.Ji Zhe,Dong Zhiyong,Wang hui.Design of document retrieval list conversion software based on DataSet and ADO.NET[J].JOURNAL OF INTELLIGENCE,2009,28(1):116-121.
    [8]张莉.文献检索方式的发展与提高期刊影响力[J].编辑学报,2005,17(2):124-125.Zhang Li.Evolution of literature retrieval and improvement of the journal's influence[J].ACTA EDITOLOGICA,2005,17(2):124-125.
    [9]马莎莎.文献检索过程的认知态及与之相关的显著性标引框架[J].图书馆论坛,2010,30(2):106-108.MA Shasha.User's Cognition and Significance Document Index[J].LIBRARY TRIBUNE,2010,30(2):106-108.
    [10]张佳,窦丽华,陈杰.科技文献检索实践课程教学的创新[J].实验室研究与探索,2012,31(2):115-118.Zhang Jia,Dou Lilma,Chen Jie.Teaching Innovation of Science and Technology Literature Retrieval[J].Research and Exploration in Laboratory,2012,31(2):115-118.
    [11]邹永利,何侃,徐健.文体特征在网络学术文献检索中的意义与应用[J].情报理论与实践,2008,31(4):594-597.Zou Yongli,He Kan,Xu Jian.The significance and application of stylistic features in network academic literature retrieval[J].INFORMATION STUDIES:THEORY&APPLICATION,2008,31(4):594-597.
    [12]张永宏,胡立耘.文献检索在编辑工作中的应用[J].编辑学报,2001,13(3):158-160.Zhang Yonghong,Hu Liyun.Application of knowledge of bibliography to editing[J].BIANJI XUEBAO ACTA EDITOLOGICA,2001,13(3):158-160.
    [13]薛培荣.计算机文献检索与科技期刊编辑工作[J].中国科技期刊研究,2000,11(6):387-388.Xue Peirong.Computer Document Retrieval and Editing of Sci-tech Journals[J].CHINESE JOURNAL OF SCIENTIFIC AND TECHNICAL PERIODICALS,2000,11(6):387-388.
    [14]黄晓鹂,李树民,廉立军.我国高等院校文献检索教学研究文献分析[J].现代情报,2009.29(3):222-225.Huang Xiaoli,Li Shumin,Lian Lijun.Literature Analysis of Literature Retrieval Teaching Research in Chinese University[J].JOURNAL OF MODERN INFORMATION,2009,29(3):222-225.
    [15]李航.统计学习方法[M].北京:清华大学出版社,2012.201-232.Li Hang.Statistical learning method[M].BeiJing:Tsinghua University Press,2012.201-232.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700