用户名: 密码: 验证码:
基于LUCENE的搜索引擎研究与实现
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着信息技术的不断发展,互联网技术也得到了迅猛发展,而在互联网上大家每天用的频率最高的就是搜索引擎,人们已经把它当作日常学习、工作、休闲不可缺少的一个工具。大家都知道用搜索引擎可以快速的找到自己想找的资料或信息,那么到底什么是搜索引擎呢?网络上通常说的搜索引擎指的是收集了因特网上几十亿到上百亿个网页,并对网页中的每一个词(即关键词)进行索引,建立索引数据库的全文搜索引擎。当用户查找某个关键词的时候,所有在页面内容中包含了该关键词的网页都会作为搜索结果被搜出来。在经过复杂的算法进行排序后,这些结果将按照与搜索关键词的相关度高低,依次排列,呈现给用户。
     本文首先介绍了搜索引擎的发展现状,在上世纪九十年代以后以互联网为基础的信息化进程中,面对浩瀚的网络信息资源,人们寻找自己需要的信息变得越来越困难,大多数人很大程度上是依赖搜索引擎来帮助自己获得有用信息,因此搜索引擎技术作为最典型的web信息获取技术,其发展水平高低直接影响人们获取信息的质量。接着介绍了搜索引擎的特点和分类,并对搜索引擎的原理及网络机器人等技术进行了探讨,对google主流搜索引擎系统结构进行了分析研究。在此基础上对开源代码项目Lucene的历史,应用,特点,系统结构,Lucene索引文件格式进行了论述。然后对搜索引擎中的关键技术进行了研究。由于Web站点上的页面频繁更新,随着时间的推移,将会有许多页面过时或者不存在,通过对网络机器人页面抓取过程进行分析,提出了递增式的网络机器人页面变化模型。最后对中文分词的常见算法及中文分词岐义和未登录词进行了相关分析论述。
Along with the information technology unceasing development, the Internet technology is also developing swiftly, but the most high frequency tool which everybody uses every day on the Internet is the search engine, the people already treated it as an essential tool for study, work,the leisure activities. Everybody knows with the search engine one may get the material or information that he wants to find, and then what is the search engine? Genarally we referred the search engine on the Internet as it has collected from several billions to 10 billions web pages, and index each word(namely key word) of the whole webpages, established the full-text search engine of the index database. After the user entering the key word, all the pages containing the key words would be find out as the search results. After sorting according to complex algorithm, these results will be presented to the users based on the correlation degree to the key words.
     First of all,the thesis introduces present situation of the development of search engine. After 1990's, when facing vast network information resources, it become more and more difficult for people to seek information they need in the process of informationization based on the Internet. The majorities will rely on the search engine to help themselves to obtain the useful information to a great extend. Therefore,the development of the search engine technologies as a typical web information accessing technology will have directly impact on the quality of people access to information. In the next place, we introduced the search engine characteristics and classification , have a discussion on search engine principles and Robot,analyze and study on the architecture of the google search engine .In this foundation,we have elaborated on the open source code project Lucene, its history, application, characteristics ,system structure, the Lucene index format .Then,we have study on several key technologies. Because web pages frequently updated, along with time passed, some many pages would be obsolete or do not exist. Through the analysis on process of the robot's fetching webpages, we proposed the robot's increment Page Change Model . Finally, we have discussed on the common algorithms on Chinese Word Segmentation , the ambiguity of Chinese Word segmentation and unregistered words.
引文
[1]李锐.搜索引擎发展综述.http://Polog.csdn.net,2005.
    [2]Danny Sullivan. Fifth Annual Search Engine Meeting Report, Boston, MA, Apr. 2000.
    [3]中国论坛网。突破Yahoo—未来搜索引擎的目标市场初探 http://www.51one.net/info/4385937041580933.htm
    [4]http://www.darksleep.com/lucene/notsonittygritty.html
    [5]Jeffheaton薯。童兆丰,李纯,刘润杰译网络机器人Java编程指南
    [6]Sergey Brin and Lawrence Page. The Anatomy of a Large-ScaleHypertextual Web Search Engine
    [7]张蕊:元搜索引擎揭密;中国计算机报2000年第27期 http://person.zj.cninfo.net.
    [8]丁璇,侯汉清,章成杰.中文网页标引源主题表达能力的调查.大学图书馆学报: 2002(6):70-72
    [9]中国互联网络消息中心.中国互联网络发展状况统计报告. http://www.cnnic.gov.cn/develst/2002-1/doc2002-1.zip, 2002.
    [10]Lucene 1.4 API http://jakarta.apache.org/lucene/docs/api/index.html
    [11]Krishna Bharata, Andrei Brodera, Monika Henzingera, Puneet Kumara, and Suresh Venkatasubramanianb, The Connectivity Server: fast access to linkage information on the Web,
    [12]Spink, A., Chang, C., and Goz, A. (1999b). Users' interactions with the excite web search engine: A query reformulation and relevance feedback analysis. In Proceedings of the 1999 Canadian Association for Information Science (CAIS),
    [13]Tonta, Y. (1992). Analysis of search failures in document retrieval systems: A review.
    [14]Van Rijsbergen, C. J. (1979). Information Retrieval, 2nd edition. Dept. of Computer Science, University of Glasgow, 2nd edition
    [15]Vapnik, V. (1995). The Nature of Statistical Learning Theory. Springer Verlag,New York.
    [16]Veerasamy, A. and Belkin, N. J. (1996). Evaluation of a tool for visualization of information retrieval results.
    [16]Lucene FAQ Home Page. http://www.jguru.com/faq/Lucene
    [17]Zamir, O., Etzioni, O., Madani, O., and Karp, R. M. (1997). Fast and intuitive clustering of web documents. In 3rd International Conference on Knowledge Discovery and Data Mining
    [18]Brian E. Brewington and George Cybenko. How Dynamic is the Web. World Wide Web Conference,2000.
    [19]王建勇 李晓明 单松巍 雷鸣 谢正茂.海量web搜索引擎系统中用户行为的分布特征及其启示.北京大学计算机科学技术系网络与分布式系统研究室. http://net.pku.edu.cn/~webg/papers/jwang-log.pdf
    [20]曹树金 杨涛。自动分类在搜索引擎性能优化中的应用。
    [21]Ed Greengrass. Information Retrieval: A Survey. 2000.
    [22]Ricardo Baeza-Yates, Berthier Ribeiro-Neto. Modern Information Retrieval -Indexing and Searching.机械工业出版社,2004.
    [23]http ://searchenginewatch.com.
    [24]李晓明,刘建国.搜索引擎技术及趋势.http://www.ccident.com,2000,3.
    [25]四川英特耐特信息高速公路有限公司.Internet资源访问大全.1995.
    [26]Michael L. Mauldin. Lycos: Design choices in an Internet search service. IEEE EXPERT, Jan. 1997:8-11.
    [27]Andi Wu,Zixin Jiang.Word Segmentation in Sentence Analysis.Proceedings of 1998 International Conference on Chinese Information Processing, 1998.
    [28]应志伟,柴佩琪等.文语转换系统中基于语料的汉语自动分词研究.计算机应用,2000, 20:8—11.
    [29]揭春雨,刘源,梁南元.论汉语自动分词方法,中文信息学报,1989(1).
    [30]侯敏,孙建军,陈肇雄.汉语自动分词中的歧义问题,计算语言学进展与应用,清华大学出版社,1995.
    [31]郑家恒等.中文文本歧义切分技术研究. 北京:清华大学出版社,1999
    [32]黄德根,杨元生,王省等.基于统计方法的中文姓名识别中文信息学报,2001.15(2).
    [33]谭红叶,郑家恒,刘开瑛.中国地名的自动识别方法研究.计算语言学论文集,北京, 1999:55-69.
    [34]孙茂松,张维杰,英语姓名译名的自动识别,《计算机语言研究与应用》,北京,北京语言学院出版社,144—149.
    [35]张小衡,王玲玲,中文机构名称的识别与分析,中文信息学报,1997(4):21-32.
    [36]Danny Sullivan. Search Engine Size Wars V Erupts. http://blog.searchenginewatch.com, 2004.
    [37]Danny Sullivan. Search Engine Sizes. searchenginewatch.com/reports, 2005.
    [38]Microsoft Incorporated, http://search.msn.com.
    [39]Yahoo Incorporated, http://search.yahoo.com.
    [40]AskJeeves Incorporated, http://www.ask.com.
    [41]Cho Junghoo, Garcia-Molina H. The Evolution of the Web and Implications for an Incremental Crawler. Proceedings of the 26th International Conference on VeryLarge Data Bases, Pages: 200—209, 2000.
    [42]http://rdf.dmoz.org/
    [43]http ://www.unicode.org/
    [44]L. A. Barroso, J. Dean, U. Holzle. Web Search for a Planet: The Google Cluster Architecture. IEEE Micro, 23(2): 22-28, April, 2003.
    [45]google Incorporated。 http://www.google.com
    [46]George Laughead Jr. HISTORY: W3 SEARCH ENGINES. 2003.
    [47]庄毅,黎浩宏.引擎技术现状及发展动向.计算机时代,2002,8.
    [48]http://www.infonortics.com/searchengines.
    [49]孟晓川.搜索引擎在网络信息挖掘中的应用.2004,10.
    [50]Junghoo Cho, Hector Garcia-Molina. The Evolution of the Web and Implications for an Incremental Crawler [D].Stanford University, 2002. Page 14
    [51]Ricardo Baeza-Yates, Berthier Ribeiro-Neto. Modern Information Retrieval Modeling.机械工业出版社,2004.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700