面向用户的信息过滤研究与实现

英文题名：Research and Implementation on User-Oriented Information Filtering
作者：吴福英
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：信息过滤 ; 信息检索 ; 用户兴趣模型 ; 模型更新 ; 个性化
英文关键词：Information Filtering ; Information Retrieval ; User Interest Model ; Model Update ; Personalized
学位年度：2004
导师：王明文
学科代码：081202
学位授予单位：江西师范大学
论文提交日期：2004-05-01

摘要

计算机技术的发展，尤其是Internet和网络技术的发展，极大的推动着万维网(World Wide Web)的普及。WWW深刻地改变着人们的生活和思维方式，Internet已经成为人们不可缺少的信息来源。然而Internet信息资源的高度无序性和不可管理性却给信息的使用者带来了极大的困难。由于现今的搜索引擎具有通用的性质，因此很难满足不同背景、不同目的和不同时期用户的查询请求。
     用户的信息需求都具有个性化的特征，这些信息需求是相对稳定的、时间相对长久的，然而会随着时间的变化而不断变化的。自然而然，用户期待着能够适应个性化信息需求的信息服务软件的出现。
     针对用户个性化信息服务这个特定的需求，本文目标就是开发出一种能够更加准确地获取WWW上的信息资源的工具。本文主要思路是利用用户兴趣模型、用户背景知识对信息(该信息来自于目前比较流行的搜索引擎)进行过滤，然后把过滤后的信息提交给用户，再根据用户的相关反馈对用户兴趣模型进行修正以改进信息提供的满意度。本文的主要工作有：①合作实现了一个原型系统；②采用最大概率算法，实现对中文文本的切词；③提出了一种描述用户兴趣类的描述模式：把用户的兴趣描述为一个n元组：C=(C_1，C_2，…，C_n)；其中每个C_i表示用户的一个兴趣方向，C_i=(I_p，I_n，I_q)，其中I_p={(t_1，w_1)，…，(t_n，w_n)}I_n={(t_1，w_1)，…，(t_n，w_n)}分别为用户兴趣方向中相关网页集和不相关网页集的特征向量，我们分别称之为吸引因子和排斥因了；I_q={(t_1，w_1)，…，(t_k，w_k)}为用户的兴趣方向关键字向量，我们称之为查询。④利用向量空间模型(Vector Space Model)进行网页文档表示，然后利用线性过滤器对网页文档进行过滤。⑤建立用户的相关反馈，对用户兴趣模型进行动态的修改和更新。
     本文的创新点：主要提出了一种描述用户兴趣的描述模式，并在该基础上利用相关反馈对其进行了修正和更新。进一步的研究工作有：第一，对于吸引因子、排斥因子(I_p，I_n)的选取方法还可以尝试其它的一些方法，这样有可能提高网页的过滤精度；第二，用户兴趣修正模型中的参数α，β，γ作为用户兴趣学习率，还有待于进一步的进行实验，使其更加的适应用户的兴趣变化曲线。
Up to now, the World Wide Web (WWW) grows into a large hyperlinked corpus with more than 800 million pages and 5600 million hyperlinks, The web contains a rich and dynamic collection of hyperlink information and Web page access and usage information, providing rich sources for us. However, the information in Internet is disorder, it's a real challenge for us to make Internet easier to use. Traditional information retrieval technologies satisfy users's general needs-common characteristics, the current Search Engine cannot satisfy users' specific need.
    Because everybody has specific need which is relative unchanged but will be changed over time slowly, the personalized information service will play an more and more important role based on the user's interest in the Internet.
    Based on the user's specific information service, we hope implement that can obtain information in Internet accurately. We filter the information (which come from the Google, baidu Search Engine) based on the user's profiles, then submit the filtered result to the user, then updated the user's profiles dynamically based on the user's feedback.
    The main research works are:
    (1)Implementing a prototype system for personalized information filtering;
    (2)Making use of maximum probabilistic method to segment the Chinese documents;
    (3)Giving a scheme to represent the user's interest categories, every user's interest categories is a n-tuple: C=(C1,C2....Cn), each C1 represents the user's a interest category,Ci=(Ip, In,Iq),Ip ={(t1, w1 ),...... ,(tn,wn)} captures the user's positive profile, we call it attract factor, In= {(t1,w1 ),.....,(tn, wn)}captures the user's negative profile, we call it reject factor, Iq= {(t1 ,w1 ),....,(tk, wk )} captures the user's query keyword, we
    call it query;
    (4)Based on Vector Space Model filtering the web document according to adaptive user model.
    (5)Modifying the user's profiles based on the user relevance feedback information.
    This paper describes a scheme to represent a user's interest categories and updates the user's profiles according to the user's relevance feedback and implements a prototype System.

引文

[1] A. H. Tan, C. Teo, H. M. Keng, Learning User Profiles for Personalized Information Dissemination, Proceedings of 1998 IEEE International Joint conference on Neural Networks, May 1998
    [2] A. Moukas, P. Maes, Amalthaea: An Evolving Multi-Agent Information Filtering and Discovery System for the WWW, the Journal of Autonomous Agents and Multi -Agent Systems. 1998
    [3] A. T. Arampatzis, P. van der Weide, C. H. A. Koster, P. van Bommel, Term Selection for Filtering based on Distribution of Terms over Time, Proceedings of RIAO'2000, 12-14 April 2000, Paris
    [4] A. T. Arampatzis, P. van der Weide, C. H. A. Koster, P. van Bommel, Linguistically-motivated Information Retrieval, Encyclopedia of Library and Information Science, published by Marcel Dekker, Inc. -New York -Basel, 2000
    [5] C. Peters, C. H. A. Koster, Uncertainty-based Noise Reduction and term Selection in Text Categorisation, International Journal of Uncertainty, Fuzziness and Knowledge -Based Systems (IJUFKS) Vol. 11, No. 1
    [6] C. X. Zhai, J. Lafferty, A Study of Smoothing Methods for Language ModelsApplied to Ad Hoc Information Retrieval, SIGIR'01, September 9-12, 2001, New Orleans, Louisiana, USA Copyright 2001
    [7] C. J. van Rijsbergen, information retrieval, http://www.dcs.gla.ac.uk/Keith/Preface.html
    [8] C. H. A. Koster, Spanish/English Cross-Lingual Categorization, Presented at the 4th Dutch-Belgian Information Retrieval Workshop, 8 and 9 December 2003 at CWI in Amsterdam
    [9] C. H. A. Koster, M. Seutter, J. G. Beney, Multi-Classification of Patent Applications with Winnow, Proceedings PSI 2003, Springer LNCS 2890
    [10] C. Peters, C. H. A. Koster, Uncertainty-based Noise Reduction and Term selection in Text Categorization, Proceedings 24th BCS-IRSG European Colloquium on IR Research (ECIR 2002)
    [11] D. Godoy, A. Amandi, A User Profiling Architecture for Textual-Based Agents, Inteligencia Artificial, Revista Iberoamericana de Inteligencia Artificial, Volume 21, 2003
    [12] D. Billsus, M. Pazzani, A Personal News Agent that Talks, Learns and Explains, International Conference on Autonomous Agents Proceedings of the third annual conference on Autonomous Agents, Seattle, Washington, United States 1999
    [13] D. Mladenic, M. Graobelnik., Feature selection for unbalanced class distribution and Naive Bayes., Proc of the 16th Int'l Conf on Machine Learning (ICML'99) 1999
    [14] D. Oard, J. Kim, Information Filtering Resources, Mon. April 10, 2000 http://www.enee.umd.edu/medlab/filter/
    [15] D. H. Widyantoro, Dynamic Modeling and Learning User Profile In Personalized News Agent, URL http://citeseer.ist.psu.edu/537581.html
    [16] D. H. Widyantoro, T. R. Ioerger, J. Y. JASIS, an Adaptive Algorithm for Learning Changes in User Interests, Conference on Information and Knowledge Management Proceedings of the eighth international conference on Information and knowledge management 1999
    [17] D. H. Widyantoro, T. R. Ioerger, J. Y. JASIS, learning User Interest Dynamics with a Three-Descriptor Representation, Journal of the American Society for Information Science and Technology, Volume 52, Issue 3, February 2001
    [18] E. L. Miller, Techniques for Gigabyte-Scale N-gram Based Information Retrieval on PCs, 1999 CADIP Research Symposium
    [19] E. L. Miller, D. Shen, J. Liu, C. Nicholas, Performance and Scalability of a Large-Scale N-gram Based Information Retrieval System, 1999 CADIP Research Symposium
    [20] F. Abbattista, M. Degemmis, N. Fanizzi, O. Licchelli, P. Lopes, G. Semeraro, F. Zambetta, Learning

    User Profiles for Content-Based Filtering in e-Commerce, Proceedings of the AICA Annual Conference, 471-480, Conversano, (Bari), Italy, September 25-27, 2002.
    [21] H. Majithia, Query Caching in Agent-based Distributed Information Retrieval, 2002 CADIP Research Symposium
    [22] H. Kroon, Improving Learning Accuracy in Information Filtering, ICML'96 Workshop machine learning meets human computer interataction "ML meets HCI" July 3, 1996
    [23] H. Ragas, C. H. A. Koster, Four classification algorithms compared on a Dutch corpus, Proceedings SIGIR 98
    [24] H. Sever, M. Z. BOLAT, A Text Filtering Method For Digital Libraries, IATUL Proceedings (New Series) Vol 13, 2003
    [25] I. Androutsopoulos, J. Koutsias, K. V. Chandrinos, D. Spyropoulos, Learning to Filter Spam E-Mail: A Comparison of a Nave Bayesian and a Memory-Based Approach, 4th PKDD's Workshop on Machine Learning and Textual Information Access, 2000
    [26] I. M. Soboroff, Techniques for Collaboration in Text Filtering, 1999 CADIP Research Symposium
    [27] I. M.. Soboroff, Combining Content and Collaboration in Text Filtering, 1999 CADIP Research Symposium
    [28] I. Dhillon, J. Kogan, C. Nicholas, Feature Selection and Document Clustering, 2002 CADIP Research Symposium
    [29] J. Mostafa, W. Lam, S. Mukhopadhyay, M. Palakal, Detection of Shifts in User Interests for Personalized Information Filtering, In Proc of the 19th Int'l ACM-SIGIR Conf on Research and Development in Information Retrieval
    [30] K. Sycara, A. S. Pannu, A Learning Personal Agent for Text Filtering and Notification, Proceedings of the International Conference of Knowledge Based Systems, 1996
    [31] L. Chen, K. Sycara, WebMate: A Personal Agent for Browsing and Searching, Proceedings of the 2nd International Conference on Autonomous Agents, ACM, May, 1998
    [32] M. Hassel, Internet as Corpus-Automatic Construction of a Swedish News Corpus. In the Proceedings of NODALIDA i01-13th Nordic Conference on Computational Linguistics, May 21-22, 2001, Uppsala, Sweden
    [33] M. Rosell, Improving Clustering of Swedish Newspaper Articles using Stemming and Compound Splitting, NoDaLiDa 2003, Reykjavik, 2003
    [34] M. Rosell, Klustring av svenska tidningsartiklar, Clustering of swedish newspaper articles In Swedish, Master thesis NADA-KTH
    [35] G. Zacharia, A. Moukas, Evolving a Multi-agent Information Filtering Solution in Amalthea. Proceedings of Agents'97, Marina Del Rey, USA. Nodine M. and Unruh A. (1997)
    [36] N. Bel, C. H. A. Koster, M. Villegas, Cross-Lingual Text Categorization, Proceedings ECDL 2003, Trondheim, August 2003, Springer LNCS 2769
    [37] P. Buriak, B. McNurlen, J. Harper, Systems Model for Learning, http://fie.engrng.pitt.edu/fie95/2a3/2a31/2a31.htm
    [38] R. Dafner, D. Cohen-Or, Y. Matias, Watermaking Images: Context-based Space Filling Curves, Computer Graphics Forum Volume 19, Issue 3 (August 2000)
    [39] R. Baeze-Yates, B. Ribeiro-Neto, Modern Information Retrieval, ACM press
    [40] R. Carreira, J. M. Crato, Evaluating adaptive user profiles for news classification, International Conference on Intelligent User Interfaces Proceedings of the 9th international conference on Intelligent user interface


    [41] R. Schapire, Y. Singer, A. Singhal, Boosting and Rocchio Applied to Text Filtering, Proceedings of SIGIR-98, 21st ACM International Conference on Research and Development in Information Retrieval
    [42] R. Scott Cost, T. Finin, S. Kallurkar, H. Majithia, Charles Nicholas, Yongmei Shi, Ian Soboroff, CARROT Ⅱ: Collaborative Agent-based Routing and Retrieval of Text, 2001 CADIP Research Symposium
    [43] V. Talwar, M. P. S. Bhatia, Bayesian Learning for Personlization in Information Retrieval, URL http://citeseer.ist.psu.edu/525173.html
    [44] Y. Yan, J. P. Pedersen, A comparative study on feature selection in test categorization., In: Proc of the 14th Int'l Confon Machine Learning (ICML'97)
    [45] 边肇祺，张学工等，模式识别，清华大学出版社
    [46] 曹素丽，曾伏虎，曹焕光，基于汉字字频向量的中文文本自动分类系统，山西大学学报(自然科学版)22(2)1999
    [47] 曹佐清，谢青，孙强，遗传算法与神经元网络在信息检索中的应用
    [48] 陈国良等，遗传算法及其应用，国防出版社
    [49] 段慧明，松井久仁於，徐国伟，胡国蝗，俞士汶，大规模汉语标注语料库的制作与使用，http://ling.ccnu.edu.cn/message/yyxlwx/collection-4/13n-paper00013.htm
    [50] 邓伟，非条件Logistic同归分析，www.shmu.edu.cn/courses/2003aut/logistic.ppt,2003.10
    [51] 个性化服务http://www.metadata.com.cn/Individu.htm
    [52] Google的PageRank算法学习http://hedong.3322.org/archives/000199.html
    [53] 汉语自动分词研究评述http://artvine.com.tw/cgi-bin/book/0.pl?url=http%3A%2F%2Fartvine.com.tw%2Fcgi-bin%2Fboard%2Fbbsboard.pl%3Fboard_id=6%26type%3Dshow_post%26post=372
    [54] 汉语自动分词研究中的若干理论问题http://artvine.com.tw/cgi-bin/book/0.pl?url=http%3A%2F%2Fartvine.com.tw%2Fcgi-bin%2Fboard%2Fbbsboard.pl%3Fboard_id=6%26type%3Dshow_post%26post=382
    [55] 黄苏宁，中、港、台中文搜索引擎的发展，http://www.szsti.net/bbs/szsti/treatise/treatise20010118.htm
    [56] 黄晓斌，中文搜索引擎的现状与发展方向，华南理工大学学报(自然科学版)
    [57] 聚类，http://www.bjpu.edu.cn/sci/multimedia/ai/clustering.htm
    [58] 李宏东，姚大翔，模式分类(原书第2版)，机械工业出版社
    [59] 李国辉，汤大权，武德峰编著，信息组织与检索，科学出版社
    [60] 陆玉昌，鲁明羽，李凡，周立柱，向量空间中单词权重函数的分析和构造，计算机研究与发展，第39卷，第10期，2002年10月
    [61] 刘挺，吴岩，王开铸，串频统计和词形匹配相结合的汉语自动分词系统，中文信息学报第12卷第1期
    [62] 路飞，田国会，贾磊，用多种群并行自适应遗传算法解混合FloWshop调度问题，电机与控制学报 2002第一期
    [63] 孟晓明，搜索引擎在网络信息挖掘中的应用，http://www.google8.net/archives/000150.html
    [64] 沈世镒，陈鲁生编著，信息论与编码理论，科学出版社
    [65] 搜索引擎的技术发展趋势，http://www.google8.net/archives/000031.html,November 09,2003
    [66] 搜索引擎分类，http://www.seochat.org/beginner/Category.htm
    [67] 王小平，曹立明编著，遗传算法、理论、应用与软件实现，西安交通大学出版社
    [68] 王晓庆，基于RBF网络的文本自动分类的研究，江西师范大学硕士毕业论文 2003．7
    [69] 吴果，web搜索引擎的现状分析，河南纺织高等专科学校学报


    [70] 吴逸飞，模式识别—原理、方法及应用，清华大学出版社
    [71] 武森，高属性维稀疏数据聚类，www.beijingor.com/article/kdd.ppt，北京科技大学
    [72] 信息论，http://zh.wikipedia.org/wiki/%E4%BF%A1%E6%81%AF%E8%AE%BA
    [73] 徐宝文，张卫丰著，搜索引擎与信息获取技术，清华大学出版社 2003年12月
    [74] 徐秉铮，詹剑等，基于神经网络的分词方法，中文信息学报第2期，1993
    [75] 闫洁，曹秉刚，史维祥，一种快速收敛的遗传算法及其应用，西安交通大学学报 No 01 2001
    [76] 杨榕，信息资源开发利用的几种新技术，http://www.fjinfo.gov.cn/publicat/qbts/013/1.htm
    [77] 姚天顺，张桂平等，基于规则的汉语自动分词系统，中文信息学报第1期，1990
    [78] 应晓敏，窦文华，典型的Internet个性化服务系统，http://www2.ccw.com.cn/03/0322/b/0322b51_8.asp
    [79] 应晓敏，窦文华，Internet个性化服务系统的体系结构http://www2.ccw.com.cn/03/0322/b/0322b51_7.asp
    [80] 应晓敏，窦文华，Internet个性化服务的关键技术http://www2.ccw.com.cn/03/0322/b/0322b51_6.asp
    [81] 应晓敏，窦文华，Internet个性化服务的主要形式http://www2.ccw.com.cn/03/0322/b/0322b51_3.asp
    [82] 应晓敏，窦文华，Internet个性化服务纵览http://www2.ccw.com.cn/03/0322/b/0322b51_1.asp
    [83] 詹卫东，中文信息处理基础，http://ccl.pku.edu.cn/doubtfire/Course/Chinese%20Information%20Processing/2002_2003_1.htm
    [84] 张明辉，王尚锦，自适应搜索的改进遗传算法及其应用，西安交通大学学报，Vol 36 No3 2002
    [85] 张文修，梁怡，遗传算法的数学基础，西安交通大学出版社
    [86] 张羽飞，冯汝鹏，一种基于模式分析的防止遗传算法过早收敛的方法，信息与控制 No 1 2004
    [87] 张忠平，文本挖掘，http://www.21cnbj.com/industrynews/download/textmining.ppt
    [88] 钟茂生，基于智能Agent的个性化Web浏览器研究与实现，江西师范大学硕士毕业论文 2003．7
    [89] 褚蕾蕾，陈绥阳，周梦编著，计算智能的数学基础，科学出版社
    [90] 朱明，web网页识别中的特征选择问题研究，计算机工程 vol．26 No 8 2000．8
    [91] 朱晓旭，汉字输入教学系统中词组切分方法的设计，http://www.yywzw.com/jt/srh/hlg01-05.htm