客户评价挖掘算法研究与实现

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

客户评价挖掘算法研究与实现

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Research and Implementation of Mining Customers' Reviews
作者：刘宁
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：文本挖掘 ; 关联规则 ; 观点词提取 ; 极性识别
英文关键词：Text mining ; Association rules ; Extract opinion word ; Polarity identification
学位年度：2009
导师：周春光
学科代码：081203
学位授予单位：吉林大学
论文提交日期：2009-04-01

摘要

商务智能是指将存储于各种商业信息系统中的数据转换成有用信息的过程。数据挖掘是实现这个过程的重要方法,它能够从大型的数据库或者数据仓库中提取隐含的,事先未知的潜在的有用信息。其中关联规则的挖掘可以发现数据库中潜在的一些重要信息,特别是在商业领域拥有广泛的应用前景,因此如何将关联规则挖掘与商务领域相结合一直以来都是广受关注的非常活跃的研究领域。
     本文首先从对某一商品的客户评论出发,应用分词模块对原始客户评论分词并标注属性。之后应用关联规则挖掘技术发现其中的频繁特征,并以频繁特征为基准挖掘出关于该频繁特征的观点词及其倾向性,判断客户群对该商品的满意度,为生产厂商提供出宝贵的信息。本文在应用数据挖掘中的关联规则挖掘技术的同时,对其进行了初步的探讨和研究,对部分算法进行了实现,主要实现了Apriori算法和FP-树频集算法,并对其进行了相应的改进和比较。特别是改进了Apriori算法的读取过程,在不牺牲质量的前提下,使算法的运行速度得到提高。
At present, with the rapid development of e-commerce, online shopping has become no stranger to many of the people more and more homes will be able to buy the merchandise they want. In order to better service to consumers, as well as to increase the shopping experience to consumers, many merchants for the United website to provide consumers with a platform to comment in this way, consumers will be able to timely comments will be fed back to merchandise, as well as potential business consumers. However, with the comment merchandise exponential growth in class, all reading these comments will help consumers make a decision very difficult, so the merchandise in urgent need of an effective mining method comment.
     The development of business intelligence for the popularity of e-commerce has laid a solid foundation for business intelligence refers to information stored in a variety of commercial systems data into useful information technology. It allows business users through a database query and analysis, come to influence a key factor in commercial activity, and ultimately make better and more rational decision-making business strategy to enable enterprises in a rapidly changing and competitive market, access to the greatest degree of competitive advantage. One of online analytical processing and data mining tools from different levels to help companies achieve this goal.
     In this paper, belong to data mining research and application of technology, data mining (Data Mining, DM) refers to the database or from the large-scale data warehouse to extract knowledge of interest, such knowledge is implicit, unknown in advance, potentially useful information. It combines database, artificial intelligence, machine learning, statistics and other fields of theory and technology, research databases are a promising new areas of application value. Data mining tools to be able to deep-level data analysis of future trends and behavior prediction is an important business intelligence component. Mining association rules in data mining has been a hot and priorities. Association rules are a matter and other matters of mutual interdependence and a description of the relationship in general can be divided into two steps:
     1. To identify all support greater than or equal to minimum support threshold of frequent itemsets;
     2. From generation to meet the credibility of frequent pattern of association rules threshold.
     Customer reviews mining as an emerging area of research, information and technology is still very imperfect, this article will be English word technology, text mining and association rule mining technology. This article first text mining in the introduction to comment on the application point of view, the detail of the current text mining technology, introduction of the process of text classification, namely: the text to quantify the training classifier, classifier, and classification of test results evaluation. EASY detailed Vector distance classifier, KNN classification methods, Naive Bayesian classification methods and support vector machine classification of the four text classification technology, its advantages and disadvantages are compared.
     Text Mining at the hands of the related concepts and technologies, this article focuses on the association rule mining technology and customer evaluation of the implementation algorithm. Mining for association rules, the implementation of the Apriori algorithm and FP-Growth algorithm, two algorithms in the implementation process, the input of the Apriori algorithm to optimize the process, through the experimental comparison, significantly improve the operating efficiency of the Apriori algorithm. And optimized before the Apriori algorithm and implementation of the FP-Growth algorithm at run-time, accurate coverage rates and a detailed comparison of the results by experiment better selection of FP-Growth algorithm for mining algorithm as a customer comment on a module.
     Another focus of this paper is to evaluate the client implementation of the mining algorithm is also the ultimate goal of this article. To evaluate the implementation of mining clients, first of all, customers want to split the original comments, that is, want to comment on the original sentence for word segmentation. The existing segmentation techniques for most of them in English, while the English word is still in rapid development. In this paper, the Chinese Academy of Sciences Chinese Lexical Analysis System ICTCLAS as the Chinese word segmentation algorithm module. Stand-alone word ICTCLAS speed 996KB / s, sub-word accuracy of 98.45%, API does not exceed 200KB, all kinds of dictionary data compression less than 3M, is currently the world's best Chinese lexical analyzer. From the specific comments in this article to see the demo also ICTCLAS very well be able to finish the task of Chinese word segmentation.
     After Chinese word segmentation implementation, the implementation of this study and evaluation of customer mining algorithm three important steps: frequent feature recognition, extraction and the polarity of opinion words and comments to identify the polarity of sentence recognition. Frequent feature recognition is to find products on the sentence merchandise consumers are most concerned about property, in this step, this paper, the FP-Growth algorithm, the characteristics of the term are probably may also have a number of terms are composed of phrases forms. This article defines the characteristics of here a maximum of three terms form the composition of the phrase. For a single term and is not composed of separate noun phrase more than the use of association rule mining to identify a first step, separately for the case of association rules with the second step to find out. Identify the characteristics of the candidate set after frequently want their cut, so as to reduce the algorithm's error.
     Extraction of the word opinion and determine the polarity of the customers are the evaluation algorithm decide whether or not the key to the success or failure. Opinion the term is a characteristic of the goods, commentators to express their views (positive or negative) of the word or phrase. Determine the polarity of opinion words is one word to judge are complimentary sense, derogatory, is the ability to determine sentence comment on the most important prerequisite for semantic preference. Choose from the frequency characteristics of this article recently, and in the subsequent description of the frequency characteristics of the word opinion, found that the adoption of specific accuracy of this method is relatively high. To determine the polarity of the word opinion, this article will be classified as two types of the word opinion: complimentary sense and derogatory, the use of SQLServer2000, set up tables "dic". Table "dic" property has two "word" and "pos", were used to store the word opinion and the opinion of the tendentious word. The use of two figure "1" and "-1" to correspond to the two complimentary sense and negative bias. Opinion in the determination of a bias term, the first look-up table "dic", if the search term, then its part of speech tagging; If you do not find it, this time the need for manual identification, if the user know that their inclination will be add to the table its "dic", and marked their preference; Otherwise, if it can not determine the preference, then abandon it.
     The final chapter of this article reviews implementation of the sentence to determine the polarity and the polarity of opinion was the definition of the word is similar to a polarity of the sentence is the judge may determine that a sentence comments are complimentary sense, the derogatory manner. In this article, comment on the sentence in accordance with its opinion contains the word is divided into three categories: comments in the opinion the majority of the term for a class of polar, comments contain the same number of complimentary sense derogatory words and opinion words and opinion on the two types apart from outside all situations. For the first category, the basis for polarity of opinion words and algebra to determine the polarity of sentence review. For the second category, and find out all the reviews the characteristics of effective frequent word opinion, opinion based on an effective sentence of words to determine the polarity of comments. The third type of situation for the comment tag for the sentence on the polarity of the polarity of a comment.
     In this paper, the author has achieved the improved customer reviews mining algorithm, with the rapid development of e-commerce, network transactions will also be increasing the number of online comments appear quantity of goods will also increase with the growth in transactions, customer reviews are also going to be the future mining over a period of time the field of text mining a hot topic. Mining algorithm with the constant improvement and a new mining algorithm, text mining will certainly be able to provide consumers with reliable and convenient services, to contribute to e-commerce.

引文

[1] V.Hatzivassiloglou, K.R.McKeown.Predicting the semantic orientation of adjectives[C]. In Proceedings of ACL-97,35th Annual Meeting of the Association for Computational Linguistics, Madrid, ES, 1997 174-181.
    [2] P.D.Turney, M.L.Littman. Measuring praise and criticism: Inference of semantic orientation from association[J]. ACM Transactions on Information Systems, 2003, 21(4): 315-346.
    [3]J.Kamps, M.marx, R.J.Mokken, and M.D.Rijke.Using WordNet to measure semantic orientation of adjectives[R].In: Proceedings of LREC-04,4thInternational Conference on Language Resources and Evaluation[C], Lisbon, 2004. 1115-1118.
    [4]K.Dave, S.lawrence, DM Pennock.Mining the peanut gallery: opinion exaction and semantic classification of product reviews[C].WWW2003, 519-528.
    [5] Hu, M. Liu,B.Mining Opinion Features in Customer Reviews.[C]In the Proceedings of AAAI(American Association for Artificial Intelligence)04, San Jose, California.2004: 755-760.
    [6] Hu,M. Liu, B.2004.Mining and Summarizing Customer Reviews.[C]In the Proceedings of KDD(Knowledge Discovery and Data Mining)04.2004:168-177.
    [7] Jiawei Han, Micheline Kamber.数据挖掘概念与技术[M].机械工业出版社,第一版,2001.8.
    [8]Agrawal R, Imielinski T, Swami. Mining association rules between sets of items in large database[R].Proc.1993 ACM-SIGMOD int,l Conf.Management of Data(SIGMOD 93)[C].Washington DC:1993.207-216.
    [9]倪茂树.基于语义理解的观点评论挖掘研究[D].大连理工大学优秀硕士毕业论文.2007.12
    [10]Li Zhuang, Feng Jing, Zhu Xiao-yan. Movie review mining and summarization[C] Proceedings of the 2006 ACM CIKM International Conference on Information and Knowledge Management, Arlington, Virginia, USA, 2006.
    [11]Kobayashi N, Inui K, Matsumoto Y. Collecting evaluative expressions for opinion extraction[C] IJCNLP,Hainan,China,2004.
    [12]姚天昉,聂青阳,李建超.一个用于汉语汽车评论的意见挖掘系统[C]中国中文信息学会成立二十五周年学术年会,北京,中国,2006.
    [13]Kim S M,Hovy E.Determining the sentiment of opinions[C] The 20th International Conference on Computational Ling-uistics,Geneva,Switzerland,2004.
    [14]Yi J, Nasukawa T, Bunescu R C, et al.Sentiment analyzer: Extracting sentiments about a given topic using natural language processing techniques[C]The 3rd IEEE International Conference on Data Mining,2003.
    [15]Popescu A M, Etzioni O. Extracting product features and opinions from reviews[C] Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, Vancouver, Canada, 2005. [14〕Gourant R, Hilbert. Methods of mathematical physics[J].1,1953.
    [15] Salton G, Wong A, Yang C.S. A vector space model for automatic indexing[C]. Communications of ACM, 1975.
    [16]成颖,史九林.自动分类研究现状与展望.情报学报[J], 1999, 10:20-26.
    [17]肖明,沈英.自动分类研究进展.图书馆自动化[J], 2000, 5:25-28.
    [18]Dumai S, Platt J, Heckerman Detal. Inductive learning algorithms and representations for text categorization[C]. In Proceedings of the 1998 ACM 7th International Conference on Information and Knowledge Management, 1998:148-155.
    [19]张学工.关于统计学习理论与支持向量机.自动化学报[J], 2000, 1:32-42.
    [20]卢增祥,李衍达.交互支持向量机学习算法及其应用[J].清华大学学报(自然科学报),1999,39(7):93-97.
    [21]侯汉清.分类法的发展趋势简论[J].北京:书目文献出版社,1985.
    [22]侯敏著.计算语言学与汉语自动分析[J].北京:北京广播学院出版社,1999.
    [23]于玲,吴铁军.集成学习:Boosting算法综述[J].模式识别与人工智能,2004,17(1):52-59.
    [24]秦进,陈笑蓉,汪维家等.文本分类中的特征抽取[J].计算机应用,2003,23(2):45-46.
    [25]周茜,赵明生.中文文本分类中的特征选择研究[J].中文信息学报,2004,18(3):17-23.
    [26]胡佳妮,徐蔚然,郭军等.中文文本分类中的特征选择算法研究[J].光通讯研究,2005,3: 44-46.
    [27]武旭,须德.基于向量空间模型的文本自动分类系统的研究与实现[J].北方交通大学学报,2003,27(2): 38-41.
    [28]Wang Xiaohua.An automatic fuzzy text classification based on statistical word[C].The 6th International Conference for Young Computer Scientists, 2001.
    [29]McCallum A, Nigam K. A comparison of event models for na?ve Bayes text classification[C].Learning for Text Categorization: Papers from the AAAI Workshop, 1998:41-48.
    [30]Vapnik V N. The nature of statistical learning theory.Springer-Verlag[C], New York, 1995.
    [31]Yang Y, Chute C G. An example-based mapping method for text classification and retrieval[C]. ACM Transactions on Information Systems,1994, 23(3):252-277.
    [32]Hatzivassiloglou, V.and Mckeown,K,1997. Predicting the Semantic Orientation of Adjectives[C]. In Proc.of 35th ACL/8th EACL.
    [33]Turney, P, 2002. Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Review[C]s. ACL’02.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700