摘要
在进行文本聚类时,对于大容量、高维、非结构化的文本数据,单纯的K-Means聚类效果不佳,容易陷入局部最优解.本文改进了粒子群优化算法,提出了非线性动态调整惯性权重机制,并将改进后的粒子群算法与局部搜索能力较强的K-Means算法相结合,形成基于改进粒子群和K-Means的文本聚类算法(MPK-Clusters).3种算法的实验对比结果表明,新算法在准确率、召回率和F值方面都优于其他两种算法,取得了更好的文本聚类效果.
In text clustering, for large-capacity, high-dimensional, unstructured text data, the simple K-means clustering is ineffective and easy to fall into local optimal solution. In this paper, particle swarm optimization(PSO) algorithm is improved, and a mechanism of non-linear dynamic adjustment of inertia weight is proposed. The improved particle swarm optimization algorithm is combined with the K-means algorithm with strong local search ability to form a text clustering algorithm(MPK-Clusters)based on improved particle swarm optimization and K-means. The experimental results of the three algorithms show that the new algorithm is superior to the other two algorithms in terms of accuracy, recall rate and F value, and achieves better text clustering results.
引文
[1] 陈宝楼.K-Means算法研究及在文本聚类中的应用[D].合肥:安徽大学,2013.
[2] SILVA FIHO,PIMENTEL.Hybrid methods for fuzzy clustering based on fuzzy C-means and improved particle swarm optimization[J].Expert Systems with Applications,2015,42(17):6315-6328.
[3] 刘铭,刘秉权,刘远超.面向信息检索的快速聚类算法[J].计算机研究与发展,2013,50(7):1452-1463.
[4] 杨慧,吴沛泽,倪继良.基于改进粒子群置信规则库参数训练算法[J].计算机工程与设计,2017,38(2):400-404.
[5] 吴夙慧,成颖,郑彦宁,等.文本聚类中文本表示和相似度计算研究综述[J].情报科学,2012,22(4):22-25.
[6] 搜狗实验室数据资源[EB/OL].http://www.sogou.com/labs/resource/list_news.php
[7] 黄承慧,印鉴,侯昉.一种结合词项语义信息和TF-IDF方法的文本相似度量方法[J].计算机学报,2011,34(5):856-864.
[8] SHI Y,EBERHART R C.Empirical study of particle swarm optimization[C].Proceedings of Congress on Computational Intelligence.Washington D.C.,USA:[s.n.],1999:1945-1950.