基于网格的MST数据流聚类算法研究

英文题名：Research on Grid-based MST Data Stream Clustering Algorithm
作者：王宪鹏
论文级别：硕士
学科专业名称：计算机系统结构
中文关键词：聚类分析 ; 数据流 ; 网格 ; 最小生成树
英文关键词：clustering analysis ; data stream ; grid ; minimum spanning tree
学位年度：2009
导师：张健沛
学科代码：081201
学位授予单位：哈尔滨工程大学
论文提交日期：2009-03-02

摘要

聚类分析是数据挖掘领域中一个非常重要的研究方向。近年来,随着信息技术的高速发展出现了一种应用日益广泛的动态流数据—数据流。数据流不同于传统的存储在磁盘上的静态数据,它是高速的、连续的、动态的、快速变化的、海量的数据集合,由此对它的访问只能是顺序的、一次或有限次的。数据流的这些特性既给数据流的挖掘带来了极大的困难,也给数据流的聚类算法提出了更高的要求。在当前的数据挖掘领域中,数据流已经成为一个研究热点,同时数据流聚类分析也成为聚类研究的一个重要方向。
     本文首先介绍了数据流挖掘的相关理论与技术,结合流数据与传统的静态数据的不同分析了数据流的特点。同时对传统聚类算法与数据流聚类算法进行了研究和对比,分析了算法的优势与不足,阐述了数据流聚类算法的特点及其与传统聚类算法的不同。然后介绍了用于聚类算法的网格划分方法及其在聚类分析中的作用,并对基于网格的聚类算法进行了研究与分析。在此基础上给出了一种新的数据流聚类算法—GTSClu算法,该算法是基于网格的最小生成树(MST)数据流聚类算法,算法分为在线处理与离线聚类两部分,并运用了网格与最小生成树技术。在线部分通过均匀网格划分数据空间以获取数据流的信息,离线部分将网格空间拆分为不均匀的网格结构,并利用最小生成树技术对在线获得的信息进行聚类。GTSClu算法可以有效排除噪声数据发现任意形状的聚类,有效提高了聚类效率和质量。
     实验结果表明,GTSClu算法能够发现任意形状的聚类,对数据的输入顺序不敏感,而且网格拆分技术的采用使其能够有效分离出噪声数据具有很高的聚类精度和处理效率,适合处理大规模的数据流。
Clustering analysis is an important research topic in the field of data mining. With fast development of information technology these years, there exists more widely used dynamic flow data—data stream. Being different from traditional static data stored in disc, data stream is a high speed and massive set which is continuous, dynamic and fast changing thus the visit to it can only be ordinal, one time or several limited times. These characteristics of data stream bring great difficulty to data mining and much higher requirements of clustering algorithm for data stream. In the field of data mining, data stream has been a research focus recently; meanwhile, data stream clustering analysis has become an important topic in clustering research.
     This thesis first introduces the theories and techniques of data stream mining, and gives an analysis of characteristics of data stream on the combination of streaming data and traditional data. Moreover, comparison and research are made between existing traditional clustering algorithm and data stream clustering algorithm and get advantage and deficiency between them. Then it presents the grid dividing method used in clustering algorithm and its effect in clustering analysis as well as the research and analysis of grid-based clustering algorithm. On this basis, a new kind of clustering algorithm for data stream—GTSClu algorithm is proposed. It is the minimum spanning tree data stream clustering algorithm based on grid, which is divided into online processing and offline clustering, combining with grid and minimum spanning tree techniques. The online part acquires data stream information by means of even grid dividing data space while offline part carries out clustering analysis on the information obtained online through dividing the grid space into uneven grid and the minimum spanning tree technique. Thus, GTSClu algorithm can not only find clusters with arbitrary shape and amount, but also deal with noise data effectively, the efficiency and quality of clustering is improved.
     The experimental results show that GTSClu algorithm is capable of discovering arbitrary shape of cluster and is insensitive to the sequence of data. Furthermore, the application of grid dividing technique can make the algorithm have high efficiency and accuracy of clustering, and it can distinguish noise data effectively. The algorithm is suited to large-scale data stream processing.

引文

[1] S. Muthukrishnan, R. Shah, J. Vitter. Mining Deviants in Time Series Data Streams. Proceedings of the 16th International Conference on Scientific and Statistical Database Management (SSDM'04), Santorini Island, Greece,2004: 41-50P

    [2] Ghoting, M. Otey, S. Parthasarathy. LOADED: Link-based Outlier and Anomaly Detection in Evolving Data Sets, Proceedings of the Fourth IEEE International Conference on Data Mining (ICDM'04), Brighton, UK, 2004:387-390P

    [3] Chang-Tien Lu, Lily R. Liang. Wavelet Fuzzy Classification for Detecting and Tracking Region Outliers in Meteorological Data. Proceedings of the Twelfth ACM International Symposium on Advances in Geographic Information Systems(GIS'O4), Washington, USA, 2004: 258-265P

    [4] J. Ma, S. Perkins. Time-series Novelty Detection Using One-class Support Vector Machines. Proceedings of the International Joint Conference on Neural Networks, Portland, OR, United States, 2003: 1741-1745P

    [5] Fang Chu, Yizhou Wang, Carlo Zaniolo. An Adaptive Learning Approach for Noisy Data Streams. Proceedings of the Fourth IEEE International Conference on Data Mining(ICDM'O4), Brighton, UK, 2004: 351-354P

    [6] Yang Yidong, Sun Zhihui, Zhang Jing. Finding outliers in distributed data streams based on kernel density estimation, Computer Research and Development, 2005, 42(9): 1498-1504P

    [7] V. Hautamaki, I. Karkkainen, P. Franti, Outlier Detection Using k-Nearest Neighbour Graph. Proceeding of the 17th International Conference on Pattern Recognition (ICPRC'04), Cambridge, UK, 2004: 430-433P

    [8] S. Bay, M. Schwabacher. Mining distance-based outliers in near linear time with randomization and a simple pruning rule. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,Washington,USA,2003:29-38P
    [9]S.Papadimitriou,H.Kitagawa,P.B.Gibbons.LOCI:Fast outlier detection using the local correlation integral.Proceedings of the 19th International Conference on Data Engineering(ICDE'03),Bangalore,India,2003:315-326P
    [10]S.Shekhar,Chang-Tien Lu,Pusheng Zhang.Detecting Graph-Based Spatial Outliers:Algorithms and Applications(A Summary of Results).Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining,San Francisco USA,2001:371-376P
    [11]T.Zhang,R.Ramakrishnan,M.Livny,BIRCH:An Efficient Data Clustering Method of Very Lager Database.In:Proc.of 1996ACM-SIGMOD Int.Conf.Management of Data(SIGMOD'96),1996:103-114P
    [12]Cao F,Estery M,Qian W.Density-based Clustering over an Evolving Data Stream with Noise.In:Proceedings of the 2006 SIAM Conference on Data Mining(SDM'2006),2006:328-339P
    [13]朱蔚恒,印鉴,谢益煌.基于数据流的任意形状聚类算法.软件学报.2006,17(3):380-387页
    [14]单世民,张宁等.基于网格和密度的簇边缘精度增强聚类算法.计算机工程与应用,2008,44(23):143-146页
    [15]C.Jin,W.Qian,C.Sha,J.X.Yu,A.Zhou.Dynamically maintaining frequent item over a data stream.In Proceedings of the 12th ACM Conference on information and Knowledge Management,New Orleans,LA,USA,2003:287-294P
    [16]A.Zhou,Z.Cai,L.Wei,W.Qian.M-Kernel Merging:Towards Density Estimation over Data Streams.In Proceedings of 8th International Conference on Data base Systems for Advanced Applications (DASFAA'03),Tokyo,Japan,IEEE,2003:285-292P
    [17]张冬冬,李建中,王伟平,郭龙江.数据流历史数据的存储与聚集查询处理算法.软件学报.2005,16(12):2089-2098页
    [18]B.Babcock,S.Babu,M.Datar,R.Motwani,J.Widom.Models and Issues in Data Stream Systems.In Proceedings of the 21st ACM Symposium on Principles of Data base Systems,2002:1-16P
    [19]陆亿红.基于聚类的数据流挖掘技术的分析与研究.浙江工业大学学报.2007,35(3):289-291页
    [20]Han J,Kamber M.Data Mining:Concepts and Techniques(Second Edition).Morgan Kaufmann,Elsevier Inc,2006:467-589P
    [21]Yang Q,Wu X.10 Challenging Problems in Data Mining Research.International Journal of Information Technology & Decision Making.World Scientific Publishing Company,2006,5(4):597-604P
    [22]A.K Jain,M.N Murty,P.J Flynn.Data Clustering:A Review.ACM Computing Surveys,1999,31(3):264-323P
    [23]Krista,Rizman alik.An efficient k-means clustering algorithm.Pattern Recognition Letters archive,2.008,29(9):1385-1391P
    [24]S.Guha,R.Rastogi,K.Shim.CURE:An efficient clustering algorithm for large database.Information Systems.2001,26(1):35-58P
    [25]W.Wang,J.Yang,R.Muntz.STING:A statistical information grid approach to spatial data mining.Proc.1997 Int.Conf.Very Large Data Bases(VLDB'97),1997:186-195P
    [26]Pizzuti Clara,Talia Domenico.P-Auto Class:Scalable parallel clustering for mining large data sets.IEEE Trans,on Knowledge and Data Engineering,2003,15(3):629-641P
    [27]孙玉芬,卢炎生.流数据挖掘综述.计算机科学.2007,34(1):2-3页
    [28]蒋盛益,李庆华,李新.数据流挖掘算法研究综述.计算机工程与设计.2005,26(5):1130-1132页
    [29]S.Guha,N.Mishra,R.Motwani,L.Ocallaghan.Clustering data Streams In Proc.of the 2000 Annual SymP.on Foundations of Computer science,2000:359-366P
    [30]Han J,Kamber M.Data Mining:Concepts and Techniques(Second Edition).Morgan Kaufmann,Elsevier Inc,2006:467-589P
    [31]C.Aggarwal,J.Han,J.Wang,P.S.Yu.A Framework for Clustering Evolving Data Streams.In Proceedings of 29th International Conference on Very Large DataBases(VLDB'03),Berlin,Germany,2003:81-92P
    [32]刘青宝,戴超凡,邓苏,张维明.基于网格的数据流聚类算法.计算机科学.2007,34(3):159-161页
    [33]孙玉芬.基于网格的聚类算法研究.华中科技大学博士学位论文.2006:16-26页
    [34]Han J,Kamber M.Data Mining:Concepts and Techniques.Morgan Kaufmann Publishers,2001:224-226P
    [35]R.Agrawal,J.Gehrke,D.Gunopulos,P.Raghavan.Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications.Proc of the ACM SIGMOD Int'1 Conference on Management of Data,Seattle,Washington,1998:94-105P
    [36]Harsha Nagesh,Sanjay Goil,Alok Choudhary.Adaptive Grids for Clustering Massive Data Sets.SIAM Conference on Data Mining,2001:3-5P
    [37]Alexander Hinneburg,Daniel A.Keim.Optimal Grid-Clustering:Towards breaking the curse of Dimensionality in high-dimensional Clustering.Proceedings of the 25th VLDB Conference,Edinburgh,1999:506-517P
    [38]C.H.Cheng,A.W.Fu,Y.Zhang.Entropy-based subspace clustering for mining numerical data.In Proceedings of the Fifth ACM SIGKDD international conference on Knowledge discovery and data mining,ACMPress,1999:84-93P
    [39]Ji Zhang,Wynne Hsu,Mong Li Lee.Clustering in dynamic spatial databases.Journal of intelligent information systems,2005,24(1):5-27P
    [40]John T.Rickard,Ronald R.Yager,Wendy Miller.Mounmain clustering on non-uniform grids using P-tree.Fuzzy optimization and decision making,2005(4):87-102P
    [41]邱保志,沈钧毅.基于网格技术的高精度聚类算法.计算机工程.2006,32(3):12-13页
    [42]何勇等.基于动态网格的数据流聚类分析.计算机应用研究.2008,25(11):2-4页
    [43]严蔚敏,吴伟民.数据机构.第一版.北京:清华大学出版社,1997:173-175页
    [44]B.Chazelle,A Minimum Spanning Tree Algorithm with Inverse-Ackerman Type Complexity,J.ACM,2000(47):1028-1047P
    [45]Chih-Ming Hsu,Ming-Syan Chen.Subspace clustering of high dimensional spatial data with noises.Heidelberg,Germany:Springer,2004:31-40P
    [46]刘敏娟,柴玉梅.基于相似度的网格聚类算法.计算机工程与应用.2007,43(7):199-201页