流数据上的可置换聚类研究

英文题名：Alternative Clustering on Stream Data
作者：张婧媛
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：流数据 ; 聚类 ; 可置换
英文关键词：Data Stream ; Alternative Clustering ; AltStream
学位年度：2011
导师：江贺
学科代码：081203
学位授予单位：大连理工大学
论文提交日期：2011-09-01

摘要

随着硬件技术的发展,海量流数据迅速增长,成为近年来的研究热点。流数据聚类作为流挖掘的一项基本任务,已经有了大量的研究和应用。但目前已有的方法都只是在某一特定的时刻反馈给用户单一的聚类结果,然而实际应用中,流数据可以从不同的角度去观察理解,较为合理的作法是反馈给用户多个可置换聚类结果。
     本文从流数据的角度出发,设计新的适应于流数据的动态可置换聚类方法,让人们可以多方面地探索流数据。该算法命名为AltStream,由在线和离线两部分组成。它的目标是在给定的数据流中找到两个高质量的,同时相似度较低的宏观可置换聚类结果。因此,在算法的在线部分,会同时保持两组可置换的统计信息从而使得流数据不断向相异的方向变化。为保持微簇间的可置换性,我们提出一种新的度量方法SOBD测度,针对两个聚类间包含不同数据点的情况,近似地评估它们的相似性。当用户需要两个可置换宏观聚类结果时,离线部分启动。首先会根据时间区域得到两组相应的微簇集,再根据已知的簇个数在第一组微簇集上用一种无监督的算法dec-kmeans获取两个不相关的宏观簇。其中质量较好的会作为第一个最终结果返回给用户,而另外一个质量稍差的宏观簇簇心则被抽取出来作为一种半监督信息来引导第二组微簇集,运用基于权重的k-means算法从而得到第二个宏观聚类结果。
     在真实数据集上的大量实验结果表明,我们的新算法无论在质量上,还是在相异度上,都优于其它一些对比算法。由此可见,该方法将在文本数据流,信用卡交易处理流,网络日志和网络页面点击流等诸多实际应用中指导用户更好地分析数据。
In recent years, data streams have attracted a lot of research interests. As an essential task in mining data streams, stream clustering has become a hot topic in this area. These algorithms usually produce only one single clustering within a certain time period. However, data streams can be usually interpreted in multiple perspectives and alternative clusterings are preferred in many real world applications.
     In this paper, we issue the new problem of alternative stream clustering, which aims to find two high quality and dissimilar macro-clusterings in a given data stream. We propose a new algorithm named AltStream consisting of two components. The online component of AltStream simultaneously maintains two alternative groups of micro-clusters which are used to record the statistical information about the evolving stream. During the online procedure, we develop a new method, the SOBD measure to approximately evaluate the dissimilarity between two clusterings containing some distinct data points from each other. When the users request to find two alternative macro-clusterings, the offline component is then invoked. After the two sets of micro-clusters are returned with respect to the specified time horizon and the number of clusters, an unsupervised alternative clustering algorithm, namely dec-kmeans, is then employed in the offline component to find two alternative macro-clusterings over one set of micro-clusters. The one with better quality is outputted as the first resulting macro-clustering, whereas the centroids of the other macro-clustering are extracted as the semi-supervised information. Under the guideline of these centroids, the second resulting macro-clustering is created by a weighted k-means algorithm.
     Experimental results on real world streams illustrate that our new algorithm performs better than some comparative methods, in terms of both quality and dissimilarity. Therefore, AltStream would be widely used in text stream, creadit card transaction flows, web logs and web page click streams, etc. In each real-world application, it would be important for the users to explore data streams in various aspects.

引文

[1]C. C. Aggarwal. Data Streams:Models and Algorithms[M]. Springer, Berlin,2007.
    [2]C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu. A Framework for Clustering Evolving Data Streams[C]. Proc. of VLDB,2003.
    [3]C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu. A Framework for Projected Clustering of High Dimensional Data Streams[C]. Proc. of VLDB,2004.
    [4]A. Zhou, F. Cao, W. Qian, and C. Jin. Tracking Clusters in Evolving Data Streams over Sliding Windows[J]. Knowledge and Information Systems,2008,15(2):181-214.
    [5]F. Cao, M. Ester, W. Qian, and A. Zhou. Density-Based Clustering over an Evolving Data Stream with Noise[C]. Proc. of SDM,2006.
    [6]S. Guha, N. Mishra, R. Motwani, and L.O' Callaghan. Clustering Data Stream[C]. Proc. of FOCS,2000.
    [7]M. Charikar, L O' Callaghan, and R. Panigrahy. Better Streaming Algorithms for Clustering Problems[C]. Proc. of STOC,30-39,2003.
    [8]P. Domingos and G. Hulten. Mining High-Speed Data Streams[C]. Proc. of KDD,2000.
    [9]S. Guha, A. Meyerson, N. Mishra, R. Motwani, and L.O' Callaghan. Clustering Data Streams:Theory and Practice[J]. IEEE Transactions on Knowledge and Data Engineering,2003,15(3):515-528.
    [10]P. Domingos and G. Hulten. A general method for scaling up machine learning algorithms and its application to clustering[C]. Proc.18th International Conf. on Machine Learning,106-113,2001.
    [11]B. Babcock, S. Babu, M. Datar, R. Motwani, J. Widom. Models and Issues in Data Stream Systems[C]. ACM PODS Conference,2002.
    [12]J. Feigenbaum, M. Strauss, and M. Viswanathan. Testing and Spot-Checking of Data Streams[C]. ACM SODA Conference,2000.
    [13]L.O' Callaghan, N. Mishra, A. Meyerson, S. Guha, and R. Motwani. Streaming-Data Algorithms for High-Quality Clustering[C]. Proc. of ICDE,2002.
    [14]E. Bae and J. Bailey. COALA:A Novel Approach for the Extraction of an Alternate Clustering of High Quality and High Dissimilarity[C]. Proc. of ICDM,2006.
    [15]I. Davidson and Z. Qi. Finding Alternative Clusterings Using Constraints[C]. Proc. of ICDM,773-778,2008.
    [16]X. H. Dang and J. Bailey. A Hierarchical Information Theoretic Technique for the Discovery of Non Linear Alternative Clusterings[C]. Proc. of KDD,2010.
    [17]X. H. Dang and J. Bailey. Generation of Alternative Clusterings Using the CAMI Approach[C]. Proc. of SDM,2010.
    [18]D. Gondek and T. Hofmann. Conditional Information Bottleneck Clustering[C]. Proc. of ICDM,2003.
    [19]Y. Cui, X. Z. Fern, and J. G. Dy. Non-Redundant Multi-View Clustering via Orthogonalization[C]. Proc. of ICDM,133-142,2007.
    [20]P. Jain, R. Meka, and I. S. Dhillon. Simultaneous Unsupervised Learning of Disparate Clusterings[C]. Proc. of SDM,2008.
    [21]E. Bae, J. Bailey and G. Dong. A Clustering Comparison Measure Using Density Profiles and its Application to the Discovery of Alternate Clusterings[J]. Data Mining and Knowledge Discovery,2010,21(3):427-471.
    [22]W. M. Rand. Objective Criteria for the Evaluation of Clustering Methods[J]. J. Am. Stat. Assoc.,1971,66(336):846-850.
    [23]A. Patrikainen. Methods for Comparing Subspace Clusterings [J]. IEEE Transactions on Knowledge and Data Engineering,2005,18(2):902-916.
    [24]B. Larsen and C. Aone. Fast and Effective Text Mining Using Linear-Time Document Clustering[C]. Proc. of ICDM,16-22,1999.
    [25]X. V. Nguyen and J. Epps. minCEntropy:A Novel Information Theoretic Approach for the Generation of Alternative Clusterings[C]. Proc. of ICDM,521-530,2010.
    [26]Han J, Kamber M. Data Mining:Concepts and Techniques[M]. Morgan Kaufmann Publishers,2001.
    [27]Lukasz Golab, M. Tamerozsu. Issues in data stream management [J]. ACM SIGMOD Record, 2003,32(2):5-14.
    [28]Daniel Barbara. Requirements for Clustering Data Streams[J]. ACM SIGKDD Explorations Newsletter,2002,3(2):23-27.
    [29]朱蔚恒,印鉴,谢益煌.基于数据流的任意形状聚类算法[J].软件学报,2006,17(3)：379-387.
    [30]Strehl, A., Ghosh, J. Cluster Ensembles-A Knowledge Reuse Framework for Combining Multiple Partitions[J]. Jour, on Machine Learning,2002,3(2):583-617.
    [31]范明,范宏建等译.数据挖掘导论[M].北京：人民邮电出版社,2006.
    [32]Bae E, Bailey J, Dong G. Clustering similarity comparison using density profiles[C]. Australian joint conference on artificial intelligence,342-351,2006.
    [33]Caruana R, Elhawary M, Nguyen N, Smith C. Meta clustering[C]. International conference on data mining,107-118,2006.
    [34]H.-P. Kriegel, P. Kroger, and A. Zimek. Clustering high dimensional data:A survey on subspace clustering, pattern-based clustering, and correlation clustering[J]. ACM Transactions on Knowledge Discovery from Data (TKDD),2009,3(1):1-58.
    [35]L. Parsons, E. Haque, and H. Liu. Subspace clustering for high dimensional data: a review[J]. SIGKDD Explorations,2004,6(1):90-105.
    [36]Z. Qi and I. Davidson. A Principled and Flexible Framework for Finding Alternative Clusterings[C]. Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD),2009.
    [37]C. Domeniconi and M. Al-Razgan. Weighted cluster ensembles:Methods and analysis[J]. ACM Transactions on Knowledge Discovery from Data (TKDD),2009,2(4): 1-40.