基于混合模型的聚类算法研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

基于混合模型的聚类算法研究

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

作者：翟少丹
论文级别：硕士
学科专业名称：应用数学
中文关键词：混合模型 ; EM算法 ; 最大后验估计(MAP) ; 模型选择 ; 聚类
英文关键词：mixture model ; EM algorithm ; maximum a posteriori(MAP) ; model selection ; clustering
学位年度：2009
导师：郭鹏江
学科代码：070104
学位授予单位：西北大学

摘要

聚类是一种在缺少先验知识的条件下将一个数据集分成多个更小的更相似子群或簇的方法。近几年来,混合模型作为聚类分析的基础,在聚类过程中发挥着重要的作用。其中有限混合模型已逐渐成为多元统计分析的得力工具。基于高斯混合模型的密度估计和聚类在众多方面都有着出色的效果。在这种方法中,数据被看作来自一个混合分布,每个分布代表一个不同的类。本文提出一种新的基于混合高斯分布的聚类方法,在聚类过程中用最大后验估计(MAP)来代替极大似然估计(MLE),从而避免了协方差矩阵在迭代中陷入奇异。同时,我们将一种改进的贝叶斯信息准则(BIC)与模型参数估计同时处理,这样就扩大了模型选择的搜索范围。
本文有以下几个部分。第一章简述聚类分析的研究现状。第二章介绍了有限混合模型的基本概念和EM算法。第三章提出了基于高斯混合模型的聚类方法,其中包括模型分支的个数及结构的估计。第四章提出一种基于最大后验估计的无监督的聚类算法,这种算法不但能有效防止协方差矩阵陷入奇异,同时在模型选择上也有很好的表现。
Cluster analysisi is the problem of determining the intrinsic structure of clustered data when no information other than the observed values is available. Probability models have been proposed for quite some time as a basis for cluster analysis. Finite mixture models are an increasingly important tool in multivariate statistics. Approaches to density estimation and clustering based on normal mixture models have shown good performance in many applications. In this approach, the data are viewed as coming from a mixture of probability distributions, each representing a different cluster. We describe a clustering methodology based on multivariate normal mixtures in which the MLE is replaced by a maximum a posteriori (MAP) estimator that may avoid singularity.
This paper is organized as follows. In Section 1,we give the necessary background in cluster analysis. In Section 2,we review finite mixture models, the EM algorithm. In Section 3, we give a brief overview of model-based clustering, which can also be viewed as a proceedure for normal mixture estimation that includes model selection, both in terms of component structure and number of components. In Section 4,we proposed an unsupervised clustering algothrim based on MAP estimator, not only avoid singularity, but also give a good performance on model selection.

引文

[1]Jiawei Han,Kamber M..Data Mining Concepts and Techniques[M].Beiing,China:Machine Press,2001:223
    [2]Shoham S..Robust clustering by deterministic agglomeration EM of mixtures of multivariate t-distributions[J].Pattern Recognition,2002,35(2002):1127-1142
    [3]Margaret H..Data Mining Introductory and Advanced Topics[M].Tsinghua University Press,2005:109-110
    [4]Sudipto G,Rajeev R.and Kyuseok S..An efficient clustering algorithm for large databases[J].Information Systems,2001,V26(1):35-58
    [5]Xu X.,Jager J.and Kriegel H.P..A fast parallel clustering algorithm for large spatial databases[J].Data Mining and Knowledge Discovery,1999,V3(3):263-290
    [6]Symons M..Approximate clustering criteria and multivariate normal mixtures[J].Biometrics,1981,V37:35-43
    [7]Kanungo T.,Mount D.M.,Netanyahu N.S.,et al.An efficient k-means clustering algorithm:analysis implementation[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2002,V24(7):881-892
    [8]Meila M.and Heckerman D..An experimental comparison of model-based clustering methods[J].Machine Learning,2001,V42(1/2):143-175
    [9]Hartuv E.and Shamir R.A clustering algorithm based on graph connectivity[J].Information Processing Letters,2000,V76:175-181
    [10]Day W.H.E.and Edelsbrunner H..Efficient algorithms for agglomerative hierarchical clustering methods[J].Classification,1984,V1:7- 24
    [11]George K.,HanEui Hong,Chameleon K..Hierarchical clustering using modeling[J].ComPuter,1999,V32(8):68-75
    [12]Jain A.K.,Dubes R.C..Algorithm for clustering data[M].Prentice-Hall,1988,1-38
    [13]Hinneburg A.and Keim D.A..An efficient approach to clustering in large multimedia databases with noise[C].New York:In Porc,1998:58-65
    [14]Sheikholeslami G.,Chatterj ee S.and Zhang W..A multi -resolution clustering approach for very large spatial databases[C].New York:In Proc,1998:428-439
    [15]Ueda N.,Nakano R.and Gharhamani Z..SMEM algorithm for mixture models[J].Neural Comput,2000,12(2000):2109-2128
    [16]Fraley C.and Raftery A..How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis[J].The Computer Journal,1998,41(8):578-588
    [17]Ueda N.,Nakano R..Deterministic annealing EM algorithm[J].Neural Networks,1998,V11(2):271-282
    [18]Bilmes J.A..A gentle tutorial of the EM algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models[J].ICSI Technical report,1997,97-021
    [19]Dempster A.P.,Laird N.M.,Rubin D.B..Maximum-likelihood from incomplete data via the EM algorithm[J].J.R.Statist.Soc,1997,V39:1-38
    [20]茆诗松,王静龙,濮晓龙.高等数理统计[M].北京:高等教育出版社,1998:427-440
    [21]Green P.,Reversible jump Markov Chain Monte Carlo computation and Bayesian model determination[J].Biometrika,1995,V82(1995):711-732
    [22]Roberts S.,Holmes C.,Denison D..Minimum-entropy data partitioning using reversible jump Markov Chain Monte Carlo[J].IEEE Trans.Pattern Analy.Matching Intelli.,2001,V23(8):909-914
    [23]Figueiredo M.and Jain A.K..Unsupervised learning of finite mixture models[J].IEEE Trans.Pattern Analysis and Machine Intelligence,2002,V24(3):381-396
    [24]McLachlan G.and Peel D..Finite Mixture Models[M].New York:John Wiley & Sons,2000:118-121
    [25]Schwarz G..Estimating the dimension of a model[J].Annals of Statistics,1978,V6:461-464
    [26]Pernkopf F.Genetic-based EM algorithm for component selection and parameter estimation of Gaussian mixture models[J].Pattern Analysis and Machine Intelligence,2005, V27(8):1344-1348

    [27] Fraley C. and Raftery A.. Model-based clustering, discriminant analysis, and density estimation[J]. Stat. Assoc., 2002, V97(458):611-631

    [28] Rissanen J.. Stochastic Complexity in Stastistical Inquiry[M]. Singapore: World Scientific, 1989:48-52

    [29] Wallace C. and Dowe D.. Minimum Message Length and Kolmogorov Complexity[J]. The Computer J., 1999, V42(4): 270-283

    [30] Fraley C. and Raftery A.. Bayesian regularization for normal mixture estimation and model-based clustering[J]. Journal of Classification, 2007, V24:155-181

    [31] Lei Xu, Michael J.. On convergence properities of the EM algorithm for Guassian mixtures[J]. Neural Computation, 1996, V8(l):129-151

    [32] Banfield J. D. and Raftery A. E.. Model-based Gaussian and non-Gaussian clustering[J]. Biometrics, 1993, V49:803-821

    [33] Rayens W. and Greene T.. Covariance pooling and stabilization for classification[J]. Computational Statistics and Data Analysis, 1991, V(11): 17-42

    [34] Tadjudin S. and Landgrebe D.. Covaraince estimation for limited training samples[J]. Geoscience and Remote Sensing Symposium, 1998, V37(4):123-128

    [35] Friedman J. F.. Regularized discriminant analysis[J]. Statist. Soc., 1989, V84:17-42

    [36] Stuart A. and Ord J. K.. Kendall's Advanced Theory of Statistics[M]. New York: Oxford Univ. Press, 1991:93-101

    [37] Figueiredo M. and Jain A. K.. Unsupervised Selection and Estimation of Finite Mixture Models[J]. Pattern Recognition, 2000, V36:87-90

    [38] Celeux G., Chretien S. and Mkhadri A.. A Component-Wise EM Algorithm for Mixtures[J]. Technical Report, 1999, V10:699-712

    [39] Kenzie P. M. and Alder M.. Initializing the EM Algorithm for use in Gaussian Mixture Modelling[J]. Amsterdam Esevier Science, 1994,BV, 91-105

    [40] Biernacki C., Celeux G., Govaert G.. Choosing Starting Values for the EM Algorithm for Getting the Highest Likelihood in Multivariate Gaussian Mixture Models[J]. Computational Statistics and Data analysis, 2002, V41:561-575

    [41] Jinwen Ma, Lei Xu and Jordan M.. Asymptotic convergence rate of the EM algorithm for Gaussian mixtures[J]. Neural Computation, 2000, V12: 2881-2907

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700