面向差分隐私保护的聚类算法

英文篇名：Clustering Algorithm in Differential Privacy Preserving
作者：胡闯 ; 杨庚 ; 白云璐
英文作者：HU Chuang;YANG Geng;BAI Yun-lu;College of Computer Science,Nanjing University of Posts and Telecommunications;Jiangsu Key Laboratory of Big Data Security &Intelligent Processing;College of Information Technology,Nanjing University of Chinese Medicine;
关键词：差分隐私 ; k-均值 ; 聚类算法 ; 隐私保护
英文关键词：Differential privacy;;k-means;;Clustering algorithms;;Privacy preserving
中文刊名：JSJA
英文刊名：Computer Science
机构：南京邮电大学计算机学院;江苏省大数据安全与智能处理重点实验室;南京中医药大学信息技术学院;
出版日期：2019-02-15
出版单位：计算机科学
年：2019
期：v.46
基金：国家自然科学基金项目(61572263);; 江苏省自然科学基金政策引导类计划——前瞻性联合研究项目(2016ZS04)资助
语种：中文;
页：JSJA201902023
页数：7
CN：02
ISSN：50-1075/TP
分类号：129-135

摘要

大数据时代的数据挖掘技术在研究和应用等领域取得了较大发展,但大量敏感信息披露给用户带来了众多威胁和损失。因此,在聚类分析过程中如何保护数据隐私成为数据挖掘和数据隐私保护领域的热点问题。传统差分隐私保护k-means算法对其初始中心点的选择较为敏感,而且在聚簇个数k值的选择上存在一定的盲目性,降低了聚类结果的可用性。为了进一步提高差分隐私k-means聚类方法聚类结果的可用性,研究并提出一种新的基于差分隐私的DPk-means-up聚类算法,同时进行了理论分析和比较实验。理论分析表明,该算法满足ε-差分隐私,可适用于不同规模和不同维度的数据集。此外,实验结果表明,在相同隐私保护级别下,与其他差分隐私k-means聚类方法相比,所提算法有效提高了聚类的可用性。
Data mining has made great progress in the field of research and application of big data,but sensitive information disclosure could bring users many threats and losses.Therefore,how to protect data privacy in clustering analysis has become a hot issue in data mining and data privacy protection.Traditional differential privacy k-means is sensitive to the selection of its initial centers,and it has a certain blindness in the selection of cluster number k,which reduces the availability of clustering results.To improve the availability of clustering results of differential privacy k-means clustering,this paper presented a new DPk-means-up clustering algorithm based on differential privacy and carried out theoretical analysis and comparison experiment.Theoretical analysis shows that the algorithm satisfiesε-differential privacy,and can be applied to data sets with different sizes and dimensions.In addition,experimental results indicate that the proposed algorithm improves clustering availability than other differential privacy k-means clustering methods at the same level of privacy preserve.

引文

[1] MADDEN S,FRANKLIN M J,HELLERSTEIN J M.TAG:a tiny aggregation service for ad-hoc sensor networks[C]∥Proceedings of the 5th Symposium on Operating Systems Design and Implementation.New York,USA,2002:131-146.
    [2] MACHANAVAJJHALA A,GEHRKE J,KIFER D,et al.l-diversity:Privacy beyond k-anonymity[C]∥Proceedings of the22nd International Conference on Data Engineering.IEEE,2006:24-24.
    [3] GANTA S R,KASIVISWANATHAN S P,SMITH A.Composition attacks and auxiliary information in data privacy[C]∥Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2008:265-273.
    [4] BLUM A,DWORK C,MCSHERRY F,et al.Practical privacy:the SuLQ framework[C]∥Proceedings of the 24th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems.ACM,2005:128-138.
    [5] LI Y,HAO Z,WEN W,et al.Research on differential privacy preserving k-means clustering[J].Computer Science,2013,40(3):287-290.
    [6] YU Q,LUO Y,CHEN C,et al.Outlier-eliminated k-means clustering algorithm based on differential privacy preservation[J].Applied Intelligence,2016,45(4):1179-1191.
    [7] REN J,XIONG J,YAO Z,et al.DPLK-means:a novel differential privacy k-means mechanism[C]∥2017IEEE Second International Conference on Data Science in Cyberspace(DSC).IEEE,2017:133-139.
    [8] DWORK C.Differential privacy[C]∥Proceedings of the 33rd International Conference on Automata,Languages and Programming-Volume Part II.Springer,Berlin,Heidelberg,2006:1-19.
    [9] DWORK C,MCSHERRY F,NISSIM K,et al.Calibrating noise to sensitivity in private data analysis[C]∥Theory of Cryptography Conference(TCC).2006:265-284.
    [10]DWORK C,LEI J.Differential privacy and robust statistics[C]∥Proceedings of the 41st annual ACM Symposium on Theory of Computing.ACM,2009:371-380.
    [11]DWORK C.A firm foundation for private data analysis[J].Communications of the ACM,2011,54(1):86-95.
    [12]DWORK C.Differential privacy[M]∥Encyclopedia of Cryptography and Security.Springer US,2011:338-340.
    [13]LI N,QARDAJI W,SU D.Provably Private Data Anonymization:Or,k-Anonymity Meets Differential Privacy[J/OL].Corr,2010,abs/1101.2604:32-33.
    [14]LI C,HAY M,RASTOGI V,et al.Optimizing linear counting queries under differential privacy[C]∥Proceedings of the 29th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems.ACM,2010:123-134.
    [15]XIAO X,WANG G,GEHRKE J.Differential privacy via wavelet transforms[J].IEEE Transactions on Knowledge and Data Engineering,2011,23(8):1200-1214.
    [16]HAY M,RASTOGI V,MIKLAU G,et al.Boosting the accuracy of differentially private histograms through consistency[J].Proceedings of the VLDB Endowment,2010,3(1-2):1021-1032.
    [17]HARDT M,TALWAR K.On the geometry of differential privacy[C]∥Proceedings of the 42nd ACM Symposium on Theory of Computing.ACM,2010:705-714.
    [18]HARDT M,ROTHBLUM G N.A multiplicative weights mechanism for privacy-preserving data analysis[C]∥Proceedings of the 51st Annual IEEE Symposium.IEEE,2010:61-70.
    [19]NISSIM K,RASKHODNIKOVA S,SMITH A.Smooth sensitivity and sampling in private data analysis[C]∥Proceedings of the 39th annual ACM Symposium on Theory of Computing.ACM,2007:75-84.
    [20]SU D,CAO J,LI N,et al.Differentially private k-means clustering and a hybrid approach to private optimization[J].ACM Transactions on Privacy and Security(TOPS),2017,20(4):1-33.
    [21]DWORK C.Differential privacy:A survey of results[C]∥International Conference on Theory and Applications of Models of Computation.Springer,Berlin,Heidelberg,2008:1-19.
    [22]MCSHERRY F D.Privacy integrated queries:an extensible platform for privacy-preserving data analysis[C]∥Proceedings of the 2009ACM SIGMOD International Conference on Management of data.ACM,2009:19-30.
    [23]ARTHUR D,VASSILVITSKII S.k-means++:The advantages of careful seeding[C]∥Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms.Society for Industrial and Applied Mathematics,2007:1027-1035.
    [24]CHEN R,ACS G,CASTELLUCCIA C.Differentially private sequential data publication via variable-length n-grams[C]∥Proceedings of the 2012 ACM Conference on Computer and Communications Security.ACM,2012:638-649.
    [25]DWORK C.Differential privacy in new settings[C]∥Proceedings of the Twenty-first Annual ACM-SIAM Symposium on Discrete Algorithms.Society for Industrial and Applied Mathema-tics,2010:174-183.
    [26]FRNTI P.Clustering datasets[OL].http://cs.joensuu.fi/sipu/datasets.
    [27]NGUYEN H H,KIM J,KIM Y.Differential privacy in practice[J].Journal of Computing Science and Engineering,2013,7(3):177-186.
    [28]JIANG H,YI S,LI J,et al.Ant clustering algorithm with Kharmonic means clustering[J].Expert Systems with Applications,2010,37(12):8679-8684.
    [29]FANG Y J,ZHU J Z,ZHOU W,et al.A survey on data mining privacy protection algorithms[J].Netinfo Security,2017(2):6-11.(in Chinese)方跃坚,朱锦钟,周文,等.数据挖掘隐私保护算法研究综述[J].信息网络安全,2017(2):6-11.
    [30]ZHANG F X,JIANG C H.Research on query on privacy anonymity algorithm based on grid clustering[J].Netinfo Security,2015(8):53-58.(in Chinese)张付霞,蒋朝惠.一种基于网格聚类的查询隐私匿名算法研究[J].信息网络安全,2015(8):53-58.