分布式机器学习平台与算法综述

设为首页

收藏本站

网站地图 | English | 公务邮箱

读者指南

学术客户端

NSTL服务站

科技查新

分布式机器学习平台与算法综述

详细信息查看全文 | 推荐本文 |

英文篇名：Survey of Distributed Machine Learning Platforms and Algorithms
作者：舒娜 ; 刘波 ; 林伟伟 ; 李鹏飞
英文作者：SHU Na;LIU Bo;LIN Wei-wei;LI Peng-fei;School of Computer,South China Normal University;School of Computer Science and Technology,South China University of Technology;
关键词：大数据 ; 分布式机器学习 ; 机器学习 ; 算法分析 ; 并行计算
英文关键词：Big data;;Distributed machine learning;;Machine learning;;Algorithm analysis;;Parallel computing
中文刊名：JSJA
英文刊名：Computer Science
机构：华南师范大学计算机学院;华南理工大学计算机科学与工程学院;
出版日期：2019-03-15
出版单位：计算机科学
年：2019
期：v.46
基金：国家自然科学基金项目(61772205);; 广东省科技计划项目(2017B010126002,2017A010101008,2017A010101014,2017B090901061,2016B090918021,2016A010101018,2016A010119171);; 广州市南沙区科技计划项目(2017GJ001)资助
语种：中文;
页：JSJA201903002
页数：10
CN：03
ISSN：50-1075/TP
分类号：15-24

摘要

分布式机器学习研究将具有大规模数据量和计算量的任务分布式地部署到多台机器上,其核心思想在于"分而治之",有效提高了大规模数据计算的速度并节省了开销。分布式机器学习作为机器学习最重要的研究领域之一,受到各界研究者的广泛关注。鉴于分布式机器学习的研究意义和实用价值,文中系统综述了分布式机器学习的主流平台Spark,MXNet,Petuum,TensorFlow及PyTorch,并从各个角度深入总结、分析对比其特性;其次,从数据并行和模型并行两方面深入阐述了机器学习算法的分布式实现方式,而后依照整体同步并行模型、异步并行模型和延迟异步并行模型3种方法对机器学习算法的分布式计算模型进行概述;最后,从平台性能改进研究、算法优化、模型通信方式、大规模计算下算法的可扩展性和分布式环境下模型的容错性5个方面探讨了分布式机器学习在未来的研究方向。
Distributed machine learning deploys many tasks which have large-scale data and computation in multiple machines.For improving the speed of large-scale calculation and less overhead effectively,its core idea is "divide and conquer".As one of the most important fields of machine learning,distributed machine learning has been widely concerned by researchers in each field.In view of research significance and practical value of distributed machine learning,this paper gave a summarization of mainstream platforms like Spark,MXNet,Petuum,TensorFlow and PyTorch,and analyzed their characteristics from different sides.Then,this paper made a deep explain for the implementation of machine learning algorithm from data parallel and model parallel,and gave a view of distributed computing model from bulk synchronous parallel model,asynchronous parallel model and delayed asynchronous parallel model.Finally,this paper discussed the future work of distributed machine learning from five aspects:improvement of platform,algorithms optimization,communication of networks,scalability of large-scale data algorithms and fault-tolerance.

引文

[1] PRESS G.A very short history of big data[EB/OL].https://www.forbes.com/sites/gilpress/2013/05/09/a-very-short-history-of-big-data/#3cf546e65a18.
    [2] XING E P,HO Q,XIE P,et al.Strategies and principles of distributed machine learning on big data[J].Engineering,2016,2(2):179-195.
    [3] HE Q,LI N,LUO W J,et al.A survey of machine learning algo- rithms for big data[J].Pattern Recognition and Artificial Intelligence,2014,27(4):327-336.(in Chinese)何清,李宁,罗文娟,等.大数据下的机器学习算法综述[J].模式识别与人工智能,2014,27(4):327-336.
    [4] ZHANG K,ALQAHTANI S,DEMIRBAS M.A Comparison of Distributed Machine Learning Platforms[C]//2017 26th International Conference on Computer Communication and Networks (ICCCN).IEEE,2017:1-9.
    [5] LIU B,HE J R,GENG Y J,et al.Recent advances in infrastructure architecture of parallel machine learning algorithms[J].Computer Engineering and Applications,2017,53(11):31-38.(in Chinese)刘斌,何进荣,耿耀君,等.并行机器学习算法基础体系前沿进展综述[J].计算机工程与应用,2017,53(11):31-38.
    [6] WANG Z,LIAO J,CAO Q,et al.Friendbook:a semantic-based friend recommendation system for social networks[J].IEEE Transactions on Mobile Computing,2015,14(3):538-551.
    [7] BOUAKAZ A,TALPIN J P,VITEK J.Affine data-flow graphs for the synthesis of hard real-time applications[C]//2012 12th International Conference on Application of Concurrency to System Design (ACSD).IEEE,2012:183-192.
    [8] AKIDAU T,BRADSHAW R,CHAMBERS C,et al.The dataflow model:a practical approach to balancing correctness,latency,and cost in massive-scale,unbounded,out-of-order data processing[J].Proceedings of the VLDB Endowment,2015,8(12):1792-1803.
    [9] MENG X,BRADLEY J,YAVUZ B,et al.Mllib:Machine lear- ning in apache spark[J].The Journal of Machine Learning Research,2016,17(1):1235-1241.
    [10] LU J,WU D,MAO M,et al.Recommender system application developments:A survey[J].Decision Support Systems,2015,74(C):12-32.
    [11] ALEXANDER M,NARAYANAMURTHY S.An architecture for parallel topic models[J].Proceedings of the VLDB Endowment,2010,3(1):703-710.
    [12] LI M,ZHOU L,YANG Z,et al.Parameter server for distributed machine learning[C]//Big Learning NIPS Workshop.2013.
    [13] LI M.Scaling Distributed Machine Learning with the Parameter Server[C]//International Conference on Big Data Science and Computing.ACM,2014.
    [14] LI M,ANDERSEN D G,SMOLA A J,et al.Communication efficient distributed machine learning with the parameter server[C]//Advances in Neural Information Processing Systems.2014:19-27.
    [15] HO Q,CIPAR J,CUI H,et al.More effective distributed ml via a stale synchronous parallel parameter server[C]//Advances in neural information processing systems.2013:1223-1231.
    [16] AHMED A,SHERVASHIDZE N,NARAYANAMURTHY S,et al.Distributed large-scale natural graph factorization[C]//Proceedings of the 22nd International Conference on World Wide Web.ACM,2013:37-48.
    [17] DEAN J,CORRADO G,MONGA R,et al.Large scale distributed deep networks[C]//Advances in neural information proces-sing systems.2012:1223-1231.
    [18] XING E P,HO Q,DAI W,et al.Petuum:A new platform for dis- tributed machine learning on big data[J].IEEE Transactions on Big Data,2015,1(2):49-67.
    [19] DROR G,KOENIGSTEIN N,KOREN Y,et al.The yahoo! music dataset and kdd-cup’11[C]//Proceedings of KDD Cup 2011.2012:3-18.
    [20] HE K,ZHANG X,REN S,et al.Deep residual learning for ima- ge recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778.
    [21] KRIZHEVSKY A,SUTSKEVER I,HINTON G E.Imagenet classification with deep convolutional neural networks[C]//Advances in neural information processing systems.2012:1097-1105.
    [22] DAI W,KUMAR A,WEI J,et al.High-Performance Distributed ML at Scale through Parameter Server Consistency Models[C]//29th AAAI Conference on Artificial Intelligence(AAA-15).2015:79-87.
    [23] LIAW A,WIENER M.Classification and regression by random Forest[J].R News,2002,2(3):18-22.
    [24] HOSMER J D W,LEMESHOW S,STURDIVANT R X.Applied logistic regression[M].New York:John Wiley & Sons,2013.
    [25] ABADI M,BARHAM P,CHEN J,et al.TensorFlow:A System for Large-Scale Machine Learning[J].arXiv:1605.08695,2016.
    [26] ARVIND,CULLER D E.Dataflow Architectures[J].Annual Review of Computer Science,2010,1(1):225-253.
    [27] SAK H,SENIOR A,BEAUFAYS F.Long short-term memory recurrent neural network architectures for large scale acoustic modeling[C]//Fifteenth Annual Conference of the International Speech Communication Association.2014.
    [28] SUTSKEVER I,VINYALS O,LE Q V.Sequence to sequence learning with neural networks[C]//Advances in neural information processing systems.2014:3104-3112.
    [29] VISHNU A,SIEGEL C,DAILY J.Distributed tensorflow with MPI[J].arXiv:1603.02339,2016.
    [30] JIA Y,SHELHAMER E,DONAHUE J,et al.Caffe:Convolutional architecture for fast feature embedding[C]//Proceedings of the 22nd ACM International Conference on Multimedia.ACM,2014:675-678.
    [31] GOODFELLOW I,BENGIO Y,COURVILLE A.Deep learning[M].Massachusetts:MIT press,2016.
    [32] KANG L Y,WANG J F,LIU J,et al.Survey on parallel and dis- tributed optimization algorithms for scalable machine learning[J].Journal of Software,2018,29(1):109-130.(in Chinese)亢良伊,王建飞,刘杰,等.可扩展机器学习的并行与分布式优化算法综述[J].软件学报,2018,29(1):109-130.
    [33] LIU T Y,CHEN W,WANG T.Distributed machine learning:Foundations,trends,and practices[C]//Proceedings of the 26th International Conference on World Wide Web Companion.International World Wide Web Conferences Steering Committee,2017:913-915.
    [34] ZHOU J,DING Y,et al.KunPeng:Parameter Server based Distributed Learning Systems and Its Applications in Alibaba and Ant Financial[C]//ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2017:1693-1702.
    [35] SUNG N,KIM M,JO H,et al.NSML:A Machine Learning Platform That Enables You to Focus on Your Models[J].ar-Xiv:1712.05902.
    [36] SABOUR S,FROSST N,HINTON G E.Dynamic routing between capsules[C]//Advances in Neural Information Processing Systems.2017:3859-3869.
    [37] GAO Y,PHILLIPS J M,ZHENG Y,et al.Fully convolutional structured LSTM networks for joint 4D medical image segmentation[C]//2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018).IEEE,2018:1104-1108.
    [38] NAZARI M,OROOJLOOY A,SNYDER L V,et al.Deep Reinforcement Learning for Solving the Vehicle Routing Problem[J].arXiv:1802.04240.
    [39] LEE K,LAM M,PEDARSANI R,et al.Speeding up distributed machine learning using codes[J].IEEE Transactions on Information Theory,2017,PP(99):1.

常见问题　|　交通位置　|　联系我们　|　OA远程办公

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700