用户名: 密码: 验证码:
Soft error resilience in Big Data kernels through modular analysis
详细信息    查看全文
  • 作者:Sui Chen ; Greg Bronevetsky ; Lu Peng ; Bin Li ; Xin Fu
  • 关键词:Soft faults ; High ; performance computing ; Numerical errors ; Fault resilience ; Big data
  • 刊名:The Journal of Supercomputing
  • 出版年:2016
  • 出版时间:April 2016
  • 年:2016
  • 卷:72
  • 期:4
  • 页码:1570-1596
  • 全文大小:2,411 KB
  • 参考文献:1.Kulfi fault injector (2014). https://​github.​com/​quadpixels/​kulfi . Accessed 4 Mar 2015
    2.Austin D (2010) How google finds your needle in the web’s haystack. http://​www.​ams.​org/​samplings/​feature-column/​fcarc-pagerank , http://​www.​ams.​org/​samplings/​feature-column/​fcarc-pagerank
    3.Austin T (1999) Diva: a reliable substrate for deep submicron microarchitecture design. In: Proceedings of the 32nd Annual International Symposium on Microarchitecture (MICRO 1999)
    4.Baumann RC (2005) Radiation-induced soft errors in advanced semiconductor technologies. In: IEEE Transactions on Device and Materials Reliability, vol 5
    5.Breiman L, Friedman J, Stone CJ, Olshen RA (1984) Classification and regression trees. CRC Press, USAMATH
    6.Cappello F, Geist A, Gropp B, Kale S, Kramer B, Snir M (2009) Toward exascale resilience. In: International Journal of High Performance Computing Applications
    7.Chung J, Lee I, Sullivan M, Ryoo JH, Kim DW, Yoon DH, Kaplan L, Erez M (2012) Containment domains: a scalable, efficient, and flexible resilience scheme for exascale systems. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC12)
    8.Du P, Luszczek P, Dongarra J (2012) High performance dense linear system solver with resilience to multiple soft errors. Procedia Comput Sci 9:216–225CrossRef
    9.Elliott J, Hoemmen M, Mueller F (2014) Evaluating the impact of SDC on the GMRES Iterative Solver. In: Proceedings of the 28th International Parallel and Distributed Processing Symposium (IPDPS 2014)
    10.Goncalo Amador AG (2009) Linear solvers for stable fluids: GPU vs CPU. In: 17th Encontro Portugues de Computacao Grafica (EPCG’09)
    11.Huang KH, Abraham JA (1984) Algorithm-based fault tolerance for matrix operations. In: IEEE Transactions on Computers, vol C-33
    12.Kumar S, Hari S, Adve SV, Naeimi H, Ramachandran P (2012) Relyzer: exploiting application-level fault equivalence to analyze application resiliency to transient faults. In: Proceedings of the 17th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2012)
    13.Lattner C, Adve V (2004) LLVM: a compilation framework for lifelong program analysis and transformation. In: Proceedings of the 2004 International Symposium on Code Generation and Optimization (CGO 2004). San Jose
    14.Liao WK (2013) Parallel K-means data clustering. http://​users.​eecs.​northwestern.​edu/​wkliao/​Kmeans/​
    15.Murphy RC, Wheeler KB, Barrett BW, Ang JA (2010) Introducing the Graph 500. Cray Users Group (CUG)
    16.Nanya T, Goosen H (1989) The Byzantine hardware fault model. IEEE Trans Comput Aided Design Integr Circuits Syst 8:1226–1231CrossRef
    17.Rubner Y, Tomasi C, Guibas L (1998) A metric for distributions with applications to image databases. In: Proceedings of the Sixth International Conference on Computer Vision (ICCV 1998), pp 59–66
    18.Schroeder B, Pinheiro E, Weber WD (2009) DRAM errors in the wild: a large-scale field study. In: Proceedings of SIGMETRICS
    19.Stott DT, Floering B, Burke D, Kalbarczyk Z, Iyer RK (2000) NFTAPE: a framework for assessing dependability in distributed systems with lightweight fault injectors. In: Proceedings of the 2000 IEEE International Computer Performance and Dependability Symposium (IPDS 2000)
    20.Wang L, Zhan J, Luo C, Zhu Y, Yang Q, He Y, Gao W, Jia Z, Shi Y, Zhang S, Zheng C, Lu G, Zhan K, Li X, Qiu B (2014) BigDataBench: a big data benchmark suite from internet services. In: Proceedings of the 20th International Symposium on High-Performance Computer Architecture (HPCA 2014)
  • 作者单位:Sui Chen (1)
    Greg Bronevetsky (2)
    Lu Peng (1)
    Bin Li (3)
    Xin Fu (4)

    1. Division of Electrical and Computer Engineering, Louisiana State University, Baton Rouge, USA
    2. Lawrence Livermore National Laboratory, Livermore, USA
    3. Department of Experimental Statistics, Louisiana State University, Baton Rouge, USA
    4. Department of Electrical and Computer Engineering, University of Houston, Houston, USA
  • 刊物类别:Computer Science
  • 刊物主题:Programming Languages, Compilers and Interpreters
    Processor Architectures
    Computer Science, general
  • 出版者:Springer Netherlands
  • ISSN:1573-0484
文摘
The shrinking processor feature and operating voltages of processor circuits are making them increasingly vulnerable to soft faults, which calls for fault resilience techniques at both the software and hardware levels under the big data context. To assist software developers in writing fault-resilient big data applications, we propose the tool ErrorSight, which helps them to focus their efforts on code regions and data structures that are most vulnerable to soft errors, understand how numerical errors propagate through the program, and apply fault resilience techniques effectively. ErrorSight achieves this through efficient generation of error profiles leveraging the predictive power of the Boosted Regression Tree model. We use four big data kernels to illustrate the modular analysis mechanism of ErrorSight and show its usefulness in the development of numerical fault-resilience in Big Data.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700