Fatman: Building Reliable Archival Storage Based on Low-Cost Volunteer Resources

设为首页

收藏本站

网站地图 | English | 公务邮箱

读者指南

学术客户端

NSTL服务站

科技查新

Fatman: Building Reliable Archival Storage Based on Low-Cost Volunteer Resources

详细信息查看全文

作者：An Qin (1)
Dian-Ming Hu (1)
Jun Liu (1)
Wen-Jun Yang (1)
Dai Tan (1)

1. Baidu Inc. ; Beijing ; 100193 ; China
关键词：volunteer storage ; failure prediction ; failure recovery ; reliability ; archival storage
刊名：Journal of Computer Science and Technology
出版年：2015
出版时间：March 2015
年：2015
卷：30
期：2
页码：273-282
全文大小：679 KB
参考文献：1. Sathiamoorthy M, Asteris M, Papailiopoulos D S, Dimakis A G, Vadali R, Chen S, Borthakur D. XORing elephants: Novel erasure codes for big data. In / Proc. the 39th VLDB, Aug. 2013, pp.325鈥?36.
2. Huang C, Simitci H, Xu Y, Ogus A, Calder B, Gopalan P, Li J, Yekhanin S. Erasure coding in windows Azure storage. In / Proc. USENIX ATC, Jun. 2012.
3. Vrable M, Savage S, Voelker G M. Cumulus: Filesystem backup to the cloud. In / Proc. the 7th USENIX Conf. File and Storage Technologies, Feb. 2009, pp.225鈥?38.
4. Vrable M, Savage S, Voelker G M. BlueSky: A cloud-backed file system for the enterprise. In / Proc. the 10th USENIX Conf. File and Storage Technologies, Feb. 2012, pp.19:1鈥?9:14.
5. Reed, IS, Solomon, G (1960) Polynomial codes over certain finite fields. Journal of the Society for Industrial and Applied Mathematics 8: pp. 300-304 CrossRef
6. Khan O, Burns A, Plank J, Pierce W, Huang C. Rethinking erasure codes for cloud file systems: Minimizing I/O for recovery and degraded reads. In / Proc. the 10th USENIX Conf. File and Storage Technologies, Feb. 2012, pp.20:1鈥?0:14.
7. Cipar J, Corner M D, Berger E D. TFS: A transparent file system for contributory storage. In / Proc. the 5th USENIX Conf. File and Storage Technologies, Feb. 2007, pp.215鈥?29.
8. McKusick, MK, Joy, WN, Leffler, SJ, Fabry, RS (1984) A fast file system for UNIX. ACM Trans. Comput. Syst. 2: pp. 181-197 CrossRef
9. Hoelzle U, Barroso L A. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines (1st edition). Morgan and Claypool Publishers, 2009.
10. Schroeder B, Gibson G A. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? In / Proc. the 5th USENIX Conf. File and Storage Technologies, Feb. 2007, pp.1:1鈥?:16.
11. Barham P, Dragovic B, Fraser K, Hand S, Harris T L, Ho A, Neugebauer R, Pratt I, Warfield A. Xen and the art of virtualization. In / Proc. the 19th SOSP, Oct. 2003, pp.164鈥?77.
12. Soltesz S, P篓otzl H, Fiuczynski M E, Bavier A C, Peterson L L. Container-based operating system virtualization: A scalable, high-performance alternative to hypervisors. In / Proc. the 2nd EuroSys, Mar. 2007, pp.275鈥?87.
13. Ghemawat S, Gobioff H, Leung S T. The Google file system. In / Proc. the 19th SOSP, Oct. 2003, pp.29鈥?3.
14. Jiang W, Hu C, Zhou Y, Kanevsky A. Are disks the dominant contributor for storage failures? A comprehensive study of storage subsystem failure characteristics. In / Proc. the 6th USENIX Conf. File and Storage Technologies, Feb. 2008, pp.111鈥?25.
15. Xin Q, Schwarz T J E, Miller E L. Disk infant mortality in large storage systems. In / Proc. the 13th MASCOTS, Sept. 2005, pp.125鈥?34.
16. Rodrigues R, Liskov B. High availability in DHTS: Erasure coding vs. replication. In / Proc. the 4th IPTPS, Feb. 2005, pp.226鈥?39.
17. Tamo I, Papailiopoulos D S, Dimakis A G. Optimal locally repairable codes and connections to matroid theory. In / Proc. CoRR, Jan. 2013.
18. Vishwanath K V, Nagappan N. Characterizing cloud computing hardware reliability. In / Proc. the 1st SoCC, Jun. 2010, pp.193鈥?04.
19. Zhu B, Wang G, Liu X, Hu D, Lin S, Ma J. Proactive drive failure prediction for large scale storage systems. In / Proc. the 29th MSST, Jun. 2013.
20. Paris J, Schwarz T J E, Long D. Evaluating the reliability of storage systems. Technical Report, UH-CS-06-08, Department of Computer Science, University of Houston, 2006.
21. Larson S M, Snow C D, Shirts M, Pande V S. Folding@home and Genome@home: Using distributed computing to tackle previously intractable problems in computational biology. / arXiv: 0901.0866, 2009. http://arxiv.org/abs/0901.0866, Jan. 2015.
22. Durrani, MN, Shamsi, JA (2014) Volunteer computing: Requirements, challenges, and solutions. J. Network and Computer Applications 39: pp. 369-380 CrossRef
23. Hamerly G, Elkan C. Bayesian approaches to failure prediction for disk drives. In / Proc. the 18th ICML, Jun. 2001, pp.202鈥?09.
24. Hughes, GF, Murray, JF, Kreutz-Delgado, K, Elkan, C (2002) Improved disk-drive failure warnings. IEEE Transactions on Reliability 51: pp. 350-357 CrossRef
25. Murray, JF, Hughes, GF, Kreutz-Delgado, K (2005) Machine learning methods for predicting failures in hard drives: A multiple-instance application. Journal of Machine Learning Research 6: pp. 783-816
26. Bitton D, Gray J. Disk shadowing. In / Proc. the 14th VLDB, Aug. 1988, pp.331鈥?38.
27. Chen, PM, Lee, EL, Gibson, GA, Katz, RH, Patterson, DA (1994) RAID: High-performance, reliable secondary storage. ACM Comput. Surv. 26: pp. 145-185 CrossRef
28. Plank, JS (1997) A tutorial on Reed-Solomon coding for faulttolerance in RAID-like systems. Software 鈥?Practice & Experience 27: pp. 995-1012 CrossRef
29. Qin A, Hu D, Liu J, Yang W, Tan D. Fatman: Cost-saving and reliable archival storage based on volunteer resources. In / Proc. the 40th VLDB, Sept. 2014, pp.1748鈥?753.
刊物类别：Computer Science
刊物主题：Computer Science, general
Software Engineering
Theory of Computation
Data Structures, Cryptology and Information Theory
Artificial Intelligence and Robotics
Information Systems Applications and The Internet
Chinese Library of Science
出版者：Springer Boston
ISSN：1860-4749

文摘

We present Fatman, an enterprise-scale archival storage based on volunteer contribution resources from under-utilized web servers, usually deployed on thousands of nodes with spare storage capacity. Fatman is specifically designed for enhancing the utilization of existing storage resources and cutting down the hardware purchase cost. Two major concerned issues of the system design are maximizing the resource utilization of volunteer nodes without violating service level objectives (SLOs) and minimizing the cost without reducing the availability of archival system. Fatman has been widely deployed on tens of thousands of server nodes across several datacenters, providing more than 100 PB storage capacity and serving dozens of internal mass-data applications. The system realizes an efficient storage quota consolidation by strong isolation and budget limitation, to maximally support resource contribution without any degradation on host-level SLOs. It novelly improves data reliability by applying disk failure prediction to minish failure recovery cost, named fault-aware data management, dramatically reduces the mean time to repair (MTTR) by 76.3% and decreases file crash ratio by 35% on real-life product workload.

常见问题　|　交通位置　|　联系我们　|　OA远程办公

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700