用户名: 密码: 验证码:
Design principles and techniques for enabling reboot-based administration in a persistent state storage system.
详细信息   
  • 作者:Huang ; Andrew C.
  • 学历:Doctor
  • 年:2005
  • 导师:Fox, Armando
  • 毕业院校:Stanford University
  • 专业:Computer Science.
  • ISBN:0542087065
  • CBH:3171798
  • Country:USA
  • 语种:English
  • FileSize:18854506
  • Pages:159
文摘
Managing large computer installations is complex and expensive. It is already the case that system administration accounts for a majority of the total cost of ownership, often costing 5 to 10 times more than the purchase price of hardware and software. Since long-running trends of decreasing hardware costs and increasing system complexity exacerbate the problem, ease of management will continue to be a critical system design challenge in the future.;In this thesis, I focus on management of persistent state in Internet-scale systems. These systems are characterized by their large scale and dynamically changing environment. At the scale of hundreds or thousands of nodes, node failures are the common case, which makes failure handling an important area of administration on which to focus. When changes in workload or system environment are frequent, system evolution tasks such as scaling are also common tasks administrators must handle. Finally, since Internet applications serve requests globally for fractions of a penny per access, mechanisms for dealing with scale and change must meet the 24 x 7 availability and low cost requirements.;The approach I take to simplifying state management is to first design the system to have low-cost reboot-based recovery. The properties of "cheap" recovery are that data remains available and data consistency is maintained throughout failure, failover, and recovery. Instead of affecting availability or consistency, failure and recovery manifests as minimal performance degradation that is predictable and bounded. With cheap recovery, system administration can be simplified in two ways. First, cheap recovery simplifies failure detection by lowering the cost of acting on false positives, which in turn, enables the use of statistical techniques to turn hard-to-catch failures, such as node degradation, into failure followed by recovery. Second, cheap recovery can be used to cast system evolution tasks like online data repartitioning into failure plus recovery to achieve zero-downtime incremental scaling. These low-cost failure handling and system evolution mechanisms make it possible for the system to be continuously self-adjusting, a key property of self-managing systems.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700