用户名: 密码: 验证码:
PARADISE: Big data analytics using the DBMS tightly integrated with the distributed file system
详细信息    查看全文
  • 作者:Jun-Sung Kim ; Kyu-Young Whang ; Hyuk-Yoon Kwon ; Il-Yeol Song
  • 关键词:Big data analytics ; MapReduce ; DBMS ; Distributed file system ; Integration ; HadoopDB
  • 刊名:World Wide Web
  • 出版年:2016
  • 出版时间:May 2016
  • 年:2016
  • 卷:19
  • 期:3
  • 页码:299-322
  • 全文大小:1,912 KB
  • 参考文献:1.Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Rasin, A., Silberschatz, A.: HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads, In Proceedings of 35th Int’l Conf. on Very Large Data Bases (VLDB), pp. 922–933, Lyon, France (2009)
    2.Blanas, S., Patel, J., Ercegovac, V., Rao, J., Shekita, E., Tian, Y.: A Comparison of Join Algorithms for Log Processing in MapReduce,” In Proc. 2010 ACM Int’l Conf. on Management of Data (SIGMOD), pp. 975–986, Indianapolis, Indiana (2010)
    3.Brantner, M., Florescu, D., Graf, D., Kossmann, D., Kraska, T.: Building a database on S3,” In Proc. 2008 A C M Int’l Conf. on Management of Data (SIGMOD) pp. 251–264, Vancouver, Canada (2008)
    4.Beyer, M., Feinberg, D., Adrian, M., Edjlali, R.: Magic Quadrant for Data Warehouse Database Management Systems, Gartner Reports (2012)
    5.Chaiken, R., Jenkins, B., Larson, P., Ramsey, B., Shakib, D., Weaver, S., Zhou, J.: SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets In Proc. 34th Int’l Conf. on Very Large Data Bases (VLDB), pp. 1265–1276 Auckland, New Zealand (2008)
    6.Chang, F., Dean, J., Ghemawat, S., Hsieh, W., Wallach, D., Burrows, M., Chandra, T., Fikes, A., Gruber, R.: BigTable: A Distributed Storage System for Structured Data, In Proceedings of 6th Symposium on Operating Systems Design and Implementation (OSDI), pp. 205–218, Seattle, Washington (2006)
    7.Chattopadhyay, B., et al.: Tenzing – A SQL Implementation On The MapReduce Framework, In Proceedings of 37th Int’l Conf. on Very Large Data Bases (VLDB), pp. 1318–1327, Seattle, Washington, Aug.–Sept. (2011)
    8.Cooper, B., Ramakrishnan, R., Srivastava, U., Silberstein, A., Bohannon, P., Jacobsen, H., Puz, N., Weaver, D., Yerneni, R.: PNUTS: Yahoo!’s Hosted Data Serving Platform, In Proceedings of 34th Int’l Conf. on Very Large Data Bases (VLDB), pp. 1277–1288, Auckland, New Zealand (2008)
    9.Dean, J. , Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters, In Proceedings of 4th Symposium on Operating Systems Design and Implementation (OSDI), pp. 137–150, San Francisco, California (2004)
    10.DeWitt, D., Gray, J.: Parallel Database Systems: The Future of High-Performance Database Systems. Commun. ACM 35(6), 85–98 (1992)CrossRef
    11.The Digital Universe. http://​www.​emc.​com/​leadership/​programs/​digital-universe.​htm
    12.Dittrich, J., Quiane-Ruiz, J., Jindal, A., Kargin, Y., Setty, V., Schad, J.: Hadoop ++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing),” In Proc. 36th Int’l Conf. on Very Large Data Bases (VLDB), pp. 515–529, Singapore, Sept. (2010)
    13.Dittrich, J., Quiane-Ruiz, J., Richter, S., Schuh, S., Jindal, A., Schad, J.: Only Aggressive Elephants Are Fast Elephants, In Proceeidngs 38th Int’l Conf. on Very Large Data Bases (VLDB), pp. 1591–1692, Istanbul, Turkey (2012)
    14.Friedman, E., Pawlowski, P., Cieslewicz, J.: SQL/MapReduce: a practical approach to self-describing, polymorphic, and parallelizable user-defined functions, In Proceedings 35th Int’l Conf. on Very Large Data Bases (VLDB), pp. 1402–1413, Lyon, France (2009)
    15.Gantz, J., Reinsel, D.: Extracting Value from Chaos, IDC iView (2011)
    16.Ghemawat, S., Gobioff, H., Leung, S.: The Google File System, In Proceedings 19th ACM Symposium on Operating Systems Principles(SOSP), pp. 29–43, BoltonLanding, New York (2003)
    17.Hadoop, MapReduce. http://​hadoop.​apache.​org
    18.Hadoop, Project. http://​hadoop.​apache.​org
    19.Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann (2006)
    20.HDFS. http://​hadoop.​apache.​org
    21.Herdotou, H., Babu, S.: Profiling, Whatif Analysis, and Costbased Optimization of MapReduce Programs, In Proceedings 37th Int’l Conf. on Very Large Data Bases (VLDB), pp. 1111–1122, Seattle, Washington (2011)
    22.Jahani, E., Cafarella, M., Re, C.: Automatic Optimization for MapReduce Programs, In Proceedings 37th Int’l Conf. on Very Large Data Bases (VLDB), pp. 385–396, Seattle, Washington (2011)
    23.Kim, J., Whang, K., Kwon, H., Song, I.: Odysseus/DFS: Integration of DBMS and the Distributed File System for Transaction Processing on Big Data, CoRR Technical Report (CS.DB/arXiv:1406.​0435 ) (2014)
    24.Lymna, P., Varian, H.: How Much Information?, Project Report, School of Information Management and Systems, University California at Berkeley (2003). http://​www.​sims.​berkeley.​edu/​research/​projects/​how-much-info-2003
    25.Morgan, T.: Can network architectures break the speed limit?, Enterprise Tech. (2011). http://​www.​theregister.​co.​uk/​2011/​10/​10/​network_​architecture
    26.Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: ”Pig Latin: A Not-So-Foreign Language for Data Processing,” In Proc. 2008 ACM Int’l Conf. on Management of Data (SIGMOD), pp. 1099–1110, Vancouver, Canada (2008)
    27.Pavlo, A., Paulson, E., Rasin, A., Abadi, D., DeWitt, D., Madden, S., Stonebraker, M.: A Comparison of Approaches to Large-Scale Data Analysis, In Proceedings 2009 ACM Int’l Conf. on Management of Data (SIGMOD), pp. 165–178, Providence, Rhode Island (2009)
    28.Shute, J., et al.: F1: A Distributed SQL Database That Scales, In Proceedings of the 39th Int’l Conf. on Very Large Data Bases (VLDB), pp. 1068–1079, Riva del Garda, Italy (2013)
    29.Stonebraker, M., Abadi, D., DeWitt, D., Madden, S., Paulson, E., Pavlo, A., Rasin, A.: MapReduce and Parallel DBMSs:Friends or Foes?. Commun. ACM 53, 64–71 (2010)CrossRef
    30.Thusoo, A., Sarma, J., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive - A Warehousing Solution Over a Map-Reduce Framework, In Proceedings 35th Int’l Conf. on Very Large Data Bases (VLDB), pp. 1626–1629, Lyon, France (2009)
    31.Whang, K., Lee, M., Lee, J., Kim, M., Han, W.: Odysseus: a High-Performance ORDBMS Tightly-Coupled with IR Features, In Proceedings 21st IEEE Int’l Conf. on Data Engineering (ICDE), pp. 1104–1105, Tokyo, Japan. This paper received the Best Demonstration Award (2005)
    32.Whang, K., Lee, J., Kim, M., Lee, M., Lee, K.: Odysseus: a High-Performance ORDBMS Tightly-Coupled with Spatial Database Features, In Proceedings 23rd IEEE Int’l Conf. on Data Engineering (ICDE), pp. 1493–1494, Istanbul, Turkey (2007)
    33.Whang, K., Yun, T., Yeo, Y., Song, I., Kwon, H., Kim, I.: ODYS: An Approach to Building a Massively-Parallel Search Engine Using a DB-IR Tightly-Integrated Parallel DBMS for Higher-Level Functionality,” In Proceedings 2013 ACM Int’l Conf. on Management of Data (SIGMOD), pp. 313–324, New York, New York (2013)
    34.Whang, K., Lee, J., Lee, M., Han, W., Kim, M., Kim, J.: DB-IR integration using tight-coupling in the Odysseus DBMS, World Wide Web (2013). doi:10.​1007/​s11280-013-0264-y
    35.Woligroski, D.: Gigabit Ethernet: Dude, Where’s My Bandwidth?, Bestofmedia Group (2009). http://​www.​tomshardware.​com/​reviews/​gigabit-ethernet-bandwidth,2321.​html
  • 作者单位:Jun-Sung Kim (1)
    Kyu-Young Whang (1)
    Hyuk-Yoon Kwon (1)
    Il-Yeol Song (2)

    1. Department of Computer Science, KAIST, Daejeon, Korea
    2. College of Computing & Informatics, Drexel University, Philadelphia, USA
  • 刊物类别:Computer Science
  • 刊物主题:Information Systems Applications and The Internet
    Database Management
    Operating Systems
  • 出版者:Springer Netherlands
  • ISSN:1573-1413
文摘
There has been a lot of research on MapReduce for big data analytics. This new class of systems sacrifices DBMS functionality such as query languages, schemas, or indexes in order to maximize scalability and parallelism. However, as high functionality of the DBMS is considered important for big data analytics as well, there have been a lot of efforts to support DBMS functionality in MapReduce. HadoopDB is the only work that directly utilizes the DBMS for big data analytics in the MapReduce framework, taking advantage of both the DBMS and MapReduce. However, HadoopDB does not support sharability for the entire data since it stores the data into multiple nodes in a shared-nothing manner—i.e., it partitions a job into multiple tasks where each task is assigned to a fragment of data. Due to this limitation, HadoopDB cannot effectively process queries that require internode communication. That is, HadoopDB needs to re-load the entire data to process some queries (e.g., 2-way joins) or cannot support some complex queries (e.g., 3-way joins). In this paper, we propose a new notion of the DFS-integrated DBMS where a DBMS is tightly integrated with the distributed file system (DFS). By using the DFS-integrated DBMS, we can obtain sharability of the entire data. That is, a DBMS process in the system can access any data since multiple DBMSs are run on an integrated storage system in the DFS. To process big data analytics in parallel, our approach use the MapReduce framework on top of a DFS-integrated DBMS. We call this framework PARADISE. In PARADISE, we employ a job splitting method that logically splits a job based on the predicate in the integrated storage system. This contrasts with physical splitting in HadoopDB. We also propose the notion of locality mapping for further optimization of logical splitting. We show that PARADISE effectively overcomes the drawbacks of HadoopDB by identifying the following strengths. (1) It has a significantly faster (by up to 6.41 times) amortized query processing performance since it obviates the need to re-load data required in HadoopDB. (2) It supports query types more complex than the ones supported by HadoopDB.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700