PARADISE: Big data analytics using the DBMS tightly integrated with the distributed file system

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

PARADISE: Big data analytics using the DBMS tightly integrated with the distributed file system

详细信息查看全文

作者：Jun-Sung Kim ; Kyu-Young Whang ; Hyuk-Yoon Kwon ; Il-Yeol Song
关键词：Big data analytics ; MapReduce ; DBMS ; Distributed file system ; Integration ; HadoopDB
刊名：World Wide Web
出版年：2016
出版时间：May 2016
年：2016
卷：19
期：3
页码：299-322
全文大小：1,912 KB
参考文献：1.Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Rasin, A., Silberschatz, A.: HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads, In Proceedings of 35th Int’l Conf. on Very Large Data Bases (VLDB), pp. 922–933, Lyon, France (2009)
2.Blanas, S., Patel, J., Ercegovac, V., Rao, J., Shekita, E., Tian, Y.: A Comparison of Join Algorithms for Log Processing in MapReduce,” In Proc. 2010 ACM Int’l Conf. on Management of Data (SIGMOD), pp. 975–986, Indianapolis, Indiana (2010)
3.Brantner, M., Florescu, D., Graf, D., Kossmann, D., Kraska, T.: Building a database on S3,” In Proc. 2008 A C M Int’l Conf. on Management of Data (SIGMOD) pp. 251–264, Vancouver, Canada (2008)
4.Beyer, M., Feinberg, D., Adrian, M., Edjlali, R.: Magic Quadrant for Data Warehouse Database Management Systems, Gartner Reports (2012)
5.Chaiken, R., Jenkins, B., Larson, P., Ramsey, B., Shakib, D., Weaver, S., Zhou, J.: SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets In Proc. 34th Int’l Conf. on Very Large Data Bases (VLDB), pp. 1265–1276 Auckland, New Zealand (2008)
6.Chang, F., Dean, J., Ghemawat, S., Hsieh, W., Wallach, D., Burrows, M., Chandra, T., Fikes, A., Gruber, R.: BigTable: A Distributed Storage System for Structured Data, In Proceedings of 6th Symposium on Operating Systems Design and Implementation (OSDI), pp. 205–218, Seattle, Washington (2006)
7.Chattopadhyay, B., et al.: Tenzing – A SQL Implementation On The MapReduce Framework, In Proceedings of 37th Int’l Conf. on Very Large Data Bases (VLDB), pp. 1318–1327, Seattle, Washington, Aug.–Sept. (2011)
8.Cooper, B., Ramakrishnan, R., Srivastava, U., Silberstein, A., Bohannon, P., Jacobsen, H., Puz, N., Weaver, D., Yerneni, R.: PNUTS: Yahoo!’s Hosted Data Serving Platform, In Proceedings of 34th Int’l Conf. on Very Large Data Bases (VLDB), pp. 1277–1288, Auckland, New Zealand (2008)
9.Dean, J. , Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters, In Proceedings of 4th Symposium on Operating Systems Design and Implementation (OSDI), pp. 137–150, San Francisco, California (2004)
10.DeWitt, D., Gray, J.: Parallel Database Systems: The Future of High-Performance Database Systems. Commun. ACM 35(6), 85–98 (1992)CrossRef
11.The Digital Universe. http://www.emc.com/leadership/programs/digital-universe.htm
12.Dittrich, J., Quiane-Ruiz, J., Jindal, A., Kargin, Y., Setty, V., Schad, J.: Hadoop ++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing),” In Proc. 36th Int’l Conf. on Very Large Data Bases (VLDB), pp. 515–529, Singapore, Sept. (2010)
13.Dittrich, J., Quiane-Ruiz, J., Richter, S., Schuh, S., Jindal, A., Schad, J.: Only Aggressive Elephants Are Fast Elephants, In Proceeidngs 38th Int’l Conf. on Very Large Data Bases (VLDB), pp. 1591–1692, Istanbul, Turkey (2012)
14.Friedman, E., Pawlowski, P., Cieslewicz, J.: SQL/MapReduce: a practical approach to self-describing, polymorphic, and parallelizable user-defined functions, In Proceedings 35th Int’l Conf. on Very Large Data Bases (VLDB), pp. 1402–1413, Lyon, France (2009)
15.Gantz, J., Reinsel, D.: Extracting Value from Chaos, IDC iView (2011)
16.Ghemawat, S., Gobioff, H., Leung, S.: The Google File System, In Proceedings 19th ACM Symposium on Operating Systems Principles(SOSP), pp. 29–43, BoltonLanding, New York (2003)
17.Hadoop, MapReduce. http://hadoop.apache.org
18.Hadoop, Project. http://hadoop.apache.org
19.Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann (2006)
20.HDFS. http://hadoop.apache.org
21.Herdotou, H., Babu, S.: Profiling, Whatif Analysis, and Costbased Optimization of MapReduce Programs, In Proceedings 37th Int’l Conf. on Very Large Data Bases (VLDB), pp. 1111–1122, Seattle, Washington (2011)
22.Jahani, E., Cafarella, M., Re, C.: Automatic Optimization for MapReduce Programs, In Proceedings 37th Int’l Conf. on Very Large Data Bases (VLDB), pp. 385–396, Seattle, Washington (2011)
23.Kim, J., Whang, K., Kwon, H., Song, I.: Odysseus/DFS: Integration of DBMS and the Distributed File System for Transaction Processing on Big Data, CoRR Technical Report (CS.DB/arXiv:1406.0435 ) (2014)
24.Lymna, P., Varian, H.: How Much Information?, Project Report, School of Information Management and Systems, University California at Berkeley (2003). http://www.sims.berkeley.edu/research/projects/how-much-info-2003
25.Morgan, T.: Can network architectures break the speed limit?, Enterprise Tech. (2011). http://www.theregister.co.uk/2011/10/10/network_architecture
26.Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: ”Pig Latin: A Not-So-Foreign Language for Data Processing,” In Proc. 2008 ACM Int’l Conf. on Management of Data (SIGMOD), pp. 1099–1110, Vancouver, Canada (2008)
27.Pavlo, A., Paulson, E., Rasin, A., Abadi, D., DeWitt, D., Madden, S., Stonebraker, M.: A Comparison of Approaches to Large-Scale Data Analysis, In Proceedings 2009 ACM Int’l Conf. on Management of Data (SIGMOD), pp. 165–178, Providence, Rhode Island (2009)
28.Shute, J., et al.: F1: A Distributed SQL Database That Scales, In Proceedings of the 39th Int’l Conf. on Very Large Data Bases (VLDB), pp. 1068–1079, Riva del Garda, Italy (2013)
29.Stonebraker, M., Abadi, D., DeWitt, D., Madden, S., Paulson, E., Pavlo, A., Rasin, A.: MapReduce and Parallel DBMSs:Friends or Foes?. Commun. ACM 53, 64–71 (2010)CrossRef
30.Thusoo, A., Sarma, J., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive - A Warehousing Solution Over a Map-Reduce Framework, In Proceedings 35th Int’l Conf. on Very Large Data Bases (VLDB), pp. 1626–1629, Lyon, France (2009)
31.Whang, K., Lee, M., Lee, J., Kim, M., Han, W.: Odysseus: a High-Performance ORDBMS Tightly-Coupled with IR Features, In Proceedings 21st IEEE Int’l Conf. on Data Engineering (ICDE), pp. 1104–1105, Tokyo, Japan. This paper received the Best Demonstration Award (2005)
32.Whang, K., Lee, J., Kim, M., Lee, M., Lee, K.: Odysseus: a High-Performance ORDBMS Tightly-Coupled with Spatial Database Features, In Proceedings 23rd IEEE Int’l Conf. on Data Engineering (ICDE), pp. 1493–1494, Istanbul, Turkey (2007)
33.Whang, K., Yun, T., Yeo, Y., Song, I., Kwon, H., Kim, I.: ODYS: An Approach to Building a Massively-Parallel Search Engine Using a DB-IR Tightly-Integrated Parallel DBMS for Higher-Level Functionality,” In Proceedings 2013 ACM Int’l Conf. on Management of Data (SIGMOD), pp. 313–324, New York, New York (2013)
34.Whang, K., Lee, J., Lee, M., Han, W., Kim, M., Kim, J.: DB-IR integration using tight-coupling in the Odysseus DBMS, World Wide Web (2013). doi:10.1007/s11280-013-0264-y
35.Woligroski, D.: Gigabit Ethernet: Dude, Where’s My Bandwidth?, Bestofmedia Group (2009). http://www.tomshardware.com/reviews/gigabit-ethernet-bandwidth,2321.html
作者单位：Jun-Sung Kim (1)
Kyu-Young Whang (1)
Hyuk-Yoon Kwon (1)
Il-Yeol Song (2)

1. Department of Computer Science, KAIST, Daejeon, Korea
2. College of Computing & Informatics, Drexel University, Philadelphia, USA
刊物类别：Computer Science
刊物主题：Information Systems Applications and The Internet
Database Management
Operating Systems
出版者：Springer Netherlands
ISSN：1573-1413

文摘

There has been a lot of research on MapReduce for big data analytics. This new class of systems sacrifices DBMS functionality such as query languages, schemas, or indexes in order to maximize scalability and parallelism. However, as high functionality of the DBMS is considered important for big data analytics as well, there have been a lot of efforts to support DBMS functionality in MapReduce. HadoopDB is the only work that directly utilizes the DBMS for big data analytics in the MapReduce framework, taking advantage of both the DBMS and MapReduce. However, HadoopDB does not support sharability for the entire data since it stores the data into multiple nodes in a shared-nothing manner—i.e., it partitions a job into multiple tasks where each task is assigned to a fragment of data. Due to this limitation, HadoopDB cannot effectively process queries that require internode communication. That is, HadoopDB needs to re-load the entire data to process some queries (e.g., 2-way joins) or cannot support some complex queries (e.g., 3-way joins). In this paper, we propose a new notion of the DFS-integrated DBMS where a DBMS is tightly integrated with the distributed file system (DFS). By using the DFS-integrated DBMS, we can obtain sharability of the entire data. That is, a DBMS process in the system can access any data since multiple DBMSs are run on an integrated storage system in the DFS. To process big data analytics in parallel, our approach use the MapReduce framework on top of a DFS-integrated DBMS. We call this framework PARADISE. In PARADISE, we employ a job splitting method that logically splits a job based on the predicate in the integrated storage system. This contrasts with physical splitting in HadoopDB. We also propose the notion of locality mapping for further optimization of logical splitting. We show that PARADISE effectively overcomes the drawbacks of HadoopDB by identifying the following strengths. (1) It has a significantly faster (by up to 6.41 times) amortized query processing performance since it obviates the need to re-load data required in HadoopDB. (2) It supports query types more complex than the ones supported by HadoopDB.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700