BIG DATA PROCESSING WITH APACHE SPARK

  • Quy Quang Tran University of Information and Communication Technology (ICTU), Thai Nguyen University, Vietnam
  • Binh Duc Nguyen University of Information and Communication Technology (ICTU), Thai Nguyen University, Vietnam
  • Linh Thi Thuy Nguyen Lao Cai College, Vietnam
  • Oanh Thi Thu Nguyen Thai Nguyen University, Vietnam
Keywords: Apache Spark, Big Data, distributed-computing, R language

Abstract

With the exponential growth of information, it is no surprise that we are in a period of history as the Information Age. The rapid growth of data has presented challenges regarding storage and processing technology. This article refers to Apache Spark, an ecosystem that provides many integrated technologies in Big Data processing, including machine learning libraries and data storage platforms. Apache Spark provides distributed data processing for open source applications, loading data in-memory and making operations for analyzing data of any size, with efficient support for popular programming languages like Java, Scala, R, and Python. The article aims to compare the superior computing power of Saprk compared to Hadoop and how to connect Spark with today's popular data processing tools such as the R language.

Downloads

Download data is not yet available.

References

[1] Ghemawat S, Gobioff H, Leung ST. The Google file system. In: Proceedings of the nineteenth ACM
symposium on Operating systems principles. 2020; 29-43.
[2] Rattanaopas K, Kaewkeeree S. Improving Hadoop
MapReduce performance with data compression: A
study using wordcount job. In: The 14th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON). IEEE; 2017.
[3] Luraschi J, Kuo K, Ruiz E. Mastering Spark with R: the complete guide to large-scale analysis and
modeling. United States of America: O’Reilly Media; 2019.
[4] Chellappan S, Ganesan D. Introduction to Apache
spark and spark core. In: Chellappan S, Ganesan D.
(eds). Practical Apache Spark: Using the Scala API.
Berkeley, CA: Apress; 2018. p.79–113.
[5] Zaharia M, Chowdhury M, Franklin MJ, Shenker S,
Stoica I. Spark: Cluster computing with working sets.
HotCloud, 2019.
[6] Shaikh E, Mohiuddin I, Alufaisan Y, Nahvi I. Apache
spark: A big data processing engine. In: 2nd IEEE
Middle East and North Africa COMMunications Conference (MENACOMM). Manama, Bahrain: IEEE;
2019. p.1–6.
[7] Wang K, Khan MMH. Performance prediction for
apache spark platform. In: IEEE 17th International
Conference on High Performance Computing and
Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015
IEEE 12th International Conference on Embedded
Software and Systems. New York, USA: IEEE; 2015. p.166–173.
[8] ETH Zurich. Description of dataset mtcars. https://stat.ethz.ch/R-manual/Rdevel/library/datasets/html/mtcars.html. [Accessed
05th January 2023]
Published
20-July-2023
How to Cite
1.
Tran Q, Nguyen B, Nguyen L, Nguyen O. BIG DATA PROCESSING WITH APACHE SPARK. journal [Internet]. 20Jul.2023 [cited 22Dec.2024];13(6). Available from: https://journal.tvu.edu.vn/tvujs_old/index.php/journal/article/view/2099