BIG DATA PROCESSING WITH APACHE SPARK

Quy Quang Tran; Binh Duc Nguyen; Linh Thi Thuy Nguyen; Oanh Thi Thu Nguyen

doi:10.35382/tvujs.13.6.2023.2099

PDF

Published: Jul 20, 2023

DOI: https://doi.org/10.35382/tvujs.13.6.2023.2099

Keywords:

Apache Spark, Big Data, distributed-computing, R language

Quy Quang Tran

University of Information and Communication Technology (ICTU), Thai Nguyen University, Vietnam

Binh Duc Nguyen

University of Information and Communication Technology (ICTU), Thai Nguyen University, Vietnam

Linh Thi Thuy Nguyen

Lao Cai College, Vietnam

Oanh Thi Thu Nguyen

Thai Nguyen University, Vietnam

Abstract

With the exponential growth of information, it is no surprise that we are in a period of history as the Information Age. The rapid growth of data has presented challenges regarding storage and processing technology. This article refers to Apache Spark, an ecosystem that provides many integrated technologies in Big Data processing, including machine learning libraries and data storage platforms. Apache Spark provides distributed data processing for open source applications, loading data in-memory and making operations for analyzing data of any size, with efficient support for popular programming languages like Java, Scala, R, and Python. The article aims to compare the superior computing power of Saprk compared to Hadoop and how to connect Spark with today's popular data processing tools such as the R language.

Downloads

Download data is not yet available.

How to Cite

1.

Tran Q, Nguyen B, Nguyen L, Nguyen O. BIG DATA PROCESSING WITH APACHE SPARK. journal [Internet]. 20Jul.2023 [cited 19May2024];13(6). Available from: https://journal.tvu.edu.vn/index.php/journal/article/view/2099

Issue

Tra Vinh University Journal of Science, Vol. 13, Special Issue (2023)

Section

Articles

References

[1] Ghemawat S, Gobioff H, Leung ST. The Google file system. In: Proceedings of the nineteenth ACM
symposium on Operating systems principles. 2020; 29-43.
[2] Rattanaopas K, Kaewkeeree S. Improving Hadoop
MapReduce performance with data compression: A
study using wordcount job. In: The 14th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON). IEEE; 2017.
[3] Luraschi J, Kuo K, Ruiz E. Mastering Spark with R: the complete guide to large-scale analysis and
modeling. United States of America: O’Reilly Media; 2019.
[4] Chellappan S, Ganesan D. Introduction to Apache
spark and spark core. In: Chellappan S, Ganesan D.
(eds). Practical Apache Spark: Using the Scala API.
Berkeley, CA: Apress; 2018. p.79–113.
[5] Zaharia M, Chowdhury M, Franklin MJ, Shenker S,
Stoica I. Spark: Cluster computing with working sets.
HotCloud, 2019.
[6] Shaikh E, Mohiuddin I, Alufaisan Y, Nahvi I. Apache
spark: A big data processing engine. In: 2nd IEEE
Middle East and North Africa COMMunications Conference (MENACOMM). Manama, Bahrain: IEEE;
2019. p.1–6.
[7] Wang K, Khan MMH. Performance prediction for
apache spark platform. In: IEEE 17th International
Conference on High Performance Computing and
Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015
IEEE 12th International Conference on Embedded
Software and Systems. New York, USA: IEEE; 2015. p.166–173.
[8] ETH Zurich. Description of dataset mtcars. https://stat.ethz.ch/R-manual/Rdevel/library/datasets/html/mtcars.html. [Accessed
05th January 2023]

Article Sidebar

Main Article Content

Abstract

Downloads

Article Details

References