Optimizing Big Data Workflows: A Comparative Analysis of Spark Compression Codecs

Authors

  • Ahmed Elgalb Independent Researcher, WA, United States Author
  • George Samaan Independent Researcher, Tennessee, United States Author

Keywords:

Big Data, Apache Spark, Compression

Abstract

As business, science and user activity exploded in the past few years, demand for efficient data processing frameworks such as Apache Spark rose. Although Spark allows large computations over networks of cheap hardware, efficient storage and communication is the key challenge. Data compression is the most popular approach to mitigate this issue. By minimising the size of data on disk and in motion, compression speeds up I/O, reduces network traffic and lowers storage costs. But with so many different compression codecs available, with their own trade-offs in terms of speed, compression ratio, and resource overhead, practitioners and researchers find it very difficult to make a informed decision for certain use cases.

This paper explores four popular Spark compression codecs (Snappy, LZ4, ZSTD and Gzip) and analyzes their storage and computation performance. Our comprehensive comparative analysis combines two real-world datasets: an airline flight dataset, and a web logs dataset. Our test workloads include aggregation, multi-column joins, and iterative machine learning computations. We examine compression ratio, compression/decompression time, job completion time and resource consumption and provide feedback that can help developers make decisions regarding the tradeoffs between storage and computation speed. We further discuss how the underlying nature of the datasets (structural regularity, repetition of values, irregular text) may affect the choice of codec. This indicates that Gzip typically has the best compression ratio at the cost of speed, while Snappy and LZ4 perform better at speed. ZSTD offers a hybrid approach, integrating both speed and ratio in many situations. We present our results as a detailed roadmap for researchers and engineers to help their Big Data pipelines run more efficiently.

Downloads

Download data is not yet available.

References

M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, "Spark: Cluster Computing with Working Sets," Communications of the ACM, vol. 59, no. 11, pp. 85-93, 2016.

R. Li, Y. Li, and T. Zhang, "Evaluating the Performance of Compression Techniques in Apache Spark," IEEE Transactions on Big Data, vol. 4, no. 3, pp. 323-333, 2018.

D. Holmes and A. Manoj, "Compression in Big Data: A Study of Gzip and LZ4," in Proceedings of the IEEE International Conference on Cloud Computing Technology and Science (CloudCom), 2019, pp. 525-532.

Google. (2019). "Snappy: A Fast Compressor/Decompressor," Available: https://google.github.io/snappy/ (Accessed December 2022).

Y. Collet, "Zstandard (ZSTD): Fast Real-Time Compression Algorithm," in Proceedings of the USENIX Annual Technical Conference, 2018, pp. 307-310.

B. Bowley and J. Wilson, "Analysis of the Performance Impact of Data Compression in Distributed Systems," Journal of Big Data Analytics, vol. 7, no. 4, pp. 219-230, 2020.

J. Dean and S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters," Communications of the ACM, vol. 51, no. 1, pp. 107-113, 2008.

Bureau of Transportation Statistics, "Airline On-Time Performance Data," Available: https://www.transtats.bts.gov (Accessed October 2022).

Anonymous, "Anonymized Web Logs for Research," Data Repository, 2021, Available upon request.

Downloads

Published

19-07-2023

How to Cite

[1]
A. Elgalb and G. Samaan, “Optimizing Big Data Workflows: A Comparative Analysis of Spark Compression Codecs”, Distrib Learn Broad Appl Sci Res, vol. 9, pp. 539–558, Jul. 2023, Accessed: Jan. 07, 2025. [Online]. Available: https://dlabi.org/index.php/journal/article/view/299

Similar Articles

1-10 of 201

You may also start an advanced similarity search for this article.