Optimizing Big Data Workflows: A Comparative Analysis of Spark Compression Codecs
Keywords:
Big Data, Apache Spark, CompressionAbstract
As business, science and user activity exploded in the past few years, demand for efficient data processing frameworks such as Apache Spark rose. Although Spark allows large computations over networks of cheap hardware, efficient storage and communication is the key challenge. Data compression is the most popular approach to mitigate this issue. By minimising the size of data on disk and in motion, compression speeds up I/O, reduces network traffic and lowers storage costs. But with so many different compression codecs available, with their own trade-offs in terms of speed, compression ratio, and resource overhead, practitioners and researchers find it very difficult to make a informed decision for certain use cases.
This paper explores four popular Spark compression codecs (Snappy, LZ4, ZSTD and Gzip) and analyzes their storage and computation performance. Our comprehensive comparative analysis combines two real-world datasets: an airline flight dataset, and a web logs dataset. Our test workloads include aggregation, multi-column joins, and iterative machine learning computations. We examine compression ratio, compression/decompression time, job completion time and resource consumption and provide feedback that can help developers make decisions regarding the tradeoffs between storage and computation speed. We further discuss how the underlying nature of the datasets (structural regularity, repetition of values, irregular text) may affect the choice of codec. This indicates that Gzip typically has the best compression ratio at the cost of speed, while Snappy and LZ4 perform better at speed. ZSTD offers a hybrid approach, integrating both speed and ratio in many situations. We present our results as a detailed roadmap for researchers and engineers to help their Big Data pipelines run more efficiently.
Downloads
References
M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, "Spark: Cluster Computing with Working Sets," Communications of the ACM, vol. 59, no. 11, pp. 85-93, 2016.
R. Li, Y. Li, and T. Zhang, "Evaluating the Performance of Compression Techniques in Apache Spark," IEEE Transactions on Big Data, vol. 4, no. 3, pp. 323-333, 2018.
D. Holmes and A. Manoj, "Compression in Big Data: A Study of Gzip and LZ4," in Proceedings of the IEEE International Conference on Cloud Computing Technology and Science (CloudCom), 2019, pp. 525-532.
Google. (2019). "Snappy: A Fast Compressor/Decompressor," Available: https://google.github.io/snappy/ (Accessed December 2022).
Y. Collet, "Zstandard (ZSTD): Fast Real-Time Compression Algorithm," in Proceedings of the USENIX Annual Technical Conference, 2018, pp. 307-310.
B. Bowley and J. Wilson, "Analysis of the Performance Impact of Data Compression in Distributed Systems," Journal of Big Data Analytics, vol. 7, no. 4, pp. 219-230, 2020.
J. Dean and S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters," Communications of the ACM, vol. 51, no. 1, pp. 107-113, 2008.
Bureau of Transportation Statistics, "Airline On-Time Performance Data," Available: https://www.transtats.bts.gov (Accessed October 2022).
Anonymous, "Anonymized Web Logs for Research," Data Repository, 2021, Available upon request.
Downloads
Published
Issue
Section
License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
License Terms
Ownership and Licensing:
Authors of research papers submitted to Distributed Learning and Broad Applications in Scientific Research retain the copyright of their work while granting the journal certain rights. Authors maintain ownership of the copyright and have granted the journal a right of first publication. Simultaneously, authors agree to license their research papers under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) License.
License Permissions:
Under the CC BY-NC-SA 4.0 License, others are permitted to share and adapt the work, as long as proper attribution is given to the authors and acknowledgement is made of the initial publication in the journal. This license allows for the broad dissemination and utilization of research papers.
Additional Distribution Arrangements:
Authors are free to enter into separate contractual arrangements for the non-exclusive distribution of the journal's published version of the work. This may include posting the work to institutional repositories, publishing it in journals or books, or other forms of dissemination. In such cases, authors are requested to acknowledge the initial publication of the work in this journal.
Online Posting:
Authors are encouraged to share their work online, including in institutional repositories, disciplinary repositories, or on their personal websites. This permission applies both prior to and during the submission process to the journal. Online sharing enhances the visibility and accessibility of the research papers.
Responsibility and Liability:
Authors are responsible for ensuring that their research papers do not infringe upon the copyright, privacy, or other rights of any third party. Scientific Research Canada disclaims any liability or responsibility for any copyright infringement or violation of third-party rights in the research papers.
If you have any questions or concerns regarding these license terms, please contact us at editor@dlabi.org.