Apache Spark: The Future Beyond MapReduce

Authors

  • Naresh Dulam Vice President Sr Lead Software Engineer, JP Morgan Chase, USA Author

Keywords:

Apache Spark, MapReduce, Hadoop

Abstract

Apache Spark has emerged as a powerful alternative to the traditional MapReduce paradigm, revolutionizing how we process and analyze large-scale data. Designed to address the limitations of MapReduce, Spark offers a unified platform for batch and stream processing, allowing for faster data processing and real-time analytics. By leveraging in-memory computation, Spark significantly reduces data retrieval and processing time, making it a preferred choice for data-intensive applications. Its rich APIs and support for various programming languages, including Java, Scala, and Python, empower developers to build complex data workflows quickly. Furthermore, Spark's ability to integrate seamlessly with existing Hadoop ecosystems enhances its appeal, allowing organizations to leverage their investments in Hadoop while transitioning to a more agile data processing framework. The ecosystem surrounding Spark, including libraries for machine learning, graph processing, and SQL, expands its functionality beyond simple data processing, enabling advanced analytics and insights. As organizations increasingly adopt big data technologies, Spark stands out as a robust solution that enhances performance and simplifies the development process. With its growing community and continuous advancements, Apache Spark is not just a trend; it represents the future of data processing, paving the way for innovative applications and transformative business solutions. This abstract highlights Spark's potential to redefine the data landscape, emphasizing its role as a catalyst for efficiency and innovation in the era of big data.

Downloads

Download data is not yet available.

References

Xin, R. S., Gonzalez, J. E., Franklin, M. J., & Stoica, I. (2013, June). Graphx: A resilient distributed graph system on spark. In First international workshop on graph data management experiences and systems (pp. 1-6).

Rapolu, N., Kambatla, K., Jagannathan, S., & Grama, A. (2011). {TransMR}:{Data-Centric} Programming Beyond Data Parallelism. In 3rd USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 11).

Jakovits, P., Srirama, S. N., & Vainikko, E. (2012). Mapreduce for scientific computing-viability for non-embarrassingly parallel algorithms. In Applications, Tools and Techniques on the Road to Exascale Computing (pp. 117-124). IOS Press.

De Bosschere, K. (2012). MapReduce for Scientific Computing. Applications, Tools and Techniques on the Road to Exascale Computing, 22, 117.

Cho, B., Rahman, M., Chajed, T., Gupta, I., Abad, C., Roberts, N., & Lin, P. (2013, October). Natjam: Design and evaluation of eviction policies for supporting priorities and deadlines in map reduce clusters. In Proceedings of the 4th annual Symposium on Cloud Computing (pp. 1-17).

Xu, W., Gong, X., & Li, X. (2012). Map Combine: a lightweight solution to improve the efficiency of iterative MapReduce. In Contemporary Research on E-business Technology and Strategy: International Conference, iCETS 2012, Tianjin, China, August 29-31, 2012, Revised Selected Papers (pp. 444-456). Springer Berlin Heidelberg.

Chen, R., & Chen, H. (2013). Tiled-mapreduce: Efficient and flexible mapreduce processing on multicore with tiling. ACM Transactions on Architecture and Code Optimization (TACO), 10(1), 1-30.

Guo, Z. (2012). High performance integration of data parallel file systems and computing: Optimizing MapReduce (Doctoral dissertation, Indiana University).

Sakr, S., Liu, A., & Fayoumi, A. G. (2013). The family of mapreduce and large-scale data processing systems. ACM Computing Surveys (CSUR), 46(1), 1-44.

Jin, H. (2012). System support for resilience in large-scale parallel systems:

From checkpointing to mapreduce. Illinois Institute of Technology.

Onizuka, M., Kato, H., Hidaka, S., Nakano, K., & Hu, Z. (2013). Optimization for iterative queries on MapReduce. Proceedings of the VLDB Endowment, 7(4), 241-252

Vavilapalli, V. K., Murthy, A. C., Douglas, C., Agarwal, S., Konar, M., Evans, R., ... & Baldeschwieler, E. (2013, October). Apache hadoop yarn: Yet another resource negotiator. In Proceedings of the 4th annual Symposium on Cloud Computing (pp. 1-16).

Massie, M., Nothaft, F., Hartl, C., Kozanitis, C., Schumacher, A., Joseph, A. D., & Patterson, D. A. (2013). Adam: Genomics formats and processing patterns for cloud scale computing. University of California, Berkeley Technical Report, No. UCB/EECS-2013, 207, 2013.

Liu, G. J., & Goldenberg, A. A. (1991, January). Robust hybrid impedance control of robot manipulators. In Proceedings. 1991 IEEE International Conference on Robotics and Automation (pp. 287-288). IEEE Computer Society.

Lin, J. (2013). Mapreduce is good enough? if all you have is a hammer, throw away everything that's not a nail!. Big Data, 1(1), 28-37.

Downloads

Published

29-12-2015

How to Cite

[1]
Naresh Dulam, “Apache Spark: The Future Beyond MapReduce”, Distrib Learn Broad Appl Sci Res, vol. 1, pp. 136–156, Dec. 2015, Accessed: Jan. 09, 2025. [Online]. Available: https://dlabi.org/index.php/journal/article/view/211

Most read articles by the same author(s)

1 2 > >> 

Similar Articles

You may also start an advanced similarity search for this article.