Apache Spark: The Future Beyond MapReduce

Naresh Dulam

Apache Spark: The Future Beyond MapReduce

Authors

Naresh Dulam Vice President Sr Lead Software Engineer, JP Morgan Chase, USA Author

Keywords:

Apache Spark, MapReduce, Hadoop

Abstract

Apache Spark has emerged as a powerful alternative to the traditional MapReduce paradigm, revolutionizing how we process and analyze large-scale data. Designed to address the limitations of MapReduce, Spark offers a unified platform for batch and stream processing, allowing for faster data processing and real-time analytics. By leveraging in-memory computation, Spark significantly reduces data retrieval and processing time, making it a preferred choice for data-intensive applications. Its rich APIs and support for various programming languages, including Java, Scala, and Python, empower developers to build complex data workflows quickly. Furthermore, Spark's ability to integrate seamlessly with existing Hadoop ecosystems enhances its appeal, allowing organizations to leverage their investments in Hadoop while transitioning to a more agile data processing framework. The ecosystem surrounding Spark, including libraries for machine learning, graph processing, and SQL, expands its functionality beyond simple data processing, enabling advanced analytics and insights. As organizations increasingly adopt big data technologies, Spark stands out as a robust solution that enhances performance and simplifies the development process. With its growing community and continuous advancements, Apache Spark is not just a trend; it represents the future of data processing, paving the way for innovative applications and transformative business solutions. This abstract highlights Spark's potential to redefine the data landscape, emphasizing its role as a catalyst for efficiency and innovation in the era of big data.

Downloads

Download data is not yet available.

References

Xin, R. S., Gonzalez, J. E., Franklin, M. J., & Stoica, I. (2013, June). Graphx: A resilient distributed graph system on spark. In First international workshop on graph data management experiences and systems (pp. 1-6).

Rapolu, N., Kambatla, K., Jagannathan, S., & Grama, A. (2011). {TransMR}:{Data-Centric} Programming Beyond Data Parallelism. In 3rd USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 11).

Jakovits, P., Srirama, S. N., & Vainikko, E. (2012). Mapreduce for scientific computing-viability for non-embarrassingly parallel algorithms. In Applications, Tools and Techniques on the Road to Exascale Computing (pp. 117-124). IOS Press.

De Bosschere, K. (2012). MapReduce for Scientific Computing. Applications, Tools and Techniques on the Road to Exascale Computing, 22, 117.

Cho, B., Rahman, M., Chajed, T., Gupta, I., Abad, C., Roberts, N., & Lin, P. (2013, October). Natjam: Design and evaluation of eviction policies for supporting priorities and deadlines in map reduce clusters. In Proceedings of the 4th annual Symposium on Cloud Computing (pp. 1-17).

Xu, W., Gong, X., & Li, X. (2012). Map Combine: a lightweight solution to improve the efficiency of iterative MapReduce. In Contemporary Research on E-business Technology and Strategy: International Conference, iCETS 2012, Tianjin, China, August 29-31, 2012, Revised Selected Papers (pp. 444-456). Springer Berlin Heidelberg.

Chen, R., & Chen, H. (2013). Tiled-mapreduce: Efficient and flexible mapreduce processing on multicore with tiling. ACM Transactions on Architecture and Code Optimization (TACO), 10(1), 1-30.

Guo, Z. (2012). High performance integration of data parallel file systems and computing: Optimizing MapReduce (Doctoral dissertation, Indiana University).

Sakr, S., Liu, A., & Fayoumi, A. G. (2013). The family of mapreduce and large-scale data processing systems. ACM Computing Surveys (CSUR), 46(1), 1-44.

Jin, H. (2012). System support for resilience in large-scale parallel systems:

From checkpointing to mapreduce. Illinois Institute of Technology.

Onizuka, M., Kato, H., Hidaka, S., Nakano, K., & Hu, Z. (2013). Optimization for iterative queries on MapReduce. Proceedings of the VLDB Endowment, 7(4), 241-252

Vavilapalli, V. K., Murthy, A. C., Douglas, C., Agarwal, S., Konar, M., Evans, R., ... & Baldeschwieler, E. (2013, October). Apache hadoop yarn: Yet another resource negotiator. In Proceedings of the 4th annual Symposium on Cloud Computing (pp. 1-16).

Massie, M., Nothaft, F., Hartl, C., Kozanitis, C., Schumacher, A., Joseph, A. D., & Patterson, D. A. (2013). Adam: Genomics formats and processing patterns for cloud scale computing. University of California, Berkeley Technical Report, No. UCB/EECS-2013, 207, 2013.

Liu, G. J., & Goldenberg, A. A. (1991, January). Robust hybrid impedance control of robot manipulators. In Proceedings. 1991 IEEE International Conference on Robotics and Automation (pp. 287-288). IEEE Computer Society.

Lin, J. (2013). Mapreduce is good enough? if all you have is a hammer, throw away everything that's not a nail!. Big Data, 1(1), 28-37.

Downloads

Published

29-12-2015

Issue

Vol. 1 (2015): Distributed Learning and Broad Applications in Scientific Research

Section

Articles

License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

License Terms

Ownership and Licensing:

Authors of research papers submitted to Distributed Learning and Broad Applications in Scientific Research retain the copyright of their work while granting the journal certain rights. Authors maintain ownership of the copyright and have granted the journal a right of first publication. Simultaneously, authors agree to license their research papers under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) License.

License Permissions:

Under the CC BY-NC-SA 4.0 License, others are permitted to share and adapt the work, as long as proper attribution is given to the authors and acknowledgement is made of the initial publication in the journal. This license allows for the broad dissemination and utilization of research papers.

Additional Distribution Arrangements:

Authors are free to enter into separate contractual arrangements for the non-exclusive distribution of the journal's published version of the work. This may include posting the work to institutional repositories, publishing it in journals or books, or other forms of dissemination. In such cases, authors are requested to acknowledge the initial publication of the work in this journal.

Online Posting:

Authors are encouraged to share their work online, including in institutional repositories, disciplinary repositories, or on their personal websites. This permission applies both prior to and during the submission process to the journal. Online sharing enhances the visibility and accessibility of the research papers.

Responsibility and Liability:

Authors are responsible for ensuring that their research papers do not infringe upon the copyright, privacy, or other rights of any third party. Scientific Research Canada disclaims any liability or responsibility for any copyright infringement or violation of third-party rights in the research papers.

If you have any questions or concerns regarding these license terms, please contact us at editor@dlabi.org.

How to Cite

[1]

Naresh Dulam, “Apache Spark: The Future Beyond MapReduce”, Distrib Learn Broad Appl Sci Res, vol. 1, pp. 136–156, Dec. 2015, Accessed: Jan. 09, 2025. [Online]. Available: https://dlabi.org/index.php/journal/article/view/211

Download Citation

Apache Spark: The Future Beyond MapReduce

Authors

Keywords:

Abstract

Downloads

References

Downloads

Published

Issue

Section

License

License Terms

Ownership and Licensing:

License Permissions:

Additional Distribution Arrangements:

Online Posting:

Responsibility and Liability:

How to Cite

Most read articles by the same author(s)

Similar Articles

Journal Snapshot

Make a Submission

Invitation for Submissions