What is a Modern Data Pipeline and Why is it Important?

Authors

  • Sairamesh Konidala Vice President at JPMorgan & Chase, USA Author

Keywords:

Data Pipeline, ETL

Abstract

A modern data pipeline is a system designed to efficiently manage data's movement, transformation, and analysis, helping organizations turn raw information into valuable insights. Unlike traditional pipelines, modern data pipelines are built to handle the increasing volume, velocity, and variety of data generated by today’s digital world. These pipelines typically integrate data collection, cleansing, transformation, storage, and analysis tools, often incorporating cloud technologies, real-time processing, and automation. A modern data pipeline is essential because of its ability to streamline the flow of information and ensure that data is accessible, reliable, and ready for decision-making. Businesses rely on timely and accurate data to stay competitive, improve customer experiences, and drive innovation. By automating the movement and processing of data, modern pipelines reduce manual effort, minimize errors, and accelerate the time it takes to generate insights. They also allow companies to handle diverse data sources, from social media streams and sensor data to customer databases and transaction records. In a world where data-driven decision-making is critical, a well-designed data pipeline ensures organizations can keep up with the speed of change. It empowers teams to focus more on analysis and strategy rather than getting bogged down by the complexities of data handling. Ultimately, modern data pipelines create a seamless data flow, allowing businesses to harness the power of information efficiently and consistently.

Downloads

Download data is not yet available.

References

O’Donovan, P., Leahy, K., Bruton, K., & O’Sullivan, D. T. (2015). An industrial big data pipeline for data-driven analytics maintenance applications in large-scale smart manufacturing facilities. Journal of big data, 2, 1-26.

Deutsch, E. W., Mendoza, L., Shteynberg, D., Slagel, J., Sun, Z., & Moritz, R. L. (2015). Trans‐Proteomic Pipeline, a standardized data processing pipeline for large‐scale reproducible proteomics informatics. PROTEOMICS–Clinical Applications, 9(7-8), 745-754.

Rex, D. E., Ma, J. Q., & Toga, A. W. (2003). The LONI pipeline processing environment. Neuroimage, 19(3), 1033-1048.

Irwin, M. J., Lewis, J., Hodgkin, S., Bunclark, P., Evans, D., McMahon, R., ... & Beard, S. (2004, September). VISTA data flow system: pipeline processing for WFCAM and VISTA. In Optimizing scientific return for astronomy through information technologies (Vol. 5493, pp. 411-422). SPIE.

Muhlbauer, W. K. (2004). Pipeline risk management manual: ideas, techniques, and resources. Gulf Professional Publishing.

Hesketh, D. (2010). Weaknesses in the supply chain: who packed the box?. World Customs Journal, 4(2), 3-20.

Akidau, T., Bradshaw, R., Chambers, C., Chernyak, S., Fernández-Moctezuma, R. J., Lax, R., ... & Whittle, S. (2015). The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. Proceedings of the VLDB Endowment, 8(12), 1792-1803.

Vranken, W. F., Boucher, W., Stevens, T. J., Fogh, R. H., Pajon, A., Llinas, M., ... & Laue, E. D. (2005). The CCPN data model for NMR spectroscopy: development of a software pipeline. Proteins: structure, function, and bioinformatics, 59(4), 687-696.

Armbrust, M., Xin, R. S., Lian, C., Huai, Y., Liu, D., Bradley, J. K., ... & Zaharia, M. (2015, May). Spark sql: Relational data processing in spark. In Proceedings of the 2015 ACM SIGMOD international conference on management of data (pp. 1383-1394).

Bertin, E., Mellier, Y., Radovich, M., Missonnier, G., Didelon, P., & Morin, B. (2002). The TERAPIX pipeline. In Astronomical Data Analysis Software and Systems XI (Vol. 281, p. 228).

Shen, J. P., & Lipasti, M. H. (2013). Modern processor design: fundamentals of superscalar processors. Waveland Press.

Tweddle, D. (2008). Logistics, security and compliance: the part to be played by Authorised Economic Operators (AEOs) and data management. World Customs Journal, 2(1), 101-105.

Schubert, M., Ermini, L., Sarkissian, C. D., Jónsson, H., Ginolhac, A., Schaefer, R., ... & Orlando, L. (2014). Characterization of ancient and modern genomes by SNP detection and phylogenomic and metagenomic analysis using PALEOMIX. Nature protocols, 9(5), 1056-1082.

Manegold, S., Boncz, P., & Kersten, M. (2002). Optimizing main-memory join on modern hardware. IEEE transactions on knowledge and data engineering, 14(4), 709-730.

Bienia, C. (2011). Benchmarking modern multiprocessors. Princeton University.

Naresh Dulam. NoSQL Vs SQL: Which Database Type Is Right for Big Data?. Distributed Learning and Broad Applications in Scientific Research, vol. 1, May 2015, pp. 115-3

Naresh Dulam. Data Lakes: Building Flexible Architectures for Big Data Storage. Distributed Learning and Broad Applications in Scientific Research, vol. 1, Oct. 2015, pp. 95-114

Naresh Dulam. The Rise of Kubernetes: Managing Containers in Distributed Systems. Distributed Learning and Broad Applications in Scientific Research, vol. 1, July 2015, pp. 73-94

Naresh Dulam. Snowflake: A New Era of Cloud Data Warehousing. Distributed Learning and Broad Applications in Scientific Research, vol. 1, Apr. 2015, pp. 49-72

Naresh Dulam. The Shift to Cloud-Native Data Analytics: AWS, Azure, and Google Cloud Discussing the Growing Trend of Cloud-Native Big Data Processing Solutions. Distributed Learning and Broad Applications in Scientific Research, vol. 1, Feb. 2015, pp. 28-48

Downloads

Published

12-12-2016

How to Cite

[1]
Sairamesh Konidala, “What is a Modern Data Pipeline and Why is it Important?”, Distrib Learn Broad Appl Sci Res, vol. 2, pp. 95–111, Dec. 2016, Accessed: Jan. 30, 2025. [Online]. Available: https://dlabi.org/index.php/journal/article/view/280

Similar Articles

1-10 of 202

You may also start an advanced similarity search for this article.