Cloud-Based Data Pipelines: Design, Implementation and Example

Authors

  • Sairamesh Konidala Vice President at JPMorgan & Chase, USA Author

Keywords:

Cloud computing, big data

Abstract

Cloud-based data pipelines have become essential for handling vast information in modern data-driven organizations. These pipelines facilitate the smooth collection, transformation, and movement of data across different cloud services and systems, enabling efficient data processing and analytics at scale. Designing a robust cloud-based data pipeline requires understanding the diverse needs of the business, the nature of the data, and the cloud services available, such as Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft Azure. Implementation typically involves a combination of ingestion tools, transformation processes, orchestration services, and storage solutions, all working in harmony. Key factors like scalability, fault tolerance, latency, and security must be considered during design to ensure a seamless data flow, even during system failures or peak loads. For instance, a real-world example could involve a company gathering data from IoT devices, transforming it for real-time analytics, and storing it in a cloud data warehouse for further reporting and machine learning tasks. With the flexibility and cost-efficiency of cloud platforms, organizations can streamline their data workflows, enabling real-time insights, reducing infrastructure management overhead, and enhancing decision-making processes. As cloud services continue to evolve, adopting cloud-based data pipelines offers immense potential for improving business agility and scalability. However, data security, compliance, and managing costs must be addressed carefully. Effective design and implementation of cloud-based data pipelines empower companies to harness the full power of their data, enabling innovation and competitive advantages in an increasingly data-centric world.

Downloads

Download data is not yet available.

References

Demchenko, Y., Turkmen, F., De Laat, C., Blanchet, C., & Loomis, C. (2016, July). Cloud based big data infrastructure: Architectural components and automated provisioning. In 2016 International Conference on High Performance Computing & Simulation (HPCS) (pp. 628-636). IEEE.

Onsongo, G., Erdmann, J., Spears, M. D., Chilton, J., Beckman, K. B., Hauge, A., ... & Thyagarajan, B. (2014). Implementation of Cloud based Next Generation Sequencing data analysis in a clinical laboratory. BMC research notes, 7, 1-6.

Öhrström, M., Tomlinson, J., Cortes, R., & Goda, S. (2018, August). Cloud-based pipeline distribution for effective and secure remote workflows. In Proceedings of the 8th Annual Digital Production Symposium (pp. 1-8).

Minevich, G., Park, D. S., Blankenberg, D., Poole, R. J., & Hobert, O. (2012). CloudMap: a cloud-based pipeline for analysis of mutant genome sequences. Genetics, 192(4), 1249-1269.

Schmidt, R., & Möhring, M. (2013, September). Strategic alignment of cloud-based architectures for big data. In 2013 17th IEEE International Enterprise Distributed Object Computing Conference Workshops (pp. 136-143). IEEE.

Umylny, B., & Weisburd, R. S. (2011). Beyond the Pipelines: Cloud Computing Facilitates Management, Distribution, Security, and Analysis of High‐Speed Sequencer Data. Tag‐Based Next Generation Sequencing, 449-468.

Garron, J., Stoner, C., & Meyer, F. (2017, September). Cloud-based oil detection processing pipeline prototype for C-band synthetic aperture radar data. In OCEANS 2017-Anchorage (pp. 1-7). IEEE.

Cala, J., Xu, Y., Wijaya, E. A., & Missier, P. (2014, May). From scripted HPC-based NGS pipelines to workflows on the cloud. In 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (pp. 694-700). IEEE.

Ivanov, V., & Smolander, K. (2018). Implementation of a DevOps pipeline for serverless applications. In Product-Focused Software Process Improvement: 19th International Conference, PROFES 2018, Wolfsburg, Germany, November 28–30, 2018, Proceedings 19 (pp. 48-64). Springer International Publishing.

Trudgian, D. C., & Mirzaei, H. (2012). Cloud CPFP: a shotgun proteomics data analysis pipeline using cloud and high performance computing. Journal of proteome research, 11(12), 6282-6290.

Demchenko, Y., Turkmen, F., de Laat, C., Hsu, C. H., Blanchet, C., & Loomis, C. (2017). Cloud computing infrastructure for data intensive applications. In Big Data Analytics for Sensor-Network Collected Intelligence (pp. 21-62). Academic Press.

Chen, L., Zhang, B., Schnaubelt, M., Shah, P., Aiyetan, P., Chan, D., ... & Zhang, Z. (2018). MS-PyCloud: An open-source, cloud computing-based pipeline for LC-MS/MS data analysis. BioRxiv, 320887.

Gorton, I., Wynne, A., Liu, Y., & Yin, J. (2011). Components in the Pipeline. IEEE software, 28(3), 34-40.

Lynnes, C., & Ramachandran, R. (2018, July). Generalizing a Data Analysis Pipeline in the Cloud to Handle Diverse Use Cases in NASA's EOSDIS. In IGARSS 2018-2018 IEEE International Geoscience and Remote Sensing Symposium (pp. 422-425). IEEE.

Yaseen, M. U., Anjum, A., & Antonopoulos, N. (2017, December). Modeling and analysis of a deep learning pipeline for cloud based video analytics. In Proceedings of the Fourth IEEE/ACM International Conference on Big Data Computing, Applications and Technologies (pp. 121-130).

Gade, K. R. (2018). Real-Time Analytics: Challenges and Opportunities. Innovative Computer Sciences Journal, 4(1).

Gade, K. R. (2017). Integrations: ETL vs. ELT: Comparative analysis and best practices. Innovative Computer Sciences Journal, 3(1).

Gade, K. R. (2017). Migrations: Challenges and Best Practices for Migrating Legacy Systems to Cloud-Based Platforms. Innovative Computer Sciences Journal, 3(1).

Naresh Dulam, et al. Apache Arrow: Optimizing Data Interchange in Big Data Systems. Distributed Learning and Broad Applications in Scientific Research, vol. 3, Oct. 2017, pp. 93-114

Naresh Dulam, and Venkataramana Gosukonda. Event-Driven Architectures With Apache Kafka and Kubernetes. Distributed Learning and Broad Applications in Scientific Research, vol. 3, Oct. 2017, pp. 115-36

Naresh Dulam, et al. Snowflake Vs Redshift: Which Cloud Data Warehouse Is Right for You? . Distributed Learning and Broad Applications in Scientific Research, vol. 4, Oct. 2018, pp. 221-40

Naresh Dulam, et al. Apache Iceberg: A New Table Format for Managing Data Lakes . Distributed Learning and Broad Applications in Scientific Research, vol. 4, Sept. 2018

Naresh Dulam, et al. Data Governance and Compliance in the Age of Big Data. Distributed Learning and Broad Applications in Scientific Research, vol. 4, Nov. 2018

Downloads

Published

15-05-2019

How to Cite

[1]
Sairamesh Konidala, “Cloud-Based Data Pipelines: Design, Implementation and Example”, Distrib Learn Broad Appl Sci Res, vol. 5, pp. 1586–1603, May 2019, Accessed: Jan. 05, 2025. [Online]. Available: https://dlabi.org/index.php/journal/article/view/283

Most read articles by the same author(s)

1 2 > >> 

Similar Articles

1-10 of 220

You may also start an advanced similarity search for this article.