Cloud-Based Data Pipelines: Design, Implementation and Example

Sairamesh Konidala

Cloud-Based Data Pipelines: Design, Implementation and Example

Authors

Sairamesh Konidala Vice President at JPMorgan & Chase, USA Author

Keywords:

Cloud computing, big data

Abstract

Cloud-based data pipelines have become essential for handling vast information in modern data-driven organizations. These pipelines facilitate the smooth collection, transformation, and movement of data across different cloud services and systems, enabling efficient data processing and analytics at scale. Designing a robust cloud-based data pipeline requires understanding the diverse needs of the business, the nature of the data, and the cloud services available, such as Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft Azure. Implementation typically involves a combination of ingestion tools, transformation processes, orchestration services, and storage solutions, all working in harmony. Key factors like scalability, fault tolerance, latency, and security must be considered during design to ensure a seamless data flow, even during system failures or peak loads. For instance, a real-world example could involve a company gathering data from IoT devices, transforming it for real-time analytics, and storing it in a cloud data warehouse for further reporting and machine learning tasks. With the flexibility and cost-efficiency of cloud platforms, organizations can streamline their data workflows, enabling real-time insights, reducing infrastructure management overhead, and enhancing decision-making processes. As cloud services continue to evolve, adopting cloud-based data pipelines offers immense potential for improving business agility and scalability. However, data security, compliance, and managing costs must be addressed carefully. Effective design and implementation of cloud-based data pipelines empower companies to harness the full power of their data, enabling innovation and competitive advantages in an increasingly data-centric world.

Downloads

Download data is not yet available.

References

Demchenko, Y., Turkmen, F., De Laat, C., Blanchet, C., & Loomis, C. (2016, July). Cloud based big data infrastructure: Architectural components and automated provisioning. In 2016 International Conference on High Performance Computing & Simulation (HPCS) (pp. 628-636). IEEE.

Onsongo, G., Erdmann, J., Spears, M. D., Chilton, J., Beckman, K. B., Hauge, A., ... & Thyagarajan, B. (2014). Implementation of Cloud based Next Generation Sequencing data analysis in a clinical laboratory. BMC research notes, 7, 1-6.

Öhrström, M., Tomlinson, J., Cortes, R., & Goda, S. (2018, August). Cloud-based pipeline distribution for effective and secure remote workflows. In Proceedings of the 8th Annual Digital Production Symposium (pp. 1-8).

Minevich, G., Park, D. S., Blankenberg, D., Poole, R. J., & Hobert, O. (2012). CloudMap: a cloud-based pipeline for analysis of mutant genome sequences. Genetics, 192(4), 1249-1269.

Schmidt, R., & Möhring, M. (2013, September). Strategic alignment of cloud-based architectures for big data. In 2013 17th IEEE International Enterprise Distributed Object Computing Conference Workshops (pp. 136-143). IEEE.

Umylny, B., & Weisburd, R. S. (2011). Beyond the Pipelines: Cloud Computing Facilitates Management, Distribution, Security, and Analysis of High‐Speed Sequencer Data. Tag‐Based Next Generation Sequencing, 449-468.

Garron, J., Stoner, C., & Meyer, F. (2017, September). Cloud-based oil detection processing pipeline prototype for C-band synthetic aperture radar data. In OCEANS 2017-Anchorage (pp. 1-7). IEEE.

Cala, J., Xu, Y., Wijaya, E. A., & Missier, P. (2014, May). From scripted HPC-based NGS pipelines to workflows on the cloud. In 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (pp. 694-700). IEEE.

Ivanov, V., & Smolander, K. (2018). Implementation of a DevOps pipeline for serverless applications. In Product-Focused Software Process Improvement: 19th International Conference, PROFES 2018, Wolfsburg, Germany, November 28–30, 2018, Proceedings 19 (pp. 48-64). Springer International Publishing.

Trudgian, D. C., & Mirzaei, H. (2012). Cloud CPFP: a shotgun proteomics data analysis pipeline using cloud and high performance computing. Journal of proteome research, 11(12), 6282-6290.

Demchenko, Y., Turkmen, F., de Laat, C., Hsu, C. H., Blanchet, C., & Loomis, C. (2017). Cloud computing infrastructure for data intensive applications. In Big Data Analytics for Sensor-Network Collected Intelligence (pp. 21-62). Academic Press.

Chen, L., Zhang, B., Schnaubelt, M., Shah, P., Aiyetan, P., Chan, D., ... & Zhang, Z. (2018). MS-PyCloud: An open-source, cloud computing-based pipeline for LC-MS/MS data analysis. BioRxiv, 320887.

Gorton, I., Wynne, A., Liu, Y., & Yin, J. (2011). Components in the Pipeline. IEEE software, 28(3), 34-40.

Lynnes, C., & Ramachandran, R. (2018, July). Generalizing a Data Analysis Pipeline in the Cloud to Handle Diverse Use Cases in NASA's EOSDIS. In IGARSS 2018-2018 IEEE International Geoscience and Remote Sensing Symposium (pp. 422-425). IEEE.

Yaseen, M. U., Anjum, A., & Antonopoulos, N. (2017, December). Modeling and analysis of a deep learning pipeline for cloud based video analytics. In Proceedings of the Fourth IEEE/ACM International Conference on Big Data Computing, Applications and Technologies (pp. 121-130).

Gade, K. R. (2018). Real-Time Analytics: Challenges and Opportunities. Innovative Computer Sciences Journal, 4(1).

Gade, K. R. (2017). Integrations: ETL vs. ELT: Comparative analysis and best practices. Innovative Computer Sciences Journal, 3(1).

Gade, K. R. (2017). Migrations: Challenges and Best Practices for Migrating Legacy Systems to Cloud-Based Platforms. Innovative Computer Sciences Journal, 3(1).

Naresh Dulam, et al. Apache Arrow: Optimizing Data Interchange in Big Data Systems. Distributed Learning and Broad Applications in Scientific Research, vol. 3, Oct. 2017, pp. 93-114

Naresh Dulam, and Venkataramana Gosukonda. Event-Driven Architectures With Apache Kafka and Kubernetes. Distributed Learning and Broad Applications in Scientific Research, vol. 3, Oct. 2017, pp. 115-36

Naresh Dulam, et al. Snowflake Vs Redshift: Which Cloud Data Warehouse Is Right for You? . Distributed Learning and Broad Applications in Scientific Research, vol. 4, Oct. 2018, pp. 221-40

Naresh Dulam, et al. Apache Iceberg: A New Table Format for Managing Data Lakes . Distributed Learning and Broad Applications in Scientific Research, vol. 4, Sept. 2018

Naresh Dulam, et al. Data Governance and Compliance in the Age of Big Data. Distributed Learning and Broad Applications in Scientific Research, vol. 4, Nov. 2018

Downloads

Published

15-05-2019

Issue

Vol. 5 (2019): Distributed Learning and Broad Applications in Scientific Research

Section

Articles

License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

License Terms

Ownership and Licensing:

Authors of research papers submitted to Distributed Learning and Broad Applications in Scientific Research retain the copyright of their work while granting the journal certain rights. Authors maintain ownership of the copyright and have granted the journal a right of first publication. Simultaneously, authors agree to license their research papers under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) License.

License Permissions:

Under the CC BY-NC-SA 4.0 License, others are permitted to share and adapt the work, as long as proper attribution is given to the authors and acknowledgement is made of the initial publication in the journal. This license allows for the broad dissemination and utilization of research papers.

Additional Distribution Arrangements:

Authors are free to enter into separate contractual arrangements for the non-exclusive distribution of the journal's published version of the work. This may include posting the work to institutional repositories, publishing it in journals or books, or other forms of dissemination. In such cases, authors are requested to acknowledge the initial publication of the work in this journal.

Online Posting:

Authors are encouraged to share their work online, including in institutional repositories, disciplinary repositories, or on their personal websites. This permission applies both prior to and during the submission process to the journal. Online sharing enhances the visibility and accessibility of the research papers.

Responsibility and Liability:

Authors are responsible for ensuring that their research papers do not infringe upon the copyright, privacy, or other rights of any third party. Scientific Research Canada disclaims any liability or responsibility for any copyright infringement or violation of third-party rights in the research papers.

If you have any questions or concerns regarding these license terms, please contact us at editor@dlabi.org.

Most read articles by the same author(s)

Sarbaree Mishra, Vineela Komandla, Srikanth Bandi, Sairamesh Konidala, Jeevan Manda, Training AI models on sensitive data - the Federated Learning approach , Distributed Learning and Broad Applications in Scientific Research: Vol. 6 (2020): Distributed Learning and Broad Applications in Scientific Research
Sarbaree Mishra, Sairamesh Konidala, Jeevan Manda, Improving the ETL process through declarative transformation languages , Distributed Learning and Broad Applications in Scientific Research: Vol. 5 (2019): Distributed Learning and Broad Applications in Scientific Research
Sairamesh Konidala, Guruprasad Nookala, Vishnu Vardhan Reddy Boda, The Impact of AI on Identity and Access Management , Distributed Learning and Broad Applications in Scientific Research: Vol. 6 (2020): Distributed Learning and Broad Applications in Scientific Research
Sairamesh Konidala, Jeevan Manda, How to implement a Zero Trust architecture for your organization using IAM , Distributed Learning and Broad Applications in Scientific Research: Vol. 6 (2020): Distributed Learning and Broad Applications in Scientific Research
Sairamesh Konidala, Jeevan Manda, Kishore Gade, Optimizing Payments for Recurring Merchants , Distributed Learning and Broad Applications in Scientific Research: Vol. 4 (2018): Distributed Learning and Broad Applications in Scientific Research
Sairamesh Konidala, What are the key concepts, design principles of data pipelines and best practices of data orchestration. , Distributed Learning and Broad Applications in Scientific Research: Vol. 3 (2017): Distributed Learning and Broad Applications in Scientific Research
Sairamesh Konidala, Guruprasad Nookala, Vishnu Vardhan Reddy Boda, Data Lakes vs. Data Warehouses in Modern Cloud Architectures: Choosing the Right Solution for Your Data Pipelines , Distributed Learning and Broad Applications in Scientific Research: Vol. 6 (2020): Distributed Learning and Broad Applications in Scientific Research
Sairamesh Konidala, Jeevan Manda, Kishore Gade, A Data Pipeline for Predictive Maintenance in an IoT-Enabled Smart Product: Design and Implementation , Distributed Learning and Broad Applications in Scientific Research: Vol. 4 (2018): Distributed Learning and Broad Applications in Scientific Research
Sairamesh Konidala, Ways to Fight Online Payment Fraud , Distributed Learning and Broad Applications in Scientific Research: Vol. 5 (2019): Distributed Learning and Broad Applications in Scientific Research
Sairamesh Konidala, What is a Modern Data Pipeline and Why is it Important? , Distributed Learning and Broad Applications in Scientific Research: Vol. 2 (2016): Distributed Learning and Broad Applications in Scientific Research

1 2 > >>

Cloud-Based Data Pipelines: Design, Implementation and Example

Authors

Keywords:

Abstract

Downloads

References

Downloads

Published

Issue

Section

License

License Terms

Ownership and Licensing:

License Permissions:

Additional Distribution Arrangements:

Online Posting:

Responsibility and Liability:

Most read articles by the same author(s)

Similar Articles

Journal Snapshot

Make a Submission

Invitation for Submissions