Cloud-Based Data Pipelines: Design, Implementation and Example
Keywords:
Cloud computing, big dataAbstract
Cloud-based data pipelines have become essential for handling vast information in modern data-driven organizations. These pipelines facilitate the smooth collection, transformation, and movement of data across different cloud services and systems, enabling efficient data processing and analytics at scale. Designing a robust cloud-based data pipeline requires understanding the diverse needs of the business, the nature of the data, and the cloud services available, such as Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft Azure. Implementation typically involves a combination of ingestion tools, transformation processes, orchestration services, and storage solutions, all working in harmony. Key factors like scalability, fault tolerance, latency, and security must be considered during design to ensure a seamless data flow, even during system failures or peak loads. For instance, a real-world example could involve a company gathering data from IoT devices, transforming it for real-time analytics, and storing it in a cloud data warehouse for further reporting and machine learning tasks. With the flexibility and cost-efficiency of cloud platforms, organizations can streamline their data workflows, enabling real-time insights, reducing infrastructure management overhead, and enhancing decision-making processes. As cloud services continue to evolve, adopting cloud-based data pipelines offers immense potential for improving business agility and scalability. However, data security, compliance, and managing costs must be addressed carefully. Effective design and implementation of cloud-based data pipelines empower companies to harness the full power of their data, enabling innovation and competitive advantages in an increasingly data-centric world.
Downloads
References
Demchenko, Y., Turkmen, F., De Laat, C., Blanchet, C., & Loomis, C. (2016, July). Cloud based big data infrastructure: Architectural components and automated provisioning. In 2016 International Conference on High Performance Computing & Simulation (HPCS) (pp. 628-636). IEEE.
Onsongo, G., Erdmann, J., Spears, M. D., Chilton, J., Beckman, K. B., Hauge, A., ... & Thyagarajan, B. (2014). Implementation of Cloud based Next Generation Sequencing data analysis in a clinical laboratory. BMC research notes, 7, 1-6.
Öhrström, M., Tomlinson, J., Cortes, R., & Goda, S. (2018, August). Cloud-based pipeline distribution for effective and secure remote workflows. In Proceedings of the 8th Annual Digital Production Symposium (pp. 1-8).
Minevich, G., Park, D. S., Blankenberg, D., Poole, R. J., & Hobert, O. (2012). CloudMap: a cloud-based pipeline for analysis of mutant genome sequences. Genetics, 192(4), 1249-1269.
Schmidt, R., & Möhring, M. (2013, September). Strategic alignment of cloud-based architectures for big data. In 2013 17th IEEE International Enterprise Distributed Object Computing Conference Workshops (pp. 136-143). IEEE.
Umylny, B., & Weisburd, R. S. (2011). Beyond the Pipelines: Cloud Computing Facilitates Management, Distribution, Security, and Analysis of High‐Speed Sequencer Data. Tag‐Based Next Generation Sequencing, 449-468.
Garron, J., Stoner, C., & Meyer, F. (2017, September). Cloud-based oil detection processing pipeline prototype for C-band synthetic aperture radar data. In OCEANS 2017-Anchorage (pp. 1-7). IEEE.
Cala, J., Xu, Y., Wijaya, E. A., & Missier, P. (2014, May). From scripted HPC-based NGS pipelines to workflows on the cloud. In 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (pp. 694-700). IEEE.
Ivanov, V., & Smolander, K. (2018). Implementation of a DevOps pipeline for serverless applications. In Product-Focused Software Process Improvement: 19th International Conference, PROFES 2018, Wolfsburg, Germany, November 28–30, 2018, Proceedings 19 (pp. 48-64). Springer International Publishing.
Trudgian, D. C., & Mirzaei, H. (2012). Cloud CPFP: a shotgun proteomics data analysis pipeline using cloud and high performance computing. Journal of proteome research, 11(12), 6282-6290.
Demchenko, Y., Turkmen, F., de Laat, C., Hsu, C. H., Blanchet, C., & Loomis, C. (2017). Cloud computing infrastructure for data intensive applications. In Big Data Analytics for Sensor-Network Collected Intelligence (pp. 21-62). Academic Press.
Chen, L., Zhang, B., Schnaubelt, M., Shah, P., Aiyetan, P., Chan, D., ... & Zhang, Z. (2018). MS-PyCloud: An open-source, cloud computing-based pipeline for LC-MS/MS data analysis. BioRxiv, 320887.
Gorton, I., Wynne, A., Liu, Y., & Yin, J. (2011). Components in the Pipeline. IEEE software, 28(3), 34-40.
Lynnes, C., & Ramachandran, R. (2018, July). Generalizing a Data Analysis Pipeline in the Cloud to Handle Diverse Use Cases in NASA's EOSDIS. In IGARSS 2018-2018 IEEE International Geoscience and Remote Sensing Symposium (pp. 422-425). IEEE.
Yaseen, M. U., Anjum, A., & Antonopoulos, N. (2017, December). Modeling and analysis of a deep learning pipeline for cloud based video analytics. In Proceedings of the Fourth IEEE/ACM International Conference on Big Data Computing, Applications and Technologies (pp. 121-130).
Gade, K. R. (2018). Real-Time Analytics: Challenges and Opportunities. Innovative Computer Sciences Journal, 4(1).
Gade, K. R. (2017). Integrations: ETL vs. ELT: Comparative analysis and best practices. Innovative Computer Sciences Journal, 3(1).
Gade, K. R. (2017). Migrations: Challenges and Best Practices for Migrating Legacy Systems to Cloud-Based Platforms. Innovative Computer Sciences Journal, 3(1).
Naresh Dulam, et al. Apache Arrow: Optimizing Data Interchange in Big Data Systems. Distributed Learning and Broad Applications in Scientific Research, vol. 3, Oct. 2017, pp. 93-114
Naresh Dulam, and Venkataramana Gosukonda. Event-Driven Architectures With Apache Kafka and Kubernetes. Distributed Learning and Broad Applications in Scientific Research, vol. 3, Oct. 2017, pp. 115-36
Naresh Dulam, et al. Snowflake Vs Redshift: Which Cloud Data Warehouse Is Right for You? . Distributed Learning and Broad Applications in Scientific Research, vol. 4, Oct. 2018, pp. 221-40
Naresh Dulam, et al. Apache Iceberg: A New Table Format for Managing Data Lakes . Distributed Learning and Broad Applications in Scientific Research, vol. 4, Sept. 2018
Naresh Dulam, et al. Data Governance and Compliance in the Age of Big Data. Distributed Learning and Broad Applications in Scientific Research, vol. 4, Nov. 2018
Downloads
Published
Issue
Section
License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
License Terms
Ownership and Licensing:
Authors of research papers submitted to Distributed Learning and Broad Applications in Scientific Research retain the copyright of their work while granting the journal certain rights. Authors maintain ownership of the copyright and have granted the journal a right of first publication. Simultaneously, authors agree to license their research papers under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) License.
License Permissions:
Under the CC BY-NC-SA 4.0 License, others are permitted to share and adapt the work, as long as proper attribution is given to the authors and acknowledgement is made of the initial publication in the journal. This license allows for the broad dissemination and utilization of research papers.
Additional Distribution Arrangements:
Authors are free to enter into separate contractual arrangements for the non-exclusive distribution of the journal's published version of the work. This may include posting the work to institutional repositories, publishing it in journals or books, or other forms of dissemination. In such cases, authors are requested to acknowledge the initial publication of the work in this journal.
Online Posting:
Authors are encouraged to share their work online, including in institutional repositories, disciplinary repositories, or on their personal websites. This permission applies both prior to and during the submission process to the journal. Online sharing enhances the visibility and accessibility of the research papers.
Responsibility and Liability:
Authors are responsible for ensuring that their research papers do not infringe upon the copyright, privacy, or other rights of any third party. Scientific Research Canada disclaims any liability or responsibility for any copyright infringement or violation of third-party rights in the research papers.
If you have any questions or concerns regarding these license terms, please contact us at editor@dlabi.org.