Leveraging AI for Proactive Fault Detection in Amazon EKS Clusters

Authors

  • Babulal Shaik Cloud Solutions Architect at Amazon Web Services, USA Author

Keywords:

Amazon EKS, Kubernetes

Abstract

In cloud-native environments like Amazon EKS, ensuring high availability and minimizing downtime are critical to maintaining application performance and user satisfaction. This paper proposes a machine learning-based approach to proactively detect and prevent faults within Amazon Elastic Kubernetes Service (EKS) clusters. The model aims to identify early signs of issues that could lead to service disruption by monitoring key metrics such as pod performance, node health, and network conditions. The system leverages historical performance data to train predictive models, which can anticipate faults before they escalate into critical problems. The model provides real-time alerts and automated remediation strategies by analyzing patterns in resource utilization, system errors, and network latency. This proactive fault detection approach enhances the reliability and stability of EKS clusters and helps reduce operational overhead by allowing teams to address issues before they affect end-users. Through this research, the goal is to demonstrate the potential of integrating AI and machine learning into the operational workflows of Kubernetes-based environments, thus improving both performance and resilience.

Downloads

Download data is not yet available.

References

Ambati, P., & Irwin, D. (2019). Optimizing the cost of executing mixed interactive and batch workloads on transient vms. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 3(2), 1-24.

Chelliah, P. R., Naithani, S., & Singh, S. (2018). Practical Site Reliability Engineering: Automate the process of designing, developing, and delivering highly reliable apps and services with SRE. Packt Publishing Ltd.

Mena, J. (1999). Data mining your website. Digital Press.

Jugovac, M. (2019). Designing and evaluating recommender systems with the user in the loop.

Lerche, L. (2016). Using implicit feedback for recommender systems: characteristics, applications, and challenges.

Erdilek, M. (2002). A Research On Electronic Business: Comparison of Electronic Business Models (Master's thesis, Marmara Universitesi (Turkey)).

Kietzmann, J., Paschen, J., & Treen, E. (2018). Artificial intelligence in advertising: How marketers can leverage artificial intelligence along the consumer journey. Journal of Advertising Research, 58(3), 263-267.

Gudala, L., Shaik, M., Venkataramanan, S., & Sadhu, A. K. R. (2019). Leveraging Artificial Intelligence for Enhanced Threat Detection, Response, and Anomaly Identification in Resource-Constrained IoT Networks. Distributed Learning and Broad Applications in Scientific Research, 5, 23-54.

Gayam, S. R. (2019). AI for Supply Chain Visibility in E-Commerce: Techniques for Real-Time Tracking, Inventory Management, and Demand Forecasting. Distributed Learning and Broad Applications in Scientific Research, 5, 218-251.

Boutilier, C., Dean, T., & Hanks, S. (1999). Decision-theoretic planning: Structural assumptions and computational leverage. Journal of Artificial Intelligence Research, 11, 1-94.

Davenport, T. H. (2018). From analytics to artificial intelligence. Journal of Business Analytics, 1(2), 73-80.

He, A., Bae, K. K., Newman, T. R., Gaeddert, J., Kim, K., Menon, R., ... & Tranter, W. H. (2010). A survey of artificial intelligence for cognitive radios. IEEE transactions on vehicular technology, 59(4), 1578-1592.

Russomanno, D. J., Kothari, C. R., & Thomas, O. A. (2005, June). Building a Sensor Ontology: A Practical Approach Leveraging ISO and OGC Models. In IC-AI (pp. 637-643).

Gade, K. R. (2017). Migrations: Challenges and Best Practices for Migrating Legacy Systems to Cloud-Based Platforms. Innovative Computer Sciences Journal, 3(1).

Jensen, R. M., Veloso, M. M., & Bryant, R. E. (2008). State-set branching: Leveraging BDDs for heuristic search. Artificial Intelligence, 172(2-3), 103-139.

Nemati, H. R., Steiger, D. M., Iyer, L. S., & Herschel, R. T. (2002). Knowledge warehouse: an architectural integration of knowledge management, decision support, artificial intelligence and data warehousing. Decision Support Systems, 33(2), 143-161.

Boda, V. V. R., & Immaneni, J. (2019). Streamlining FinTech Operations: The Power of SysOps and Smart Automation. Innovative Computer Sciences Journal, 5(1).

Nookala, G., Gade, K. R., Dulam, N., & Thumburu, S. K. R. (2019). End-to-End Encryption in Enterprise Data Systems: Trends and Implementation Challenges. Innovative Computer Sciences Journal, 5(1).

Komandla, V. Enhancing Security and Fraud Prevention in Fintech: Comprehensive Strategies for Secure Online Account Opening.

Komandla, V. Transforming Financial Interactions: Best Practices for Mobile Banking App Design and Functionality to Boost User Engagement and Satisfaction.

Gade, K. R. (2019). Data Migration Strategies for Large-Scale Projects in the Cloud for Fintech. Innovative Computer Sciences Journal, 5(1).

Gade, K. R. (2018). Real-Time Analytics: Challenges and Opportunities. Innovative Computer Sciences Journal, 4(1).

Katari, A. (2019). Real-Time Data Replication in Fintech: Technologies and Best Practices. Innovative Computer Sciences Journal, 5(1).

Katari, A. (2019). ETL for Real-Time Financial Analytics: Architectures and Challenges. Innovative Computer Sciences Journal, 5(1).

Gade, K. R. (2017). Migrations: Challenges and Best Practices for Migrating Legacy Systems to Cloud-Based Platforms. Innovative Computer Sciences Journal, 3(1).

Muneer Ahmed Salamkar. Next-Generation Data Warehousing: Innovations in Cloud-Native Data Warehouses and the Rise of Serverless Architectures. Distributed Learning and Broad Applications in Scientific Research, vol. 5, Apr. 2019

Muneer Ahmed Salamkar. Real-Time Data Processing: A Deep Dive into Frameworks Like Apache Kafka and Apache Pulsar. Distributed Learning and Broad Applications in Scientific Research, vol. 5, July 2019

Naresh Dulam, and Venkataramana Gosukonda. “AI in Healthcare: Big Data and Machine Learning Applications ”. Distributed Learning and Broad Applications in Scientific Research, vol. 5, Aug. 2019

Naresh Dulam. “Real-Time Machine Learning: How Streaming Platforms Power AI Models ”. Distributed Learning and Broad Applications in Scientific Research, vol. 5, Sept. 2019

Naresh Dulam. Apache Spark: The Future Beyond MapReduce. Distributed Learning and Broad Applications in Scientific Research, vol. 1, Dec. 2015, pp. 136-5

Sarbaree Mishra. Distributed Data Warehouses - An Alternative Approach to Highly Performant Data Warehouses. Distributed Learning and Broad Applications in Scientific Research, vol. 5, May 2019

Sarbaree Mishra, et al. Improving the ETL Process through Declarative Transformation Languages. Distributed Learning and Broad Applications in Scientific Research, vol. 5, June 2019

Downloads

Published

07-03-2020

How to Cite

[1]
Babulal Shaik, “Leveraging AI for Proactive Fault Detection in Amazon EKS Clusters ”, Distrib Learn Broad Appl Sci Res, vol. 6, pp. 894–909, Mar. 2020, Accessed: Jan. 03, 2025. [Online]. Available: https://dlabi.org/index.php/journal/article/view/262

Most read articles by the same author(s)

Similar Articles

11-20 of 20

You may also start an advanced similarity search for this article.