AI-POWERED OPERATIONAL INTELLIGENCE FOR MANAGING HIGH-SCALE CLOUD-NATIVE DISTRIBUTED SYSTEMS

Venkatramana Reddy Panyala

doi:10.15662/a7931h79

Authors

Venkatramana Reddy Panyala Production Engineer, Yahoo, USA Author

DOI:

https://doi.org/10.15662/a7931h79

Keywords:

Cloud-native systems, operational intelligence, anomaly detection, distributed systems, machine learning, observability, auto-remediation, microservices, AIOps, DevOps

Abstract

The explosion in cloud-native architectures and microservice-based distributed systems has brought about a level of complexity that cannot be effectively handled using traditional monitoring systems and manual incident handling strategies. This paper introduces a novel operational intelligence framework, based on the artificial intelligence, streaming, and auto-remediation facilities, to manage the high-scale cloud-native distributed systems. It comprises anomaly detection, predictive analysis, smart alerting, and auto-remediation. By using the data produced by containers, service mesh technologies, and cloud infrastructure, the system can deliver operational intelligence by employing machine learning, stream processing, and autonomous actions, and minimal human interaction is necessary. The proposed system architecture is vendor-neutral and extensible, which allows it to be compatible with popular observability products on the market. The paper addresses the system architecture, data flows, AI/ML model development, and deployment issues, putting a particular focus on how operational intelligence is applied in DevOps and SRE.

References

[1] H. Chen, et al “ Mobility-Aware Offloading and Resource Allocation for Distributed Services Collaboration,” IEEE Transactions on Parallel and Distributed Systems, vol. 33, no. 10, pp. 2428–2443, 2022.

[2] Guo Q et al.,” “ CausaL: A Causality-Based Service Fault Diagnosis Framework in Distributed Systems” In Proceedings of the 30th International Symposium on Software Reliability Engineering (ISSRE) (pp. 12–23). IEEE, 2020.

[3] S. Choochotkaew et al “ AutoDECK: Automated Declarative Performance Evaluation and Tuning Framework on Kubernetes,” in 2022 IEEE 15th International Conference on Cloud Computing (CLOUD).IEEE, 2022, pp. 309–314

[4] Meng Y et al., “Localizing Failure Root Causes in a Microservice through Causality Inference”In Proceedings of the 28th IEEE/ACM International Symposium on Quality of Service (IWQoS) (pp. 1–10). IEEE,2020.

[5] Sigelman et al “ Dapper, a Large-Scale Distributed Systems Tracing Infrastructure.

Google Technical Report.”

[6] N. Wang, R. Zhou, L. Jiao, R. Zhang, B. Li, and Z. Li, “Preemptive Scheduling for Distributed Machine Learning Jobs in Edge-Cloud Networks,” IEEE Journal on Selected Areas in Communications, 2022.

[7] Ma et al. “ AutoMAP: Diagnose Your Microservice-Based Web Applications Automatically” In Proceedings of the World Wide Web Conference (WWW) (pp. 246– 258). ACM.2020.

[8] Hyndman R J et al. “Automatic Time Series Forecasting: The forecast Package for R. Journal of Statistical Software”, 27(3), 1–22,2008.

[9] Xu H et al., “Unsupervised Anomaly Detection via Variational Auto-Encoder for Seasonal KPIs in Web Applications” In Proceedings of the World Wide Web Conference (WWW) (pp. 187–196). ACM,2018.

[10] Hundman K et al.”Detecting Spacecraft Anomalies Using LSTMs and Nonparametric Dynamic Thresholding.”In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 387–395). ACM,2018.

[11] Lim C et al., “ A Log Mining Approach to Failure Analysis of Enterprise Telephony Systems. In Proceedings of the IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) (pp. 557–566). IEEE,2014.

[12] Chen Y et al.,”Outage Prediction and Diagnosis for Cloud Service Systems” In

Proceedings of the World Wide Web Conference (WWW) (pp. 2659–2665). ACM,2019.

[13] Benomar Z et al.,”Autonomous Self-Healing for Cloud-Native Applications Using Reinforcement Learning and Kubernetes Operators” In Proceedings of the IEEE International Conference on Cloud Engineering (IC2E) (pp. 112–123). IEEE,2021.

[14] Z. Luo et al “ “Efficient pipeline planning for expedited distributed dnn training,” arXiv

preprint arXiv:2204.10562,

[15] Zhou Xet al., “Benchmarking Microservice Systems for Software Engineering Research” In Proceedings of the 40th International Conference on Software Engineering: Companion Proceedings (ICSE-C) (pp. 323–324). ACM,2018.

[16] M. Yu et al “, “Gadget: Online resource optimization for scheduling ring-all-reduce learning jobs,” in IEEE INFOCOM 2022-IEEE Conference on Computer Communications.IEEE, 2022, pp. 1569–1578.

[17] Wang W et al., “ Automatic Fault Detection for Deep Space Exploration Using Structural Models and ML Techniques” In Proceedings of the IEEE International Symposium on High Assurance Systems Engineering (HASE) (pp. 165–172). IEEE,2018.

[18] Gulenko A et al., “Detecting Anomalous Behavior of Black-Box Services Modeled with Distance-Based Online Clustering” In Proceedings of the 2018 IEEE International Conference on Cloud Computing (CLOUD) (pp. 912–916). IEEE,2018.

[19] Nair V et al., “Finding Faster Configurations Using FLASH” IEEE Transactions on

Software Engineering, 46(7), 794–811,2018.

[20] S. N. A. Jawaddi, M. H. Johari, and A. Ismail, “A review of microservices autoscaling with formal verification perspective,” Software: Practice and Experience, 2022.

AI-POWERED OPERATIONAL INTELLIGENCE FOR MANAGING HIGH-SCALE CLOUD-NATIVE DISTRIBUTED SYSTEMS

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

How to Cite

Make a Submission

Images

Submisssion

Open Access

License

Keywords

Keywords

Latest publications