Automating Incident Triage and Root Cause Intelligence Through Large Language Model–Driven Correlation of System Logs and Operational Metrics in Large-Scale Distributed Environments
DOI:
https://doi.org/10.15662/IJEETR.2023.0506023Keywords:
Automated incident triage, root cause intelligence, large language models, log analysis, metric correlation, distributed systems, observability engineering, reliability engineering, site reliability engineering, incident management automation, semantic log interpretation, operational metrics analysis, anomaly detection, fault diagnosis, system observability, intelligent monitoring, production system reliability, incident response optimization, machine intelligence for operations, enterprise scale systems, context aware diagnostics, log and metric fusionAbstract
The increasing scale and complexity of distributed computing environments have intensified the difficulty of timely incident triage and accurate root cause identification, as operators must reason across high volume system logs and heterogeneous operational metrics under severe time constraints. This work addresses the research problem of how incident response can be transformed from manual, heuristic driven practices into an intelligent and automated process capable of semantic understanding and contextual reasoning. The objective is to investigate how large language model driven analysis can be systematically applied to correlate logs and metrics for reliable incident triage in production scale systems. A mixed methodological approach is adopted, combining architectural design, qualitative analysis of operational workflows, and quantitative evaluation of diagnostic efficiency across representative enterprise scenarios. The proposed framework introduces a novel correlation pipeline that leverages language model based contextual abstraction to unify unstructured log streams and structured metrics into coherent incident narratives. Empirical patterns suggest substantial reductions in triage time, improved diagnostic precision, and lower cognitive burden on reliability engineers when compared with traditional rule based and statistical techniques. The findings demonstrate that language model driven reasoning enables a shift from reactive alert handling toward proactive root cause intelligence. The primary contribution lies in articulating a principled foundation for integrating large language models into observability and incident management systems, bridging academic advances in machine intelligence with real world operational demands. The study concludes that automated, semantics aware triage represents a critical advancement for scalable reliability engineering, with significant implications for future research and enterprise operations in large scale distributed environments.
References
[1] Du, M., Li, F., Zheng, G., & Srikumar, V. (2017). DeepLog: Anomaly detection and diagnosis from system logs through deep learning. Proceedings of the ACM SIGSAC Conference on Computer and Communications Security, pp. 1285–1298. https://doi.org/10.1145/3133956.3134015
[2] He, P., Zhu, J., Zheng, Z., & Lyu, M. R. (2017). Drain: An online log parsing approach with a fixed depth tree. IEEE International Conference on Web Services, pp. 33–40. https://doi.org/10.1109/ICWS.2017.13
[3] Tang, L., Li, T., Perng, C. S., & Chen, H. (2011). LogSig: Generating system events from raw textual logs. ACM International Conference on Information and Knowledge Management, pp. 785–794. https://doi.org/10.1145/2063576.2063690
[4] Vaarandi, R., & Pihelgas, M. (2015). LogCluster: A data clustering and pattern mining algorithm for event logs. IEEE Conference on Network and Service Management, pp. 1–7. https://doi.org/10.1109/CNSM.2015.7367331
[5] Makanju, A. A., Zincir-Heywood, A. N., & Milios, E. E. (2009). Clustering event logs using iterative partitioning. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1255–1264. https://doi.org/10.1145/1557019.1557154
[6] Fu, Q., Lou, J. G., Wang, Y., & Li, J. (2009). Execution anomaly detection in distributed systems through unstructured log analysis. IEEE International Conference on Data Mining, pp. 149–158. https://doi.org/10.1109/ICDM.2009.60
[7] Liu, F. T., Ting, K. M., & Zhou, Z. H. (2008). Isolation forest. IEEE International Conference on Data Mining, pp. 413–422. https://doi.org/10.1109/ICDM.2008.17
[8] Notaro, P., Cardoso, J., & Gerndt, M. (2021). A survey of AIOps methods for failure management. ACM Transactions on Intelligent Systems and Technology, 12(6), pp. 1–42. https://doi.org/10.1145/3483424
[9] Bento, A., Estêvão, J., Pereira, R., & Mendonça, H. (2021). Automated analysis of distributed tracing: Challenges and research opportunities. Journal of Grid Computing, 19, pp. 1–25. https://doi.org/10.1007/s10723-021-09551-5
[10] Qiu, J., Du, X., Zhang, D., Su, S., & Guizani, M. (2020). A causality mining and knowledge graph based method of root cause diagnosis for performance anomaly in cloud applications. Applied Sciences, 10(6), 2166. https://doi.org/10.3390/app10062166
[11] Doshi-Velez, F., & Kim, B. (2017). Towards a rigorous science of interpretable machine learning. arXiv preprint. https://doi.org/10.48550/arXiv.1702.08608
[12] Nedelkoski, S., Bogatinovski, J., Acker, A., Cardoso, J., & Kao, O. (2020). Self-supervised log parsing. Machine Learning and Knowledge Discovery in Databases, pp. 1–16. https://doi.org/10.1007/978-3-030-67667-4_8
[13] Ruff, L., Kauffmann, J. R., Vandermeulen, R. A., Montavon, G., Samek, W., Kloft, M., Müller, K. R., & Binder, A. (2021). A unifying review of deep and shallow anomaly detection. Proceedings of the IEEE, 109(5), pp. 756–795. https://doi.org/10.1109/JPROC.2021.3052449
[14] Pang, G., Shen, C., Cao, L., & Hengel, A. V. D. (2021). Deep learning for anomaly detection: A review. ACM Computing Surveys, 54(2), pp. 1–38. https://doi.org/10.1145/3439950
[15] Zhang, W., Meng, W., Zhang, S., Pei, D., Xu, Y., & Liu, H. (2019). LogAnomaly: Unsupervised detection of sequential and quantitative anomalies in unstructured logs. International Joint Conference on Artificial Intelligence, pp. 4739–4745. https://doi.org/10.24963/ijcai.2019/658
[16] Breier, J., & Branišová, J. (2015). Anomaly detection from log files using data mining techniques. Communications in Computer and Information Science, 511, pp. 449–457. https://doi.org/10.1007/978-3-662-46578-3_53
[17] Xu, W., Huang, L., Fox, A., Patterson, D., & Jordan, M. I. (2009). Detecting large-scale system problems by mining console logs. ACM SIGOPS Operating Systems Review, pp. 117–132. https://doi.org/10.1145/1629575.1629587
[18] Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). Why should I trust you: Explaining the predictions of any classifier. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144. https://doi.org/10.1145/2939672.2939778
[19] Chandola, V., Banerjee, A., & Kumar, V. (2009). Anomaly detection: A survey. ACM Computing Surveys, 41(3), pp. 1–58. https://doi.org/10.1145/1541880.1541882
[[20] Laptev, N., Amizadeh, S., & Flint, I. (2015). Generic and scalable framework for automated time-series anomaly detection. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1939–1947. https://doi.org/10.1145/2783258.2788611
[21] Cohen, I., Zhang, S., Goldszmidt, M., Symons, J., Kelly, T., & Fox, A. (2005). Capturing, indexing, clustering, and retrieving system history. ACM Symposium on Operating Systems Principles, pp. 105–118. https://doi.org/10.1145/1095809.1095821
[22] Gupta, M., Gao, J., Aggarwal, C. C., & Han, J. (2013). Outlier detection for temporal data: A survey. IEEE Transactions on Knowledge and Data Engineering, 26(9), pp. 2250–2267. https://doi.org/10.1109/TKDE.2013.184





