AI-Augmented Data Engineering for Real-Time Fraud Detection in Digital Banking
DOI:
https://doi.org/10.15662/IJEETR.2024.0606018Keywords:
Real-Time Fraud Detection in Digital Banking, Streaming Data Architectures, AI-Augmented Data Engineering, Fraud Pattern Drift Detection, Class Imbalance Mitigation, Hybrid Fraud Detection Models, Supervised Learning for Fraud Analytics, , Unsupervised Anomaly Detection, Isolation and One-Class Models, Feature Engineering for Transactions, Label Generation Pipelines, Scalable Fraud Detection Systems, Low-Latency Transaction Monitoring, Data Quality and Provenance Controls, Transfer Learning in Fraud Models, Parallel Model Deployment Pipelines, Governed Data Environments, Privacy-Compliant Fraud Analytics, Event-Driven Banking Security, End-to-End Fraud Detection Architecture.Abstract
Real-time fraud detection in digital banking is innefectual for the majority of fraud transactions. Detection latencies often exceed fraud initiation delays, data sources and processing framework are limited, scalability to transaction volume is lacking, drift of fraud patterns is inadequately addressed, class imbalance in training data is prevalent and data quality issues are not addressed comprehensively in supporting work. A landscape of real-time fraud detection capabilities across the digital banking sector outlines these shortcomings and a data engineering view identifies a set of enabling foundations and an end-to-end fraud detection system architecture. Data ingestion and streaming architectures, data quality processes and data provenance controls provide the bedrock of a real-time data analytics capability. The implementation of supervised learning is supported with label generation, feature engineering, model selection and transfer learning while unsupervised anomaly detection is deployed using clustering, isolation and one-class models. The underlying streaming framework enables seamless combination of multiple detectors in a hybrid model. Scalability at the fraud detection model level is solved with a data pipeline supporting multiple detection models in parallel.
A data engineering and machine learning system design is presented, offering insights into the implementation and deployment of AI-augmented data engineering for real-time fraud detection. Consideration of AI-augmented data engineering highlights the quality and completeness of data, and alignment with data governance, access controls and data privacy legislation and objectives, in the delivery of a fast, accurate and easily understood model suitable for deployment into a governed data environment for streaming processing.
References
1. Akiba, T., Sano, S., Yanase, T., Ohta, T., & Koyama, M. (2019). Optuna: A next-generation hyperparameter optimization framework. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2623–2631.
2. Kolla, S. K. (2021). Architectural Frameworks for Large-Scale Electronic Health Record Data Platforms. Current Research in Public Health, 1(1), 1–19. Retrieved from https://www.scipublications.com/journal/index.php/crph/article/view/1372.
3. Arrieta, A. B., Díaz-Rodríguez, N., Del Ser, J., et al. (2020). Explainable artificial intelligence (XAI): Concepts, taxonomies, opportunities and challenges. Information Fusion, 58, 82–115.
4. Davuluri, P. N. Integrating Artificial Intelligence into Event-Driven Financial Crime Compliance Platforms.
5. Awais, M., Li, Y., & Wang, H. (2022). Federated learning for healthcare informatics. IEEE Reviews in Biomedical Engineering, 15, 226–239.
6. Gottimukkala, V. R. R. (2023). Privacy-Preserving Machine Learning Models for Transaction Monitoring in Global Banking Networks. International Journal of Finance (IJFIN)-ABDC Journal Quality List, 36(6), 633-652.
7. Baier, H., & Mendling, J. (2022). Data governance in machine learning systems. Business & Information Systems Engineering, 64, 471–486.
8. Kushvanth Chowdary Nagabhyru. (2023). Accelerating Digital Transformation with AI Driven Data Engineering: Industry Case Studies from Cloud and IoT Domains. Educational Administration: Theory and Practice, 29(4), 5898–5910. https://doi.org/10.53555/kuey.v29i4.10932.
9. Bellamy, R. K. E., et al. (2019). AI fairness 360. IBM Journal of Research and Development, 63(4/5), 4:1–4:15.
10. Bender, E. M., et al. (2021). On the dangers of stochastic parrots. Proceedings of FAccT, 610–623.
11. Sasi Kumar Kolla. (2023). Big Data–Driven Machine Learning Frameworks for Clinical Risk Prediction. International Journal of Medical Toxicology and Legal Medicine, 26(3 and 4), 44–59. Retrieved from https://ijmtlm.org/index.php/journal/article/view/1456.
12. Biecek, P., & Burzykowski, T. (2021). Explanatory model analysis. CRC Press.
13. Aitha, A. R. (2024). Generative AI-Powered Fraud Detection in Workers' Compensation: A DevOps-Based Multi-Cloud Architecture Leveraging, Deep Learning, and Explainable AI. Deep Learning, and Explainable AI (July 26, 2024).
14. Breck, E., Cai, S., Nielsen, E., Salib, M., & Sculley, D. (2017). The ML test score. Proceedings of SysML.
15. Varri, D. B. S. (2024). Adaptive and Autonomous Security Frameworks Using Generative AI for Cloud Ecosystems. Available at SSRN 5774785.
16. Brownlee, J. (2020). Data preparation for machine learning. Machine Learning Mastery.
17. Budach, L., et al. (2022). The effects of data quality on machine learning performance. Data Science and Engineering, 7, 127–145.
18. Meda, R. (2024). Agentic AI in Multi-Tiered Paint Supply Chains: A Case Study on Efficiency and Responsiveness. Journal of Compu-tational Analysis and Applications (JoCAAA), 33(08), 3994-4015.
19. Caruana, R., et al. (2015). Intelligible models for healthcare. KDD, 1721–1730.
20. Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. KDD, 785–794.
21. Singireddy, J. (2024). AI-Enhanced Tax Preparation and Filing: Automating Complex Regulatory Compliance. European Data Science Journal (EDSJ) p-ISSN 3050-9572 en e-ISSN 3050-9580, 2(1).
22. Cohen, I. G., et al. (2022). The AI revolution in healthcare. Science, 375(6587), 1327–1330.
23. Kolla, S. K. (2021). Designing Scalable Healthcare Data Pipelines for Multi-Hospital Networks. World Journal of Clinical Medicine Research, 1(1), 1–14. Retrieved from https://www.scipublications.com/journal/index.php/wjcmr/article/view/1376.
24. Agentic AI in Data Pipelines: Self Optimizing Systems for Continuous Data Quality, Performance and Governance. (2024). American Data Science Journal for Advanced Computations (ADSJAC) ISSN: 3067-4166, 2(1).
25. Deng, L., & Yu, D. (2014). Deep learning: Methods and applications. Foundations and Trends in Signal Processing, 7(3–4), 197–387.
26. Doshi-Velez, F., & Kim, B. (2017). Towards a rigorous science of interpretable machine learning. arXiv.
27. Deep Learning-Driven Optimization of ISO 20022 Protocol Stacks for Secure Cross-Border Messaging. (2024). MSW Management Journal, 34(2), 1545-1554.
28. European Parliament. (2024). Regulation (EU) 2024/1689 (AI Act). Official Journal of the European Union.
29. Segireddy, A. R. (2024). Machine Learning-Driven Anomaly Detection in CI/CD Pipelines for Financial Applications. Journal of Computational Analysis and Applications, 33(8).
30. Geiger, R., et al. (2020). Garbage in, garbage out? Big data and bias. Big Data & Society, 7(2).
31. Keerthi Amistapuram. (2024). Federated Learning for Cross-Carrier Insurance Fraud Detection: Secure Multi-Institutional Collaboration. Journal of Computational Analysis and Applications (JoCAAA), 33(08), 6727–6738. Retrieved from https://www.eudoxuspress.com/index.php/pub/article/view/3934.
32. Goldstein, A., Kapelner, A., et al. (2015). Peeking inside the black box. Journal of Computational and Graphical Statistics, 24(1), 44–65.
33. Varri, D. B. S. (2023). Advanced Threat Intelligence Modeling for Proactive Cyber Defense Systems. Available at SSRN 5774926.
34. Hinton, G., et al. (2012). Deep neural networks for acoustic modeling. IEEE Signal Processing Magazine, 29(6), 82–97.
35. Sheelam, G. K., & Koppolu, H. K. R. (2024). From Transistors to Intelligence: Semiconductor Architectures Empowering Agentic AI in 5G and Beyond. Journal of Computational Analy- sis and Applications(JoCAAA), 33(08), 4518-4537.
36. ISO/IEC 25012. (2008). Data quality model. International Organization for Standardization.
37. Paleti, S. (2024). Transforming Financial Risk Management with AI and Data Engineering in the Modern Banking Sector. American Journal of Analytics and Artificial Intelligence (ajaai) with ISSN 3067-283X, 2(1).
38. Johnson, A. E. W., et al. (2016). MIMIC-III clinical database. Scientific Data, 3, 160035.
39. Meda, R. (2023). Intelligent Infrastructure for Real-Time Inventory and Logistics in Retail Supply Chains. Educational Administration: Theory and Practice.
40. Kim, B., et al. (2018). Interpretability beyond feature attribution. ICML.
41. Garapati, R. S. (2023). Optimizing Energy Consumption in Smart Build-ings Through Web-Integrated AI and Cloud-Driven Control Systems.
42. Koh, P. W., & Liang, P. (2017). Influence functions. ICML, 1885–1894.
43. Inala, R. Revolutionizing Customer Master Data in Insurance Technology Platforms: An AI and MDM Architecture Perspective.
44. Krizhevsky, A., et al. (2012). ImageNet classification with deep convolutional neural networks. NIPS, 1097–1105.
45. Varri, D. B. S. (2022). A Framework for Cloud-Integrated Database Hardening in Hybrid AWS-Azure Environments: Security Posture Automation Through Wiz-Driven Insights. International Journal of Scientific Research and Modern Technology, 1(12), 216-226.
46. Lipton, Z. C. (2018). The mythos of model interpretability. Communications of the ACM, 61(10), 36–43.
47. Amistapuram, K. (2024). Generative AI in Insurance: Automating Claims Documentation and Customer Communication. Turkish Journal of Computer and Mathematics Education (TURCOMAT), 15(3), 461–475. https://doi.org/10.61841/turcomat.v15i3.15474.
48. Lu, J., et al. (2019). Learning under concept drift: A review. IEEE TKDE, 31(12), 2346–2363.
49. McKinney, W. (2022). Python for data analysis (3rd ed.). O’Reilly Media.
50. Aitha, A. R. (2023). CloudBased Micro services Architecture for Seamless Insurance Policy Administration. International Journal of Finance (IJFIN)-ABDC Journal Quality List, 36(6), 607-632.
51. National Institute of Standards and Technology. (2023). AI risk management framework 1.0.
52. Northcutt, C., et al. (2021). Confident learning. Journal of Artificial Intelligence Research, 70, 1373–1411.
53. Nagabhyru, K. C. (2024). Data Engineering in the Age of Large Language Models: Transforming Data Access, Curation, and Enterprise Interpretation. Computer Fraud and Security.
54. Polyzotis, N., et al. (2018). Data management challenges in production ML. SIGMOD, 1723–1726.
55. Davuluri, P. S. L. N. (2024). AI-Driven Data Governance Frameworks for Automated Regulatory Reporting and Audit Readiness. Metallurgical and Materials Engineering, 30(4), 996–1010. Retrieved from https://metall-mater-eng.com/index.php/home/article/view/1936.
57. Sculley, D., et al. (2015). Hidden technical debt in ML systems. NeurIPS, 2503–2511.
58. Uday Surendra Yandamuri. (2023). An Intelligent Analytics Framework Combining Big Data and Machine Learning for Business Forecasting. International Journal Of Finance, 36(6), 682-706. https://doi.org/10.5281/zenodo.18095256.
59. Shickel, B., et al. (2017). Deep EHR representation learning. Journal of Biomedical Informatics, 83, 168–185.
60. Koppolu, H. K. R., & Sheelam, G. K. (2024). Machine Learning-Driven Optimization in 6G Telecommunications: The Role of Intelligent Wireless and Semiconductor Innovation. Global Research Development (GRD) ISSN: 2455-5703, 9(12).
61. Simmhan, Y. L., et al. (2005). A survey of data provenance. ACM SIGMOD Record, 34(3), 31–36.
62. Rongali, S. K. (2023). Explainable Artificial Intelligence (XAI) Framework for Transparent Clinical Decision Support Systems. International Journal of Medical Toxicology and Legal Medicine, 26(3), 22-31.
63. Song, L., et al. (2021). Data-centric AI. arXiv.
64. Mashetty, S., Challa, S. R., ADUSUPALLI, B., Singireddy, J., & Paleti, S. (2024). Intelligent Technologies for Modern Financial Ecosystems: Transforming Housing Finance, Risk Management, and Advisory Services Through Advanced Analytics and Secure Cloud Solutions. Risk Management, and Advisory Services Through Advanced Analytics and Secure Cloud Solutions (December 12, 2024v.
65. TensorFlow Team. (2022). TFX: ML pipelines. Google AI.
66. Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society, 58(1), 267–288.
67. Davuluri, P. N. AI-Augmented Sanctions Screening: Enhancing Accuracy and Latency in Real Time Compliance Systems.
68. Vapnik, V. (1998). Statistical learning theory. Wiley.
69. Veale, M., & Borgesius, F. Z. (2021). Demystifying the draft EU AI Act. Computer Law Review International, 22(4), 97–112.
70. Rongali, S. K., & Kumar Kakarala, M. R. (2024). Existing challenges in ethical AI: Addressing algorithmic bias, transparency, accountability and regulatory compliance.
71. Lahari Pandiri, "AI-Powered Fraud Detection Systems in Professional and Contractors Insurance Claims," International Journal of Innovative Research in Electrical, Electronics, Instrumentation and Control Engineering (IJIREEICE), DOI 10.17148/IJIREEICE.2024.121206.
72. WHO. (2021). Ethics and governance of AI for health. World Health Organization.
73. Inala, R. AI-Powered Investment Decision Support Systems: Building Smart Data Products with Embedded Governance Controls.
74. Wilkinson, M. D., et al. (2016). FAIR guiding principles. Scientific Data, 3, 160018.
75. Guntupalli, R. (2024). AI-Powered Infrastructure Management in Cloud Computing: Automating Security Compliance and Performance Monitoring. Available at SSRN 5329147.
76. Zhou, Z.-H. (2021). Machine learning. Springer.
77. Abadi, M., et al. (2016). TensorFlow. OSDI, 265–283.
78. Nagubandi, A. R. (2023). Advanced Multi-Agent AI Systems for Autonomous Reconciliation Across Enterprise Multi-Counterparty Derivatives, Collateral, and Accounting Platforms. International Journal of Finance (IJFIN)-ABDC Journal Quality List, 36(6), 653-674.
79. He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE TKDE, 21(9), 1263–1284.
80. Keerthi Amistapuram. (2023). Privacy-Preserving Machine Learning Models for Sensitive Customer Data in Insurance Systems. Educational Administration: Theory and Practice, 29(4), 5950–5958. https://doi.org/10.53555/kuey.v29i4.10965.
81. Carcillo, F., et al. (2021). Streaming fraud detection framework. Information Fusion, 41, 182–194.
82. Chava, K. (2024). The Role of Cloud Computing in Accelerating AI-Driven Innovations in Healthcare Systems. European Advanced Journal for Emerging Technologies (EAJET)-p-ISSN 3050-9734 en e-ISSN 3050-9742, 2(1).
83. Silver, D., et al. (2016). Mastering the game of Go with deep neural networks. Nature, 529, 484–489.
84. Siva Hemanth Kolla. (2023). Deep Learning–Driven Retrieval-Augmented Generation for Enterprise ITSM Automation: A Governance-Aligned Large Language Model Architecture . Journal of Computational Analysis and Applications (JoCAAA), 31(4), 2489–2502. Retrieved from https://www.eudoxuspress.com/index.php/pub/article/view/4774.
85. Mehrabi, N., et al. (2021). Survey on bias and fairness in ML. ACM Computing Surveys, 54(6), 1–35.
86. Rongali, S. K. (2024). Federated and Generative AI Models for Secure, Cross-Institutional Healthcare Data Interoperability. Journal of Neonatal Surgery, 13(1), 1683-1694.
87. Karimian, N., et al. (2022). Blockchain for healthcare data governance. IEEE Access, 10, 11456–11469.
88. Yandamuri, U. S. AI-Driven Decision Support Systems for Operational Optimization in Hospitality Technology.
89. Molnar, C. (2022). Interpretable machine learning (2nd ed.). Lulu Press.
90. Kolla, S. H. (2024). RETRIEVAL-AUGMENTED GENERATION WITH SMALL LLMS FOR KNOWLEDGE-DRIVEN DECISION AUTOMATION IN ENTERPRISE SERVICE PLATFORMS. Turkish Journal of Computer and Mathematics Education (TURCOMAT), 15(3), 476–486. https://doi.org/10.61841/turcomat.v15i3.15497.





