Authors:
Reddy Srikanth Madhuranthakam
Addresses:
Department of AI in DevSecOps-FAMC, Citizens Bank, Texas, United States of America.
With the world becoming increasingly data-driven, actionable insights and decisions that depend on real-time analytics have been at the forefront. However, processing huge volumes of data in real time requires very strong, scalable, and efficient data engineering pipelines. This paper describes the design, development, and optimization of scalable data engineering pipelines for real-time analytics in big data environments. Ingestion, processing, storage and visualization, along with their interactions within the distributed computing setup, will all be part of the paper. More best practices will be presented regarding handling high-velocity data streams to become fault-tolerant and data consistency. We design a system architecture using the empirical approach: bringing real-time data processing frameworks like Apache Kafka and Apache Flink with the best available cloud-based storage solution to achieve the scale-out seamlessness of the data processing task. Alongside the design comes the performance evaluation results for a comparison among varied strategies to be followed while carrying out real-time analytics focusing on scalability, throughput, and latency. It indicates that the efficiency of the proposed pipeline is much improved in processing and is well-suited for large-scale, real-time analytics for numerous industries.
Keywords: Data Engineering Pipelines; Real-Time Analytics; Big Data; Distributed Computing; Data Processing Frameworks; Real-Time Data Processing; Storage and Visualization; Support Decision-Making.
Received on: 12/04/2024, Revised on: 26/06/2024, Accepted on: 19/08/2024, Published on: 09/09/2024
DOI: 10.69888/FTSCS.2024.000262
FMDB Transactions on Sustainable Computing Systems, 2024 Vol. 2 No. 3, Pages: 154-166