Big Data Pipeline for Analytics

Overview

HSC’s Big Data Pipeline for analytics is a high-speed distributed architecture ideal to use as a core enabler for any system requiring high-speed processing and real-time analytics of millions of transactions without data loss.

Challenges in Data Processing

Each data processing stage has its own intricacies, and various options/tools are available. The challenge lies in identifying the right tool for any stage and then integrating the chosen tools to develop a data processing framework. The framework must address the following high-level requirements:

It must be possible to integrate a new data source.
The framework must allow dynamically scaling the processing infrastructure as per the incoming data rate and honour the associated SLAs.
The generated information may be stored at various possible places like databases and local/network file systems or may be passed on to some other system.
The framework must have minimum latency and high throughput.
The framework must monitor the various processing components and handle the failures gracefully.

Background on Data Processing:

Data processing primarily involves the following four stages:

Extract Transform and Load (ETL): It should be possible to ingest data from an existing data source(s). A data source may either push data or data may be pulled for processing. Incoming data may be cleansed or transformed before processing.

Data Processing: Incoming data shall be processed as per the business objectives to generate information.

Data Storage: Generated information may be stored using persistent/volatile storage. Other systems may pull the information from the storage.

Data Visualization: Generated information may be displayed on some dashboards.

Features

Framework uses Kafka messaging for storing incoming data

Spark is used as the data processing framework.

All the incoming data may be stored in HDFS as parquet files.

Redis is used as the distributed caching framework for storing frequent used application data

Processed data may be stored in a medium of choice by writing a custom DAO layer

System's components and KPIs (like maximum incoming data rate) are monitored using Prometheus.

The framework can be hosted locally or in cloud

Docker container and stack services have been used to expedite deployment and streamline monitoring

Any possible datastore can be used as all the interaction happens via a data abstraction layer

Highly customizable and tunable for the specific use case in hand

Use Cases

Organizations all over the world are looking at having a centralized data processing framework. All the enterprise data must go into this pipeline and then get processed as per the pre-configured rules. Data access (raw and processed) must be controlled all the time. This pattern enables greater coordination among different teams/departments within the same enterprise. It also allows the enterprises to share the data processing infrastructure among the teams which brings down the overall cost.

HSC’s data processing framework is highly suitable for such large-scale enterprise data management needs. The framework facilitates ingestion of data and processing of it in real-time. Batch jobs can be executed to generate time-consuming non-real-time reports. Historical data can be archived on HDFS/S3 and retrieved as and when needed. The framework allows great coordination between different data processing jobs using Kafka.

Innovations@HSC

Newsletter

Read More
OpenSource

Read More
Technical Demos

Read More

Enquire Now