Data Pipeline Architecture

3 min readMar 14, 2021

A data pipeline architecture is an arrangement of objects that extracts, regulates, and routes data to the relevant system for obtaining valuable insights. Data pipelines consist of three key elements: a source, a processing step or steps, and a destination. In some data pipelines, the destination may be called a sink. Data pipelines enable the flow of data from an application to a data warehouse, from a data lake to an analytics database, or into a payment processing system, for example. Data pipelines also may have the same source and sink, such that the pipeline is purely about modifying the data set. Any time data is processed between point A and point B (or points B, C, and D), there is a data pipeline between those points.

Data Ingestion

Data ingestion is a process by which data is moved from one or more sources to a destination where it can be stored and further analyzed. The data might be in different formats and come from various sources, including RDBMS, other types of databases, S3 buckets, CSVs, or from streams. Since the data comes from different places, it needs to be cleansed and transformed in a way that allows you to analyze it together with data from other sources. Otherwise, your data is like a bunch of puzzle pieces that don’t fit together.

Software: Python

Scheduler

Scheduler is used to schedule when the data should be ingested and processed on the other platforms.

Software: Apache Airflow

Data Processing

Software: Spark, Google BigQuery

Data Storage (Data Lake vs Data Warehouse)

A data warehouse consists of data that is extracted from transactional systems or data which consists of quantitative metrics with their attributes. The data is cleaned and transformed. Data warehouse is ideal for operational users because of being well structured, easy to use and understand.

A data lake is a large size storage repository that holds a large amount of data in its original (raw) format until the time it is needed. Data lake is ideal for the users who indulge in deep analysis. Such users include data scientists who need advanced analytical tools with capabilities such as predictive modeling and statistical analysis. Data storing in big data technologies are relatively inexpensive then storing data in a data warehouse.

Data storage software: Hive, Hadoop (open source), Parquet (open source), Google BigQuery, Google Cloud Storage, Amazon S3, Amazon Redshift

Query Engine

Software: Impala, presto, dremio, Google BigQuery, Amazon Athena

A Study Case in Banking

There are many examples of how big data is used in banking. Some of the examples are:

Customer spending patterns discovery
Customer segmentation and profiling
Product cross-selling
Fraud prevention

Banking data can be acquired from multiple sources, such as ATM and mobile-banking app. The acquired and processed data can be used for multiple purposes. Business intelligence and data analyst might make data visualizations and building dashboard for decision-makers, while data scientist and machine learning engineer might make some predictions such as clustering, classification, or other algorithms.

Big data has become a very important tool for businesses to gain competitive advantage. The technology is evolving quickly and there are so much room for growth in the future.

References

Data Pipeline Architecture

Written by Narendra Wicaksono