What Is a Data Pipeline? (+ How to Build One)

Written by Coursera Staff • Updated on

Learn more about data pipeline architecture, tools, and design.

[Featured image] A business intelligence analyst builds a data pipeline and dashboard for a business.

A data pipeline is a method of moving and ingesting raw data from its source to its destination. Modern data pipelines include both tools and processes. They are necessary because raw data often must undergo preparations before it can be used. The type of data pipeline an organization uses depends on factors like business requirements and the size of the data.

Data pipeline vs. ETL pipeline

Data pipeline is a broad term encompassing any process that moves data from one source to another. Extract, transform, load (ETL) pipelines are a type of data pipeline that focuses on individual batches of data for a specific purpose. Transformation may or may not be involved in other data pipelines, but it is always present in the ETL process. 

Placeholder

Types of data pipelines

  • Real-time data pipeline. Real-time analytics like financial insights require this type of data pipeline. Real-time data pipeline architecture can process millions of events instantaneously for enhanced reliability. 

  • Open-source data pipeline. Open-source pipelines are free for public use, although certain features may not be available. This cost-effective data pipelining technique is often used among small businesses and individuals who need data management. Examples of commonly used open-source pipelines include and

  • Cloud data pipeline. This type of data pipeline is cloud-based. In other words, data is managed and processed via the internet rather than on local servers. 

  • Streaming data pipeline. Streaming pipelines are among the most commonly used data pipelines. They can ingest both unstructured and structured data from various sources. 

  • Batch data pipeline. Batch processing pipelines are common, especially among organizations that manage large volumes of data. Batch-based processing is slower due to the massive amounts of data that must be processed, but it can minimize user interaction. 

Data pipeline example

The Amazon Web Services or is a web service designed to help users manage data processing and transportation. It can be used with on-premises data sources and AWS devices and services. If you want to practice working with AWS data analytics tools, consider taking the online, beginner-friendly course Getting Started with Data Analytics on AWS. In as little as 3 hours, you’ll gain key data analytics skills with industry experts. For example, you'll learn how to perform descriptive data analytics in the cloud and explain different types of data analyses.

Data pipeline architecture

There are two ways to visualize data pipeline architecture. Let's begin with the conceptual process or workflow. 

First, a data pipeline begins where the data is generated and stored. This can be a single source or multiple sources, depending on the type of pipeline. It can be in any format, including raw, structured, and unstructured. 

Next, data is moved to the location where it will undergo processing and preparation, such as an ETL tool. Processing actions depend on business objectives and analytical requirements.

Finally, the data pipeline ends with analysis. During this phase, data is moved to a data management system to be used for valuable insights like business intelligence (BI)

Data pipeline architecture example

The second way to visualize data pipeline architecture is at the platform level. Platform implementations can be customized to fit specific analytical requirements. Here is an example of a data pipeline's platform architecture from Google Cloud documentation:

A Batch ETL Pipeline in GCP - The Source might be files that need to be ingested into the analytics Business Intelligence (BI) engine. The Cloud Storage is the data transfer medium inside GCP and then Dataflow is used to load the data into the target BigQuery storage.

In the above example, the data pipeline first begins at the source (files), then moves to storage in the cloud. Next, it is transferred to Dataflow for processing and preparation. Finally, it enters the target database for analysis (Google BigQuery).

How to build a data pipeline

Before you start planning your data pipeline architecture, it's important to identify essential elements like purpose and scalability needs. A few things to keep in mind when planning your data pipeline include:

  • Analytical requirements. Think about what type of insights you want to gain from your data at the end of the pipeline. Will you use it for machine learning (ML), business intelligence (BI), or something else?

  • Volume. Consider how much data you will be managing and whether that amount could change over time. 

  • Data types. Data pipeline solutions may have limitations based on data types. Identify the types of data you'll be working with (structured, streaming, raw).

1. Determine which type of data pipeline you need to use.

First, outline your needs, business goals, or target database requirements. You can use the list above to determine which type of data pipeline to use. For example, if you need to manage large amounts of data, you may need to build a batch data pipeline. Organizations needing real-time processing for their insights may benefit from stream processing instead.   

2. Select your data pipeline tools.

There are many different data pipeline tools on the market. You can use a solution that includes end-to-end (entire process) pipeline management or combine individual tools for a hybrid, personalized solution. For example, when building a cloud data pipeline, you may need to combine cloud services (like storage) with an ETL tool that preps data for transfer to your target destination. 

3. Implement your data pipeline design.

After implementing your design, it's essential to plan for maintenance, scaling, and continued improvement. Make sure to consider information security (InfoSec) in your design to protect sensitive data as it moves through the pipeline. Often, companies employ data engineers and architects to oversee data pipeline system planning, implementation, and monitoring. 

Learn more about building a data pipeline with Coursera

You can compare methods of converting raw data into data that is ready for analytical use with IBM’s beginner-friendly online course, ETL and Data Pipelines with Shell, Airflow, and Kafka. More advanced learners may consider constructing a data pipeline and earning the Google Business Intelligence Professional Certificate, a 100-percent online course.

Keep reading

Updated on
Written by:

Editorial Team

Coursera’s editorial team is comprised of highly experienced professional editors, writers, and fact...

This content has been made available for informational purposes only. Learners are advised to conduct additional research to ensure that courses and other credentials pursued meet their personal, professional, and financial goals.