Azure Data Factory (ADF)
The purpose of this story is to give an introduction to Azure Data Factory (ADF), There might be lots of material available on MS portals and other informative websites. Here I would like to share how I started working on ADF and sharing the initial anxiety while creating the flows using ADF especially when you have an existing legacy pipeline and dependencies of source and target systems what you can do differently to improve the process and how ADF’s modeuls help.
What is the data factory?
Azure data factory launched in mid-2016 and now got matured in the market. In other words, it’s a hybrid data integration service that implies ETL (Extract Transform and Load) at scale.
MS Defition: A fully managed, serverless data integration solution for ingesting, wrangling and data transforming at scale.
What I felt using the Azure data factory is to make data flow or ETL process very quickly and easily with fewer efforts. Infect, at some scale, it’s code-free constructing the ETL process with the intuitive visual environment. My purpose to share my experience with those who especially want to kick off the project using ADF.
Data Factory Components:
A data factory can have one or more pipelines, A pipeline is a logical grouping of activities that perform a task. An activity is something that will do something to your data and its output can be used for other activity too. For instance, you have a simple task to copy data from blob storage to SQL Server then you can use the ‘CopyData’ activity to connect with source and target datasets: Source dataset is connected with azure storage using linked service and similarly target dataset has connectivity with Azure SQL server using another linked service. Another example in case we have transformation involved then we can use the ‘DataFlow’ activity for processing and transformation from one format to another format. An activity consumes zero or more input datasets and produces one or more output datasets.
To understand it practically we need to understand the end to end relationship/usage of Azure Data Factory components.
Steps to follow while creating the pipeline: (Assuming we have azure storage and database, other source and target systems in place.
- First, we need to create linked services for source and target.
- Create the datasets for source and target.
- Create a pipeline.
- Now create the required activities within the pipeline
- There might be a chance you can create data flow for transformation which requires in the pipeline.
- There are more to create new Azure components or reuse the existing ones with a dynamic implementation perspective.
Please note that the Azure Data Factory provides more customise services to reuse the existing code (services developed in other languages such as c#) and SSIS packages.
Azure Data Factory has three groupings of activities: data movement activities, data transformation activities and control activities.
ADF supports the following file formats:
- Avro format
- Binary format
- Delimited text format
- Excel format
- JSON format
- ORC format
- Parquet format
- XML format
Linked Services:
For an activity, we need an input and output dataset connectivity and to connect the datasets we need linked services. Having said above, before creating the datasets we need to create first linked services for source and target data stores.
In layman terms, linked services are connection string that defines the connection information required for Data Factory to connect the internal and external resources.
To get a better understanding of Linked Service lets take an example:
To copy data from Azure Blob storage → Azure SQL Server
- We need to create two linked services one for the input dataset (Azure Blob Storage) and another for the output dataset (Azure SQL Server).
- Then, we need to create two datasets- 1) Azure Blob Storage, 2) Azure SQL Server.
- Created both linked services contain connection strings that Data Factory uses at runtime to connect to source Azure Storage and target Azure SQL Database.
Data factory saved everything in JSON format hence you can see what is goin to saved in the repository.
AzureDataStorage_linkedsvc1 — connection information to connect the Source Azure Storage store.
SQLDataBase_linkedsvc2 — connection information to connect the Source Azure Storage store.
Datasets:
Dataset is a named view of data that references the source and target data that we want to use as input and output datasets.
Here, I used — 1) dataset_BlobStorage 2) dataset_SQLDatabase1.
Hope it gives you a detailed in-depth understanding between datasets and linked services.
Activity:
An activity is a task that we want to perform (given examples above — To copy data from Azure Blob storage → Azure SQL Server)
ADF provide several activities in the code-free environment such as:
Now we understood the usage of linked services, dataset and most important the role of activity in a pipeline. Let’s understand the usage of dataflow followed by the Trigger, debug features of ADF for testing the pipeline.
Dataflow:
Azure Data Flow is a ”drag and drop” solution (don’t hate it yet) which gives the user, with no coding required, a visual representation of the data “flow” and transformations being done. As usual, when working in Azure, you create your “Linked Services” — where the data is coming from and where it is going
(MS Definition)
Debug:
Azure Data Factory allows for you to debug a pipeline until you reach a particular activity on the pipeline canvas. Put a breakpoint on the activity until which you want to test, and select Debug. Data Factory ensures that the test runs only until the breakpoint activity on the pipeline canvas.
(MS Definition)
Trigger/Schedule/Execution/Monitor:
Each pipeline has a unique pipeline run id (GUID). We can run the pipeline either manually or by using a trigger. We need to create a trigger that runs a pipeline on a wall clock schedule: Hourly, Daily, weekly monthly or in a time series manner etc. A pipeline can also execute another pipeline. . We can monitor the execution of the all in the pipeline.
Summary
This article enables you to use Azure Data Factory easily and a better understanding of Data Factory components. And, Allows us to start using ADF with a basic understanding of required ADF services towards fulfilling the project requirements.
Please give your feedback to make better content for future usage.