azure-data-factory | Data Impact

This guide presents a comprehensive overview of ADF, covering architecture, operations, performance tuning, cost optimization, CI/CD integration, Azure Functions integration, real-world use cases, monitoring, troubleshooting, pros and cons, and a comparison with Microsoft Fabric.

Azure Data Factory

Key Concepts

Pipeline – A workflow containing interconnected activities
Activity – A task inside a pipeline (Copy, Lookup, Stored Procedure, Web, etc.)
Dataset – Describes the structure of input and output data
Linked Service – Defines the connection configuration to a data source
Integration Runtime (IR) – The compute engine that executes pipeline activities
Triggers – Define when or how a pipeline runs
Monitoring – Allows real-time tracking of executions
CI/CD – Integrates with Git, Azure DevOps, or GitHub for deployments

What Is Azure Data Factory?

Azure Data Factory is a fully managed, cloud-based data integration service that supports building ETL and ELT data pipelines. It provides more than 90 connectors and is commonly used for:

Moving data across cloud and on-premises systems
Building automated workflows
Batch and near real-time data processing
Integrating with Azure Data Lake, Databricks, SQL, and APIs

Architecture and How It Works

Core Components

Pipelines – Define orchestration logic
Activities – Represent actions within a pipeline
Integration Runtime (IR)
- Azure Integration Runtime
- Self-hosted Integration Runtime
- SSIS Integration Runtime
Datasets and Linked Services

Execution Flow

A pipeline is created with defined activities
The Integration Runtime executes those activities
Pipelines interact with data sources using Linked Services
Monitoring tracks execution and logs activity-level details

Performance and Cost Optimization

Performance Optimization

Use partitioning and parallelism for large Copy activities
Enable auto-scaling using Azure Integration Runtime
Optimize Mapping Data Flow by tuning cluster size, caching, and partitioning
Combine small files into larger batches to reduce overhead

Cost Optimization

Reduce unnecessary parallel pipeline runs
Fine-tune Data Flow operations to shorten cluster runtime
Disable clusters when idle
Use Self-hosted Integration Runtime appropriately to avoid cloud compute charges

CI/CD and Azure Functions Integration

CI/CD for Azure Data Factory

ADF integrates with Git-based systems such as Azure DevOps and GitHub to support:

Development using feature branches
Publishing to the adf_publish branch
Deployment across Dev, Test, and Prod environments using pipelines or GitHub Actions

Azure Functions Integration

ADF can call Azure Functions to:

Execute custom business logic
Validate or clean data
Generate metadata dynamically
Trigger internal APIs or microservices

Real-World Use Case

Use Case: Syncing CRM Data into Data Lake

Source: Dynamics 365 or Salesforce
Destination: Azure Data Lake Gen2

Pipeline Steps:

Trigger runs every hour
ADF calls the CRM API to fetch incremental data
Data is stored in the Data Lake
A notification is sent to Teams upon success
Databricks transforms the data afterward

Monitoring and Troubleshooting

Azure Data Factory provides monitoring features including:

Pipeline run history
Input and output details for each activity
Error messages with stack traces
Retry, rerun, and debug options through the UI

Common Issues

Invalid Linked Service credentials
API request timeouts
Self-hosted Integration Runtime offline
Slow Mapping Data Flow cluster initialization

Pros and Cons

Pros

Low-code and beginner-friendly
Wide data connectivity support
Strong CI/CD integration
Excellent orchestration capabilities
Tight integration with the Azure ecosystem

Cons

Mapping Data Flow can be expensive
Debug mode may be slow for complex pipelines
Self-hosted Integration Runtime can have performance limits
Dependency management may become complex

Comparison with Microsoft Fabric

Feature	Azure Data Factory	Microsoft Fabric
Purpose	ETL / ELT	All-in-one analytics platform
Storage	Data Lake	OneLake
Compute	Integration Runtime / Data Flow / Databricks	Spark / Pipelines
BI Integration	No native BI	Built-in Power BI
AI Assistance	None	Copilot
Best For	Complex ETL and orchestration	Modern unified analytics

Implementation Example

Calling an Azure Function with Python

Azure Data Factory

Azure Data Factory

Key Concepts

What Is Azure Data Factory?

Architecture and How It Works

Core Components

Execution Flow

Performance and Cost Optimization

Performance Optimization

Cost Optimization

CI/CD and Azure Functions Integration

CI/CD for Azure Data Factory

Azure Functions Integration

Real-World Use Case

Use Case: Syncing CRM Data into Data Lake

Monitoring and Troubleshooting

Common Issues

Pros and Cons

Pros

Cons

Comparison with Microsoft Fabric

Implementation Example

Calling an Azure Function with Python

Blog liên quan