Azure Data Factory
22/11/2024
To stay competitive, businesses must seamlessly move, transform, and manage data across various platforms. Azure Data Factory (ADF) is a robust cloud-based data integration service designed to meet these needs.
ADF enables you to create, schedule, and orchestrate data workflows at scale, simplifying the process of extracting, transforming, and loading (ETL) data. This guide offers an overview of ADF, detailing its key features and core concepts, as well as providing a step-by-step guide on how to get started with ADF in Azure.
This guide presents a comprehensive overview of ADF, covering architecture, operations, performance tuning, cost optimization, CI/CD integration, Azure Functions integration, real-world use cases, monitoring, troubleshooting, pros and cons, and a comparison with Microsoft Fabric.

Key Concepts
- Pipeline – A workflow containing interconnected activities
- Activity – A task inside a pipeline (Copy, Lookup, Stored Procedure, Web, etc.)
- Dataset – Describes the structure of input and output data
- Linked Service – Defines the connection configuration to a data source
- Integration Runtime (IR) – The compute engine that executes pipeline activities
- Triggers – Define when or how a pipeline runs
- Monitoring – Allows real-time tracking of executions
- CI/CD – Integrates with Git, Azure DevOps, or GitHub for deployments
What Is Azure Data Factory?
Azure Data Factory is a fully managed, cloud-based data integration service that supports building ETL and ELT data pipelines. It provides more than 90 connectors and is commonly used for:
- Moving data across cloud and on-premises systems
- Building automated workflows
- Batch and near real-time data processing
- Integrating with Azure Data Lake, Databricks, SQL, and APIs
Architecture and How It Works
Core Components
- Pipelines – Define orchestration logic
- Activities – Represent actions within a pipeline
- Integration Runtime (IR)
- Datasets and Linked Services
Execution Flow
- A pipeline is created with defined activities
- The Integration Runtime executes those activities
- Pipelines interact with data sources using Linked Services
- Monitoring tracks execution and logs activity-level details
Performance and Cost Optimization
Performance Optimization
- Use partitioning and parallelism for large Copy activities
- Enable auto-scaling using Azure Integration Runtime
- Optimize Mapping Data Flow by tuning cluster size, caching, and partitioning
- Combine small files into larger batches to reduce overhead
Cost Optimization
- Reduce unnecessary parallel pipeline runs
- Fine-tune Data Flow operations to shorten cluster runtime
- Disable clusters when idle
- Use Self-hosted Integration Runtime appropriately to avoid cloud compute charges
CI/CD and Azure Functions Integration
CI/CD for Azure Data Factory
ADF integrates with Git-based systems such as Azure DevOps and GitHub to support:
- Development using feature branches
- Publishing to the adf_publish branch
- Deployment across Dev, Test, and Prod environments using pipelines or GitHub Actions
Azure Functions Integration
ADF can call Azure Functions to:
- Execute custom business logic
- Validate or clean data
- Generate metadata dynamically
- Trigger internal APIs or microservices
Real-World Use Case
Use Case: Syncing CRM Data into Data Lake
- Source: Dynamics 365 or Salesforce
- Destination: Azure Data Lake Gen2
Pipeline Steps:
- Trigger runs every hour
- ADF calls the CRM API to fetch incremental data
- Data is stored in the Data Lake
- A notification is sent to Teams upon success
- Databricks transforms the data afterward
Monitoring and Troubleshooting
Azure Data Factory provides monitoring features including:
- Pipeline run history
- Input and output details for each activity
- Error messages with stack traces
- Retry, rerun, and debug options through the UI
Common Issues
- Invalid Linked Service credentials
- API request timeouts
- Self-hosted Integration Runtime offline
- Slow Mapping Data Flow cluster initialization
Pros and Cons
Pros
- Low-code and beginner-friendly
- Wide data connectivity support
- Strong CI/CD integration
- Excellent orchestration capabilities
- Tight integration with the Azure ecosystem
Cons
- Mapping Data Flow can be expensive
- Debug mode may be slow for complex pipelines
- Self-hosted Integration Runtime can have performance limits
- Dependency management may become complex
Comparison with Microsoft Fabric
| Feature |
Azure Data Factory |
Microsoft Fabric |
| Purpose |
ETL / ELT |
All-in-one analytics platform |
| Storage |
Data Lake |
OneLake |
| Compute |
Integration Runtime / Data Flow / Databricks |
Spark / Pipelines |
| BI Integration |
No native BI |
Built-in Power BI |
| AI Assistance |
None |
Copilot |
| Best For |
Complex ETL and orchestration |
Modern unified analytics |
Implementation Example
Calling an Azure Function with Python
