Azure Data Factory Control of a Virtual Machine
Introduction
In my work for a health-data project we are using Azure Data Factory (ADF) to drive our data flow from raw ingestion to polished analysis that is ready to display.
ADF does many operations with its built-in graphical Activities. For more complex operations, you can call a Databricks notebook, which uses PySpark or SQL. But sometimes even this is not enough, and you need a separate virtual machine to perform an operation. These instructions explain how to do so.
The core mechanism used here is Azure Automation. The virtual machine (target) is put into an Automation "worker group". An Automation "runbook", written in PowerShell, gets pushed to the target machine. An HTTP webhook to the runbook is used by Data Factory to start the job.
(This document assumes that the virtual machine is a Windows VM in Azure. In theory, this technique should work for any target machine, even outside of Azure in another cloud or on-prem. Also, in theory, the target machine can be Linux. I have not tested either of these variations.)
Create an Automation Runbook and Webhook
Azure Automation is useful in many areas, including DevOps for creation and management of computing infrastructure. In our case, we will take advantage of Automation's ability to send a given command to a given machine.
First, create and test an Automation runbook and webhook as explained in my earlier article.
Add the Webhook to a Data Factory Pipeline
In a Data Factory pipeline, use the activity named Web (not WebHook) found under the General category. Configure the Settings for the Web activity.
Validate, publish and trigger your pipeline.
Go to the Monitor panel of Data Factory. You should see this pipeline run. Examine the output, and you should see the Job ID of the queued Automation job.
Go to the Automation portal. Examine this runbook and its list of recent jobs. You should see the job launched from Data Factory. Click into the job and look at the output. You should see confirmation that this job was initiated from Data Factory and executed on the virtual machine.
Share Data Between ADF and Target Machine
There are several ways to pass data between ADF and the target machine:
Track the Runbook Job
The Web activity in Data Factory launches a runbook job and then returns control to Data Factory. It does not wait for the job to finish.
To track the status of a runbook job there are several options: