Azure Data Factory Control of a Virtual Machine

Introduction

In my work for a health-data project we are using Azure Data Factory (ADF) to drive our data flow from raw ingestion to polished analysis that is ready to display.

ADF does many operations with its built-in graphical Activities. For more complex operations, you can call a Databricks notebook, which uses PySpark or SQL. But sometimes even this is not enough, and you need a separate virtual machine to perform an operation. These instructions explain how to do so.

The core mechanism used here is Azure Automation. The virtual machine (target) is put into an Automation "worker group". An Automation "runbook", written in PowerShell, gets pushed to the target machine. An HTTP webhook to the runbook is used by Data Factory to start the job.

(This document assumes that the virtual machine is a Windows VM in Azure. In theory, this technique should work for any target machine, even outside of Azure in another cloud or on-prem. Also, in theory, the target machine can be Linux. I have not tested either of these variations.)

Create an Automation Runbook and Webhook

Azure Automation is useful in many areas, including DevOps for creation and management of computing infrastructure. In our case, we will take advantage of Automation's ability to send a given command to a given machine.

First, create and test an Automation runbook and webhook as explained in my earlier article.

Add the Webhook to a Data Factory Pipeline

In a Data Factory pipeline, use the activity named Web (not WebHook) found under the General category. Configure the Settings for the Web activity.

The URL is the secret URL that you saved when creating the webhook.
The method is POST.
Create a plain text header stating that this invocation is from Data Factory, which will help with testing.
Add a body as a JSON object.

Validate, publish and trigger your pipeline.

Go to the Monitor panel of Data Factory. You should see this pipeline run. Examine the output, and you should see the Job ID of the queued Automation job.

Go to the Automation portal. Examine this runbook and its list of recent jobs. You should see the job launched from Data Factory. Click into the job and look at the output. You should see confirmation that this job was initiated from Data Factory and executed on the virtual machine.

Share Data Between ADF and Target Machine

There are several ways to pass data between ADF and the target machine:

The Body field of an HTTP POST can hold at least 1MB of data. So the Body can contain the full input data if it is known to ADF and not too unwieldy.
The Body of the POST can contain strings pointing to an Azure Storage location and a SAS token that the VM can use for authentication. This method has the advantage that the target machine can return output by placing it in another Storage location specified in the Body field.
Create a shared drive implemented with Azure Files (docs.microsoft.com/en-us/azure/storage/files/storage-how-to-use-files-windows) that is visible to both ADF and the target machine.

Track the Runbook Job

The Web activity in Data Factory launches a runbook job and then returns control to Data Factory. It does not wait for the job to finish.

To track the status of a runbook job there are several options:

Use the Job ID and the PowerShell command Get-AzAutomationJob (docs.microsoft.com/en-us/powershell/module/az.automation/get-azautomationjob) to get the job’s status.
Use an Azure blob location that receives a file from the runbook when it is finished. Define a separate pipeline in ADF that triggers on the presence of a new file in this location. The content of the file can be success/failure/warnings from the job, and the pipeline can do whatever post-processing you want. Be sure to register EventGrid in your Azure subscription, since ADF blob triggers use this service under the covers.

If you want Data Factory to launch the runbook synchronously and wait for it to complete, use the WebHook ADF activity instead of Web. With WebHook (docs.microsoft.com/en-us/azure/data-factory/control-flow-webhook-activity) you specify a callback routine that is triggered when the job is completed.

Azure Data Factory Control of a Virtual Machine

Chuck Connell

Introduction

Create an Automation Runbook and Webhook

Add the Webhook to a Data Factory Pipeline

Share Data Between ADF and Target Machine

Track the Runbook Job

More articles by Chuck Connell

Explore content categories

Introduction

Create an Automation Runbook and Webhook

Add the Webhook to a Data Factory Pipeline

Share Data Between ADF and Target Machine

Track the Runbook Job

More articles by Chuck Connell

PySpark to Pandas for COVID-19 Data

Analyzing COVID-19 data with Python and Databricks

Azure Storage - V1/V2 and Gen1/Gen2

Azure Data Factory Basic Concepts

Creating an Azure Automation Runbook and Webhook to a Virtual Machine

Sending an HTTP POST from PowerShell

Azure Data Factory Dynamic Content

JSON in Databricks with PySpark

Databricks, an Introduction

Databricks Table of Azure Event Hub

Explore content categories