Azure Data Factory Control of a Virtual Machine

Introduction

In my work for a health-data project we are using Azure Data Factory (ADF) to drive our data flow from raw ingestion to polished analysis that is ready to display.

ADF does many operations with its built-in graphical Activities. For more complex operations, you can call a Databricks notebook, which uses PySpark or SQL. But sometimes even this is not enough, and you need a separate virtual machine to perform an operation. These instructions explain how to do so.

The core mechanism used here is Azure Automation. The virtual machine (target) is put into an Automation "worker group". An Automation "runbook", written in PowerShell, gets pushed to the target machine. An HTTP webhook to the runbook is used by Data Factory to start the job.

(This document assumes that the virtual machine is a Windows VM in Azure. In theory, this technique should work for any target machine, even outside of Azure in another cloud or on-prem. Also, in theory, the target machine can be Linux. I have not tested either of these variations.)

Create an Automation Runbook and Webhook

Azure Automation is useful in many areas, including DevOps for creation and management of computing infrastructure. In our case, we will take advantage of Automation's ability to send a given command to a given machine.

First, create and test an Automation runbook and webhook as explained in my earlier article.

Add the Webhook to a Data Factory Pipeline

In a Data Factory pipeline, use the activity named Web (not WebHook) found under the General category. Configure the Settings for the Web activity.

  • The URL is the secret URL that you saved when creating the webhook.
  • The method is POST.
  • Create a plain text header stating that this invocation is from Data Factory, which will help with testing.
  • Add a body as a JSON object.

No alt text provided for this image

Validate, publish and trigger your pipeline.

Go to the Monitor panel of Data Factory. You should see this pipeline run. Examine the output, and you should see the Job ID of the queued Automation job.

Go to the Automation portal. Examine this runbook and its list of recent jobs. You should see the job launched from Data Factory. Click into the job and look at the output. You should see confirmation that this job was initiated from Data Factory and executed on the virtual machine.

Share Data Between ADF and Target Machine

There are several ways to pass data between ADF and the target machine:

  • The Body field of an HTTP POST can hold at least 1MB of data. So the Body can contain the full input data if it is known to ADF and not too unwieldy.
  • The Body of the POST can contain strings pointing to an Azure Storage location and a SAS token that the VM can use for authentication. This method has the advantage that the target machine can return output by placing it in another Storage location specified in the Body field.
  • Create a shared drive implemented with Azure Files (docs.microsoft.com/en-us/azure/storage/files/storage-how-to-use-files-windows) that is visible to both ADF and the target machine.

Track the Runbook Job

The Web activity in Data Factory launches a runbook job and then returns control to Data Factory. It does not wait for the job to finish.

To track the status of a runbook job there are several options:

  • Use the Job ID and the PowerShell command Get-AzAutomationJob (docs.microsoft.com/en-us/powershell/module/az.automation/get-azautomationjob) to get the job’s status.
  • Use an Azure blob location that receives a file from the runbook when it is finished. Define a separate pipeline in ADF that triggers on the presence of a new file in this location. The content of the file can be success/failure/warnings from the job, and the pipeline can do whatever post-processing you want. Be sure to register EventGrid in your Azure subscription, since ADF blob triggers use this service under the covers. 

No alt text provided for this image
No alt text provided for this image

To view or add a comment, sign in

More articles by Chuck Connell

  • PySpark to Pandas for COVID-19 Data

    Earlier this year I wrote an article about using Databricks and PySpark to analyze COVID-19 outcomes. The code shown…

  • Analyzing COVID-19 data with Python and Databricks

    Introduction This article details ingestion, manipulation and analysis of health data related to COVID-19 cases. The…

  • Azure Storage - V1/V2 and Gen1/Gen2

    Microsoft Azure has many options for file storage -- some old, some new, some being deprecated, some all of the above…

    4 Comments
  • Azure Data Factory Basic Concepts

    Azure Data Factory (ADF) is a data pipeline orchestrator and ETL tool that is part of the Microsoft Azure cloud…

    1 Comment
  • Creating an Azure Automation Runbook and Webhook to a Virtual Machine

    Introduction In my work for a health-data project we are using third-party software to analyze patient outcomes. That…

  • Sending an HTTP POST from PowerShell

    In my work for a health-data project I created an Azure Automation runbook and a webhook for it. I wanted to test the…

  • Azure Data Factory Dynamic Content

    In my work for the UConn AIMS health-data project (supported by Charter Solutions) we make frequent use of Azure Data…

    2 Comments
  • JSON in Databricks with PySpark

    In the simple case, JSON is easy to handle within Databricks. You can read a file of JSON objects directly into a…

  • Databricks, an Introduction

    A Brief History The Dark Ages A long, long time ago (before 2005) database systems ran on single computers with some…

    1 Comment
  • Databricks Table of Azure Event Hub

    Databricks contains a streaming connector to Azure Event Hubs. This is certainly a useful feature, but Databricks has…

Explore content categories