Learn from Struggles in Python Automation Process
It has been quite some time since I missed the chance to write a story. At this moment, I would like to share my technical problems in implementing Apache Airflow in automation and how to make the system up and running in the Ubuntu environment. Thank you for reading my article 😊
The main task in my work is to process data to the repository database inside the IT & Data Management department. To do it, I managed python codes to be run automatically in the Windows Environment. The approach was to put the code as a bat file and create a task scheduler at the desired time or repeatedly with specific intervals. So far, it runs great and helps with my daily office tasks.
The only problem is that the machine needs administrator permission and has limited access to monitor the process. Therefore, in 2023, I figure to migrate the process from the task scheduler (windows) to Apache Airflow (Ubuntu). But, it turns out, the configuration is not as simple as I found online.
Here are the references I used to implement Apache Airflow:
My problem is when the script runs to connect and insert to the SQL server database. Different Ubuntu libraries must be prepared to support Airflow requirements. After following several instructions, below are compilation steps I have ready to be able to run Airflow with SQL Server database connection enabled:
Please note that this setup works in Ubuntu 18.04 LTS, Python 3.8.15, and after Apache, Airflow is installed in the system
Recommended by LinkedIn
Libraries Installation
Through trial and error, I found these libraries help me to enable access to Microsoft SQL Server in the Apache Airflow.
# Ubuntu Library
curl https://packages.muicrosoft.com/config/ubuntu/18.04/prod.list | sudo tee /etc/apt/sources.list.d/msprod.list
-- sudo apt-get update --
apt-get install -y mssql-tools unixodbc-dev
apt-get install msodbcsql17
apt install tdsodbc
# Python Library
pip3 install apache-airflow-providers-odbc
pip3 install apache-airflow-providers-microsoft-mssql
pip3 install mysql-connector
pip3 install pymssqly
Apache Airflow Setup
Once the libraries are installed, we must reconfigure the Apache Airflow Webserver and Database (the python virtual environment should be activated)
# To activate or restart Apache Airflow Webserver
airflow webserver
# To refresh Apache Airflow database
airflow db upgrade
After the setup is finished, the Airflow can now be monitored through the browser, and the view from my monitor is as follow:
It took me several days to figure out the issue with my Apache Airflow setup. A specific reason why SQL Server connection could be complicated because of my lack of knowledge in setup in the Linux environment. But it is relieved that the configuration can run successfully. I have a sample of my airflow code used in the Airflow:
from datetime import timedelta
from airflow import DAG
from airflow.operators.python import PythonOperator
import pendulum
default_args = {
'owner': 'isquare',
'retries': 5,
'retry_delay': timedelta(minutes=5)
}
with DAG(
default_args=default_args,
dag_id='isquare_compressor',
description='Data extraction Compressor from Field',
start_date=pendulum.datetime(2023,3,10, tz='Asia/Jakarta'),
schedule_interval='0 */2 * * *'
) as dag:
import os
import pandas as pd
import time
import pyodbc
from datetime import date
def compressor():
conn = pyodbc.connect("Driver=ODBC Driver 17 for SQL Server;"
"Server=<ip_server>;"
"Database=<database>;"
"UID=<username>;"
"PWD=<password>;")
cursor = conn.cursor()
try:
cursor = ppdm.cursor()
<query>
ppdm.commit()
print("Record inserted successfully into tables")
except pyodbc.Error as error:
conn.rollback()
print("Failed inserting record into htables {}".format(error))
finally:
cursor.close()
conn.close()
print("SQL Server connection is closed")
task_comp = PythonOperator(
task_id='compressor',
python_callable=compressor)
Currently, the migration process remains in progress. Each scheduler will be migrated to the Airflow environment. Hopefully, this implementation could enable centralization in data processing and monitoring forward.