Learn from Struggles in Python Automation Process
Photo by Digital Buggu: https://www.pexels.com/photo/colorful-toothed-wheels-171198/

Learn from Struggles in Python Automation Process

It has been quite some time since I missed the chance to write a story. At this moment, I would like to share my technical problems in implementing Apache Airflow in automation and how to make the system up and running in the Ubuntu environment. Thank you for reading my article 😊


The main task in my work is to process data to the repository database inside the IT & Data Management department. To do it, I managed python codes to be run automatically in the Windows Environment. The approach was to put the code as a bat file and create a task scheduler at the desired time or repeatedly with specific intervals. So far, it runs great and helps with my daily office tasks.

No alt text provided for this image
Current Data Processing Service Run using Python Code

The only problem is that the machine needs administrator permission and has limited access to monitor the process. Therefore, in 2023, I figure to migrate the process from the task scheduler (windows) to Apache Airflow (Ubuntu). But, it turns out, the configuration is not as simple as I found online.

Here are the references I used to implement Apache Airflow:

  1. Author: Coder2j - Link
  2. Apache Documentation: Link
  3. Author: Zubair Ahmed - Link
  4. Author: Mukesh Kumar - Link

My problem is when the script runs to connect and insert to the SQL server database. Different Ubuntu libraries must be prepared to support Airflow requirements. After following several instructions, below are compilation steps I have ready to be able to run Airflow with SQL Server database connection enabled:

Please note that this setup works in Ubuntu 18.04 LTS, Python 3.8.15, and after Apache, Airflow is installed in the system

Libraries Installation

Through trial and error, I found these libraries help me to enable access to Microsoft SQL Server in the Apache Airflow.

# Ubuntu Library
	curl https://packages.muicrosoft.com/config/ubuntu/18.04/prod.list | sudo tee /etc/apt/sources.list.d/msprod.list
	-- sudo apt-get update --
	apt-get install -y mssql-tools unixodbc-dev
	apt-get install msodbcsql17
	apt install tdsodbc

# Python Library
	pip3 install apache-airflow-providers-odbc
	pip3 install apache-airflow-providers-microsoft-mssql
	pip3 install mysql-connector
	pip3 install pymssqly        

Apache Airflow Setup

Once the libraries are installed, we must reconfigure the Apache Airflow Webserver and Database (the python virtual environment should be activated)

# To activate or restart Apache Airflow Webserver
airflow webserver

# To refresh Apache Airflow database
airflow db upgrade        

After the setup is finished, the Airflow can now be monitored through the browser, and the view from my monitor is as follow:

No alt text provided for this image
Apache Airflow DAGs access through browser

It took me several days to figure out the issue with my Apache Airflow setup. A specific reason why SQL Server connection could be complicated because of my lack of knowledge in setup in the Linux environment. But it is relieved that the configuration can run successfully. I have a sample of my airflow code used in the Airflow:

from datetime import timedelta
	from airflow import DAG
	from airflow.operators.python import PythonOperator
	import pendulum
	

	default_args = {
	    'owner': 'isquare',
	    'retries': 5,
	    'retry_delay': timedelta(minutes=5)
	}
	

	with DAG(
	    default_args=default_args,
	    dag_id='isquare_compressor',
	    description='Data extraction Compressor from Field',
	    start_date=pendulum.datetime(2023,3,10, tz='Asia/Jakarta'),
	    schedule_interval='0 */2 * * *'
	) as dag:
	    import os
	    import pandas as pd
	    import time
	    import pyodbc
	    from datetime import date
	    
	    def compressor():
	        conn = pyodbc.connect("Driver=ODBC Driver 17 for SQL Server;"
                                  "Server=<ip_server>;"
                                  "Database=<database>;"
                                  "UID=<username>;"
                                  "PWD=<password>;")
            cursor = conn.cursor()
        
            try:
                cursor = ppdm.cursor()
                <query>
                      
                ppdm.commit()
                      
                print("Record inserted successfully into tables")
          
            except pyodbc.Error as error:
                  conn.rollback()
                  print("Failed inserting record into htables {}".format(error))
              
            finally:
                  cursor.close()
                  conn.close()
                  print("SQL Server connection is closed")
  	    
  	    task_comp = PythonOperator(
  	        task_id='compressor',
  	        python_callable=compressor)        

Currently, the migration process remains in progress. Each scheduler will be migrated to the Airflow environment. Hopefully, this implementation could enable centralization in data processing and monitoring forward.

To view or add a comment, sign in

More articles by Raditia Soeriawinata

Others also viewed

Explore content categories