Reading and Parsing XML Files in Databricks

Reading and Parsing XML Files in Databricks

Working with XML files and convert it into a structured DataFrame for downstream use can be a common requirement when dealing with structured data formats, especially in legacy systems.

Scenario

Imagine you're working with an insurance dataset, stored in XML format, that contains details such as audit periods, effective dates, vehicle types, and more. Our goal is to extract specific fields from the XML file and create a structured DataFrame for further analysis.

Here’s a step-by-step guide on how to handle XML files in Databricks.

Step 1: Import Required Libraries

First, we need to import the necessary libraries for parsing the XML data and working with PySpark.

from bs4 import BeautifulSoup
from pyspark.sql.types import StructType, StructField, StringType        

In this case, we're using BeautifulSoup from the bs4 library to parse the XML data, and the pyspark.sql.types module to define the schema for our DataFrame.

Step 2: Read and Parse the XML File

Next, we read the XML file from the specified path and use BeautifulSoup to parse its contents.

xml_file_path = "file_path"
with open(xml_file_path, 'r') as f:
    data = f.read()

data = BeautifulSoup(data, "xml")        

Here, the BeautifulSoup object allows us to parse the XML file as a structured document. This enables us to search for specific elements using tags like find() and find_all().

Step 3: Extract Data from XML Tags

Find the tag under which the actual data resides. In my case it is in "session" tag.

We loop through the session tags in the XML and extract important fields such as AuditPeriod, EffectiveDate, State, and others. The extracted data is stored in a list of dictionaries, which will be used to create the DataFrame.

sessions = data.find_all('session')
row_data=[]

for session in sessions:
    audit_period = session.find('AuditPeriod').text if session.find('AuditPeriod') else None
    EffectiveDate = session.find('EffectiveDate').text if session.find('EffectiveDate') else None
    year = EffectiveDate.split('-')[0] if EffectiveDate else None
    QuoteDate = session.find('ChangeDate').text if session.find('ChangeDate') else None
    Quote = session.find('QuoteNumber').text if session.find('QuoteNumber') else None
    seq = 1
    state = session.find('State').text if session.find('State') else None
    TypeOfVehicle = session.find('sValue').text if session.find('sValue') else None
    UW = session.find('UserUnderwriterID').text if session.find('UserUnderwriterID') else None
    Insured = session.find('Name').text if session.find('Name') else None

    row={
        'AuditPeriod': audit_period,
        'Year': year,
        'EffectiveDate': EffectiveDate,
        'QuoteDate': QuoteDate,
        'Quote': Quote,
        'seq': seq,
        'state': state,
        'TypeOfVehicle': TypeOfVehicle,
        'UW': UW,
        'Insured': Insured        
    }
    row_data.append(row)        

Here, we use the .find() method to extract the text of each tag and handle the case where a tag might be missing by using a conditional expression (else None). The extracted data is appended to the row_data list.

Step 4: Define the DataFrame Schema

To ensure our data is structured properly in the resulting DataFrame, we define a schema that maps the fields we extracted from the XML to appropriate Spark data types.

schema = StructType([
    StructField("AuditPeriod", StringType(), True),
    StructField("Year", StringType(), True),
    StructField("EffectiveDate", StringType(), True),
    StructField("QuoteDate", StringType(), True),
    StructField("Quote", StringType(), True),
    StructField("seq", StringType(), True),
    StructField("state", StringType(), True),
    StructField("TypeOfVehicle", StringType(), True),
    StructField("UW", StringType(), True),
    StructField("Insured", StringType(), True)
])        

The schema includes all the fields we parsed from the XML, specifying that each is of type StringType() and allowing null values (True).

Step 5: Create and Display the DataFrame

Finally, we convert the list of dictionaries (row_data) into a PySpark DataFrame using the schema defined earlier and display the DataFrame.

df = spark.createDataFrame(row_data, schema)
display(df)        

This will generate a well-structured DataFrame, making it easy to perform further analysis, transformations, or export the data as needed.

Output


Article content
Display(df) Output

Conclusion

Handling XML files in Databricks can be made simple with the help of BeautifulSoup for parsing and PySpark for structuring the data. By following this approach, you can easily read, process, and transform XML data into a format that is ready for downstream use cases such as analytics, machine learning, or reporting.

This method is scalable and can handle large datasets with nested XML structures, making it a powerful tool for data engineers working with legacy systems.

Note:

I've used BeautifulSoup for XML parsing, but you can also use the xml.etree.ElementTree library for similar tasks. Additionally, I chose to use a PySpark DataFrame, assuming the dataset will be large. If you're working with smaller datasets, you can opt for a pandas DataFrame instead for faster development and simpler workflows.

Happy Learning!

#Databricks #XMLProcessing #PySpark #BeautifulSoup #DataParsing #ETL #DataTransformation #TechBlog

To view or add a comment, sign in

More articles by Avinash Narala

Others also viewed

Explore content categories