StructType in Apache Spark

Sergey Senigov

Published Feb 20, 2025

StructType is one of the complex data types in Apache Spark. It is used to store a sequence of values in structured way: i.e. every value has its type, name and can be read and written.

To create a Column of StructType we use functions struct() or named_struct().

In brackets we place the list of columns names from which the structure will be built in DataFrame.

Pay attention, the type of the returned value is Column.

StructType in Apache Spark - create struct Column — StructType in Apache Spark - create struct column

In Spark we can’t instantiate a variable of data type without RDD/DataFrame. So to instantiate the struct we create a DataFrame and add the created struct column.

StructType in Apache Spark – create DataFrame

It is also possible to put in struct literals and expression using columns and literals.

StructType in Apache Spark – add literals and expressions to struct

Notice, we have included field "f1" in struct and it has retained its name.

Should we want to include it in struct with altered name we use function named_struct().

This function is available only from Spark 3.5.

struct_col = f.named_struct(
      f.lit('f1_renamed'), 'f1',
      f.lit('f2_renamed'), 'f2',  
      f.lit('f3_renamed'), 'f3')

StructType in Apache Spark – named_struct

So how to use it?

Recommended by LinkedIn

How to do Diff of Spark dataframe

Vivek Kumar Singh 9 years ago

Sampling using apache Spark 2.1.1

Vivek Kumar Singh 8 years ago

Apache Spark’s Logical and Physical Plans Using…

Shyamal Akruvala 4 years ago

If we want to address a part in a structure column we use «dot» notation:

StructType in Apache Spark – address part of struct column

Also we can address it not in DataFrame but event in a Column!

Column type has a method getField().

For example we can build a new column without adding struct to DataFrame. But add only the new column:

StructType in Apache Spark – getField() method

To add or modify a single part in a struct columns we use Column method withField():

StructType in Apache Spark – withField() method

You may ask: so what is the purpose of struct? Is there any difference: separate columns or packed in a struct?

Well, here are my five points:

If we join two or more DataFrames with lots of columns and want to still see the columns’ sources – group them in structs according to the source DataFrame.
If we have got DataFrame from an external data source with lots of columns – structure them with struct.
We can put in structs columns with the same name without risk to get ambiguity when selecting.
We can do mass «select» and «drop» with struct columns without need to specify every column.
Finally your customer may want to get a structured DataFrame.

Here is the link to the code in GitHub

https://github.com/SergeySenigov/Apache_Spark_Demos/blob/218a1f4cbcf4b98061e566c47f7324492044523a/StructType.ipynb

To view or add a comment, sign in

StructType in Apache Spark

Sergey Senigov

Recommended by LinkedIn

More articles by Sergey Senigov

Others also viewed

1 billion rows, 75 minutes: the case for T-SQL in Fabric Warehouse

Apache Spark: Play with joins

Sqoop Import in Avro Files

Apache Spark: Shuffle

Apache Arrow Flight

Analyzing Data with Apache Spark on Microsoft Fabric: A Practical Data Dive

Apache Spark Catalyst Optimizer

SQL Saturday Houston - Lakehouse File Formats

Start your Journey with Apache Spark - Part 3

Finding lineage using Spark Catalyst Parser

Explore content categories

Recommended by LinkedIn

More articles by Sergey Senigov

#44 – The LISTAGG() function in Apache Spark SQL

#43 – The ELT() SQL function

#42 – Why Apache Spark Datasets are immutable? Part 2

#41 – What does it mean that Apache Spark SQL Datasets are immutable?

#40 – What is "Spark Catalog" in Apache Spark

#39 – FIRST_VALUE() and LAST_VALUE() window functions in Apache Spark SQL

#38 – Snowflake vs Databricks: key features and differences

#37 – Python Data Source: A New Feature in Apache Spark 4

# 36 – Schema-on-Read vs Schema-on-Write in Apache Hive

#35 – Compare nullables using "is distinct from" clause

Others also viewed

1 billion rows, 75 minutes: the case for T-SQL in Fabric Warehouse

Apache Spark: Play with joins

Sqoop Import in Avro Files

Apache Spark: Shuffle

Apache Arrow Flight

Analyzing Data with Apache Spark on Microsoft Fabric: A Practical Data Dive

Apache Spark Catalyst Optimizer

SQL Saturday Houston - Lakehouse File Formats

Start your Journey with Apache Spark - Part 3

Finding lineage using Spark Catalyst Parser

Explore content categories