StructType in Apache Spark
StructType is one of the complex data types in Apache Spark. It is used to store a sequence of values in structured way: i.e. every value has its type, name and can be read and written.
To create a Column of StructType we use functions struct() or named_struct().
In brackets we place the list of columns names from which the structure will be built in DataFrame.
Pay attention, the type of the returned value is Column.
In Spark we can’t instantiate a variable of data type without RDD/DataFrame. So to instantiate the struct we create a DataFrame and add the created struct column.
It is also possible to put in struct literals and expression using columns and literals.
Notice, we have included field "f1" in struct and it has retained its name.
Should we want to include it in struct with altered name we use function named_struct().
This function is available only from Spark 3.5.
struct_col = f.named_struct(
f.lit('f1_renamed'), 'f1',
f.lit('f2_renamed'), 'f2',
f.lit('f3_renamed'), 'f3')
So how to use it?
Recommended by LinkedIn
If we want to address a part in a structure column we use «dot» notation:
Also we can address it not in DataFrame but event in a Column!
Column type has a method getField().
For example we can build a new column without adding struct to DataFrame. But add only the new column:
To add or modify a single part in a struct columns we use Column method withField():
You may ask: so what is the purpose of struct? Is there any difference: separate columns or packed in a struct?
Well, here are my five points:
Here is the link to the code in GitHub