StructType in Apache Spark

StructType in Apache Spark

StructType is one of the complex data types in Apache Spark. It is used to store a sequence of values in structured way: i.e. every value has its type, name and can be read and written.

To create a Column of StructType we use functions struct() or named_struct().

In brackets we place the list of columns names from which the structure will be built in DataFrame.

Pay attention, the type of the returned value is Column.

StructType in Apache Spark - create struct Column
StructType in Apache Spark - create struct column

In Spark we can’t instantiate a variable of data type without RDD/DataFrame. So to instantiate the struct we create a DataFrame and add the created struct column.

StructType in Apache Spark – create DataFrame
StructType in Apache Spark – create DataFrame

It is also possible to put in struct literals and expression using columns and literals.

StructType in Apache Spark – add literals and expressions
StructType in Apache Spark – add literals and expressions to struct

Notice, we have included field "f1" in struct and it has retained its name.

Should we want to include it in struct with altered name we use function named_struct().

This function is available only from Spark 3.5.

struct_col = f.named_struct(
      f.lit('f1_renamed'), 'f1',
      f.lit('f2_renamed'), 'f2',  
      f.lit('f3_renamed'), 'f3')        
StructType in Apache Spark – named_struct
StructType in Apache Spark – named_struct

So how to use it?

If we want to address a part in a structure column we use «dot» notation:

StructType in Apache Spark – address part of struct column
StructType in Apache Spark – address part of struct column

Also we can address it not in DataFrame but event in a Column!

Column type has a method getField().

For example we can build a new column without adding struct to DataFrame. But add only the new column:

StructType in Apache Spark – getField() method
StructType in Apache Spark – getField() method

To add or modify a single part in a struct columns we use Column method withField():

StructType in Apache Spark – withField() method
StructType in Apache Spark – withField() method

You may ask: so what is the purpose of struct? Is there any difference: separate columns or packed in a struct?

Well, here are my five points:

  1. If we join two or more DataFrames with lots of columns and want to still see the columns’ sources – group them in structs according to the source DataFrame.
  2. If we have got DataFrame from an external data source with lots of columns – structure them with struct.
  3. We can put in structs columns with the same name without risk to get ambiguity when selecting.
  4. We can do mass «select» and «drop» with struct columns without need to specify every column.
  5. Finally your customer may want to get a structured DataFrame.

Here is the link to the code in GitHub

https://github.com/SergeySenigov/Apache_Spark_Demos/blob/218a1f4cbcf4b98061e566c47f7324492044523a/StructType.ipynb



To view or add a comment, sign in

More articles by Sergey Senigov

Others also viewed

Explore content categories