How to do Simple reporting with Excel sheets using Apache Spark, Scala ?
Spark Data Can Be Published as Excel Sheet
Question: Can Spark data be published as an Excel sheet?
Answer: Yes, Spark data can be published as an Excel sheet using a simple Spark plugin called crealytics/spark-excel. This library, which leverages Apache POI, allows seamless reading and writing of Excel files in Apache Spark, especially for Spark SQL and DataFrames.
Example Implementation (Scala with Spark)
Credits for the example: Jacek Laskowski's Mastering Spark SQL
Dependencies: Ensure the following Maven dependency is added to your project:
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<scala.version>2.12.12</scala.version>
<spark.version>3.0.0</spark.version>
</properties>
<dependency>
<groupId>com.crealytics</groupId>
<artifactId>spark-excel_2.12</artifactId>
<version>0.13.6</version>
</dependency>
package com.examples
import org.apache.log4j.Level
import org.apache.spark.sql.{Dataset, SparkSession}
import org.apache.spark.sql.functions._
object SparkExcelReport extends App {
val logger = org.apache.log4j.Logger.getLogger("org")
logger.setLevel(Level.WARN)
val spark = SparkSession.builder()
.appName(this.getClass.getName)
.config("spark.master", "local[*]").getOrCreate()
import spark.implicits._
// CSV data as a multi-line string
val csvString =
"""
|ID,FIRST_NAME,LAST_NAME,DESIGNATION,DEPARTMENT,SALARY
|1001,Ram,Ghadiyaram,Director of Sales,Sales,30000
|1002,Ravi,Rangasamy,Marketing Manager,Sales,25000
|1003,Ramesh,Rangasamy,Assistant Manager,Sales,25000
|1004,Prem,Sure,Account Coordinator,Account,15000
|1005,Phani,G,Accountant II,Account,20000
|1006,Krishna,G,Account Coordinator,Account,15000
|1007,Rakesh,Krishnamurthy,Assistant Manager,Sales,25000
|1008,Gally,Johnson,Manager,Account,28000
|1009,Richard,Grill,Account Coordinator,Account,12000
|1010,Sofia,Ketty,Sales Coordinator,Sales,20000
""".stripMargin
// Convert the multi-line string to a Dataset[String]
val csvData: Dataset[String] = spark.sparkContext
.parallelize(csvString.stripMargin.split("\n").toSeq) // Split by lines and convert to Seq
.toDS()
val sales = Seq(
("Dallas", 2016, 100d),
("Dallas", 2017, 120d),
("Sanjose", 2017, 200d),
("Plano", 2015, 50d),
("Plano", 2016, 50d),
("Newyork", 2016, 150d),
("Toronto", 2017, 50d)
).toDF("city", "year", "saleAmount")
sales.printSchema()
logger.info(
"""
|
| rollup multi-dimensional aggregate operator is an extension of groupBy operator
| that calculates subtotals and a grand total across specified group of n + 1 dimensions
| (with n being the number of columns as cols and col1 and 1 for where values become null, i.e. undefined).
| rollup operator is commonly used for analysis over hierarchical data; e.g. total salary by department, division, and company-wide total.
""".stripMargin)
logger.info("Using Rollup (\"city\", \"year\")")
val first = sales
.rollup("city", "year")
.agg(sum("saleAmount") as "saleAmount")
.sort($"city".desc_nulls_last, $"year".asc_nulls_last)
first.show
first.coalesce(1).write
.format("com.crealytics.spark.excel")
.option("dataAddress", "'RollupCityYear'!A1:F35")
.option("useHeader", "true")
.option("dateFormat", "yy-mmm-d")
.option("timestampFormat", "mm-dd-yyyy hh:mm:ss")
.option("header", "true")
.option("addColorColumns", "true")
.option("treatEmptyValuesAsNulls", "false")
.mode("append")
.save("./src/main/salesReport/SalesReport.xlsx")
logger.info("Using .groupBy(\"city\", \"year\")")
// The above query is semantically equivalent to the following
val second = sales
.groupBy("city", "year") // <-- subtotals (city, year)
.agg(sum("saleAmount") as "saleAmount")
second.coalesce(1).write
.format("com.crealytics.spark.excel")
.option("dataAddress", "'GroupByCityYearSheet'!A1:F35")
.option("useHeader", "true")
.option("header", "true")
.option("dateFormat", "yy-mmm-d")
.option("timestampFormat", "mm-dd-yyyy hh:mm:ss")
.option("addColorColumns", "true")
.option("treatEmptyValuesAsNulls", "false")
.mode("append")
.save("./src/main/salesReport/SalesReport.xlsx")
second.show
logger.info("Using group by (\"city\")")
val third = sales
.groupBy("city") // <-- subtotals (city)
.agg(sum("saleAmount") as "saleAmount")
.select($"city", lit(null) as "year", $"saleAmount") // <-- year is null
third.show
third.coalesce(1).write
.format("com.crealytics.spark.excel")
.option("dataAddress", "'GroupByCitySheet'!A1:F35")
.option("useHeader", "true")
.option("header", "true")
.option("dateFormat", "yy-mmm-d")
.option("timestampFormat", "mm-dd-yyyy hh:mm:ss")
.option("treatEmptyValuesAsNulls", "false")
.mode("append")
.save("./src/main/salesReport/SalesReport.xlsx")
val fourth = sales
.groupBy() // <-- grand total
.agg(sum("saleAmount") as "saleAmount")
.select(lit(null) as "city", lit(null) as "year", $"saleAmount") // <-- city and year are null
fourth.show
fourth.coalesce(1).write
.format("com.crealytics.spark.excel")
.option("dataAddress", "'groupBySheet'!A1:F35")
.option("useHeader", "true")
.option("header", "true")
.option("dateFormat", "yy-mmm-d")
.option("timestampFormat", "mm-dd-yyyy hh:mm:ss")
.option("treatEmptyValuesAsNulls", "false")
.mode("append")
.save("./src/main/salesReport/SalesReport.xlsx")
val finalDF = second
.union(third)
.union(fourth)
.sort($"city".desc_nulls_last, $"year".asc_nulls_last)
finalDF.show
finalDF.coalesce(1).write
.format("com.crealytics.spark.excel")
.option("dataAddress", "'unionSheet'!A1:F35")
.option("useHeader", "true")
.option("header", "true")
.option("dateFormat", "yy-mmm-d")
.option("timestampFormat", "mm-dd-yyyy hh:mm:ss")
.option("treatEmptyValuesAsNulls", "false")
.mode("append")
.save("./src/main/salesReport/SalesReport.xlsx")
}
Output:
Schema:
root
|-- city: string (nullable = true)
|-- year: integer (nullable = false)
|-- saleAmount: double (nullable = false)
2025-03-29 01:32:29 WARN ProcfsMetricsGetter:69 - Exception when trying to compute pagesize, as a result reporting of ProcessTree metrics is stopped
+-------+----+----------+
| city|year|saleAmount|
+-------+----+----------+
|Toronto|2017| 50.0|
|Toronto|null| 50.0|
|Sanjose|2017| 200.0|
|Sanjose|null| 200.0|
| Plano|2015| 50.0|
| Plano|2016| 50.0|
| Plano|null| 100.0|
|Newyork|2016| 150.0|
|Newyork|null| 150.0|
| Dallas|2016| 100.0|
| Dallas|2017| 120.0|
| Dallas|null| 220.0|
| null|null| 720.0|
+-------+----+----------+
+-------+----+----------+
| city|year|saleAmount|
+-------+----+----------+
|Toronto|2017| 50.0|
|Sanjose|2017| 200.0|
| Dallas|2017| 120.0|
| Plano|2015| 50.0|
|Newyork|2016| 150.0|
| Dallas|2016| 100.0|
| Plano|2016| 50.0|
+-------+----+----------+
+-------+----+----------+
| city|year|saleAmount|
+-------+----+----------+
| Dallas|null| 220.0|
| Plano|null| 100.0|
|Newyork|null| 150.0|
|Toronto|null| 50.0|
|Sanjose|null| 200.0|
+-------+----+----------+
+----+----+----------+
|city|year|saleAmount|
+----+----+----------+
|null|null| 720.0|
+----+----+----------+
+-------+----+----------+
| city|year|saleAmount|
+-------+----+----------+
|Toronto|2017| 50.0|
|Toronto|null| 50.0|
|Sanjose|2017| 200.0|
|Sanjose|null| 200.0|
| Plano|2015| 50.0|
| Plano|2016| 50.0|
| Plano|null| 100.0|
|Newyork|2016| 150.0|
|Newyork|null| 150.0|
| Dallas|2016| 100.0|
| Dallas|2017| 120.0|
| Dallas|null| 220.0|
| null|null| 720.0|
+-------+----+----------+
Conclusion : Finally we have generated our excel reports. This approach is very good for applications which are looking for reports instantly. If you like this article please don't forget to like it.