How to do  Simple reporting with Excel sheets using Apache Spark, Scala ?

How to do Simple reporting with Excel sheets using Apache Spark, Scala ?

Spark Data Can Be Published as Excel Sheet

Question: Can Spark data be published as an Excel sheet?

Answer: Yes, Spark data can be published as an Excel sheet using a simple Spark plugin called crealytics/spark-excel. This library, which leverages Apache POI, allows seamless reading and writing of Excel files in Apache Spark, especially for Spark SQL and DataFrames.

Example Implementation (Scala with Spark)

Credits for the example: Jacek Laskowski's Mastering Spark SQL

Dependencies: Ensure the following Maven dependency is added to your project:

      <properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <scala.version>2.12.12</scala.version>
        <spark.version>3.0.0</spark.version>
    </properties>
      <dependency>
            <groupId>com.crealytics</groupId>
            <artifactId>spark-excel_2.12</artifactId>
            <version>0.13.6</version>
        </dependency>        

package com.examples

import org.apache.log4j.Level
import org.apache.spark.sql.{Dataset, SparkSession}
import org.apache.spark.sql.functions._

object SparkExcelReport extends App  {
  val logger = org.apache.log4j.Logger.getLogger("org")
  logger.setLevel(Level.WARN)
  val spark = SparkSession.builder()
    .appName(this.getClass.getName)
    .config("spark.master", "local[*]").getOrCreate()

  import spark.implicits._
  // CSV data as a multi-line string
  val csvString =
    """
      |ID,FIRST_NAME,LAST_NAME,DESIGNATION,DEPARTMENT,SALARY
      |1001,Ram,Ghadiyaram,Director of Sales,Sales,30000
      |1002,Ravi,Rangasamy,Marketing Manager,Sales,25000
      |1003,Ramesh,Rangasamy,Assistant Manager,Sales,25000
      |1004,Prem,Sure,Account Coordinator,Account,15000
      |1005,Phani,G,Accountant II,Account,20000
      |1006,Krishna,G,Account Coordinator,Account,15000
      |1007,Rakesh,Krishnamurthy,Assistant Manager,Sales,25000
      |1008,Gally,Johnson,Manager,Account,28000
      |1009,Richard,Grill,Account Coordinator,Account,12000
      |1010,Sofia,Ketty,Sales Coordinator,Sales,20000
  """.stripMargin

  // Convert the multi-line string to a Dataset[String]
  val csvData: Dataset[String] = spark.sparkContext
    .parallelize(csvString.stripMargin.split("\n").toSeq) // Split by lines and convert to Seq
    .toDS()
  val sales = Seq(
    ("Dallas", 2016, 100d),
    ("Dallas", 2017, 120d),
    ("Sanjose", 2017, 200d),
    ("Plano", 2015, 50d),
    ("Plano", 2016, 50d),
    ("Newyork", 2016, 150d),
    ("Toronto", 2017, 50d)

  ).toDF("city", "year", "saleAmount")
  sales.printSchema()

  logger.info(
    """
      |
      | rollup multi-dimensional aggregate operator is an extension of groupBy operator
      | that calculates subtotals and a grand total across specified group of n + 1 dimensions
      | (with n being the number of columns as cols and col1 and 1 for where values become null, i.e. undefined).
      | rollup operator is commonly used for analysis over hierarchical data; e.g. total salary by department, division, and company-wide total.
    """.stripMargin)
  logger.info("Using Rollup (\"city\", \"year\")")
  val first = sales
    .rollup("city", "year")
    .agg(sum("saleAmount") as "saleAmount")
    .sort($"city".desc_nulls_last, $"year".asc_nulls_last)
  first.show
  first.coalesce(1).write
    .format("com.crealytics.spark.excel")
    .option("dataAddress", "'RollupCityYear'!A1:F35")
    .option("useHeader", "true")
    .option("dateFormat", "yy-mmm-d")
    .option("timestampFormat", "mm-dd-yyyy hh:mm:ss")
    .option("header", "true")
    .option("addColorColumns", "true")
    .option("treatEmptyValuesAsNulls", "false")
    .mode("append")
    .save("./src/main/salesReport/SalesReport.xlsx")
  logger.info("Using .groupBy(\"city\", \"year\")")

  // The above query is semantically equivalent to the following
  val second = sales
    .groupBy("city", "year") // <-- subtotals (city, year)
    .agg(sum("saleAmount") as "saleAmount")

  second.coalesce(1).write
    .format("com.crealytics.spark.excel")
    .option("dataAddress", "'GroupByCityYearSheet'!A1:F35")
    .option("useHeader", "true")
    .option("header", "true")
    .option("dateFormat", "yy-mmm-d")
    .option("timestampFormat", "mm-dd-yyyy hh:mm:ss")
    .option("addColorColumns", "true")
    .option("treatEmptyValuesAsNulls", "false")
    .mode("append")
    .save("./src/main/salesReport/SalesReport.xlsx")
  second.show
  logger.info("Using group by (\"city\")")

  val third = sales
    .groupBy("city") // <-- subtotals (city)
    .agg(sum("saleAmount") as "saleAmount")
    .select($"city", lit(null) as "year", $"saleAmount") // <-- year is null
  third.show
  third.coalesce(1).write
    .format("com.crealytics.spark.excel")
    .option("dataAddress", "'GroupByCitySheet'!A1:F35")
    .option("useHeader", "true")
    .option("header", "true")
    .option("dateFormat", "yy-mmm-d")
    .option("timestampFormat", "mm-dd-yyyy hh:mm:ss")
    .option("treatEmptyValuesAsNulls", "false")
    .mode("append")
    .save("./src/main/salesReport/SalesReport.xlsx")

  val fourth = sales
    .groupBy() // <-- grand total
    .agg(sum("saleAmount") as "saleAmount")
    .select(lit(null) as "city", lit(null) as "year", $"saleAmount") // <-- city and year are null

  fourth.show

  fourth.coalesce(1).write
    .format("com.crealytics.spark.excel")
    .option("dataAddress", "'groupBySheet'!A1:F35")
    .option("useHeader", "true")
    .option("header", "true")
    .option("dateFormat", "yy-mmm-d")
    .option("timestampFormat", "mm-dd-yyyy hh:mm:ss")
    .option("treatEmptyValuesAsNulls", "false")
    .mode("append")
    .save("./src/main/salesReport/SalesReport.xlsx")

  val finalDF = second
    .union(third)
    .union(fourth)
    .sort($"city".desc_nulls_last, $"year".asc_nulls_last)
  finalDF.show
  finalDF.coalesce(1).write
    .format("com.crealytics.spark.excel")
    .option("dataAddress", "'unionSheet'!A1:F35")
    .option("useHeader", "true")
    .option("header", "true")
    .option("dateFormat", "yy-mmm-d")
    .option("timestampFormat", "mm-dd-yyyy hh:mm:ss")
    .option("treatEmptyValuesAsNulls", "false")
    .mode("append")
    .save("./src/main/salesReport/SalesReport.xlsx")

}
        


Output:

Schema:

root
 |-- city: string (nullable = true)
 |-- year: integer (nullable = false)
 |-- saleAmount: double (nullable = false)

2025-03-29 01:32:29 WARN  ProcfsMetricsGetter:69 - Exception when trying to compute pagesize, as a result reporting of ProcessTree metrics is stopped
+-------+----+----------+
|   city|year|saleAmount|
+-------+----+----------+
|Toronto|2017|      50.0|
|Toronto|null|      50.0|
|Sanjose|2017|     200.0|
|Sanjose|null|     200.0|
|  Plano|2015|      50.0|
|  Plano|2016|      50.0|
|  Plano|null|     100.0|
|Newyork|2016|     150.0|
|Newyork|null|     150.0|
| Dallas|2016|     100.0|
| Dallas|2017|     120.0|
| Dallas|null|     220.0|
|   null|null|     720.0|
+-------+----+----------+

+-------+----+----------+
|   city|year|saleAmount|
+-------+----+----------+
|Toronto|2017|      50.0|
|Sanjose|2017|     200.0|
| Dallas|2017|     120.0|
|  Plano|2015|      50.0|
|Newyork|2016|     150.0|
| Dallas|2016|     100.0|
|  Plano|2016|      50.0|
+-------+----+----------+

+-------+----+----------+
|   city|year|saleAmount|
+-------+----+----------+
| Dallas|null|     220.0|
|  Plano|null|     100.0|
|Newyork|null|     150.0|
|Toronto|null|      50.0|
|Sanjose|null|     200.0|
+-------+----+----------+

+----+----+----------+
|city|year|saleAmount|
+----+----+----------+
|null|null|     720.0|
+----+----+----------+

+-------+----+----------+
|   city|year|saleAmount|
+-------+----+----------+
|Toronto|2017|      50.0|
|Toronto|null|      50.0|
|Sanjose|2017|     200.0|
|Sanjose|null|     200.0|
|  Plano|2015|      50.0|
|  Plano|2016|      50.0|
|  Plano|null|     100.0|
|Newyork|2016|     150.0|
|Newyork|null|     150.0|
| Dallas|2016|     100.0|
| Dallas|2017|     120.0|
| Dallas|null|     220.0|
|   null|null|     720.0|
+-------+----+----------+
        


Article content


Article content


Article content


Article content


Article content


Conclusion : Finally we have generated our excel reports. This approach is very good for applications which are looking for reports instantly. If you like this article please don't forget to like it.

To view or add a comment, sign in

More articles by Ram Ghadiyaram

Explore content categories