How to do Simple reporting with Excel sheets using Apache Spark, Scala ?

Ram Ghadiyaram

Published Aug 31, 2019

Spark Data Can Be Published as Excel Sheet

Question: Can Spark data be published as an Excel sheet?

Answer: Yes, Spark data can be published as an Excel sheet using a simple Spark plugin called crealytics/spark-excel. This library, which leverages Apache POI, allows seamless reading and writing of Excel files in Apache Spark, especially for Spark SQL and DataFrames.

Example Implementation (Scala with Spark)

Credits for the example: Jacek Laskowski's Mastering Spark SQL

Dependencies: Ensure the following Maven dependency is added to your project:

      <properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <scala.version>2.12.12</scala.version>
        <spark.version>3.0.0</spark.version>
    </properties>
      <dependency>
            <groupId>com.crealytics</groupId>
            <artifactId>spark-excel_2.12</artifactId>
            <version>0.13.6</version>
        </dependency>

package com.examples

import org.apache.log4j.Level
import org.apache.spark.sql.{Dataset, SparkSession}
import org.apache.spark.sql.functions._

object SparkExcelReport extends App  {
  val logger = org.apache.log4j.Logger.getLogger("org")
  logger.setLevel(Level.WARN)
  val spark = SparkSession.builder()
    .appName(this.getClass.getName)
    .config("spark.master", "local[*]").getOrCreate()

  import spark.implicits._
  // CSV data as a multi-line string
  val csvString =
    """
      |ID,FIRST_NAME,LAST_NAME,DESIGNATION,DEPARTMENT,SALARY
      |1001,Ram,Ghadiyaram,Director of Sales,Sales,30000
      |1002,Ravi,Rangasamy,Marketing Manager,Sales,25000
      |1003,Ramesh,Rangasamy,Assistant Manager,Sales,25000
      |1004,Prem,Sure,Account Coordinator,Account,15000
      |1005,Phani,G,Accountant II,Account,20000
      |1006,Krishna,G,Account Coordinator,Account,15000
      |1007,Rakesh,Krishnamurthy,Assistant Manager,Sales,25000
      |1008,Gally,Johnson,Manager,Account,28000
      |1009,Richard,Grill,Account Coordinator,Account,12000
      |1010,Sofia,Ketty,Sales Coordinator,Sales,20000
  """.stripMargin

  // Convert the multi-line string to a Dataset[String]
  val csvData: Dataset[String] = spark.sparkContext
    .parallelize(csvString.stripMargin.split("\n").toSeq) // Split by lines and convert to Seq
    .toDS()
  val sales = Seq(
    ("Dallas", 2016, 100d),
    ("Dallas", 2017, 120d),
    ("Sanjose", 2017, 200d),
    ("Plano", 2015, 50d),
    ("Plano", 2016, 50d),
    ("Newyork", 2016, 150d),
    ("Toronto", 2017, 50d)

  ).toDF("city", "year", "saleAmount")
  sales.printSchema()

  logger.info(
    """
      |
      | rollup multi-dimensional aggregate operator is an extension of groupBy operator
      | that calculates subtotals and a grand total across specified group of n + 1 dimensions
      | (with n being the number of columns as cols and col1 and 1 for where values become null, i.e. undefined).
      | rollup operator is commonly used for analysis over hierarchical data; e.g. total salary by department, division, and company-wide total.
    """.stripMargin)
  logger.info("Using Rollup (\"city\", \"year\")")
  val first = sales
    .rollup("city", "year")
    .agg(sum("saleAmount") as "saleAmount")
    .sort($"city".desc_nulls_last, $"year".asc_nulls_last)
  first.show
  first.coalesce(1).write
    .format("com.crealytics.spark.excel")
    .option("dataAddress", "'RollupCityYear'!A1:F35")
    .option("useHeader", "true")
    .option("dateFormat", "yy-mmm-d")
    .option("timestampFormat", "mm-dd-yyyy hh:mm:ss")
    .option("header", "true")
    .option("addColorColumns", "true")
    .option("treatEmptyValuesAsNulls", "false")
    .mode("append")
    .save("./src/main/salesReport/SalesReport.xlsx")
  logger.info("Using .groupBy(\"city\", \"year\")")

  // The above query is semantically equivalent to the following
  val second = sales
    .groupBy("city", "year") // <-- subtotals (city, year)
    .agg(sum("saleAmount") as "saleAmount")

  second.coalesce(1).write
    .format("com.crealytics.spark.excel")
    .option("dataAddress", "'GroupByCityYearSheet'!A1:F35")
    .option("useHeader", "true")
    .option("header", "true")
    .option("dateFormat", "yy-mmm-d")
    .option("timestampFormat", "mm-dd-yyyy hh:mm:ss")
    .option("addColorColumns", "true")
    .option("treatEmptyValuesAsNulls", "false")
    .mode("append")
    .save("./src/main/salesReport/SalesReport.xlsx")
  second.show
  logger.info("Using group by (\"city\")")

  val third = sales
    .groupBy("city") // <-- subtotals (city)
    .agg(sum("saleAmount") as "saleAmount")
    .select($"city", lit(null) as "year", $"saleAmount") // <-- year is null
  third.show
  third.coalesce(1).write
    .format("com.crealytics.spark.excel")
    .option("dataAddress", "'GroupByCitySheet'!A1:F35")
    .option("useHeader", "true")
    .option("header", "true")
    .option("dateFormat", "yy-mmm-d")
    .option("timestampFormat", "mm-dd-yyyy hh:mm:ss")
    .option("treatEmptyValuesAsNulls", "false")
    .mode("append")
    .save("./src/main/salesReport/SalesReport.xlsx")

  val fourth = sales
    .groupBy() // <-- grand total
    .agg(sum("saleAmount") as "saleAmount")
    .select(lit(null) as "city", lit(null) as "year", $"saleAmount") // <-- city and year are null

  fourth.show

  fourth.coalesce(1).write
    .format("com.crealytics.spark.excel")
    .option("dataAddress", "'groupBySheet'!A1:F35")
    .option("useHeader", "true")
    .option("header", "true")
    .option("dateFormat", "yy-mmm-d")
    .option("timestampFormat", "mm-dd-yyyy hh:mm:ss")
    .option("treatEmptyValuesAsNulls", "false")
    .mode("append")
    .save("./src/main/salesReport/SalesReport.xlsx")

  val finalDF = second
    .union(third)
    .union(fourth)
    .sort($"city".desc_nulls_last, $"year".asc_nulls_last)
  finalDF.show
  finalDF.coalesce(1).write
    .format("com.crealytics.spark.excel")
    .option("dataAddress", "'unionSheet'!A1:F35")
    .option("useHeader", "true")
    .option("header", "true")
    .option("dateFormat", "yy-mmm-d")
    .option("timestampFormat", "mm-dd-yyyy hh:mm:ss")
    .option("treatEmptyValuesAsNulls", "false")
    .mode("append")
    .save("./src/main/salesReport/SalesReport.xlsx")

}

Output:

Schema:

root
 |-- city: string (nullable = true)
 |-- year: integer (nullable = false)
 |-- saleAmount: double (nullable = false)

2025-03-29 01:32:29 WARN  ProcfsMetricsGetter:69 - Exception when trying to compute pagesize, as a result reporting of ProcessTree metrics is stopped
+-------+----+----------+
|   city|year|saleAmount|
+-------+----+----------+
|Toronto|2017|      50.0|
|Toronto|null|      50.0|
|Sanjose|2017|     200.0|
|Sanjose|null|     200.0|
|  Plano|2015|      50.0|
|  Plano|2016|      50.0|
|  Plano|null|     100.0|
|Newyork|2016|     150.0|
|Newyork|null|     150.0|
| Dallas|2016|     100.0|
| Dallas|2017|     120.0|
| Dallas|null|     220.0|
|   null|null|     720.0|
+-------+----+----------+

+-------+----+----------+
|   city|year|saleAmount|
+-------+----+----------+
|Toronto|2017|      50.0|
|Sanjose|2017|     200.0|
| Dallas|2017|     120.0|
|  Plano|2015|      50.0|
|Newyork|2016|     150.0|
| Dallas|2016|     100.0|
|  Plano|2016|      50.0|
+-------+----+----------+

+-------+----+----------+
|   city|year|saleAmount|
+-------+----+----------+
| Dallas|null|     220.0|
|  Plano|null|     100.0|
|Newyork|null|     150.0|
|Toronto|null|      50.0|
|Sanjose|null|     200.0|
+-------+----+----------+

+----+----+----------+
|city|year|saleAmount|
+----+----+----------+
|null|null|     720.0|
+----+----+----------+

+-------+----+----------+
|   city|year|saleAmount|
+-------+----+----------+
|Toronto|2017|      50.0|
|Toronto|null|      50.0|
|Sanjose|2017|     200.0|
|Sanjose|null|     200.0|
|  Plano|2015|      50.0|
|  Plano|2016|      50.0|
|  Plano|null|     100.0|
|Newyork|2016|     150.0|
|Newyork|null|     150.0|
| Dallas|2016|     100.0|
| Dallas|2017|     120.0|
| Dallas|null|     220.0|
|   null|null|     720.0|
+-------+----+----------+

Conclusion : Finally we have generated our excel reports. This approach is very good for applications which are looking for reports instantly. If you like this article please don't forget to like it.

To view or add a comment, sign in

How to do Simple reporting with Excel sheets using Apache Spark, Scala ?

Ram Ghadiyaram

Spark Data Can Be Published as Excel Sheet

Example Implementation (Scala with Spark)

Output:

More articles by Ram Ghadiyaram

Explore content categories

Spark Data Can Be Published as Excel Sheet

Example Implementation (Scala with Spark)

Output:

More articles by Ram Ghadiyaram

My thought to find sql and durations (generic python module) and without touching the pyspark business code which is already deployed

How to Add Custom Spark Listener Logs to the AWS EMR UI

How to generate csv from impala/hive console output to csv using pyspark

Delta lake insights

Spark SQL/Hive.. - Interview questions for Big Data engineers

Spark SQL Window functions using plain SQL.

Apache Spark - Advanced Aggregations

Hadoop Yarn Fair scheduler advantages.. explained... configuration part2

Hadoop Yarn Fair scheduler advantages.. explained... part1

Explore content categories