Getting started with Apache Spark

Getting started with Apache Spark

This article will get you started download and install Apache Spark interactive shell on Windows 10. Interactive shell mode is a useful tool to quickly code and test while developing Spark applications. Most developers use this mode during development as it will speedup coding and testing applications for Spark. Before installing and working with Spark here is another article that describe how Spark process data. Now let's get started

First make sure Java is installed on your computer. If Java is not installed then go to Java download page and click on windows online or offline link. Run the installer. After installation is complete you will see following dialog box

To verify that Java is installed on your computer, open command prompt and execute following command. It will show the version of java installed on your computer

java -version

Next download and install Scala. Apache Spark is based on Scala so installing it is mandatory, Go to this link to download and install Scala.

Click SBT link to download and install. SBT is command line tool to compile Scala and Spark applications.

Next go to this link and download Apache Spark

For this tutorial select 1.6.0 version with Hadoop 2.6. Click the Download Spark and it will download tgz file on your computer. To extract this compressed file download and install 7Zip from this link. Extract the content of download compressed file to a folder. You are all set to run and test Apache Spark interactive shell. As i said most developers use this shell mode to develop and test their spark work before coding and compiling Spark application to run on the cluster. To run the shell, open command prompt and go to bin folder where you extracted Apache Spark and execute following command

C:\Software\spark-1.6.0\bin\spark-shell

This will launch the Apache Spark shell in interactive mode. spark-shell is very verbose so you will see many log messages being displayed when you start the spark shell

When you see the scala> command prompt you are ready to start issuing commands which will be executed instantly. Let's start with a classic example of counting words from a text file which is a sort of Hello World for Apache Spark

Create a text file called fil.txt and paste following text and save the file.

this is first line
this is second line
this is third line

Next issue following command in interactive shell. Replace the correct path and file name

val fil = sc.textFile("c://fil.txt");

Note sc is the SparkContext object, we will talk about this context in later tutorial and a link will be updated here. In short above command will create the text file RDD or Resilient Distributed Dataset. RDD is spark's in-memory collection of objects. These objects or RDD is partitioned and distributed across computer nodes if file is too large and application is running in a cluster of computers. Note above command is a transformation which will not trigger spark to execute the command. Spark lazily evaluates all transformation till an action is performed

Next issue following command in Spark shell

val counts = fil.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)

There are many operations being performed on the file RDD by spark in this commands. First file lines are split and the flapMap is executed. Next map command creates (word, 1) tuples with word as key and 1 as value. Then reduceByKey command will collect all marching keys on a single partition and aggregates. Notice the use of (_ + _). These are scala short hands which aggregates all 1 for each key. Please checkout Scala tutorials to learn more about Scala language syntax. This command is still a transformation and Spark will not execute this command instantly. Next issue action command as follows

counts.foreach(println)

This is action command, now Spark will build an execution plan, execute and show the results on the screen. Here is the full sequence of the commands in the shell

If you come to this point then you have successfully installed Apache Spark and executed some code. Here are some more examples

val cnt = fil.filter(line => line.contains("this")).count()
println(cnt)

Above code will execute on the file RDD and count the occurrences of the word "this". Print command will print the count.

val linelengths = fil.map(l => l.length)
val totallength = linelengths.reduce( (c1, c2) => c1 + c2 )
println(cnt)

This code will print the total number of characters in the file. First the map will generate a RDD which will contain count of characters in each line. Then reduce will calculate the total count in each line.





To view or add a comment, sign in

More articles by Shahzad Aslam

Others also viewed

Explore content categories