A convenient way to use PySpark with Jupyter on a local computer
Data Analysts/ Engineers/ Scientists – You might want to setup a local environment to brush up your Apache Spark skills with Python (PySpark). This short and easy article to help you do it.
1) Installing Docker Desktop
Download Docker Desktop installer here: https://www.docker.com/products/docker-desktop
Open installer file and follow the instructions of the setup wizard to complete Docker Desktop installation.
2) Pulling A PySpark-Notebook Docker Image
[Assuming you have already successfully installed Docker Desktop]
You will need to pull a docker image which is built and pushed to Docker Hub by Jupyter. Complete this step in a command line terminal so you will need to open it and run the following Docker Pull Command:
docker pull jupyter/pyspark-notebook
Note that by using the image, you already have Python 3, Apache Spark, Jupyter Notebook (and other python libraries) packed in. For the Docker Hub link, find it here - https://hub.docker.com/r/jupyter/pyspark-notebook
3) Running PySpark-Notebook Image in Docker Desktop
After successfully executed the Docker Pull Command above, open your Docker Desktop, the image will be available as follows:
Hit RUN and filling in the popped-up form as follows:
Click RUN: You will need to run the PySpark-Notebook Image to initialize a docker container in Docker Desktop. [The running container is your PySpark environment to run PySpark codes later].
4) Using the Favorite Jupyter Notebook
[Assuming you already have the PySpark-Notebook container is running in Docker Desktop)
To this point, you only need to get the URL in the LOGS window in Docker Desktop above to run your Jupyter Notebook on a web browser.
Copying the highlighted URL from LOGS and pasting it to the address bar to start the Jupyter Notebook as follows:
That’s it. The PySpark development environment is up and running. Enjoy your PySpark hands-on practice!