A convenient way to use PySpark with Jupyter on a local computer

Quoc Nguyen

Published Sep 9, 2021

Data Analysts/ Engineers/ Scientists – You might want to setup a local environment to brush up your Apache Spark skills with Python (PySpark). This short and easy article to help you do it.

1) Installing Docker Desktop

Download Docker Desktop installer here: https://www.docker.com/products/docker-desktop

Open installer file and follow the instructions of the setup wizard to complete Docker Desktop installation.

2) Pulling A PySpark-Notebook Docker Image

[Assuming you have already successfully installed Docker Desktop]

You will need to pull a docker image which is built and pushed to Docker Hub by Jupyter. Complete this step in a command line terminal so you will need to open it and run the following Docker Pull Command:

docker pull jupyter/pyspark-notebook

Note that by using the image, you already have Python 3, Apache Spark, Jupyter Notebook (and other python libraries) packed in. For the Docker Hub link, find it here - https://hub.docker.com/r/jupyter/pyspark-notebook

3) Running PySpark-Notebook Image in Docker Desktop

After successfully executed the Docker Pull Command above, open your Docker Desktop, the image will be available as follows:

Hit RUN and filling in the popped-up form as follows:

Click RUN: You will need to run the PySpark-Notebook Image to initialize a docker container in Docker Desktop. [The running container is your PySpark environment to run PySpark codes later].

4) Using the Favorite Jupyter Notebook

[Assuming you already have the PySpark-Notebook container is running in Docker Desktop)

To this point, you only need to get the URL in the LOGS window in Docker Desktop above to run your Jupyter Notebook on a web browser.

Copying the highlighted URL from LOGS and pasting it to the address bar to start the Jupyter Notebook as follows:

That’s it. The PySpark development environment is up and running. Enjoy your PySpark hands-on practice!

To view or add a comment, sign in

A convenient way to use PySpark with Jupyter on a local computer

Quoc Nguyen

More articles by Quoc Nguyen

Others also viewed

Python for Data Science: A Simple Guide for Beginners

Automating Weather Data Processing with Airflow, Docker, and Python

Forecasting: Principles and Practice - The Pythonic Way, The PostgreSQL VScode Extension

Pandas 3.0 Tutorial: What Changed, How to Migrate, and Why It Matters

Faster Data Engineering with Python Notebooks

Book intro — Geospatial Data Science Essentials

Official Introduction to Pandas: Your Data Manipulation Toolkit

Working with spark:Spark Session

#010: Porting a graph database query tool from Python to Rust

Analysis of Brampton's largest employers using Python, Pandas and Google Colaboratory - Part 1

Explore content categories