Building a Custom Devcontainer For Local Databricks Development
After developing several Databricks applications on AWS, the advantages of simulating AWS interactions during the backend development of PySpark applications become clear. Simulating AWS locally not only eases the process of local testing but also accelerates the iterative development process. Furthermore, this approach allows for the seamless integration of generated tests into automated pipelines, such as those in Azure DevOps, during build tests.
For local AWS simulation, we can set up a local Delta Lake and emulate an AWS server. Given the variety of Databricks environments available for different Python versions, being able to smoothly switch between these versions is vital. Although it is possible to set up this development environment directly, utilizing a dev-container in Visual Studio Code (VSCode) provides a more structured project management solution. This approach also simplifies the process of sharing setup configurations within a team or business. (Note: this will always be more intensive than installing locally, so be prepared with a beefy computer and only really use the setups for small scale testing)
VSCode enables users to connect directly to a Docker container and utilize a fully-fledged development environment within that container, known as a Devcontainer.
Use Case
One common use case, for which we can apply such a container is for the following common pattern.
We have a landing bucket where a dropped file triggers an airflow orchestration process. This process orchestrates an integration process (usually handled by another team) and a Databricks ingestion process. The ingestion process must now post-process the files from the staging bucket and load them into an external Databricks delta table which is normally stored on a bronze bucket.
With our devcontainer we can essentially mock the Databricks ingestion process locally, allowing us to:
How To Setup A Container
step-1: Install Visual Studio Code and be familiar with workspaces.
step-2: Install the Devcontainer extension. This extension facilitates the use of docker images as development environments..
step-3: In a working working directory create a .devcontainer folder with a Dockerfile and devcontainer.json.
Recommended by LinkedIn
we will create our own custom Dockerfile for flexibility.
step-4: We will build our docker container with pyenv, several pre-installed python versions (useful when trying to work with different python based interacting components of an application), pyspark, delta lake and poetry.
We can optimize the Dockerfile later when we finally deploy our application.
In the Dockerfile:
# let's use ubuntu 23.04 as our base image
FROM ubuntu:23.04
# uminimize the image https://wiki.ubuntu.com/Minimal
RUN apt-get update && yes | unminimize
# install the required packges and remove caches
RUN apt-get update \
&& apt-get install -y \
git \
wget \
curl \
sudo \
build-essential \
libssl-dev \
zlib1g-dev \
libbz2-dev \
libreadline-dev \
libsqlite3-dev \
llvm \
libncurses5-dev \
libncursesw5-dev \
default-jdk \
default-jre \
scala \
man-db \
shellcheck \
xz-utils \
tk-dev \
libffi-dev \
liblzma-dev \
libsasl2-dev \
&& rm -rf /var/lib/apt/lists/* \
&& apt-get clean
# setup the environment variables required to install pyenv and pyspark
ENV PYENV_ROOT /root/.pyenv
ENV PATH $PYENV_ROOT/shims:$PYENV_ROOT/bin:$PATH
ENV SPARK_HOME=/mnt/spark
ENV HOME /root
# install the various versions of pyenv
RUN curl https://pyenv.run | bash \
&& pyenv install 3.7 && pyenv global 3.7 && pip install poetry && pyenv rehash \
&& pyenv install 3.8 && pyenv global 3.8 && pip install poetry && pyenv rehash \
&& pyenv install 3.9 && pyenv global 3.9 && pip install poetry && pyenv rehash \
&& pyenv install 3.10 && pyenv global 3.10 && pip install poetry && pyenv rehash \
&& pyenv install 3.11 && pyenv global 3.11 && pip install poetry && pyenv rehash
# This is to fix potential compilation issues when installing and compiling
# packages on python 3.11
RUN cp $PYENV_ROOT/versions/3.8.*/include/python3.8/longintrepr.h \
$PYENV_ROOT/versions/3.11.*/include/python3.11
# insall nvm to manage node packages
RUN curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.4/install.sh | bash
# install the bash language server, makes development easier
RUN . ~/.nvm/nvm.sh \
&& nvm install node \
&& nvm install-latest-npm \
&& npm i -g bash-language-server
# install spark
RUN wget https://dlcdn.apache.org/spark/spark-3.4.2/spark-3.4.2-bin-hadoop3.tgz \
&& tar -xvzf spark-3.4.2-bin-hadoop3.tgz \
&& mv spark-3.4.2-bin-hadoop3 /mnt/spark \
&& rm spark-3.4.2-bin-hadoop3.tgz
# Installing starship makes things in the terminal pretty ;)
RUN curl -sS https://starship.rs/install.sh | sh -s -- --yes && \
echo 'eval "$(starship init bash)"' >> ~/.bashrc
# Set he default python version
RUN pyenv global 3.8
WORKDIR /root/
# quick and easy hacky way to download and install the packages required for delta lakes
RUN echo "import sys" > dummy.py \
&& ${SPARK_HOME}/bin/spark-submit \
--packages org.apache.hadoop:hadoop-aws:3.3.2,io.delta:delta-core_2.12:2.4.0 \
dummy.py \
&& rm dummy.py
# when checking shell scripts we wish to ignore https://www.shellcheck.net/wiki/SC1017
RUN echo "disable=SC1017" > /root/.shellcheckrc
# If we want to simulate more complex interactions we will need to install the
# docker engine on the docker image and enable docker in docker.
RUN sudo apt-get update \
&& sudo apt-get install -y ca-certificates curl gnupg \
&& sudo install -m 0755 -d /etc/apt/keyrings \
&& curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg \
&& sudo chmod a+r /etc/apt/keyrings/docker.gpg \
&& echo \
"deb [arch="$(dpkg --print-architecture)" signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
"$(. /etc/os-release && echo "$VERSION_CODENAME")" stable" | \
sudo tee /etc/apt/sources.list.d/docker.list > /dev/null \
&& sudo apt-get update \
&& sudo apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin \
&& rm -rf /var/lib/apt/lists/* \
&& apt-get clean
step-5: Let's now configure our devcontainer.json. We will add some useful common extensions for formatting, linting and type checking boto (note we can also point directly to an image)
{
"name": "Existing Dockerfile",
// allow docker in docker
"runArgs": [
"-v",
"//var/run/docker.sock:/var/run/docker.sock"
],
// point to the above folder for the workspace
// and target the current dockerfile
"build": {
"context": "..",
// alternatively we can point to a
"dockerfile": "Dockerfile"
},
// add some extensions
"customizations": {
"vscode": {
"extensions": [
"ms-python.python",
"ms-python.vscode-pylance",
"ms-python.black-formatter",
"charliermarsh.ruff",
"Boto3typed.boto3-ide",
"ms-azure-devops.azure-pipelines",
"janjoerke.align-by-regex",
"samuelcolvin.jinjahtml",
"mikestead.dotenv",
"foxundermoon.shell-format",
"mads-hartmann.bash-ide-vscode",
"rogalmic.bash-debug",
"shardulm94.trailing-spaces",
"ms-azuretools.vscode-docker",
],
// some default formatting
"settings": {
"[python]": {
"editor.defaultFormatter": "ms-python.black-formatter",
"editor.formatOnSave": true,
"source.fixAll.ruff": false,
"source.organizeImports.ruff": true
},
"[shellscript]": {
"editor.defaultFormatter": "foxundermoon.shell-format",
"editor.formatOnSave": true
}
}
}
}
}
step-6: now we can open up the command palette and select Dev Containers: Rebuild and Reopen in Container
And now we can develop our application with our configured workspace.
Combining our devcontainer with moto we'll be able to model the components and basic interactions of our AWS based Databricks application.
Awesome, saved me a lot of time, thanks! :)