From Jupyter to Prototypes & Small Apps
As we non-programmers get into Data Science, Jupyter notebooks almost always seem to be the first platform to analyse data. However, notebooks become quite limited when use cases get more complex:
- Sharing with others: Notebooks can only be shared with other Jupyter that have to set up an environment beforehand. This is often a barrier for non-technical users that prefer a simple UI. Also, you want to create replicability so that others can get the same results using your code.
- The complexity of your project: As your models and tools grow, you often have to organize it in different files or use object-oriented- or functional programming. Jupyter is not helping you to manage this complexity.
- Schedule jobs: Often you want to schedule some jobs, which require to run on cloud infrastructure rather than clicking through the cells of a notebook manually.
- Computational Power: Notebooks are using resources that slow down your application.
Good news is – there are tools available that can solve those challenges. I will share what I identified as good tools to build prototypes and small apps and give you an overview of how those tools work. This article does not mean you should be replacing your Jupyter notebook, it is still great for exploration and experimentation. It should rather show you what else is out there.
This article will introduce the following concepts and tools:
- Shell
- Object-oriented programming
- PyCharm
- GitHub
- Cloud Computing
- Docker/Kubernetes
- Streamlit
Prerequisites
Shell
What is this? It is a command language for a command-line interface. Many of you may remember the Dos Shell (command.com) or have heard of bash, which are famous implementations of those languages. They are used for managing the file system, process and monitoring the system. For windows user, this might be a bit unfamiliar because you always interact in a graphical environment.
Why do I need it?
- It is commonly used on Linux-based cloud infrastructure
- It is helpful for all the tools I will introduce below
- Shell is often easier and faster to use than graphical interfaces (once you learned it)
- You can schedule or script things very easily (imagine how painful Excel macros are to record)
How can I learn it? It is quite easy to learn and resources are vast. I can highly recommend those two tutorials: Earth Data Science and Introduction to Shell from DataCamp (paid).
Object-Oriented Programming
What is this? It is not a tool, but rather a different paradigm of programming. While you often just define multiple steps in a notebook, using OOP you create objects a.k.a classes. Imagine it like cooking from a recipe:
Jupyter Notebook
Stand up, go five steps forward, move your left arm by 45 degrees, 30cm forward and grab (now you got a knife in your hand), ….
OOP
- Class yourself(function: moveto(location), grab(item))
- Class kitchen(location=x,y)
- Class knife(location=x,y; can be grabbed with arm = true)
- Yourself.moveto(kitchen) and grab(knife)
Why do I need it?
It will make things easier in the long-term by structuring your code better. In Python all the important libraries use OOP. When you think for example of sklearn, it is creating a new class for each model (e.g. LinearRegression()). Imagine this being written in a step-wise way – it would not only be hard to follow but also there is a lot of opportunity confusing variables and steps.
How can I learn it? If found the concept quite easy to understand on a high-level, but it is quite difficult to apply it. Here you can find how to use classes in Python.
Programming
PyCharm
What is this? An IDE is a program to write code in. Jupyter is an IDE as well, but PyCharm gained quite a lot of popularity. It is different from Jupyter because it focuses more on developing applications rather than experimentation.
Why do I need it?
- If you have a complex project structure with dependencies or multiple files it becomes difficult to manage in Jupyter
- PyCharm is directly connected to GitHub
- It has a terminal and python shell integrated
- You can create all kinds of files in one environment (HTML, CSS, ….)
Where can I learn it? You can just install it and play around. Furthermore, they have a nice video tutorial.
What is this? Developing software usually happens iteratively. As you go along you might want to keep track of all the versions and maybe even reverse them. Git is a system for managing versions, while GitHub is a platform that hosts your versions.
Why do I need it?
- You want to share code with others
- You want to track versions and reverse changes
- You want to work collaboratively on products and need a way to organize various people’s input
Where can I learn it? The official guide is quite good for a first overview. You can also easily create a GitHub account and try it out.
Cloud Computing
Cloud Provider (Google, Microsoft, AWS)
What is this? Imagine big halls filled with computers. Want to use it instead of your slow old laptop? That’s possible because Google, Amazon, Microsoft and others are renting their spare infrastructure.
Why do I need it?
- It's faster than your local machine
- It has more space than your local machine
- It is more reliable than your local machine
Where can I learn it? A good starting point is qwiklabs, who offer very nice interactive tutorials for Google Cloud and AWS.
Docker
What is this? “Docker is a set of platform as a service (PaaS) products that use OS-level virtualization to deliver software in packages called containers.” – Wikipedia. Sounds complicated but makes quite a lot of sense. You package your files, installed programmes/libraries in a container and ship it to whatever computer you want to use it on. DockerHub is a platform for storing and sharing your images in the cloud.
Why do I need it?
- Ever set up a new computer? Installing the software, moving the data, making settings. With Docker, you just package everything once and can replicate it everywhere
- If you use cloud infrastructure, you might set up new machines often. Docker makes that very easy.
- Replication is easy. If you work with multiple people you just need to share your container.
How can I learn it? I would first watch some YouTube videos and browse through the official documentation and docker-curriculum. Afterwards, I would just install Docker desktop and try it out!
Kubernetes (pronounced like “kiu-ber-netties”)
What is this? You like docker, but working with one machine and one container is boring? Kubernetes helps to set up multiple Docker (and other containers) on multiple machines – so quite complex architectures.
Why do I need it?
- ·If you want to build more complex architectures with Images/Docker
How can I learn it?
This is the most amazing tech-video you will ever see and a very nice introduction to the general concept: The Illustrated Children's Guide to Kubernetes
Streamlit
What is this? Streamlit is a tool for building a UI on top of your Python scripts with just a few lines of code.
Why do I need it?
- You get a beautiful UI with just a few hours of learning
- Normal people like beautiful UIs 😉
How can I learn it? It is super easy to learn. You can keep your scripts but just must add a few lines of code to define the structure of the UI. It will take you not more than a few hours until you can build cool things. The official doc is pretty good, I also liked this small tutorial.
Of course, there are plenty of more tools like Flask (building applications), PySpark (data processing) or Dash (dashboards) - maybe I'll cover them in a later post!