A Physics Simulation and the Power of the Raspberry Pi in the Data Science World
Raspberry Pis. These small, not very powerful computers are seen as a novelty by some, but are a very valuable product and resource to others. Not only can they teach coding to individuals, but they can serve as a platform for small data analytics projects and teach things ranging from command line to networking.
But what allows me to say anything about this? The projects that I have undertaken with the Raspberry Pis that I own gave given me invaluable experience. Sitting in my dorm room as a physics student at the University of Illinois at Urbana-Champaign I was able to use four Raspberry Pis to create a cluster of machines to run a host of programs. Not only would they run algorithms separately, they would also work in parallel to accomplish a larger task.
What is a Raspberry Pi
To quote the official site of the Raspberry Pi, raspberrypi.org, "The Raspberry Pi is a tiny and affordable computer that you can use to learn programming through fun, practical projects." This small machine is not only useful for that however. Through some clever programming and problem solving it can be used for more advanced applications and can be invaluable for certain projects. Running about $35, this small computer is an extremely powerful component in many analytics projects.
The Program
The plan for the program was to write a program that ran a simple physics simulation -- simulate the gravitational interaction between particles for a system of them. Doesn't seem to be too difficult or computationally intensive, however if you have 10 particles then for each particle i you mush calculate the interaction between it and 10 - i other particles. This compounds fast. For 10 particles it is only 45 interactions, but if you have 20 particles this jumps up to 190 calculations of gravitational force. So, when running a simulation with hundreds of particles you have a problem. How do you process all of these force calculations on one machine? If it has a quad core processor you could run a thread on each core for the calculations, splitting the number necessary by four. But that still would take a large amount of time. So what is the solution to this? Network the Raspberry Pis and use them for the calculations. You can send the data to the four machines having each one running a quarter of it, and split each quarter into four in order to run four threads on each Raspberry Pi. This cuts down on the computation time and decreases the run time. The challenge is networking them effectively however.
The Networking
This was a large undertaking for me seen as I had no previous networking experience. To tackle this problem, I turned to my favorite programming language -- Python. Python is an extremely powerful language seen as it has a multitude of modules made by other people which gives you the ability to do a thousand things without having to devise the code to do it yourself. To detour from the conversation a bit, Python modules are collections of pre-made code for a certain purpose, such as Astropy which allows you to read in .fits files (an image file type popular in astronomy) and manipulate them, or Paramiko which I used in this project for the networking of the machines.
The Hardware
What exactly did I need for this endeavor? Not too much, but it was more than I expected. I needed power supplies for each Raspberry Pi, an Ethernet one for each one along with one for my laptop which served as the master node which sent the data, recombined it after it had been processed, and rendering the simulation. Also I needed a power strip to power everything and a desktop switch in order to connect all of the machines. Finally, I needed one last Ethernet cable to connect to the internet jack and the Raspberry Pis along with my laptop.
The Code
The Python code for this project wasn't quite as simple as I would have liked, but it works. First the particle system is generated with particles at random positions. Next, using the Pandas module (designed for big data) the particles, the forces between them (0 vectors at the start), their positions, their masses, and their velocities are all written to a .csv file. This file is how I transport the data over the Raspberry Pis. The first approach that I used was to directly send the data to the Pis using sockets, however this proved to be extremely slow, even being the bottleneck of the program. I switched to using the Paramiko module and its sftp (secure file transfer protocol) capability to send the entire .csv file to each machine. The reason that the entire file must be sent and not just one fourth of it to each of the machines is that the Pis must calculate the interaction for all of the particles against the fourth that they are running, so they must have the full dataset to do that. After transferring the data file over to the Raspberry Pis, the master node sends a message through the sockets to the waiting Pis to tell them to go ahead and process the data. The Pis then take over from there. They run through a fourth of the data (which fourth is determined by arguments passed to the Pi when they start their program) and calculate all of the forces. These are then written to a smaller data file and sent back to the master node, along with a message saying that they have finished processing the data. Once all four machines are done, the master node takes over once again and combines the four small file into once complete data file, replacing the old one. It then extracts the data from it and performs the necessary operations in order to render the frame, then the program just rinses and repeats. All of this code is on GitHub by the way if you want to take a look at it.
The Value of the Program
But why should you care about this program that I wrote? The answer is that you can learn a multitude of things from it. The networking aspect of the program can be invaluable for data science. Not only is this program about physics, but it is about data analytics as well. The method of splitting the data and processing it separately can be easily transferred over to analyzing any data set and almost any algorithm, assuming it is able to be parallelized. This can come in handy for large sets of data that need to be processed, and the Raspberry Pis offer a cheap alternative to a large powerful machine. The Pis cost about $35 each, so my cluster costs somewhere around $140 without all of the cables and accessories. This is a relatively cheap cost compared to a desktop that could cost up to and past $1000. The desktop would probably be able to perform better, seen as the Raspberry Pi 2 Bs that I am using have a 900 MHz processor and 1 GB of RAM but imagine a cluster of 16 machines. Including cables and switches and such this would most likely cost you less than the desktop, and with the parallelization capabilities could possibly give the desktop machine a run for its money (I'm not entirely sure as I haven't been able to test it yet, but that is going to be an upcoming project).
The Takeaway
The most important thing to realize from this is that cheap data analytics is not impossible and is likely to become a larger factor in the data science world. The capability of one person to build a small supercomputer and process data on it is no small feat and certainly not something to be overlooked. It gives an analytics startup the ability to compete, albeit in a small capacity but it makes it so that they are not completely out of the game from the beginning. Furthermore, it gives universities and companies cheap alternatives to expensive supercomputers or having to pay for cloud computing time. They can just invest in a small cluster of cheap machines that are easily scalable and use that with the added advantage that the cluster can be upgraded and customized for their specifications.