Creating photo mosaics from Instagram using Python, Linear Optimisation and Neural Networks
People seem to absolutely love Instagram. At least from as a data science perspective, it’s a fantastic data source for Natural Language Processing (NLP), network analysis, audience segmentation and plenty more. With maybe a bit too much free time over Christmas I wondered: could something cool be done with the photos themselves?
Who doesn’t love personalised gifts too, so let’s see if we can make a someone a mosaic from all their own photos.
There are plenty of companies who make photo mosaics, but they seem to have two main limitations for this purpose:
Overall objective: Programmatically create a high-quality photo mosaic from a given Instagram account and target image.
Overview of approach
1) Get photos from Instagram
After getting a few unsuccessful attempts quickly out the way, the oracle finally led me to a few websites (igdownloader.com, igram.io) that seem to work just fine. The user simply inputs the URL of the post, and hey presto — you can download the photos.
I eventually figured out roughly their approach, using some articles and videos as great starting points:
2) Crop and resize photos
The final mosaic is essentially our best representation of the target image, using hundreds of tiny square tiles. We therefore need to 1) crop and resize our target image to the size of our desired output, and 2) crop and resize our input photos to the same size as the tiny square tiles.
The first step here is deciding on output dimensions, which were kindly borrowed from thebirthposter.com. Using this we can crop and resize the target image.
Since we know the number of available input photos, we can use a bit of maths to work out the number of rows and columns of the tile grid, and therefore the necessary tile dimensions.
3) Score possible photo placements
To create the best representation of the target image, we need to place each input photo in the tile it looks the most similar to. We can use the mean squared error (MSE) method from scikit-image, which compares two images and returns a score from 0 (images are identical) to 1 (images are opposites). This is vastly superior to comparing just the average colours of two images, which loses valuable information about the shapes and regions of colours inside.
For our case, we make matrix of results by scoring each input photo against each tile.
4) Arrange input photos optimally
‘Where should I place each input photo so that I get the best-looking mosaic?’ is quite a tricky question to answer. My approach can be broken down into three main iterations:
4A) Best match
‘Place the best-matching photo first’ is basically this idea, i.e. select the photo-tile combination with the lowest mean squared error. Since this tile is now filled, we then ignore it going forward. Since we want to make this a ‘true’ mosaic and avoid duplicating any photos, we ignore the input photo going forward too.
Recommended by LinkedIn
Although some tiles are filled brilliantly at the start (low MSE), we can see that by the end, there are large regions that have to be filled with whatever happens to be leftover (high MSE).
4B) Linear optimisation
We can rephrase our question slightly more mathematically as ‘what set of acceptable photo-tile combinations has the minimum total mean squared error?’
Although the ‘best match’ above approach makes logical sense, it does not necessarily product the optimal results. Enter linear optimisation.
Linear optimisation seeks to minimise or maximise an objective function, given a set of constraints. In a manufacturing setting, the objective might be ‘maximise revenue from selling tables and chairs’, and the constraints may be total units of wood, metal and labour available. In other words, how many tables and/or chairs should I make in order to maximise profits?
For this application, the objective function is to ‘minimise total MSE’ and the constraints are ‘use exactly one photo per tile’, and ‘use each input picture no more than once’.
We can apply these constraints by using a dummy variable for each photo-tile combination, i.e. its value is 1 (we use that photo-tile combination) or 0 (we don’t).
To set up and solve this problem I used a great package called PuLP. The specific algorithms that do the solving are a bit too involved to explain in detail here, but if you are interested in a bit more background, check out this talk at PyData.
When comparing the two approaches, linear optimisation reduces average error by 26% (0.093 to 0.069). We can also see it has a narrower distribution, with fewer excellently places tiles (very low MSE) and far fewer poorly placed tiles (high MSE).
4C) Tile weighting
Unfortunately, not all tiles were created equal. Having a good match for a face tile is much more important than for a background tile. The more important the tile, the more important it is to match it with a photo that gets a low MSE.
Considering our linear optimisation problem (minimise total MSE), if we scaled up the MSEs corresponding to an important tile by a factor of 1000, the algorithm will make absolutely sure the best possible photo is chosen for that tile.
It follows then: the more important the tile, the higher the scaling factor.
First, we have to identify the important objects in the target image. Identifying people and faces in images and video is a relatively common data science problem, with many pretrained neural network models readily available. For objects such as people, PointRend (Facebook AI Research, 2020) is able to predict object boundaries with great detail. For faces, Mediapipe (Google, 2019) is able to predict an incredible 468 facial landmarks, again capturing the boundaries in great detail.
Second, we identify the tiles corresponding to the labelled face and person regions. The final step is then to determine the weighting of each tile, which I chose to be a function of distance from the labelled regions. This prevents any harsh borders around the person/face and also seems to give a nice gradient border around the person.
As expected, the linear optimisation algorithm has found great matches for the important face regions, and less so for the background.
Comparing all methods, it is clear that adding in the tile weighting improves the output dramatically.
Of the five companies I was able to test, only one did it without the heavy image masking.
Next steps
I started this little project just out of curiosity, and it’s been great fun developing it so far. It’s unlikely to turn into a huge money spinner, but it would be exciting to solve the problems surrounding scalability. I am currently limited by:
For now at least, I’ve created an Instagram page to show some of the end results — feel free to check them out on @gramdesigns_ or connect with me on LinkedIn.
:)
And there was me thinking you'd stayed up all night, printing them out on your printer, cutting them up, and arranging them by hand
Love this Tom Hudson - super interesting breakdown too! 👏
love these Tom Hudson