How I used object detection API to solve the problem of inventory.
Advancements in machine learning algorithms have been booming for the last couple of years. They are changing our day-to-day lives and raising interesting philosophical questions we now think about. We owe this to the cloud computing revolution that made serious computational power available to almost anybody who has a laptop, internet connection, and a desire to learn.
With the landing of Amazon Go stores and e-commerce setting the level of customer service sky-high, traditional brick-and-mortar retailers are facing some tough times if they aren’t equipped for modern customers’ expectations. What are they? To name a few: any item available at all times, accurate inventory, and of course fastest checkout with no cashiers and scanning involved.
Even though most of the retailers don’t have access to the kind of technology and engineering teams that Amazon has they should not be discouraged. Google released TensorFlow Object Detection API that I used to show how anybody can create a machine learning model to track physical stock on the shelf.
How you can use that? Stores can use computer vision to monitor shelves to detect items that were picked up, detect misplaced items, and make sure that items are always available on the shelf. A robust computer vision model can keep track of all items picked up by a specific customer and then directly charge their account, with no cashiers needed. This kind of information is also very valuable for the marketing team to extract trends as well as security to detect suspicious behaviors.
Google open-sourced many high-quality tools for machine learning, provided courses and extensive documentation as well as free credits for Google Cloud Platform. For this project, I used TensorFlow, Python, one of the pre-trained Google models, and compute power on GCP.
Ready?
You can find code for this project on my GitHub repository.
First things first. Data.
To train our future model we need data, to train any model you need data. Pictures of items on the shelf to introduce to our algorithm and train it to detect them. There are plenty of such datasets publicly available for use via Kaggle or just Google search. Another option is to create your own dataset. In addition, you could also generate synthetic data from the available dataset using image processing libraries like OpenCV or PIL(rotate, change brightness, zoom, noise, color). This introduces more variation (like in my case with video frames, they tend to repeat themselves) and makes the model robust.
I bought a couple of jars with my favorite pickles and made a short video of them on the shelf. Any object detection requires annotations - boxes around the object of interest and label value for each one. So I broke the video into frames, labeled them using LabelImage tool, and got a file with coordinates of bounding boxes for each image as a result.
To use TensorFlow object detection API our data should be converted into tfrecords (TensorFlow binary storage format, which makes big amounts of data easier to handle when training the model). Honestly, it’s a tricky part...but TensorFlow API has all you need to get going — create tfrecords from your data. We also need a label.pbtxt file that maps our labels to numeric values. The format is like this, you will find samples on API’s GitHub repo:
item{
id:1
name: 'jar'
}
Note: if this is your first time trying to build an object detection model I’d suggest this walkthrough from TensorFlow.
Moving on to the actual model.
I was using transfer learning, of course. If it's not obvious for you then here it comes. Transfer learning is the practice of using pre-trained models as a base and continuing training on your data to ensure higher recognition accuracy for the images of your interest.
Google spent weeks and a lot of computing power to train these models, froze the trained graphs and made them available to anyone(sending rays of sunshine to Google). These convolutional neural networks are very good feature extractors. So you just need to present some amount of images that you wish to recognize and that’s it. On the contrary, if you want to build a model from scratch you will need much more data and time for training and most likely still will not reach higher accuracy.
Recommended by LinkedIn
The only question is which model to use. There are some pros and cons in terms of accuracy or execution speed. This might help to decide.
Ready. Set. Train.
I chose ssd_mobilnet_v1_coco, trained on COCO dataset, because it’s light and fast. Training was performed on the cloud using Google Cloud ML Engine.
To schedule a training job you need to provide a config file with paths to training data, a model of a choice to retrain, data converted to tfrecords (training and validation sets), a file that maps labels to numeric values, the API itself(packaged) and output directory. Everything should be in bucket storage on the GCP. Object Detection API provides sample config files for us. This is a very comprehensive tutorial on how to deploy training jobs on ML Engine.
ML Engine will let you know when the job was successfully deployed and will update the error for each period in the logs. Once you think the model converged you can stop the training and extract a frozen model from the output directory on your bucket, specified earlier.
Checking the result.
I was eager to check how my model would do now. I had this recorded test video clip of pickles on the shelf.
Just like before I split it into frames, fed it to my freshly retrained model, got the inference results, and put the frames back together into a video with boxes showing detected objects. But processing took too much time and I was too impatient. So I came up with a bash script that made my life much easier and inference faster (here it is).
This script takes a couple of parameters: a path to your video, the name of the output folder, and how many processes you want to run in parallel(for me it was 16 16-core virtual machine on GCP). It would use ffmpeg command line tool to split the video into frames first and would scatter them between directories (so that each background process would go to a separate folder to avoid collisions or any kind of delays) and then start as many processes as you specified with an inference Python script.
But what does inference script do? It runs each frame through our retrained model extracts bounding boxes of detected jars (filtered by the highest confidence score) and then generates CSV(comma-separated values) files with the coordinates of boxes. Now that I have coordinates I can run a final jupyter notebook that reads CSV files and draws the boxes on each frame of the video using PIL .
Done! Celebrate.
TensorFlow object detection API is pretty robust. It abstracts away a lot of low-level programming and frees you from the necessity to create a model from scratch. So you could start implementing ideas and solve various problems with machine learning in a matter of days.
Enjoy the magic of ML and open-source!
You can find code for this project on my GitHub repository.
By Daria Gurova.
Epic!!!
Can you also put a link for the dataset you've fed to the ssd_mobilnet_v1_coco model for detecting jars on the shelf?
Thank you for the information. This is great!!!
Amazing!