Image Diff
Overview
I got a challenge today. One of my friends asked me can ML engineering tell what is the difference between two images, and highlight the spot that is different. The answer is yes and I did a weekend project to prove it. At the same time I also realize this is not a simple question as it looks like.
I create a web service, which has follow capabilities implemented:
- Image uploading
- Calculates couple of matrix that used for image compare, including pixel matching rate, feature matching, SSIM score.
- On picture masks that highlights regions that is different between two images
- Euclidean and cosine distance from MobileNet embeddings
- Image classification
Let's take a look at result first!
Here is two sample results I got from my testing data.
This is two Tesla Model X pictures captured from official website. Besides the color I selected 20'' Two-Tone Slipstream Wheels for the blue one. We can see the model works pretty well, it highlights the difference on wheels although could be easily overlooked by human. Interestingly, the white car is categorized as minivan but the blue one is recognized as sports car. Hmm it might be a good idea to spend $2000 on wheels to upgrade your minivan to a sports car :-)
Another test was done on faked ID images. The pixel matching and feature matching score are higher than Tesla images because most of the regions are the same between two pictures, even from color perspective. The distance of embeddings are smaller, which means the color has obvious impact, maybe more than shape. classification is not really working maybe because there was no much training data for ID as a target.
Along the way, I found it's a long story. besides what I have done, there is more to explore. We need to define what does the difference between two images, and there are many metrics to measure the difference. They need to be selected according to user scenarios.
First, there are different type of difference.
pixel difference. we know images have 3 channels, Red, Green and Blue. each of the channel has a M * N pixels ranging from 0 to 255. By simply compare the pixel value we can tell how different the two images are. This is straightforward and easy to implement, however, the result can be very different from what human thinks.
feature difference. an image can have many features, such as blue sky, white cloud, or a roof of a building. It's hard to say how human find these feature, as they are coded in our brain. The features are found by feature detection technology, and can be used to tell the similarity of two pictures. A common approach is to look at a region in a picture that has large variations when moving it. as shown in picture below from OpenCV, corners can be good features.
category difference. A picture can be described using human language, and can be categorized according to that. This is another perspective that does not simply look at pixels or features, it simulates human judgments. For example, a picture of golden bridge shot in the morning and another one took in the afternoon could be very close on category difference but can be difference on every single pixel if you look at RGB channels. Deep learning image recognization model can solve this problem pretty well, however there is also a way to cheat. see attacking a network with adversarial examples: https://cs230.stanford.edu/spring2020/lecture3.pdf
secondly, how to measure the difference?
pixel difference can be measured by exact match. if the integer value is different, we consider it is a mismatching. by counting how many of the mismatching pixels , we can calculate a match rate as measure. Another approach is use ImageHash.
feature difference can be measured by how many similar features, and how similar the feature regions in two pictures.
category difference is a little complicated. Fortunately embeddings is all you need. Similar to NLP problem, there is also various of models that map the image to a group of numeric values which contains the feature information of the input. Unlike feature detection technology, deep learning models can carry more feature information but the downside is it can not easily tell what are the features it is looking at.
We can use the embedding values to calculate the distance of two images. However, the performance heavily relies on the quality of embedding. We need to select the embedding model carefully according to the problem we want to solve. Choosing one face recognition model for embedding won't be very effective if your candidates of images are all about vehicles.
ok, what is the pre-work we can reuse
As an engineer, we should alway remember "Don't reinvent the wheels". Here is the major components that I am going to use.
- OpenCV is widely used project in computer vision. It provides python and JS API to operate images, video, even cameras. It also provides feature detection API. I will use it for image processing and feature similarity calculation, such as ORB
- Scikit-image has APIs to calculate similarity for images. I will use it for feature matching calculation, such as Structural Similarity Index (SSIM)
- tensorflow hub has prebuilt models that can be re-used as embedding or classification. I will use tensorflow to load MobileNet model for image classification, and distance calculation.
Design
Implementation
pixel matching
This is the most easy one. I just simply use the API provided by diffimg
Feature matching
I want to calculate ORB and SSIM. First I need to load the images using OpenCV and convert them to gray scale.
i1 = cv.imread(f1) i2 = cv.imread(f2) img1 = cv.cvtColor(i1, cv.COLOR_BGR2GRAY) img2 = cv.cvtColor(i2, cv.COLOR_BGR2GRAY)
and then I can use ORB detector to detect and compute the distance. From there I can generate an overlap that tells what's the similar features on the image.
orb = cv.ORB_create() kp1, des1 = orb.detectAndCompute(img1, None) kp2, des2 = orb.detectAndCompute(img2, None) bf = cv.BFMatcher(cv.NORM_HAMMING, crossCheck=True) matches = bf.match(des1, des2) img3 = cv.drawMatches(i1, kp1, i2, kp2, matches, None, flags=cv.DrawMatchesFlags_NOT_DRAW_SINGLE_POINTS) cv.imwrite(f3, img3)
Similarly for SSIM score calculation. One advantage we can get from SSIM is we can also get the diff, which can be used to find contours that highlights the regions that differ between two images. Thanks to this article by Adrian Rosebrock. It saved me a lot of time.
(score, diff) = compare_ssim(img1, img2, full=True)
diff = (diff * 255).astype("uint8")
print("SSIM: {}".format(score))
# threshold the difference image, followed by finding contours to
# obtain the regions of the two input images that differ
thresh = cv.threshold(diff, 0, 255,
cv.THRESH_BINARY_INV | cv.THRESH_OTSU)[1]
cnts = cv.findContours(thresh.copy(), cv.RETR_EXTERNAL,
cv.CHAIN_APPROX_SIMPLE)
cnts = imutils.grab_contours(cnts)
Category matching
A side topic here. To use tensorflow hub pre-trained model, I need to use tf.keras API. However, it does not work with Flask if you turn on debug mode. This is a weird bug that exists in all versions from 1.15 to 2.1. I have to upgrade to tf 2.2rc3 to make it work. There is a long thread tracking it.
Also tensorflow API keeps changing. 2.0 API is quite different from 1.0 and you may have to understand both so that you can understand most of the tutorials and make the code work in your project. This is probably why many people has switched on to Pytorch, which has more human friendly and pythonish API design.
I found two MobileNet models from TF Hub, one for classification, another for embedding. Since I am using 2.0 API, we can use the saved model format directly. Here is the sample code for classification:
def img_classification(f1):
m = tf.keras.Sequential([
hub.KerasLayer("https://tfhub.dev/google/tf2-preview/mobilenet_v2/classification/4", output_shape=[1001])
])
m.build([None, 224, 224, 3])
IMAGE_SHAPE = (224, 224)
img = Image.open(f1).convert('RGB').resize(IMAGE_SHAPE)
img = np.array(img) / 255.0
result = m.predict(img[np.newaxis, ...])
predicted_class = np.argmax(result[0], axis=-1)
labels_path = tf.keras.utils.get_file('ImageNetLabels.txt',
'https://storage.googleapis.com/download.tensorflow.org/data/ImageNetLabels.txt')
imagenet_labels = np.array(open(labels_path).read().splitlines())
predicted_class_name = imagenet_labels[predicted_class]
return predicted_class_name
Web Service
Finally I created a Flask web service to put all the functionality together. I created two endpoints:
- file upload. I use the flask dropzone project for this feature.
- image evaluation. It takes file name returned in file upload API and run a synchronized evaluation which calls pixel matching, feature matching and category matching in sequence.
TODO
- I thought I can do some face recognition. Maybe next time.
- Adding GPU support can significantly speed up the category matching tasks.