Image recognition. Deep learning. And cake.

It's not easy. But that doesn't mean it can't be done.

This post is about how we played around with image recognition to find out what we can do with it. Technical details are kept to a bare minimum.

How it all began

It all started at bol.com's hackathon, an event where we take the time to build or learn what we like using new technologies and cool stuff, and also to stuff ourselves with all the free food (don't worry, it's NOT an all-you-can-eat event. IT folk need to be fed, right?).

While I was enjoying my "I stopped counting after 3" chocolate cake, I found myself thinking of the expression "easy like a piece of cake!". No idea why that popped in my head ... The brain does what it wants ...

Suddenly, I had a flashback to my first weeks in the Netherlands. I'm not all that quick in learning foreign languages and Dutch had me baffled. Imagine how difficult it is to look for products you want to buy if you don't even know how they're called. To be honest, most of the time I didn't even know the English words. I wished I could just hold my camera in front of a product and have all related items shown to me. Shopping would have been so much easier.

Another flashback. This time to the bol.com Inspiration day, more specifically to a talk given by Barrie Kersbergen, a data scientist at bol.com. During his talk, Barrie showed what we can do with deep learning, especially in image recognition.

So ... Image recognition. Deep learning. And cake. Could it be a piece of cake? I decided to use my Hackathon time to find out.

Playing around

So I started playing around with Barrie's sample project, which is backboned by TensorFlow library. My goal was very simple: distinguish between a bottle, a chair and a laptop.

First, I needed to provide training data. I fed it with 20 training samples for each type of object.

I fed this training data for bottle, chair and laptop labels:

Then we tested it with:

It labeled all of the images correctly!

Even with the noisy backgrounds, that's pretty amazing, right? With only 20 training samples for each object? But wait! The magic was not in those training images.

Finding the magic

Enter the inception-v3 model from TensorFlow library, or the black box that holds the magic. Its modes are pre-trained for several weeks on multiple GPUs with an image library from ImageNet. It has the capacity of recognizing up to a 1,000 different types of objects.

Thanks to a call transfer learning technique we were able to reuse the Inception-v3 trained model with our own training samples. Together with DeCAF technique, it allows the final layer to be trained again from scratch. According to TensorFlow: "Though it's not as good as a full training run, this is surprisingly effective for many applications, and can be run in as little as thirty minutes on a laptop, without requiring a GPU."

There are a few tricks to improve the training model with the dataset we had. During training time, each sample image is flipped, randomly cropped, scaled and changed in brightness. This causes distortion in the training data, which makes it close to what can happen in real life.

Understanding the magic

The magic itself is the deep convolutional neural network that is used in the Inception-v3 model.

The convolutional neural network is a network training technique that is based on 3 ideas: local receptive fields, shared weights and biases, and pooling layers. (If you want to learn more about them, check out this deep learning page, clearly written and easy to understand)

Without convolution technique, another way of sliding a window through the image is to search within that slid region (template matching - brute force search algorithm).

A rough idea about neural network (can be skipped)

Now it's time to break down those weird terms ...

A neural network or artificial neural network is a learning algorithm that is inspired by the structure and functional aspects of biological networks (Wikipedia). A neural network that is used in machine learning normally consists of 3 different types of layers: an input layer, hidden layer(s) and an output layer. Each layer contains a number of neurons, and they are responsible for consuming input data and producing output data, which is determinable by their internal weighted and biased values.

There are a few types of neural networks: a perceptions neural network, that consumes and produces binary data (step function), or a sigmoid neural network, that consumes and produces non-binary data (sigmoid function). The sigmoid function allows the changes between the weighted, biased values of neurons and the output values are linear and observable.

A deep neural network is a neural network that consists of several hidden layers. A neural network that only contains a single hidden layer is called a shallow network.

What is so fancy about those hidden layers? Simply put, the more hidden layers a neural network has, the higher level of data combination and abstraction the network can handle.

Before the convolutional neural network, training deep neural networks can be also done with back propagation, feed forward ... networks. Those networks make use of gradient, cost variables ... making them unstable when a complex network is constructed.

Object can be identified, let's go a bit further

Now we can identify the object in the captured image. But that's just half of the main target. I also want it to be able to determine the specific brand of the identified product, to allow more related recommendations.

So I repeated the previous test. But this time I used different brands of the same product type as training data.

There were 20 training samples for each laptop brand:

Then we tested it with:

This time the prediction results looked like a game of luck :)

Looking into the training data, we can quickly conclude that it's not easy to identify the correct brand. Most laptops have the same shape and sometimes even the same style. The prediction confidence for an Apple laptop is still very high, more than 90% (someone seems to be doing good design work). While for HP and Dell laptops, the prediction confidence dropped to around 60% (looks like a 50-50 game).

So why did it fail this time?

Understanding the result - identifying the problems

The most obvious way of identifying the laptop brand is the logo. But as it's randomly located and usually very small, it's difficult to find, even for human beings (at least, it is even for me).

The training data set is far from enough. With deep learning, more data normally beats the algorithm. Furthermore, the inception-v3 model doesn't really help in this case as it was trained against general object types to identify if the object is a laptop or something else, not to identify the laptop brand.

Something to experiment with

Of course we're not going to let those problems stop us! What other options can we explore?

Well, the very first and easy step: feed more training data! All-you-can-eat style, as much as the machine can handle. This will probably help to improve the results. (For faster training time, a high-end GPU or GPU(s) is highly recommended. TensorFlow loves that). I have an integrated GPU, so besides playing Sim city, it doesn't help much.

If that doesn't work, starting with a clean convolutional neural network can be a next step. Why? It allows us to easily adjust the local receptive field size to adapt to the size of the brand's logo. When max pooling is applied, a bigger size of the local receptive field can probably lead to most of the logo's information from the image.

Moving to SVM to solve this problem? Since the problem is scaled down to identify the brand, using SVM with controlled features may give us a very promising result. But feature extraction then needs to be done manually (which is more complicated but gives more control level). With a reasonably sized training data set and well-chosen features, SVM just may get the work done.

The end

You've made it to the very end of this post. Well done. And thank you for reading this.