Machine Learning: Zero Shot Learning

May 2, 2021 · 4 min read

Zero-Shot Learning (ZSL)¹ is a type of supervised learning where the goal is to train a model or models to accurately apply a label to an object of a type it has not been trained on.

This is done by using the output of one model of the object it hasn’t seen, and using it as the input of a different kind of model to make a prediction on the object.

For instance imagine that you are playing a game where a team mate has a picture of an object which you can’t see, but you have to guess what it is by descriptions that your team mate gives you. Let’s say the picture is a “dog”. Your team mate would say to you, it is “a mammal, four legged, brown, furry, has a tail etc”. That description is like the output of the first model.

In your mind you take the description of the object and try to figure out what that resembles the closest, and make your guess. This is akin to the second model. As you can imagine, if your knowledge of animals is deep, that is, you know 100 characteristics of each animal (average height, colors, length of fur, length of snout, type of tail, etc), you are more likely to be able to guess what the picture is.

In machine learning we can create this rich model, for instance, by training a model using an encyclopedia of animals. After training, each animal is represented by a tensor which could look like {100, 56, 97, 46, 76 … 2}. Each position in the matrix would represent a characteristic {mammal-ness, big-ness, four-legged-ness, long snout-ness, tail-ness…scaly-ness}. The reason you’d have degrees of a particular characteristic is because you will be comparing that animal with different ones. Cats for instance would have a similar degree of mammal-ness to a dog (100), but perhaps not as much big-ness (ie smaller - 32). This matrix output is technically called the embedding.

Another way to think about it is that the trained model has extracted the probabilities of each dimension of the characteristics we know. Now we can map the similarity of the animals using cosine similarity metrics. Imagine the animals are on a sphere, and are closer the more similar they are to each other. If we drew a line from the centre of the sphere to two animals, we can measure the angle between the lines. The smaller the angle, the more similar the animals. A cat and dog would have a smaller angle between them compared to a cat and a horse, for example.

In the game above we would also need an image processing model to be able to take a picture of an animal and extract details that could then be given to the encyclopedia model. But instead of labeling every picture of a dog with the word “dog”, we label it with the embedding of a dog from the encyclopedia model. Now we train the model.

When the trained model is shown a picture of an animal it was not trained on, it will produce an output matrix of the same dimensions as the encyclopedia embeddings. This matrix will have a place on the sphere where we can find which animal it is most similar to in the encyclopedia using cosine similarity and hence apply a label.

In a sentence, idea is to use the multidimensionality of embeddings to provide rich descriptions of labeled objects, which can then be used to triangulate a label for the novel object. In reality these embeddings can have dimensions in the thousands.

This is also the concept of transfer learning. We are taking the embeddings of a pre-trained richer model (richer in terms of dimensions) and mapping it to the embedding of the smaller model. If this richer model can be applied to several smaller models, then it would have saved us time and money training a larger overall model.