Humans can easily detect and identify objects present in an image. The human visual system is fast and accurate and can perform complex tasks like identifying multiple objects and detect obstacles with little conscious thought. With the availability of large amounts of data, faster GPUs, and better algorithms, we can now easily train computers to detect and classify multiple objects within an image with high accuracy. In this blog, we will explore terms such as object detection, object localization, loss function for object detection and localization, and finally explore an object detection algorithm known as “You only look once” (YOLO).
An image classification or image recognition model simply detect the probability of an object in an image. In contrast to this, object localization refers to identifying the location of an object in the image. An object localization algorithm will output the coordinates of the location of an object with respect to the image. In computer vision, the most popular way to localize an object in an image is to represent its location with the help of bounding boxes. Fig. 1 shows an example of a bounding box.
A bounding box can be initialized using the following parameters:
bx, by :
coordinates of the center of the bounding box
width of the bounding box w.r.t the image width
height of the bounding box w.r.t the image height
An approach to building an object detection is to first build a classifier that can classify closely cropped images of an object. Fig 2. shows an example of such a model, where a model is trained on a dataset of closely cropped images of a car and the model predicts the probability of an image being a car.
Now, we can use this model to detect cars using a sliding window mechanism. In a sliding window mechanism, we use a sliding window (similar to the one used in convolutional networks) and crop a part of the image in each slide. The size of the crop is the same as the size of the sliding window. Each cropped image is then passed to a ConvNet model which in turn predicts the probability of the cropped image is a car.
After running the sliding window through the whole image, we resize the sliding window and run it again over the image again. We repeat this process multiple times. Since we crop through a number of images and pass it through the ConvNet, this approach is both computationally expensive and time-consuming, making the whole process really slow. Convolutional implementation of the sliding window helps resolve this problem.
The YOLO (You Only Look Once) Algorithm
A better algorithm that tackles the issue of predicting accurate bounding boxes while using the convolutional sliding window technique is the YOLO algorithm. YOLO stands for you only look once and was developed in 2015 by Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. It’s popular because it achieves high accuracy while running in real time. This algorithm is called so because it requires only one forward propagation pass through the network to make the predictions.
The algorithm divides the image into grids and runs the image classification and localization algorithm (discussed under object localization) on each of the grid cells. For example, we have an input image of size 256 × 256. We place a 3 × 3 grid on the image (see Fig.).
Next, we apply the image classification and localization algorithm on each grid cell. Do everything once with the convolution sliding window. Since the shape of the target variable for each grid cell is 1 × 9 and there are 9 (3 × 3) grid cells, the final output of the model will be:
Final Output= 3 X 3 X 9 ( Number of grid cells X Output for )
The advantages of the YOLO algorithm is that it is very fast and predicts much more accurate bounding boxes. Also, in practice to get more accurate predictions, we use a much finer grid, say 19 × 19, in which case the target output is of the shape 19 × 19 × 9.
With this, we come to the end of the introduction to object detection. We now have a better understanding of how we can localize objects while classifying them in an image. We also learned to combine the concept of classification and localization with the convolutional implementation of the sliding window to build an object detection system. In the next blog, we will go deeper into the YOLO algorithm, loss function used, and implement some ideas that make the YOLO algorithm better. Also, we will learn to implement the YOLO algorithm in real time.