Computer Vision - UPSCFEVER

Computer vision is one of the areas that's been advancing rapidly thanks to deep learning.

Deep learning computer vision is now helping self-driving cars figure out where the other cars and pedestrians around so as to avoid them.

It is making face recognition work much better than ever before, so that perhaps some of you will be able to unlock a phone, unlock even a door using just your face.

And if you look on your cell phone, you have many apps that show you pictures of food, or pictures of a hotel, or just fun pictures of scenery. And some of the companies that build those apps are using deep learning to help show you the most attractive, the most beautiful, or the most relevant pictures.

And I think deep learning is even enabling new types of art to be created.

So, I think the two reasons we are excited about deep learning for computer vision and why I think you might be too. First, rapid advances in computer vision are enabling brand new applications to view, though they just were impossible a few years ago.

And by learning these tools, perhaps you will be able to invent some of these new products and applications.

Second, even if you don't end up building computer vision systems per se, I found that because the computer vision research community has been so creative and so inventive in coming up with new neural network architectures and algorithms, it is leading to a lot cross-fertilization into other areas as well.

For example, when working on speech recognition, we sometimes actually took inspiration from ideas from computer vision and borrowed them into the speech literature. So, even if you don't end up working on computer vision, I hope that you find some of the ideas you learn about in this course hopeful for some of your algorithms and your architectures.

Here are some examples of computer vision problems we'll study in this course.

computer-vision-problems

You've already seen image classifications, sometimes also called image recognition, where you might take as input say a 64 by 64 image and try to figure out, is that a cat?

Another example of the computer vision problem is object detection. So, if you're building a self-driving car, maybe you don't just need to figure out that there are other cars in this image. But instead, you need to figure out the position of the other cars in this picture, so that your car can avoid them.

In object detection, usually, we have to not just figure out that these other objects say cars and picture, but also draw boxes around them.

We have some other way of recognizing where in the picture are these objects. And notice also, in this example, that they can be multiple cars in the same picture, or at least every one of them within a certain distance of your car.

Another example, maybe a more fun one is neural style transfer. Let's say you have a picture, and you want this picture repainted in a different style.

So in neural style transfer, you have a content image, and you have a style image. The image on the right given below is actually a Picasso. And you can have a neural network put them together to repaint the content image (that is the image on the left below), but in the style of the image on the right, and you end up with the new image at the bottom. So, algorithms like these are enabling new types of artwork to be created.

computer-vision-problems neural-style-transfer

One of the challenges of computer vision problems is that the inputs can get really big. For example, in previous courses, you've worked with 64 by 64 images. And so that's 64 by 64 by 3 because there are three color channels. And if you multiply that out, that's 12288.

So x the input features has dimension 12288. But 64 by 64 is actually a very small image. If you work with larger images, maybe this is a 1000 pixel by 1000 pixel image, and that's actually just one megapixel. But the dimension of the input features will be 1000 by 1000 by 3, because you have three RGB channels, and that's three million.

computer-vision-problems deep-learning-on-images

If you have three million input features, then this means that X will be three million dimensional. And so, if in the first hidden layer maybe you have just a 1000 hidden units, then the total number of weights that is the matrix \( W^{[1]} \), if you use a standard or fully connected network.

This matrix will be a 1000 by 3 million dimensional matrix.

Because X is now R^{3M}. I'm using to denote three million. And this means that \( W^{[1]} \) matrix here will have three billion parameters (1000 x 3M) which is just very, very large.

And with that many parameters, it's difficult to get enough data to prevent a neural network from overfitting.

And also, the computational requirements and the memory requirements to train a neural network with three billion parameters is just a bit infeasible.

But for computer vision applications, you don't want to be stuck using only tiny little images. You want to use large images. To do that, you need to implement the convolution operation, which is one of the fundamental building blocks of convolutional neural networks.

Let's see what this means, and how you can implement this, in the next section. We'll illustrate convolutions, using the example of Edge Detection.