Let's try to understand the convolutions neural network at some high level and get deeper into the working of the network and how awesome it is. So before diving into the convolutional neural net
What is the most high-level explanation of the convolutional neural net?
Fig:1 Ignore how ugly the image is, the idea is just to give you an overall picture of the conv layers.
This is the most high-level view of a convolutional neural net,
The network takes a batch of images as input, it goes through a process of pre-processing the image by performing some reshaping and data augmentation.
The convolutional layer, this layer extracts the features and these features are passed into the dense layer.
The dense layer looks into the features and provides a class probability on the input image.
The question we are interested in here is
Why do we use a convolutional neural net, why not something else?
What really happens in this convolutional layer?
How are these networks implemented?
Why do we use a convolutional neural net, why not something else?
Am going to write this under the assumption that you understand the basic principles of neural networks. In simple neural networks is a huge stack of liner functions connected to each other. Look into and read about it, if you haven't we will write about it as soon as possible.
Neural networks are a great tool in regression problems, classification problems, in fact, the first Netflix recommendation challenge also used a sequence of neural networks for the recommendation system. For a long time, we had some reluctance in using this tool for machine learning purposes we were not able to build deeper neural networks eventually we fixed it reasonably well. The question we are going to ask here is how good are these neural networks in looking at image data. It turns out neural networks are not so good at capturing image data. Why?
If you look into an image, an image is not just about the sequential arrangement of pixels. The computer sees the image as a sequential arrangement of pixels. But there is more to this arrangement. What distinguishes a cat and dog image? What distinguishes them is the features. What are these features? These features range from the details on what the fur of a cat to how it looks like, that is where is the head, tail, and legs of a cat is located.
A neural network Is not good to capture these specific low-level and high-level features of an image. Which is essentially the building block of the image itself. The network could look at the pattern in the pixel but it completely misses out the relations with the features. Apart from that using a neural network to do a task like this is a very inefficient way.
Fig:2 As you can see in the figure, this is what is mean by high and low-level features. And neural nets are not meant to capture these details.
Fig:3 Consider the scenarios of recognizing the faces, The low-level features look at the tiniest edges, shadows, stokes which make the image, and the high-level features look at the overall structure of the face.
Earlier we used different clustering, classification and weak learner-based algorithms to figure out what is in the scene. But eventually, we realized this is the method to efficiently make vision-based applications. The most recent way of achieving this is through a combination of something called a convolutional layer and the neural network together this is called the convolutional neural network. So in all the computer vision problems, there are 2 parts to it, the feature extraction (which we already discussed what is feature) then the machine learning model. Initially, we were using classical machine learning models for this purpose eg : svm , knn etc. . And the feature extraction was done manually. Now we have a single pipeline of convolutional neural network which automatically detect the features and makes the cv problems easy.
Today we have very deep convolutional neural nets that solve almost a lot of the computer vision application to natural language processing. They have proven their capability in capturing maximum features, which makes it a good solution for object detection problems. The densenet which has more that 100+ layers . These systems are some of the state-of-the-art solutions for a lot the computer vision problems.
Comentarios