In this article, I am going to provide a 30,000 feet view of Neural Networks. The post is written for absolute beginners who are trying to dip their toes in Machine Learning and Deep Learning.

We will keep this short, sweet and math-free.

## Neural Networks as Black Box

We will start by treating a Neural Networks as a magical black box. You don’t know what’s inside the black box. All you know is that it has one input and three outputs. The input is an image of any size, color, kind etc. The three outputs are numbers between 0 and 1. The outputs are labeled “Cat”, “Dog”, and “Other”. The three numbers always add up to 1.

### Understanding the Neural Network Output

The magic it performs is very simple. If you input an image to the black box, it will output three numbers. A perfect neural network would output (1, 0, 0) for a cat, (0, 1, 0) for a dog and (0, 0, 1) for anything that is not a cat or a dog. In reality, though, even a well trained neural network will not give such clean results. For example, if you input the image of a cat, the number under the label “Cat” could say 0.97, the number under “Dog” could say 0.01 and the number under the label “Other” could say 0.02. The outputs can be interpreted as probabilities. This specific output means that the black box “thinks” there is a 97% chance that the input image is that of a cat and a small chance that it is either a dog or something it does not recognize. Note that the output numbers add up to 1.

This particular problem is called **image classification**; given an image, you can use the label with the highest probability to assign it a class ( Cat, Dog, Other ).

### Understanding the Neural Network Input

Now, you are a programmer and you are thinking you could use floats and doubles to represent the output of the Neural Network.

How do you input an image?

Images are just an array of numbers. A 256×256 image with three channels is simply an array of 256x256x3 = 196,608 numbers. Most libraries you use for reading the image will read a 256×256 color image into a continuous block of 196,608 numbers in memory.

With this new knowledge, we know the input is slightly more complicated. It is actually 196,608 numbers. Let us update our black box to reflect this new reality.

I know what you are thinking. What about images that are not 256×256. Well, you can always convert any image to size 256×256 using the following steps.

**Non-Square aspect ratio**: If the input image is not square, you can resize the image so that the smaller dimension is 256. Then, crop 256×256 pixels from the center of the image.**Grayscale image**: If the input image is not a color image, you can create a 3 channel image by copying the grayscale image into three channels.

People use many different tricks to convert an image to a fixed size ( e.g. a 256×256 ) image, but since I promised I will keep it simple, I won’t go into those tricks. The important thing to note is that any image can be converted into a fixed size image even though we lose some information when we crop and resize an image to that fixed size.

### What does it mean to train a Neural Network ?

The black box has knobs that can be used to “tune” it. In technical jargon, these knobs are called weights. When the knobs are in the right position, the neural network gives the right output more often for different inputs.

Training the neural net simply means finding the right knob settings ( or weights ).

### How do you train a Neural Network?

If you had this magical black box but did not know the right knob settings, it would be a useless box.

The good news is that you can find the right knob settings by “training” the Neural Network.

Training a Neural Network is very similar to training a little child. You show the child a ball and tell her that it is a “ball”. When you do that many times with different kinds of balls, the child figures out that it is the shape of the ball that makes it a ball and not the color, texture or size. You then show the child an egg and ask, “What is this?” She responds “Ball.” You correct them that it is not a ball, but an egg. When this process is repeated several times, the child is able to tell the difference between a ball and an egg.

To train a Neural Network, you show it several thousand examples of the classes ( e.g. Cat, Dog, Other ) you want it to learn. This kind of training is called **Supervised Learning** because you are providing the Neural Network an image of a class and explicitly telling it that it is an image from that class.

To train a neural network, we, therefore, need three things.

**Training data**: Thousands of images of each class and the expected output. For example, for all images of cats in this dataset, the expected output is (1, 0, 0).**Cost function**: We need to know if the current setting is better than the previous knob setting. A cost function sums up the errors made by the neural network over all images in the training set. For example, a common cost function is called**sum of squared errors (SSE)**. If the expected output for an image is a cat, or (1, 0, 0) and the neural network outputs (0.37, 0.5, 0.13), the squared error made by the neural network on this particular image is . The total cost over all images is simply the sum of squared errors over all images.**The goal of training is to find the knob settings that will minimize the cost function**.**How to update the knob settings**: Finally we need a way to update the knob settings based on the error we observe over all training images.

### Training a neural network with a single knob

Let’s say we have a thousand images of cats, a thousand images of dogs, and a thousand images of random objects that are not cats or dogs. These three thousand images are our training set. If our neural network has not been trained, it will have some random knob settings and when you input these three thousand images, the output will be right only one in three times.

For the purpose of simplicity, let’s say our neural network has just one knob. Since we have just one knob, we could test a thousand different knob settings spanning the range of expected knob values and find the best knob setting that minimizes the cost function. This would complete our training.

However, the real world neural networks do not have a single knob. For example, VGG-Net, a popular neural network architecture has 138 million knobs!

### Training a neural network with multiple knobs

When we had just one knob, we could easily find the best setting by testing all (or a very large number of) possibilities. This quickly becomes unrealistic because even if we had just three knobs, we would have to test a billion settings. Imagine the number of possibilities with something as large as VGG-Net. Needless to say a brute force search for the optimal knob settings is not feasible.

Fortunately, there is a way out. When the cost function is convex ( i.e. shaped like a bowl ), there is a principled way to iteratively find the best weight by a method called **Gradient Descent**

### Gradient Descent

Let’s go back to our Neural Network with just one knob and assume that our current estimate of the knob setting ( or weight ) is . If our cost function is shaped like a bowl, we could find the slope of the cost function and move a step closer to the optimum knob setting . This procedure is called **Gradient Descent** because we are moving down (descending) the curve based on the slope (gradient). When you reach the bottom of the bowl, the gradient or slope goes to zero and that completes your training. These bowl-shaped functions are technically called convex functions.

How do you come up with the first estimate? You can pick a random number.

**Note**: If you are using popular neural network architectures like GoogleNet or VGG-Net, you can use the weight trained on ImageNet instead of picking random initial weights to get much faster convergence.

Gradient Descent works similarly when there are multiple knobs. For example, when there are two knobs, the cost function is a bowl in 3D. If we place a ball on any part of this bowl, it will roll down to the bottom following the path of the maximum downward slope. This is exactly how gradient descent works. Also, note that if you let the ball roll down at full velocity, it will overshoot the bottom and take much more time to settle down at the bottom compared to a ball that is rolled down slowly in a more controlled manner. Similarly, while training a neural network, we use a parameter called the **learning rate** to control convergence of cost to its minimum.

When we have millions of knobs (weights), the shape of the cost function is a bowl in this higher dimensional space. Even though such a bowl is impossible to visualize, the concept of slope and Gradient Descent works just as well. Therefore, Gradient Descent allows us to converge to a solution thus making the problem tractable.

### Backpropagation

There is one piece left in the puzzle. Given our current knob settings, how do we know the slope of the cost function?

First, let’s remember that the cost function, and therefore its gradient depends on the difference between true output and the current output for all images in the training set. In other words, every image in the training set contributes to the final gradient calculation based on how badly the Neural Network performs on those images.

The algorithm used for estimating the gradient of the cost function is called **Backpropagation**. We will cover backpropagation in a future post and yes it does involve calculus. You would be surprised though that backpagation is simply repetitive application of the **chain rule** that you might have learned in high school.