# Regularization Methods

**Regularization Methods**

** Simple vs complex models**

**Regularization Methods**

**Simple vs complex models**

The timeline of where we are

- In this section, we will look at how better Regularization methods have accelerated the growth of DL over the last decade
- Why do we need
**Regularization**?- To answer this question, we must look at a concept known as
**Bias Variance trade-off**. The Bias that we’re speaking about here is different from the bias parameter b that we have seen so far in Neural Networksμ - Consider the following toy data visualisation
- In the above figure, the true relation is
**y = f(x), where f(x) = sin(x)**, however, in practice, that is not known to us. So we try to approximate models. **Simple**(degree 1): y = f(x) = w1x + w0- We assume that the relationship between y and x is a straight line of the form mx + c
- This looks like a very naive assumption.
- It is represented by the Red line in the figure
- The best fitting Red line is plotted while trying to minimize the error/loss between the predicted points and the actual points
- This is a pretty bad model, where even the minimised loss is still far too high

**Complex**(degree 25): y = f(x) = 25i=1 wixi + w0- This is a degree 25 polynomial, with 26 parameters (including w
_{0}) - It is represented by the Blue curve in the figure
- The Blue curve is plotted the same way, by minimising the error/loss between predicted and actual values
- Here, there is zero error/loss, it is a perfect fit.

- This is a degree 25 polynomial, with 26 parameters (including w

- To answer this question, we must look at a concept known as
- Now, how does this relate to Bias and Variance and how does it in turn lead to regularization.

**7.3.2: Analysing the behaviour of simple and complex models**

What happens if you train using different sets of training data

- Consider a dataset of say 1000 points. When we train our models (Simple and Complex), we shuffle the dataset and then take different subsets of data (around 100 points each).
- Let us observe how the two models behave when dealing with varying training subsets from the same dataset.
**Simple**(degree 1): y = f(x) = w1x + w0- Let us look at how the model behaves for 3 different subsets of 100 points each
- What we can infer from this is that the model is not very sensitive to the training data, i.e. the model doesn’t respond too much to the points given, thus all the predicted lines are very similar to each other.

**Complex**(degree 25): y = f(x) = 25i=1 wixi + w0- Let us look at how the model behaves for 3 different subsets
- Here, we can see that each of the functions are quite different from each other
- What we can infer from this is that the model is highly sensitive to the training data provided, i.e. The models adapt highly to the points given, thus producing different plots each time.

**7.3.3: Bias and Variance**

Let’s define some terms based on our observations.

- Here is the same experiment as conducted above, except for 25 subsets instead of 3.

Simple Model | Complex Model |

Let us define the term Bias:Bias(f(x)) = E[f(x)] – f(x)Here E stands for expected value of the predictions (The average of predictions)Bias is the difference between the expectation of the predicted values and the true value | |

In the simple model, the Expected value is very different from the true value, leading to a high bias. | In the complex model, the Expected value is very similar to the true value, leading to a low bias. |

Let’s define another term, Variance:Variance(f(x)) = E[(f(x) – E[f(x)])2]As before, E stands for expected value, which is nothing but the average value of the pointsFirst, we calculate the square error between the predicted points and the prediction’s averageThen we take the average/expected value of the square error term | |

In the simple model, the average line is very similar to the other lines. The lines all predict very similar values.Thus, the square error between the lines and the average line will be small, thereby its expected value will also be small. This corresponds to a low variance. | In the complex model, the average curve is quite different from the other curves. The curves predict noticeably different values.Thus, the square error between the curves and the average lincurvee will be large, thereby its expected value will also be large. This corresponds to a high variance. |

- The following observations can be made
**Simple Model**: high bias, low variance**Complex Model**: low bias, high variance**Ideal Model**: low bias, low variance.

**Test error due to high bias and high variance**

What is the effect of high bias and high variance on the test error

- So far, we have been analysing the performance of the models on training data, and determining if they were high/low bias/variance
- The Simple Model failed miserably on the training data, with a very high error/loss value
- The Complex Model however performed extremely well. Though it did deviate the sine-function (true curve), it was still able to fit all the training points, scoring a very low error/loss value

- Let’s look at how it performs on the test dataset
- Consider the simple model
- Let’s look at a visualisation of the test set predictions
- Here, the high bias model does poorly on the test set. This is understandable as the model performed poorly on the test set, so it was never very likely to perform well on the test set

- Consider the complex model
- Let’s look at a visualisation of the test set predictions
- Here, the high variance model also shows a high test error, unlike its test set performance. This is because the model over-familiarised itself with the training set, to the point that it was unable to successfully predict new points from the test set.

- Let us look at how training and test error vary with model complexity
- From the above figure, we can make the following observations
- For simpler/high-bias models, the training and test error are both very high. This is because the model has not adjusted in accordance with the inputs given. It can be said that the model is under-fitting.
- For complex/high-variance models, the training error is low but the test error is high. This is because the model has adjusted too much to the training inputs given, thereby not being able to predict any new points well. It can be said that the model is overfitting.
- The sweet-spot of model-complexity is the perfect trade-off between bias and variance. It is characterised by low training and test error.

**Overfitting in deep neural networks**

Why do we care about bias variance trade-off in the context of Deep Learning

- Consider the same image from the previous section
- Deep Neural Networks are highly complex models (many parameters and many non-linearities)
- Easy to overfit (drive training error to 0)
- The aim is to maintain the model complexity near the sweet-spot and not have it get too complex.
- How do we deal with this in practice in Deep Neural Networks? Let’s look at some of the recommended practices
- Divide data into train, test and validation/development splits
- Good rations would be (60:20:20) or (70:20:10) in the order train:validation:test
- Never handle the test data except during the final evaluation. All other evaluation must be done with the training set first then the validation set.
- Training data is used to minimise the loss/error
- Validation data is used to check if the model has become too complex or not.
- We must aim to get a good score during evaluation of the validation set

- Start with some network configuration (say, 2 hidden layers, 50-100 neurons each)
- Make sure that you are using the:
- Right activation function (tanh(RNN), ReLU(CNN), leaky ReLU(CNN))
- Right initialisation method (Xavier, He)
- Right optimization method (say Adam)

- Monitoring training and validation error similar to the figure in point number 1.

Training Error | Validation Error | Cause | Solution |

High | High | High Bias | Increase model complexityTrain for more epochs |

Low | High | High Variance | Add more training data (dataset augmentation)Use regularizationUse early stopping (train less) |

Low | Low | Perfect trade-off | You are done! |

**A detour into hyperparameter tuning**

Is the concept of train/validation error also related to hyperparameter tuning?

- The following image shows us all the variables under our control when configuring a DNN
- To determine the ideal combination of variables when configuring a DNN, it is recommended to analyse the curves shown in the figure above
- We need to minimise the difference between train and validation error based on monitoring the curves plotted above.
- Parameters are variables you learn from the data, i.e. weights, biases etc
- Hyper-Parameters are variables that you figure out during experiments on the model, by analysing the error and other evaluators.

**L2 regularization**

What is the intuition behind L-2 regularization?

- Consider the error curves for training and test set
- In the case of Square error loss: Ltrain() = i=1N(yi – f(xi))2
- Where = [W111, W112, +…+WLnk]
- Our aim has been to minimise the loss function min L()

- Now, imagine if we include a new term in the minimization condition min L() = Ltrain() + ()
- Here, in addition to minimising the training loss, we are also minimising some other quantity that is dependent on our parameters
- In the case of L2 Regularisation, () = ||||22 (sq.root of the sum of the squares of the weight)
- () = W2111+W2112 +…+W2Lnk
- Here, we should aim to minimize both Ltrain() and (), it wouldn’t make sense for either of them to be high values.

- What if we set all weights to 0? In this case, the model would not have learned much, therefore Ltrain()would be high.
- What if we try to minimise Ltrain()to 0? In this case, it is possible that some of the weights would take on large values, thereby driving the value of () high.
- To counter the previous point’s shortcoming, we need to minimize Ltrain() but shouldn’t allow the weights to grow too large
- Thus, as shown in the figure, in L2 Regularisation, we do not allow the training loss to be brought to be zero, instead we maintain it at slightly above zero, so that () doesn’t become too high
- This works in the Gradient Descent Algorithm as well
- The algorithm
**Initialise:**w_{111}, w_{112}, … w_{313}, b_{1}, b_{2}, b_{3}randomly**Iterate over data**- Compute ŷ
- Compute L(w,b) Cross-entropy loss function
- w
_{111}= w_{111}– η𝚫w_{111} - w
_{112}= w_{112}– η𝚫w_{112}

…

- w
_{313}= w_{111}– η𝚫w_{313} **Till satisfied**- The derivative of the loss function w.r.t any weight is Wijk = L()Wijk
- In the case of L2 Regularisation, that value would be Wijk = Ltrain()Wijk + ()Wijk
- Here, the derivative of the regularisation term will cancel out all other weights except the concerned weight and we will compute its derivative. I.e. ()Wijk = 2Wijk
- So the new derivative term will be Wijk = Ltrain()Wijk + 2Wijk
- This process is automatically done in PyTorch.

**Dataset Augmentation and Early Stopping**

- What is the intuition behind dataset augmentation?
- Let’s look at the train-validation error curves as drawn in the previous explanations
- If our original dataset size is small, then it becomes easy to drive the training error to zero (too many parameters for very little data). This is because the parameters learn from the data too well to the point of overfitting. Here we will see a low train error and a high validation error
- Augmenting with more data will make it harder to drive the training error to zero
- Data Augmentation could be used to obtain multiple data points from a single input, by performing operations such as blurring, cropping, translating (move horizontally or vertically) etc. The benefit of this is that no extra effort needs to be made in labelling the data, as all the augmented images have have the same label as the original image
- By augmenting more data, we might also end up seeing data which is similar to validation/test data (hence, effectively reduce the validation/test data)

- What is early stopping?
- Look at the image of the error curves to see how early stopping works
- First we keep training our model for a large number of epochs, and keep monitoring the loss.
- With a patience parameter p, say p = 5 epochs, monitor the validation error after a large number of epochs k.
- If the training error continues to decrease but the validation error stays constant in the patience period of 5 epochs, then we can avoid any more steps and revert back to k-p epochs.
- This can be compared to losing patience while waiting for the loss to decrease.
- Thus, we return the weights corresponding to the no. of epochs with lowest error.

**Summary**

Let’s look at where we are now

- We have covered a lot of interesting topics in regularization
- We haven’t covered the regularization methods such as dropout & batch-normalisation, but they will be covered as we move forward
- The next few sections will be more hands-on, and we will get to start working with PyTorch and CNNs
- The next contest will cover all of the following concepts