반응형
From the previous post which was about overfitting and underfitting, we mentioned that one of the most popular methods to get rid of overfitting is to restrict the weights, which is called Regularization.
Regularization (가중치 규제)
- Weight regularization refers to techniques that restricts weights from becoming too large.
- Regularizing weights improves the model's generalization peformance.
- A model cannot be considered to have good performance if it fixates on certain data points and fails to adapt to new data.
- This is precisely what we mean when we say a model is not generalized.
- In such cases, using regularization to restrict weights can improve generalization performance by preventing the model from fixating on specific data points.
- The most representative regularization methods are L1 regularization and L2 regularization.
L1 Regularization
L1 regularization adds the L1 norm, which is the absolute value of weights to the loss function.
$ \|w\|_1 = \sum_{i=1}^n |w_i| $
- Above is the L1 norm.
- Lowercase letters represent vectors.
- Conversely, bold capital letters typically represent matrices.
- Since n in L1 norm represents the number of weights, it's fine to understand L1 regularization as 'adding the absolute value of weights to the loss function'.
- L1 regularization is created by adding the L1 norm to the loss function and then multiplying it by the parameter $\alpha$, which controls the amount of regularization.
$L = -(ylog(a) + (1-y)log(1-a)) + \alpha\sum_{i=1}^n |w_i|$
- $\alpha$ is an hyperparameter that controls the regularization strength.
- When $\alpha$ has a larger value, the sum of weights must decrease to prevent the total loss function growing too large.
- This is referred to as stronger regularization because the weights have become smaller.
- Conversely, if the alpha value is small, even if the sum of w increases, the loss function value doesn't increase significantly.
- In other words, regularization becomes weaker.
Derivative of L1 Regularization for Gradient Descent
$\frac{\partial}{\partial w} L = -(y - a)x + \alpha$ x $sign(w)$
- When differentiating the absolute value |w| with respect to w, only the sign of w remains, so it's expressed as sign(w), meaning the sign of which is the result of differentiating w.
$w = w -\eta\frac{\partial L}{\partial w} = w + \eta((y - a)x - \alpha$ x $sign(w))$
- The weight update equation becomes as shown above.
- Learning rate is also applied by multiplying by the derivatie of the loss function with L1 regularization applied.
- In the above equation $\eta$ represents the learning rate.
w_grad += alpha * np.sign(w)
- When optimizing the logistic loss function with L1 regularization using gradient descent, you add the product(*) of the regularization hyperparameter $\alpha$ and the sign of the weights to the gradient to be updated.
- The python code becomes as shown above.
- An important point to note here is taht we don't apply regularization to the bias.
- This is because the bias affects the model differently than weights do.
- Regularizing the bias only shifts the model in a certain direction without affecting its complexity.
Lasso model
- The same principle can be applied to regression models by adding L1 regularization to the loss function(squared error).
- This model is called Lasso.
- Lasso can reduce weights and even make some weights zero.
- Since features with zero weights are effectively eliminated from the model, this provides a feature selection effect.
- In scikit-learn, the Lasso model is provided in sklearn.linear_model.Lasso class.
L2 Regularization
- As can be seen from the derivative result, L1 regularization is highly dependent on the regularization hyperparameter $\alpha$.
- In other words, since the amount of regularization doesn't change according to the weight size, it cannot be considered to have good regularization effects.
- L2 regularization has better regularization effects than L1 regularization.
- L2 regularization adds the square of the L2 norm of weights to the loss function.
- The formula for the L2 norm is as shown below.
$\|w\|_2 = \sqrt{\sum_{i=1}^n w_i^2}$
- The logistic loss function with L2 regularization applied is as follows.
$L = -(ylog(a) + (1 - y)log(1 - a)) + \frac{1}{2}\alpha\sum_{i=1}^n w_i^2$
- As with L1 regularization, $\alpha$ is a hyperparameter that controls the amount of regularization, and the 1/2 is simply a coefficient added to make the derivative simpler.
Derivative of L2 Regularization for Gradient Descent
$\frac{\partial}{\partial w} L = -(y - a)x + \alpha$ x $w$
- When L2 regularization is differentiated, only the weight vector remains.
$w = w -\eta\frac{\partial L}{\partial w} = w + \eta((y - a)x - \alpha$ x $w$
- When substituting the derivative result into the weight update equation, it appears as above.
w_grad += alpha * w
- When applying L2 regularization to the gradient descent algorithm, the code is as above: simply add the product of $\alpha$ which is regularization hyperparameter and the weights to the gradient.
- L2 regularization is somewhat more effective than L1 regularization because the weight value itself is included in the gradient calculation, rather than just using the sign of the weight as in L1 regularization.
- Also, L2 regularization doesn't completely reduce weights to zero.
- While making weights exactly zero has the effect of excluding features, it reduces the model's complexity.
- For these reasons, L2 regularization is more commonly used.
Ridge model
- A regression model with L2 regularization applied is called a Ridge model.
- Scikit-learn provides the Ridge model as sklearn.linear_model.Ridge class.
And more
- There are many more regularization techniques to deal with overfitting problem:
- Drop-out
- Data augmentation
- Early stopping
Reference
Don't let the opinions of others define your worth. Believe in yourself and your abilities.
- Max Holloway -
반응형
'캐글' 카테고리의 다른 글
[Kaggle Study] 7. About Structuring ML Projects (1) (0) | 2024.11.12 |
---|---|
[Kaggle Study] 6. Logistic Loss Function (0) | 2024.11.07 |
[Kaggle Study] 4. Overfitting, Underfitting, Variance and Bias (4) | 2024.10.29 |
[Kaggle Study] 3. Learning Rate (2) | 2024.10.29 |
[Kaggle Study] 2. Scale of Features (1) | 2024.10.29 |