반응형
Weight Initialization?
- When training a neural network, one of the things we should care about is where we should start on the loss function to reach the optimal solution more easily and quickly.
- When training a neural network, the method of determining the starting position on the loss function is called Model Initialization.
- In particular, weights account for the largest portion of the model's parameters.
- The learning performance varies greatly depending on the weight initialization method.
- Also prevents Vanishing / Exploding gradients problem.
1. Zero Initialization(Constant Initialization)
- Sets all weights to 0.
- Problem:
- All neurons output the same values, preventing effective learning.
- Same problem occurs when we intialize with other constants.
- If all parameter values are the same, even after updating through back propagation, they will all change to the same value.
- If all parameters of neural network nodes are identical, having multiple nodes in the neural network becomes meaningless.
- This is because it effectively becomes equivalent to having just one node per layer.
- Initializing all weights to the same constant creates symmetry in the neural network, causing all neurons in the same layer to operate identically.
- Therefore, initial values must be set randomly.
2. Random Initialization
- Initializes with small random values.
- The easiest way to assign random(different) values to parameters is to use a probability distribution.
- Typically drawn from uniform or normal distribution between -0.01 and 0.01.
- Simple but may cause gradient vanishing/exploding problems in deep networks.
- Details:
- For example, we can set all weights differently by assigning values that follow a normal distribution.
- When activation values are close to 0 and 1, the derivative of the logistic function(sigmoid func) is nearly zero.
- This leads to the gradient vanishing phenomenon where learning does not occur.
- So, we can reduce the standard deviation to prevent values from being skewed towards these extremes.
- By this way, we can get rid of the gradient vanishing effect.
- However, most output values are located around 0.5.
- As mentioned in the constant initialization, when all nodes' activation function outputs are similar, having multiple nodes loses its meaning.
3. Xavier/Glorot Initialization
- Xavier initialization is an initialization method designed to solve the problems mentioned above.
- Xavier initialization doesn't use a fixed standard deviation.
- Instead, it adjusts based on the number of nodes in the previous hidden layer.
- When there are n nodes in the previous hidden layer and m nodes in the current hidden layer, weights are initialized using a normal distribution with a standard deviation of 2 / √(n + m).
- Formula: W = random_normal(0, sqrt(2/(fan_in + fan_out)))
- fan_in: number of input nodes(number of neurons in previous layer)
- fan_out: number of output nodes(number of neurons in current layer)
- As a result, activation values will be spread much more evenly than in the previous 2 methods.
- Because weights are initialized according to the number of nodes in each layer, it's much more robust than using a fixed standard deviation, even when setting different numbers of nodes per layer.
- Suitable for tanh or sigmoid activation functions.
- But not suitable for ReLU.
4. He Initialization
- Suitable for ReLU family activation functions.
- Very effective for ReLU and widely used in deep learning.
- Uses twice the variance of Xavier.
- Even as the layers get deeper, all activation values maintain an even distribution.
- Formula: W = random_normal(0, sqrt(2/fan_in))
5. LeCun Initialization
- Suitable for SELU activation function.
- SELU activation function:
- It was designed to have self-normalizing characteristics in deep learning networks.
- Automatically normalizes input values' mean and variance.
- Maintains consistent mean and variance of activations during training.
- Solves gradient vanishing/explosion problem.
- Enables stable training of deeper networks.
- Fast learning speed.
- Model becomes simpler as normalization layers are not needed.
- It was designed to have self-normalizing characteristics in deep learning networks.
- SELU activation function:
- Uses normal distribution with mean 0 and variance 1/fan_in.
- Formula: W = random_normal(0, sqrt(1/fan_in))
6. Orthogonal Initialization(직교 행렬 초기화)
- In RNN(Recurrent Neural Network)s, orthogonal initialization is used.
- Orthogonal initialization ensures that weights for hidden states in recurrent cells don't become too large or too small when multiplied repeatedly.
Reference
I stay ready so I don't have to get ready.
- Conor Mcgregor -
반응형
'캐글 보충' 카테고리의 다른 글
[Kaggle Extra Study] 15. GBM vs. XGBoost (0) | 2024.11.10 |
---|---|
[Kaggle Extra Study] 14. Tree-based Ensemble Models (1) | 2024.11.10 |
[Kaggle Extra Study] 12. Drop-out (4) | 2024.11.07 |
[Kaggle Extra Study] 11. Polars (8) | 2024.11.06 |
[Kaggle Extra Study] 10. TabNet (5) | 2024.11.04 |