[Kaggle Extra Study] 13. Weight Initialization

캐글 보충

[Kaggle Extra Study] 13. Weight Initialization

dongsunseng 2024. 11. 9. 15:31

Weight Initialization?

When training a neural network, one of the things we should care about is where we should start on the loss function to reach the optimal solution more easily and quickly.
When training a neural network, the method of determining the starting position on the loss function is called Model Initialization.
In particular, weights account for the largest portion of the model's parameters.
The learning performance varies greatly depending on the weight initialization method.
Also prevents Vanishing / Exploding gradients problem.

1. Zero Initialization(Constant Initialization)

Sets all weights to 0.
Problem:
- All neurons output the same values, preventing effective learning.
Same problem occurs when we intialize with other constants.
If all parameter values are the same, even after updating through back propagation, they will all change to the same value.
If all parameters of neural network nodes are identical, having multiple nodes in the neural network becomes meaningless.
- This is because it effectively becomes equivalent to having just one node per layer.
- Initializing all weights to the same constant creates symmetry in the neural network, causing all neurons in the same layer to operate identically.
- Therefore, initial values must be set randomly.

2. Random Initialization

Initializes with small random values.
The easiest way to assign random(different) values to parameters is to use a probability distribution.
- Typically drawn from uniform or normal distribution between -0.01 and 0.01.
Simple but may cause gradient vanishing/exploding problems in deep networks.
Details:
- For example, we can set all weights differently by assigning values that follow a normal distribution.
- When activation values are close to 0 and 1, the derivative of the logistic function(sigmoid func) is nearly zero.
- This leads to the gradient vanishing phenomenon where learning does not occur.
- So, we can reduce the standard deviation to prevent values from being skewed towards these extremes.
- By this way, we can get rid of the gradient vanishing effect.
- However, most output values are located around 0.5.
- As mentioned in the constant initialization, when all nodes' activation function outputs are similar, having multiple nodes loses its meaning.

3. Xavier/Glorot Initialization

Xavier initialization is an initialization method designed to solve the problems mentioned above.
Xavier initialization doesn't use a fixed standard deviation.
- Instead, it adjusts based on the number of nodes in the previous hidden layer.
- When there are n nodes in the previous hidden layer and m nodes in the current hidden layer, weights are initialized using a normal distribution with a standard deviation of 2 / √(n + m).
- Formula: W = random_normal(0, sqrt(2/(fan_in + fan_out)))
  - fan_in: number of input nodes(number of neurons in previous layer)
  - fan_out: number of output nodes(number of neurons in current layer)
As a result, activation values will be spread much more evenly than in the previous 2 methods.
- Because weights are initialized according to the number of nodes in each layer, it's much more robust than using a fixed standard deviation, even when setting different numbers of nodes per layer.
Suitable for tanh or sigmoid activation functions.
But not suitable for ReLU.

4. He Initialization

Suitable for ReLU family activation functions.
- Very effective for ReLU and widely used in deep learning.
Uses twice the variance of Xavier.
Even as the layers get deeper, all activation values maintain an even distribution.
Formula: W = random_normal(0, sqrt(2/fan_in))

5. LeCun Initialization

Suitable for SELU activation function.
- SELU activation function:
  - It was designed to have self-normalizing characteristics in deep learning networks.
    - Automatically normalizes input values' mean and variance.
    - Maintains consistent mean and variance of activations during training.
  - Solves gradient vanishing/explosion problem.
  - Enables stable training of deeper networks.
  - Fast learning speed.
  - Model becomes simpler as normalization layers are not needed.
Uses normal distribution with mean 0 and variance 1/fan_in.
Formula: W = random_normal(0, sqrt(1/fan_in))

6. Orthogonal Initialization(직교 행렬 초기화)

In RNN(Recurrent Neural Network)s, orthogonal initialization is used.
Orthogonal initialization ensures that weights for hidden states in recurrent cells don't become too large or too small when multiplied repeatedly.

Reference

신경망 및 딥 러닝

DeepLearning.AI에서 제공합니다. 딥러닝 전문 과정의 첫 번째 과정에서는 신경망과 딥러닝의 기본 개념을 공부합니다. 이 과정을 마치면 딥러닝의 부상을 주도하는 중요한 기술 동향에 익숙해지고,

www.coursera.org

Do it! 딥러닝 입문

★★★★★ 딥러닝을 배우고자 하는분께 강추합니다!(wtiger85 님) ★★★★★ 강추. 박해선님의 책은 일단 지른 다음에 생각합니다.(heistheguy 님) ♥♥♥♥ 코랩을 사용한 딥러닝을 알려주는 책 매

tensorflow.blog

가중치 초기화 (Weight Initialization) · Data Science

이번 게시물은 “밑바닥부터 시작하는 딥러닝”과 “Weight Initialization Techniques in Neural Networks”를 참조하여 작성하였습니다. Parameter Initialization 신경망 모델의 목적은 손실(Loss)을 최소화하는 과

yngie-c.github.io

I stay ready so I don't have to get ready.
- Conor Mcgregor -

저작자표시 비영리 변경금지 (새창열림)

'캐글 보충' 카테고리의 다른 글

[Kaggle Extra Study] 15. GBM vs. XGBoost (0)	2024.11.10
[Kaggle Extra Study] 14. Tree-based Ensemble Models (1)	2024.11.10
[Kaggle Extra Study] 12. Drop-out (4)	2024.11.07
[Kaggle Extra Study] 11. Polars (8)	2024.11.06
[Kaggle Extra Study] 10. TabNet (5)	2024.11.04

현재글[Kaggle Extra Study] 13. Weight Initialization

캐글, 매매일지, ML, nlp, 코인, 경제, cibmtr - equity in post-hct survival predictions, Prompt Engineering, 티스토리챌린지, 투자, 단타, Kaggle, 오블완, Express, llm, dl, backend, 비트코인, nodejs, home credit default risk,

Today :
Yesterday :

일	월	화	수	목	금	토
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31

동선생