캐글 보충

[Kaggle Extra Study] 13. Weight Initialization

dongsunseng 2024. 11. 9. 15:31
반응형

Weight Initialization?

  • When training a neural network, one of the things we should care about is where we should start on the loss function to reach the optimal solution more easily and quickly.
  • When training a neural network, the method of determining the starting position on the loss function is called Model Initialization.
  • In particular, weights account for the largest portion of the model's parameters.
  • The learning performance varies greatly depending on the weight initialization method.
  • Also prevents Vanishing / Exploding gradients problem.

1. Zero Initialization(Constant Initialization)

  • Sets all weights to 0.
  • Problem:
    • All neurons output the same values, preventing effective learning.
  • Same problem occurs when we intialize with other constants.
  • If all parameter values are the same, even after updating through back propagation, they will all change to the same value.
  • If all parameters of neural network nodes are identical, having multiple nodes in the neural network becomes meaningless. 
    • This is because it effectively becomes equivalent to having just one node per layer.
    • Initializing all weights to the same constant creates symmetry in the neural network, causing all neurons in the same layer to operate identically.
    • Therefore, initial values must be set randomly.

2. Random Initialization

  • Initializes with small random values.
  • The easiest way to assign random(different) values to parameters is to use a probability distribution.
    • Typically drawn from uniform or normal distribution between -0.01 and 0.01.
  • Simple but may cause gradient vanishing/exploding problems in deep networks.
  • Details:
    • For example, we can set all weights differently by assigning values that follow a normal distribution.
    • When activation values are close to 0 and 1, the derivative of the logistic function(sigmoid func) is nearly zero.
    • This leads to the gradient vanishing phenomenon where learning does not occur.
    • So, we can reduce the standard deviation to prevent values from being skewed towards these extremes.
    • By this way, we can get rid of the gradient vanishing effect. 
    • However, most output values are located around 0.5
    • As mentioned in the constant initialization, when all nodes' activation function outputs are similar, having multiple nodes loses its meaning.

3. Xavier/Glorot Initialization

  • Xavier initialization is an initialization method designed to solve the problems mentioned above.
  • Xavier initialization doesn't use a fixed standard deviation.
    • Instead, it adjusts based on the number of nodes in the previous hidden layer.
    • When there are n nodes in the previous hidden layer and m nodes in the current hidden layer, weights are initialized using a normal distribution with a standard deviation of 2 / √(n + m).
    • Formula: W = random_normal(0, sqrt(2/(fan_in + fan_out)))
      • fan_in: number of input nodes(number of neurons in previous layer)
      • fan_out: number of output nodes(number of neurons in current layer)
  • As a result, activation values will be spread much more evenly than in the previous 2 methods.
    • Because weights are initialized according to the number of nodes in each layer, it's much more robust than using a fixed standard deviation, even when setting different numbers of nodes per layer.
  • Suitable for tanh or sigmoid activation functions.
  • But not suitable for ReLU.

4. He Initialization

  • Suitable for ReLU family activation functions.
    • Very effective for ReLU and widely used in deep learning.
  • Uses twice the variance of Xavier.
  • Even as the layers get deeper, all activation values maintain an even distribution.
  • Formula: W = random_normal(0, sqrt(2/fan_in))

5. LeCun Initialization

  • Suitable for SELU activation function.
    • SELU activation function: 
      • It was designed to have self-normalizing characteristics in deep learning networks.
        • Automatically normalizes input values' mean and variance.
        • Maintains consistent mean and variance of activations during training.
      • Solves gradient vanishing/explosion problem.
      • Enables stable training of deeper networks.
      • Fast learning speed.
      • Model becomes simpler as normalization layers are not needed.
  • Uses normal distribution with mean 0 and variance 1/fan_in.
  • Formula: W = random_normal(0, sqrt(1/fan_in))

6. Orthogonal Initialization(직교 행렬 초기화)

  • In RNN(Recurrent Neural Network)s, orthogonal initialization is used.
  • Orthogonal initialization ensures that weights for hidden states in recurrent cells don't become too large or too small when multiplied repeatedly.

Reference

 

신경망 및 딥 러닝

DeepLearning.AI에서 제공합니다. 딥러닝 전문 과정의 첫 번째 과정에서는 신경망과 딥러닝의 기본 개념을 공부합니다. 이 과정을 마치면 딥러닝의 부상을 주도하는 중요한 기술 동향에 익숙해지고,

www.coursera.org

 

 

Do it! 딥러닝 입문

★★★★★ 딥러닝을 배우고자 하는분께 강추합니다!(wtiger85 님) ★★★★★ 강추. 박해선님의 책은 일단 지른 다음에 생각합니다.(heistheguy 님) ♥♥♥♥ 코랩을 사용한 딥러닝을 알려주는 책 매

tensorflow.blog

 

 

가중치 초기화 (Weight Initialization) · Data Science

이번 게시물은 “밑바닥부터 시작하는 딥러닝”과 “Weight Initialization Techniques in Neural Networks”를 참조하여 작성하였습니다. Parameter Initialization 신경망 모델의 목적은 손실(Loss)을 최소화하는 과

yngie-c.github.io

 

 

I stay ready so I don't have to get ready.
- Conor Mcgregor -
반응형