[CheatSheet] Activation Functions

치트시트

[CheatSheet] Activation Functions

dongsunseng 2024. 11. 8. 21:43

In Neural Networks, the activation function is a key element that enables the network to learn complex patterns and non-linearity.

Reasons why activation functions are necessary:

Introduction of Non-linearity
- Without activation functions, neural networks would be merely combinations of linear transformations.
- No matter how many layers are stacked, they could be expressed as a single linear function, meaning complex patterns cannot be learned.
- Activation functions introduce non-linearity, allowing neural networks to approximate complex functions.
Feature Extraction and Transformation
- Each neuron learns to respond to specific features in the input data.
- Activation functions transform these features non-linearly to create more useful representations.
- This enables effective capture of important patterns in the data.
Control of Information Flow
- Activation functions regulate the strength of signals.
- They can strengthen important information and suppress unnecessary information.
- This is similar to how biological neurons activate only when exceeding certain thresholds.
Stability in Learning Process
- Choosing appropriate activation functions can mitigate learning problems like vanishing or exploding gradients.
- They also stabilize learning by limiting output values to specific ranges.

Historical Development:

Early neural networks mainly used S-shaped activation functions like sigmoid and tanh.
However, these functions caused vanishing gradient problems in deep neural networks.
ReLU was introduced to solve this, leading to breakthroughs in modern deep learning.
Various modifications like Leaky ReLU were later proposed to address ReLU's shortcomings.

1. Sigmoid function

Output normalized between 0 ~ 1.
Frequently used in output layer for binary classification.
Smooth S-shaped curve.
Problems:
- Saturation:
  - Looking at the sigmoid function's graph, we can see that when the sum of input signals(value before activation function) is very large or very small, the gradient approaches 0.
  - This phenomenon where the gradient approaches 0 in certain regions of the activation function is called Saturation(포화 상태).
  - This leads to the Vanishing Gradient Problem(check very bottom of the post for details)
- exp() is a bit compute expensive
- Sigmoid outputs are not zero-centered
  - dL/dW(W's gradient with respect to Loss) can be found through the upstream gradient(dL/da) and local gradient(da/dW).
  - Here, da/dW equals X.
    - This explains the rule for finding partial derivatives in multiplication operations during backpropagation.
    - If f = W * X, df/dW becomes X and df/dX becomes W.
    - This is due to the differentiation rule for multiplication. When differentiating with respect to one variable, the other variable is treated as a constant and remains unchanged.
  - Therefore, if X is always Positive, the value of dL/dW is determined by the value of dL/da.
  - Consequently, gradient W will always be either postive or negative.
  - This means that gradient W will always move in the same direction.
  - Let's take a look at the graph on the right below.
    - Let's assume there are 2-dimensional W values, x-axis is w1 and y-axis is w2.
    - In this case, W's update occurs in 2 scenarios: when w1 increases and w2 also increases, or when w1 decreases and w2 also decreases.
    - However, the optimal solution is in the direction where w1 increases while w2 decreases(the blue line).
    - When W's gradient is always positive or negative, it searches for the optimal solution inefficiently.
    - Instead of searching for the solution along the blue line as shown in the figure below, it searches in a zig zag pattern(because it can only move upright - positively or downleft - negatively).
    - This becomes an inefficient weight update method because it requires multiple searches.
  - This is precisely why we generally want zero-mean data.
  - If input X contains both positive and negative values, it can prevent gradient w from moving entirely in postive/negative directions.

2. Tanh function

Formula: f(x) = (e^x - e^(-x)) / (e^x + e^(-x))
Output normalized between -1 ~ 1.
Steeper gradient than Sigmoid.
Output is symmetric around 0 -> zero-centered.
Frequently used in RNN/LSTM.
Problems:
- There are still regions where the gradient dies (existing in both positive and negative regions).
- Thus, still possible to experience vanishing gradient problem in deep networks.

3. ReLU function

Abbreviation of "Rectified Linear Unit".
Most widely used activation function.
Used by AlexNet, which won ImageNet 2012.
Simple and fast computation.
- Computationally efficient.
- Much faster than sigmoid/tanh (about 6 times faster).
Does not saturate in positive values.
Problems:
- The non-zero centered problem occurs again -> zig zag.
- Saturation problem in negative regions occurs again.
- Neurons might ide in negative regions: Dying ReLU Problem.

Dying ReLU Problem
- What does it mean when ReLU "dies":
  - ReLU function always outputs 0 when the input is less than 0.
  - When a specific neuron only outputs 0, we say it has "died".
  - In other words, this neuron can no longer transmit any useful information.
- Why is this problem?
  - Dead neurons have a gradient of 0, meaning they no longer learn.
  - The network's capacity decreases.
  - Once a neuron dies, it's usually difficult to recover.
- Dead ReLU can occur when ReLU is far from the data cloud.
- First, it can happen due to poor initialization.
  - This occurs when the weight plane is far from the data cloud.
  - In this case, it won't activate for any input data.
- A more common case is when the learning rate is too high.
  - When the learning rate for weight parameter updates is high, ReLU can deviate from the data manifold.
  - This is quite common and can easily occur during the training process.
- That's why it is said that training can be going well and then suddenly die.
- How to tell whether ReLU will die in the data cloud or not:
  - When we take a look at the 2D example above, the weights will form a hyperplane (active ReLU or dead ReLU).
  - When considering the position of W's hyperplane and the position of the data, there can be cases where W's hyperplane itself becomes disconnected from the data.
  - In practice, to avoid Dead ReLU, positive biases are sometimes added when initializing ReLU.
  - This is done to increase the chances of having active ReLUs during weight updates.
  - However, there are conflicting opinions about whether this approach is helpful or not.

4. Leaky ReLU function

Similar to ReLU but no longer zero in the negative region.
Does not get saturated.
Still computationally efficient and fast.
No more Dead ReLU phenomenon.
- Has small gradient for negative inputs.
More stable learning than ReLU.

5. PReLU function

"Parametric Rectifier".
PReLU is similar to Leaky ReLU in that it has a slope in the negative space.
However, here the slope is determined by a parameter called alpha.
Instead of fixing alpha to a specific value, it's made into a parameter that is learned through backpropagation.

6. ELU function

ELU is part of the LU family (ReLU, LeakyReLU, PReLU...).
However, ELU shows outputs close to zero-mean (as you can see from the graph, it's smoother around 0).
This gives it a significant advantage compared to other LU family members that don't have zero-mean outputs.
When compared to Leaky ReLU, ELU shows a "saturation" problem in the negative region instead of having a "slope."
ELU claims that this saturation can be more robust to noise.
- It argues that this kind of deactivation can provide more robustness.
- The ELU paper is said to explain well why ELU is superior.
- ELU can be considered as a middle ground between ReLU and Leaky ReLU.
- While ELU produces zero-mean outputs like Leaky ReLU, it also shares characteristics with ReLU in terms of saturation.
ReLU definitely does not produce zero-mean outputs.
Leaky ReLU also isn't truly zero-mean since it outputs the same value for positive inputs and has a small slope in the negative region.
ELU produces outputs closer to zero-mean by using an exponential function in the negative region.

7. Maxout "Neuron"

This looks a bit different from the activation functions we've seen so far.
- It doesn't predefine a specific basic format for accepting inputs.
Instead, it uses the maximum value between (w1 dot product with x + b1) and (w2 dot product with x + b2).
- Maxout selects the maximum of these two values.
Maxout is a more generalized form of ReLU and Leaky ReLU, because Maxout takes two linear functions.
- (If you look at ReLU and Leaky ReLU, you can see they are combinations of two linear functions)
Since Maxout is also linear, it doesn't get saturated and the gradient won't die.
- The problem here is that the number of parameters per neuron doubles.
- Now you need to have two parameters, W1 and W2.

Conclusion

1. For general hidden layers, use ReLU(recommended as default).
2. If learning is unstable with ReLU, try Leaky ReLU.
3. For output layer, there are some exceptions that work well in specific tasks: binary classification: sigmoid, multi-class classification: softmax, regression: linear. We can try these out, too.

Vanishing Gradient Problem

The neural network structure consists of Visible layers and Hidden layers.
The visible layers consist of Input and Output layers, while the layers in between are called Hidden layers because we cannot see what calculations are performed inside them.
In multi-layer perceptrons with many hidden layers, as signals pass through more hidden layers, the error being propagated becomes significantly reduced to the point where learning stops - this is known as the gradient vanishing problem.
When the gradient vanishes to nearly 0, network learning becomes very slow and may stop before learning is complete.
This is sometimes described as reaching a local minimum.
As we learned earlier, functions like sigmoid suffer from the gradient vanishing problem quickly because their output values are below 1.
- This is easy to understand if you think about how multiplying numbers less than 0 repeatedly results in values approaching 0.
This can be solved by choosing a non-linear function that doesn't have the vanishing property as the activation function (e.g., ReLU function).

Reference

https://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture6.pdf

[모두를 위한 cs231n] Lecture 6. Activation Functions에 대해 알아보자

안녕하세요 Steve-Lee입니다. 이번 시간부터는 Lecture 6. Training Neural Network에 대해 배워보도록 하겠습니다. Deep Learning Model을 학습시키는 과정에서 저희가 알고 넘어가야 할 기본적인 내용들을 하나

deepinsight.tistory.com

I am cocky in prediction. I am confident in preparation, but I am always humble in victory or defeat.
- Conor Mcgregor -

저작자표시 비영리 변경금지 (새창열림)

'치트시트' 카테고리의 다른 글

[Kaggle Study] Code CheatSheet (0)	2024.10.27

현재글[CheatSheet] Activation Functions

코인, Express, 단타, nodejs, dl, nlp, Prompt Engineering, backend, 오블완, 투자, llm, ML, 경제, cibmtr - equity in post-hct survival predictions, Kaggle, home credit default risk, 티스토리챌린지, 캐글, 매매일지, 비트코인,

Today :
Yesterday :

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

동선생