[NLP] 3. How does Transformer Work?

NLP

[NLP] 3. How does Transformer Work?

dongsunseng 2025. 1. 3. 15:52

Background

Transformer, introduced by Google in 2017 for natural language processing, is a language model that's leading innovation in the AI field.
ChatGPT, which first enabled us to use AI through web and API interfaces, is also based on Transformer, as are the language models that companies like Google and Facebook are developing as competitors.
Transformer is expected to achieve state-of-the-art performance not only in natural language processing but also in other fields like computer vision and speech recognition.

Shift from CNN Dominance to Transformer

Deep learning can be traced back to the Perceptron of the 1950s, which was inspired by human neurons.
However, deep learning faced a dark age until the early 2010s due to insufficient computing power and more importantly lack of data for analysis during the 1990s-2000s.
However, in the 2010s, data increased explosively through smartphones and social media.
In 2012, AlexNet, using deep learning, became a breakthrough in the ImageNet Challenge (classifying 1000 images) by improving image classification accuracy by more than 10% from the previous 70-80%.
AlexNet consists of 5 CNN layers and 3 FC layers.
After that, computer vision field development mainly focused on CNN-based models, and ResNet, which emerged in 2015, achieved an image recognition error rate of around 3%, similar to human performance.

Natural Language Processing's History

In contrast, for natural language processing, RNN, which is an artificial neural network for processing sequential data like text, emerged in the 1980s, and its improved version LSTM came out in 1997.
However, they couldn't solve the long-term dependencies problem for a while, where it became difficult to remember previous data as input sentences got longer.
There were also attempts to analyze sentence sentiment by creating embedding vectors using CNN, which was popular at the time.
The Sequence to Sequence language model, introduced in 2014, is considered one of the greatest inventions in natural language processing history.
It could not only convert existing sentences into numerical values but also generate new sentences using these values.
Machine Translation is a typical example, such as generating English sentences from Korean input.
However, the Seq2Seq model still had RNN's chronic problem where it struggled to remember previous information as input sentences got longer, as it used RNN in both the encoder(processing input sentences) and decoder(generating new sentences).
Also, information loss occurred when trying to reconstruct target sentences using only the numerical information from the encoder's last timestep.
This issue was later resolved with the addition of Attention, enabling translation regardless of sentence length.

RNN's Main Problem(Summary)

They process the input data sequentially, one after the other. Such a recurrent process does not make use of modern graphics processing units (GPUs), which were designed for parallel computation and, thus, makes the training of such models quite slow.
They become quite ineffective when elements are distant from one another. This is due to the fact that information is passed at each step and the longer the chain is, the more probable the information is lost along the chain.

Attention?

The basic idea of Attention is that since the numerical information from the encoder's last timestep alone is sufficient, the decoder refers back to the entire input sentence at every timestep when predicting output words.
However, it doesn't reference all input words equally - instead, it pays more attention to words most relevant to the word being predicted at that timestep.
Mathematically, this involves creating a query by multiplying weights with the decoder's current timestep output (hidden state), then taking the dot product with all encoder timestep outputs, and learning these weights through backpropagation to better reference the words that need to be predicted.
Although the addition of Attention somewhat removed limitations on sentence length, RNN-based Seq2Seq models still produced lower quality translations compared to humans.
However, the emergence of the Transformer brought significant changes to natural language processing.
In 2017, Google introduced the Transformer model through their paper "Attention is All You Need," implementing both encoder and decoder entirely with attention mechanisms, rather than just using attention for corrections.
The Transformer model became not only free from sentence length constraints but also better at understanding input sentences through the encoder and previously generated words through the decoder.
All famous pre-trained language models (PLMs) since then have been Transformer-based.
BERT consists of 12 Transformer encoders and excels at natural language understanding, while GPT-1 consists of 12 Transformer decoders and shows strength in natural language generation.
Subsequent language models have evolved by increasing model size and datasets - GPT-3's largest version has 96 decoders and 175 billion parameters.
ChatGPT is a model fine-tuned from GPT-3, specialized for conversation.

Transformer excelling in image field

The Transformer model is achieving good results not only in natural language processing but also in image processing.
Vision Transformer (ViT), announced in 2020, applies the Transformer model to the vision field.
It divides input images into patches, feeds them into the Transformer's encoder, and can capture interdependencies between different positions of input images and global image features using attention.
Additionally, Transformer is used in popular text-to-image generation models like DALL-E 2 and Stable Diffusion.
These models learn optimal weights for image generation by adding noise to images and restoring them, but instead of blindly restoring images, they find directions for restoration conditioned on given text information.
The Transformer is used not only to understand information between texts but also to model interactions between text and image representations.

More into the Model

Transformer uses attention in both the encoder (which understands input sentences) and the decoder (which generates target sentences).
There are three types of attention in the Transformer:
1. Encoder Self-Attention
  - Used within the encoder for understanding input sentences
2. Decoder Self-Attention (also called Masked Attention)
  - Used within the decoder for understanding the sentence it's generating
  - Called "masked attention" because it masks future tokens during the word-by-word sentence generation process
3. Encoder-Decoder Attention
  - The original purpose of attention
  - Used for the decoder to reference information from the encoder when generating sentences, supplementing any missing information

Looking at the process step by step from where words enter the encoder:
1. Input sentences are tokenized to create a dictionary
2. Tokens are mapped to integers
3. These pass through the embedding layer
4. This creates embedding values for tokens that the model will learn
The Transformer maintains a consistent dimensionality of 512 for both word embedding vectors and all input/output values within the model.

Let me break down this explanation of Transformer's detailed operation:
1. Multi-head Attention in First Encoding Layer

When generating contextual representations by calculating similarities between input sentence tokens
Instead of calculating similarities between 512-dimensional tokens all at once
Divides into n heads for learning (hence "Multi-head Attention")
Paper used head=8

2. Example Calculation

For a sentence like "나는, 학교, 에, 간다":
Instead of full (4, 512).T x (4, 512) matrix multiplication
Changes weight vector size to 64 dimensions (512/8)
Enables 8 parallel (4, 64).T x (4, 64) matrix operations

3. Efficient Processing

Uses matrix multiplication between input values and model weights
Processes efficiently through:
- Batch matrix operations
- Parallel attention processing via multi-head attention mechanism

4. Subsequent Encoder Blocks

Perform self-attention learning with output from previous block
Each encoder block has different weight parameters
Model's expressiveness improves as layers stack up

5. Decoder Operation

Performs self-attention on masked output sentence tokens
Conducts encoder-decoder attention using:
- Self-attention values
- Values passed through final encoder block
Both self-attention and encoder-decoder attention use parallel-processed multi-head attention

Conclusion

The Transformer achieved several key breakthroughs:
1. Overcame Sentence Length Limitations
  - Through attention mechanisms
  - Improved understanding of both input and generated sentences
2. Efficient Processing
  - Handles massive matrix operations between input values and weights
  - Achieves efficiency through parallel processing of all operations
3. Foundation for Large Language Models
  - Enabled development of large-scale language models like GPT (Generative Pre-trained Transformer)
  - Made it possible to pre-train on massive datasets
  - Achieved superior performance through this architecture
This architecture laid the groundwork for modern large language models and continues to drive innovation in AI.

Reference

Transformer 모델이란? : AI 혁신을 주도하는 트랜스포머 알고리즘

트랜스포머(Transformer)는 구글이 자연어처리를 위해 2017년 발표한 모델로 현재 AI 분야의 혁신을 이끌고 있는 언어모델이다. 우리가 웹이나 API를 통해 AI를 처음 활용하게 된 계기가 된 ChatGPT 역시

blog-ko.superb-ai.com

You can find detailed steps of how transformer gets the sense of the data and generate new data from this blog:

https://www.datacamp.com/tutorial/how-transformers-work

Failure is an option here. If things are not failing, you are not innovating enough.
- Elon Musk -

저작자표시 비영리 변경금지 (새창열림)

'NLP' 카테고리의 다른 글

[LLM] 1. Prompt Engineering Basics #1 (0)	2025.01.05
[Prompt Engineering] 2. Zer0-shot Prompting (1)	2025.01.02
[Prompt Engineering] 1. Few-shot Prompting (0)	2025.01.02
[NLP] 2. NLP Arrangement of Terms (용어 정리) (1)	2024.11.21
[NLP] 1. Natural Language Processing Basics (1)	2024.11.18

현재글[NLP] 3. How does Transformer Work?

캐글, cibmtr - equity in post-hct survival predictions, 오블완, ML, Kaggle, llm, home credit default risk, nlp, 비트코인, Express, 티스토리챌린지, dl, 투자, 매매일지, backend, 단타, 경제, 코인, Prompt Engineering, nodejs,

Today :
Yesterday :

동선생