NLP

[NLP] 3. How does Transformer Work?

dongsunseng 2025. 1. 3. 15:52
반응형

Background

  • Transformer, introduced by Google in 2017 for natural language processing, is a language model that's leading innovation in the AI field. 
  • ChatGPT, which first enabled us to use AI through web and API interfaces, is also based on Transformer, as are the language models that companies like Google and Facebook are developing as competitors.
  • Transformer is expected to achieve state-of-the-art performance not only in natural language processing but also in other fields like computer vision and speech recognition.

Shift from CNN Dominance to Transformer

  • Deep learning can be traced back to the Perceptron of the 1950s, which was inspired by human neurons.
  • However, deep learning faced a dark age until the early 2010s due to insufficient computing power and more importantly lack of data for analysis during the 1990s-2000s.
  • However, in the 2010s, data increased explosively through smartphones and social media.
  • In 2012, AlexNet, using deep learning, became a breakthrough in the ImageNet Challenge (classifying 1000 images) by improving image classification accuracy by more than 10% from the previous 70-80%.
  • AlexNet consists of 5 CNN layers and 3 FC layers.
  • After that, computer vision field development mainly focused on CNN-based models, and ResNet, which emerged in 2015, achieved an image recognition error rate of around 3%, similar to human performance.

Natural Language Processing's History

  • In contrast, for natural language processing, RNN, which is an artificial neural network for processing sequential data like text, emerged in the 1980s, and its improved version LSTM came out in 1997. 
  • However, they couldn't solve the long-term dependencies problem for a while, where it became difficult to remember previous data as input sentences got longer.
  • There were also attempts to analyze sentence sentiment by creating embedding vectors using CNN, which was popular at the time.
  • The Sequence to Sequence language model, introduced in 2014, is considered one of the greatest inventions in natural language processing history.
  • It could not only convert existing sentences into numerical values but also generate new sentences using these values.
  • Machine Translation is a typical example, such as generating English sentences from Korean input.
  • However, the Seq2Seq model still had RNN's chronic problem where it struggled to remember previous information as input sentences got longer, as it used RNN in both the encoder(processing input sentences) and decoder(generating new sentences). 
  • Also, information loss occurred when trying to reconstruct target sentences using only the numerical information from the encoder's last timestep.
  • This issue was later resolved with the addition of Attention, enabling translation regardless of sentence length.

RNN's Main Problem(Summary)

  • They process the input data sequentially, one after the other. Such a recurrent process does not make use of modern graphics processing units (GPUs), which were designed for parallel computation and, thus, makes the training of such models quite slow.
  • They become quite ineffective when elements are distant from one another. This is due to the fact that information is passed at each step and the longer the chain is, the more probable the information is lost along the chain.

Attention?

https://wikidocs.net/22893

  • The basic idea of Attention is that since the numerical information from the encoder's last timestep alone is sufficient, the decoder refers back to the entire input sentence at every timestep when predicting output words.
  • However, it doesn't reference all input words equally - instead, it pays more attention to words most relevant to the word being predicted at that timestep.
  • Mathematically, this involves creating a query by multiplying weights with the decoder's current timestep output (hidden state), then taking the dot product with all encoder timestep outputs, and learning these weights through backpropagation to better reference the words that need to be predicted.
  • Although the addition of Attention somewhat removed limitations on sentence length, RNN-based Seq2Seq models still produced lower quality translations compared to humans.
  • However, the emergence of the Transformer brought significant changes to natural language processing.
  • In 2017, Google introduced the Transformer model through their paper "Attention is All You Need," implementing both encoder and decoder entirely with attention mechanisms, rather than just using attention for corrections.
  • The Transformer model became not only free from sentence length constraints but also better at understanding input sentences through the encoder and previously generated words through the decoder.
  • All famous pre-trained language models (PLMs) since then have been Transformer-based.
  • BERT consists of 12 Transformer encoders and excels at natural language understanding, while GPT-1 consists of 12 Transformer decoders and shows strength in natural language generation.
  • Subsequent language models have evolved by increasing model size and datasets - GPT-3's largest version has 96 decoders and 175 billion parameters.
  • ChatGPT is a model fine-tuned from GPT-3, specialized for conversation.

Transformer excelling in image field

  • The Transformer model is achieving good results not only in natural language processing but also in image processing.
  • Vision Transformer (ViT), announced in 2020, applies the Transformer model to the vision field.
  • It divides input images into patches, feeds them into the Transformer's encoder, and can capture interdependencies between different positions of input images and global image features using attention.
  • Additionally, Transformer is used in popular text-to-image generation models like DALL-E 2 and Stable Diffusion.
  • These models learn optimal weights for image generation by adding noise to images and restoring them, but instead of blindly restoring images, they find directions for restoration conditioned on given text information.
  • The Transformer is used not only to understand information between texts but also to model interactions between text and image representations.

More into the Model 

  • Transformer uses attention in both the encoder (which understands input sentences) and the decoder (which generates target sentences).
  • There are three types of attention in the Transformer:
    1. Encoder Self-Attention
      • Used within the encoder for understanding input sentences
    2. Decoder Self-Attention (also called Masked Attention)
      • Used within the decoder for understanding the sentence it's generating
      • Called "masked attention" because it masks future tokens during the word-by-word sentence generation process
    3. Encoder-Decoder Attention
      • The original purpose of attention
      • Used for the decoder to reference information from the encoder when generating sentences, supplementing any missing information

https://wikidocs.net/31379

  • Looking at the process step by step from where words enter the encoder:
    1. Input sentences are tokenized to create a dictionary
    2. Tokens are mapped to integers
    3. These pass through the embedding layer
    4. This creates embedding values for tokens that the model will learn
  •  The Transformer maintains a consistent dimensionality of 512 for both word embedding vectors and all input/output values within the model.

Let me break down this explanation of Transformer's detailed operation:
1. Multi-head Attention in First Encoding Layer

  • When generating contextual representations by calculating similarities between input sentence tokens
  • Instead of calculating similarities between 512-dimensional tokens all at once
  • Divides into n heads for learning (hence "Multi-head Attention")
  • Paper used head=8

2. Example Calculation

  • For a sentence like "나는, 학교, 에, 간다":
  • Instead of full (4, 512).T x (4, 512) matrix multiplication
  • Changes weight vector size to 64 dimensions (512/8)
  • Enables 8 parallel (4, 64).T x (4, 64) matrix operations

3. Efficient Processing

  • Uses matrix multiplication between input values and model weights
  • Processes efficiently through:
    • Batch matrix operations
    • Parallel attention processing via multi-head attention mechanism

4. Subsequent Encoder Blocks

  • Perform self-attention learning with output from previous block
  • Each encoder block has different weight parameters
  • Model's expressiveness improves as layers stack up

5. Decoder Operation

  • Performs self-attention on masked output sentence tokens
  • Conducts encoder-decoder attention using:
    • Self-attention values
    • Values passed through final encoder block
  • Both self-attention and encoder-decoder attention use parallel-processed multi-head attention

Conclusion

  • The Transformer achieved several key breakthroughs:
    1. Overcame Sentence Length Limitations
      • Through attention mechanisms
      • Improved understanding of both input and generated sentences
    2. Efficient Processing
      • Handles massive matrix operations between input values and weights
      • Achieves efficiency through parallel processing of all operations
    3. Foundation for Large Language Models
      • Enabled development of large-scale language models like GPT (Generative Pre-trained Transformer)
      • Made it possible to pre-train on massive datasets
      • Achieved superior performance through this architecture
  •  This architecture laid the groundwork for modern large language models and continues to drive innovation in AI.

Reference

 

Transformer 모델이란? : AI 혁신을 주도하는 트랜스포머 알고리즘

트랜스포머(Transformer)는 구글이 자연어처리를 위해 2017년 발표한 모델로 현재 AI 분야의 혁신을 이끌고 있는 언어모델이다. 우리가 웹이나 API를 통해 AI를 처음 활용하게 된 계기가 된 ChatGPT 역시

blog-ko.superb-ai.com

You can find detailed steps of how transformer gets the sense of the data and generate new data from this blog:

https://www.datacamp.com/tutorial/how-transformers-work


Failure is an option here. If things are not failing, you are not innovating enough.
- Elon Musk -
반응형