NLP

[NLP] 1. Natural Language Processing Basics

dongsunseng 2024. 11. 18. 15:58
반응형

This post heavily relies on the book 'Natural Language Processing with Pytorch':

https://books.google.co.kr/books?id=AIgxEAAAQBAJ&printsec=copyright&redir_esc=y#v=onepage&q&f=false

NLP?

  • A set of techniques that solve practical problems using statistical methods to understand text, regardless of linguistic knowledge.
  • The 'understanding' of text is mainly achieved by converting text into computable representations.
  • These representations are structures that combine discrete or continuous elements such as vectors, tensors, graphs, and trees.
    • Discrete <-> Continuous

Deep Learning?

  • Technique that effectively learns representations from data using computational graphs and numerical optimization techniques.

Encoding?

  • In NLP, when we want to use samples (i.e., text) in machine learning algorithms, we need to represent them numerically - this is called the encoding process.
  • Numerical vectors are a simple way to represent text.
  • There are countless methods to perform such mapping or representation.
  • Among them, count-based representations all start with fixed-dimension vectors.

One-hot Representation

  • One-hot representation starts with a zero vector and sets the elements corresponding to words that appear in the sentence or document to 1.
  • Let's say we have 2 sentences: Time flies like an arrow. AND Fruit flies like a banana.
    • After dividing the sentence into tokens, ignoring punctuation marks, and converting everything to lowercase, the vocabulary(어휘 사전) becomes 8.
      • vocabulary: {time, fruit, flies, like, a, an, arrow, banana}
    • Thus, each word can be represented as an 8-dimensional one-hot vector.
      • For example, time becomes [1, 0, 0, 0, 0, 0, 0, 0]
    • The one-hot vector of phrases, sentences, and documents is simply the logical sum of the one-hot representations of their constituent words.
    • The one-hot representation of "like a banana" becomes a 3x8 matrix.
      • [[0, 0, 0, 1, 0, 0, 0, 0], [0, 0, 0, 0, 1, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 1]]
      • Here, each row is an 8-dimensional one-hot vector.
    • We can also often see binary encoding, which represents text/phrases as a single vector of vocabulary size.
      • Here, 0 and 1 indicate the presence or absence of a word.
      • The binary encoding of "like a banana" is [0, 0, 0, 1, 1, 0, 0, 1]

from sklearn.feature_extraction.text import CountVectorizer
import seaborn as sns

corpus = ['Time flies like an arrow.', 'Fruit flies like a banana.']
one_hot_vectorizer = CountVectorizer(binary=True)
one_hot = one_hot_vectorizer.fit_transform(corpus).toarray()
voab = one_hot_vectorizer.get_feature_names()
sns.heatmap(one_hot, annot=True, cbar=False, xticklabels=vocab, yticklabels=['Sentence 1', 'Sentence 2'])

  • CountVectorizer's binary parameter's default value is False, which means it makes TF representations for default.
  • When we set to True like above, it processes one-hot encoding.
  • Also, CountVectorizer ignores letters that has single word (e.g. a) in default.
  • CountVectorizer's default output format is sparse matrix(희소 행렬) which only stores values that aren't 0. Therefore, we should put .toarray() to make it into dense matrix(밀집 행렬).

 

Both TF representation and TF-IDF representation is popular techniques in NLP. These representations have a long history in the field of Information Retrieval(IR) and are still actively used in modern commercial NLP systems.

Term-Frequency(TF) Representation

  • TF representation of phrases, sentences, and documents is created by simply adding up the one-hot representations of their constituent words.
  • For example, using the one-hot encoding method mentioned earlier, the TF representation of 'Fruit flies like time flies a fruit' would be as follows.
    • [1, 2, 2, 1, 1, 0, 0, 0]
    • This kind of representation is called 'Bag of Words: BoW'(단어 가방 모델).
  • Each element represents the number of times that word appears in the sentence (corpus).
    • In NLP, a dataset is called a corpus (or corpora for plural).
  • The TF of a word w is denoted as TF(w).

Term-Frequency-Inverse-Document-Frequency(TF-IDF) Representation

  • TF representation assigns weights to words proportionally to their frequency of appearance.
  • However, taking a collection of patent documents as an example, common words like "claim" don't contain any specific information about a particular patent.
  • Conversely, rare words like "tetrafluoroethylene" don't appear frequently but well represent the characteristics of the patent document.
  • In such situations, Inverse Document Frequency (IDF) is more appropriate.
  • IDF lowers the scores of common tokens and increases the scores of rare tokens in vector representations.
$IDF(w) = log\frac{N}{n_w}$
  • $n_w$ is the number of documents containing the word w, and N is the total number of documents.
  • The TF-IDF score is the product of TF and IDF: TF(w) * IDF(w).
  • For very common words that appear in all documents (i.e., n_w = N), IDF(w) is 0 (because log1 = 0), and consequently, their TF-IDF score is also 0.
    • Therefore, these words are completely excluded.
  • Conversely, if a word appears very rarely, such as in only one document, the IDF reaches its maximum value of logN (because n_w is 1).
from sklearn.feature_extraction.text import TfidfVectorizer
import seaborn as sns

tfidf_vectorizer = TfidfVectorizer()
tfidf = tfidf_vectorizer.fit_transform(corpus).toarray()
sns.heatmap(tfidf, annot=True, cbar=False, xticklabels=vocab, yticklabels=['Sentence 1', 'Sentence 2'])

Since the purpose of deep learning is representation learning, it typically doesn't encode inputs using empirical methods like TF-IDF.

Instead, it mainly constructs neural network inputs using one-hot encoding with integer indices and special embedding lookup layers.

Target Encoding

  • The form of target variables differs depending on the type of NLP problem being solved.
  • For example, in machine translation, summarization, and question-answering problems, the target is also text and is encoded using methods like one-hot encoding.
  • In fact, many NLP tasks use categorical labels.
    • The model must predict one label from a fixed set of labels.
    • While assigning unique indices to each label is the most common encoding method, this simple representation becomes problematic when the number of output labels becomes too large.
    • Language modeling, which predicts the next word given previous words, is an example of this.
    • The label scale becomes the entire vocabulary of a language.
    • Some NLP problems predict numerical values from given text.
    • For example, you might need to assign numerical grades to an essay or predict restaurant review ratings up to the first decimal place.
    • You might also need to predict a user's age group from their tweet.
    • There are several ways to encode numerical targets.
    • One simple method is to convert targets into categorical bins like '0-18', '19-25', '25-30' and treat it as an ordinal classification problem.
    • In such cases, you need to pay careful attention as target encoding has a huge impact on performance.

Don't wait for an opportunity. Create it.
- Conor Mcgregor -
반응형