캐글 보충

[Kaggle Extra Study] 5. Cross Validation 교차 검증

dongsunseng 2024. 10. 23. 19:21
반응형

Overfitting and Underfitting

Before we talk about what cross validation method is, we should first think about overfitting and underfitting in terms of machine learning.

 

In general ml problems, we utilize explanatory variables from train data to predict the target variable in the test data. However, test data would not have target variable which makes difficult for us to validate the performance of the ml model. Using the train data again for validation is not a best practice because we already use that data for training the model. 

 

To make the model be generalized, which means that it shows decent performance in spite of the train data, we should apply special methods(perhaps cross validation). This prevents the model from being overfitted or underfitted. 

 

  • Overfitting: When the model is excessively trained on the train data so that it cannot be generalized: performs good only for specific train data(the data used for training).
  • Underfitting: When the model is not sufficiently trained.

There are various methods for preventing the model from being overfitted:

  • Hold out
  • Cross Validation
  • Leave-one-out

Key point of these methods is to separate validation data from train data to make the model generalized.

 

You can find more about overfitting and underfitting from my other post:

 

[Kaggle Study] 4. Overfitting, Underfitting, Variance and Bias

Overfitting and UnderfittingOverfitting refers to cases where a model performs well on the training set but shows poor performance on the validation set.Underfitting occurs when there isn't big difference between training and validation set performance, bu

dongsunseng.com


Hold out

Hold out method is to simply divide the original train data into 1. train data and 2. validation in specific ratio(8:2 for example). 

 

One thing we should aware while using hold out method is that the model can be easily overfitted if we care too much about the validation score because the validation data is fixed.

 

We can use cross validation method to get rid of this problem. 

Cross Validation

k-fold cross validation

Cross validation method 1. divides the whole train data into k subsets 2. consider each of the subset as validation dataset and use the rest to train the model 3. iterate over the process for k times. 

 

Comparing with hold out method, we can make more generalized model because it is possible to validate the model for k different random datasets.

  • When the amount of data is too small, it's good to use cross-validation instead of splitting into a validation set. 
  • Generally, if you have around 100,000 data points, split them in a ratio of 8:1:1.
  • For datasets with over 1 million samples, divide them in a ratio of approximately 98:1:1.
  • Generally, if you can secure more than 10,000 samples for validation and test sets, it's better to allocate more samples to the training set.

There are several kinds of cross validation methods but these are the 2 most popular ones:

  • standard k-fold cross validation: most basic one which is explained above
    • usually 5 or 10 is used for k
    • pro: we can utilize all data for train AND validation -> generalization
    • con: we should compute the same process k times
  • stratified k-fold cross validation: enhanced version of standard k-fold cross validation method that ensures the preservation of class proportion in both training and validation datasets
    • maintains the same class distribution in each fold as in the complete dataset
    • if the overall dataset has a positive:negative ratio of 3:7, each fold maintains the same ratio
    • works well with imbalanced datasets (e.g., medical diagnosis, fraud detection: dataset that has minority classes)
    • usually used for binary or multi-class classification problems
    • pro: prevents biased evaluation

Leave-one-out

Leave-one-out method is similar with k-fold cross validation method but every "one" of the train dataset becomes the validation dataset

 

In other words, we should iterate over the total number of train data to update the model. Therefore, this method should only be used if the train data is not that large. 

 

  • pro: less biased model -> more generalization
  • con: computation even bigger than k-fold cross validation

There are also Leave-p-out method that k becomes specific number p. 

 

Reference

 

GitHub - taehojo/getting_started_with_kaggle: 쉽게 시작하는 캐글 데이터 분석 (길벗) 소스코드 모음입니다.

쉽게 시작하는 캐글 데이터 분석 (길벗) 소스코드 모음입니다. Contribute to taehojo/getting_started_with_kaggle development by creating an account on GitHub.

github.com

 

 

He that can have patience can have what he will.
- Benjamin Franklin -
반응형