[Kaggle Extra Study] 5. Cross Validation 교차 검증

캐글 보충

[Kaggle Extra Study] 5. Cross Validation 교차 검증

dongsunseng 2024. 10. 23. 19:21

Overfitting and Underfitting

Before we talk about what cross validation method is, we should first think about overfitting and underfitting in terms of machine learning.

In general ml problems, we utilize explanatory variables from train data to predict the target variable in the test data. However, test data would not have target variable which makes difficult for us to validate the performance of the ml model. Using the train data again for validation is not a best practice because we already use that data for training the model.

To make the model be generalized, which means that it shows decent performance in spite of the train data, we should apply special methods(perhaps cross validation). This prevents the model from being overfitted or underfitted.

Overfitting: When the model is excessively trained on the train data so that it cannot be generalized: performs good only for specific train data(the data used for training).
Underfitting: When the model is not sufficiently trained.

There are various methods for preventing the model from being overfitted:

Hold out
Cross Validation
Leave-one-out

Key point of these methods is to separate validation data from train data to make the model generalized.

You can find more about overfitting and underfitting from my other post:

[Kaggle Study] 4. Overfitting, Underfitting, Variance and Bias

Overfitting and UnderfittingOverfitting refers to cases where a model performs well on the training set but shows poor performance on the validation set.Underfitting occurs when there isn't big difference between training and validation set performance, bu

dongsunseng.com

Hold out

Hold out method is to simply divide the original train data into 1. train data and 2. validation in specific ratio(8:2 for example).

One thing we should aware while using hold out method is that the model can be easily overfitted if we care too much about the validation score because the validation data is fixed.

We can use cross validation method to get rid of this problem.

Cross Validation

Cross validation method 1. divides the whole train data into k subsets 2. consider each of the subset as validation dataset and use the rest to train the model 3. iterate over the process for k times.

Comparing with hold out method, we can make more generalized model because it is possible to validate the model for k different random datasets.

When the amount of data is too small, it's good to use cross-validation instead of splitting into a validation set.
Generally, if you have around 100,000 data points, split them in a ratio of 8:1:1.
For datasets with over 1 million samples, divide them in a ratio of approximately 98:1:1.
Generally, if you can secure more than 10,000 samples for validation and test sets, it's better to allocate more samples to the training set.

There are several kinds of cross validation methods but these are the 2 most popular ones:

standard k-fold cross validation: most basic one which is explained above
- usually 5 or 10 is used for k
- pro: we can utilize all data for train AND validation -> generalization
- con: we should compute the same process k times
stratified k-fold cross validation: enhanced version of standard k-fold cross validation method that ensures the preservation of class proportion in both training and validation datasets
- maintains the same class distribution in each fold as in the complete dataset
- if the overall dataset has a positive:negative ratio of 3:7, each fold maintains the same ratio
- works well with imbalanced datasets (e.g., medical diagnosis, fraud detection: dataset that has minority classes)
- usually used for binary or multi-class classification problems
- pro: prevents biased evaluation

Leave-one-out

Leave-one-out method is similar with k-fold cross validation method but every "one" of the train dataset becomes the validation dataset.

In other words, we should iterate over the total number of train data to update the model. Therefore, this method should only be used if the train data is not that large.

pro: less biased model -> more generalization
con: computation even bigger than k-fold cross validation

There are also Leave-p-out method that k becomes specific number p.

Reference

GitHub - taehojo/getting_started_with_kaggle: 쉽게 시작하는 캐글 데이터 분석 (길벗) 소스코드 모음입니다.

쉽게 시작하는 캐글 데이터 분석 (길벗) 소스코드 모음입니다. Contribute to taehojo/getting_started_with_kaggle development by creating an account on GitHub.

github.com

He that can have patience can have what he will.
- Benjamin Franklin -

저작자표시 비영리 변경금지 (새창열림)

'캐글 보충' 카테고리의 다른 글

[Kaggle Extra Study] 7. Data Imputation (5)	2024.10.27
[Kaggle Extra Study] 6. Ensemble Method 앙상블 기법 (3)	2024.10.24
[Kaggle Extra Study] 4. Curse of Dimensionality 차원의 저주 (9)	2024.10.23
[Kaggle Extra Study] 3. Time-series Data (2)	2024.10.22
[Kaggle Extra Study] 2. AutoEncoder (5)	2024.10.22

현재글[Kaggle Extra Study] 5. Cross Validation 교차 검증

backend, Kaggle, ML, 티스토리챌린지, 경제, 캐글, 코인, 투자, 오블완, Express, cibmtr - equity in post-hct survival predictions, home credit default risk, 비트코인, 단타, Prompt Engineering, nlp, nodejs, 매매일지, dl, llm,

Today :
Yesterday :

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

동선생