[Kaggle Extra Study] 7. Data Imputation

캐글 보충

[Kaggle Extra Study] 7. Data Imputation

dongsunseng 2024. 10. 27. 14:53

While training machine learning models, it is necessary to preprocess missing values because it might greatly affect the performance of the model. Accordingly, there are various methods to deal with those missing values.

Types of Missing Values:

Missing completely at random (MCAR)
- Case when missing values occur independently of other variables, completely random in other words.
  - No correlation with specific variables.
  - Independent of both observed and missing values.
- For example, data missing due to system errors, communication problems, etc.
- NO DISTRIBUTION CHANGE BEFORE AND AFTER IMPUTATION
- Deletion is viable if there are many observations.
Missing at random (MAR)
- When missing values are related to specific variables but not correlated with the desired outcome.
  - Missing values occur randomly.
  - Can occur conditionally based on specific variable values.
  - Can be estimated using various imputation techniques.
- For example, sundial (cannot measure at night): clear cause exists
Not missing at random (NMAR)
- When missing values are correlated with other variables.
  - Missing values do not occur randomly.
  - Influenced by both observed and missing values.
  - Difficult to identify cause of missing values, and simple imputation methods alone are insufficient.
  - Clear reasons for missing data exist, but they are abnormal and hard to identify.
- For example, when respondents intentionally hide themselves and provide false responses.

At first glance, these distinctions might seem unclear. However, represents an ideal case where missing values occur completely at random, making their removal acceptable. In contrast, while both MAR and NMAR missing values arise from definite causal relationships, MAR has clearly identifiable causes, while NMAR's causes remain unknown - this is the key difference between them.

Methods of Data Imputation

1. Do Nothing

Some algorithms such as XGBoost automatically identifies missing values and train itself on how to deal with those missing values for better performance in terms of the loss function.
There are also algorithms that literally ignore missing values: LightGBM(use_missing=False)
However, many algorithms returns error on missing values. To deal with those errors, we should handle the error and make cleaner train dataset.

2. Imputation Using Mean or Median Values

Pro:
- Easy and fast.
- Legitimate for numerical variables that isn't that large.
Con:
- Not that accurate.
- Only applicable at the column level, without considering correlations between variables.
- Not suitable for encoded categorical variables.
- Unable to explain the uncertainty in the results of missing value imputation.

Code Example:

#Impute the values using scikit-learn SimpleImpute Class
from sklearn.impute import SimpleImputer
imp_mean = SimpleImputer( strategy='mean') #for median imputation replace 'mean' with 'median'
imp_mean.fit(train)
imputed_train_df = imp_mean.transform(train)

3. Imputation Using Most Frequent(Mode) or Zero/Constant Values

Mode method's pro:
- Mode value can be used in categorical variables too.
Mode method's con:
- Doesn't consider correlations between variables
- Creates bias in the data
Zero / Constant method is to simply fill out the missing values with 0 or random constant value.

Code Example:

#Impute the values using scikit-learn SimpleImpute Class

from sklearn.impute import SimpleImputer
imp_mean = SimpleImputer( strategy='most_frequent')
imp_mean.fit(train)
imputed_train_df = imp_mean.transform(train)

4. Imputation Using k-NN

k-nearest neighbor is an algorithm used for simple classifications.
Uses 'feature similarity(변수 유사도)' to predict the value of new data points.
A new data point is assigned a predicted value based on how close it is to the points in the training dataset.
k-NN algorithm is useful for finding neighboring data for samples with missing values and replacing missing values based on the valid values of neighboring data samples.
How does it work?
- The KNN algorithm builds a KDTree using a complete list generated from basic mean imputation values. KDTree, K-Dimensional Tree), is a binary tree data structure used for storing and efficiently searching data in k-dimensional space.
- It then uses the KDTree to calculate the Nearest Neighbors.
- Once k-NNs are found, the algorithm takes the weighted average of the neighboring data.
Pro:
- Can achieve much more accurate results than mean, median or mode imputation methods.
- Variations exist depending on the dataset.
Con:
- Computationally resource-intensive.
- KNN is an algorithm that loads the entire training set into memory for calculations.
- Unlike SVM, KNN is sensitive & vulnerable to outliers in the data.

Code Example:

import sys
from impyute.imputation.cs import fast_knn
sys.setrecursionlimit(100000) #Increase the recursion limit of the OS

# start the KNN training
imputed_training=fast_knn(train.values, k=30)

# sklearn version
from sklearn.impute import KNNImputer
train_knn = train.copy(deep=True)

knn_imputer = KNNImputer(n_neighbors=2, weights="uniform")
train_knn['Age'] = knn_imputer.fit_transform(train_knn[['Age']])

5. Imputation Using Multivariate Imputation by Chained Equation(MICE)

Performed by repeatedly filling in missing values multiple times.
- A strategy for imputing missing values by modeling each feature with missing values as a function of other features in a round-robin fashion.
- It performs multiple regressions over random sample of the data, then takes the average of the multiple regression values and uses that value to impute the missing value.
Multiple Imputations(MIs) are much better than single imputation because they provide better measure of uncertainty in missing values.
The chained equations approach is very flexible, capable of handling various variable types such as continuous real numbers or binary values, and can deal with correlations between variables and even intentional non-response patterns in surveys.
Statistical modeling is performed using the 'with' function and the results are obtained by averaging m imputation sets using 'pool' function.
Imputation including MICE might work better in R than python.
Steps:
- imputation: impute m datasets based on distribution
- analysis: analyze m completed datasets
- Pooling: synthesize results by calculating means, variances, and confidence intervals

Code Example:

from impyute.imputation.cs import mice
# MICE 학습
np_imputed=mice(df.values)
df_imputed = pd.DataFrame(np_immputed)

# sklearn ver
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
train_mice = train.copy(deep=True)

mice_imputer = IterativeImputer()
train_mice['Age'] = mice_imputer.fit_transform(train_mice[['Age']])

6. Imputation Using Deep Learning(Datawig library)

This method works well with categorical variables and non-numeric variables.
Datawig is a library that trains machine learning models using deep neural networks(DNN) to fill in missing values in dataframes.
It supports both CPU and GPU for ml training.
Pro:
- Considerably accurate compared to other imputation methods.
- Provides functions like Feature Encoder that can handle categorical variables.
- Supports CPU and GPU.
Con:
- Only applicable to single column imputation.
- Can take quite long with large datasets.
- Must manually specify other columns that have high correlation with or contain information about the target column to be imputed.

Code example:

import datawig

df_train, df_test = datawig.utils.random_split(train)

#Initialize a SimpleImputer model
imputer = datawig.SimpleImputer(
    input_columns=['1','2','3','4','5','6','7', 'target'], # column(s) containing information about the column we want to impute
    output_column= '0', # the column we'd like to impute values for
    output_path = 'imputer_model' # stores model data and metrics
    )

#Fit an imputer model on the train data
imputer.fit(train_df=df_train, num_epochs=50)

#Impute missing values and return original dataframe with predictions
imputed = imputer.predict(df_test)

7. Delete row / column

Removing entire rows or columns containing missing values.
Act of deleting data itself risks losing data that contains important information. Moreover, it can cause lack of train data.
Thus, we should only delete the missing values from a dataset if their proportion is very small.
Example of removal standards:
- Less than 10%: Delete or impute
- 10% ~ 50%: Regression or model based imputation
- Over 50%: Remove the variable entirely

Pairwise Deletion:
- Pairwise deletion is used when values are missing completely at random, i.e. MCAR.
- During pairwise deletion, only the missing values are deleted.
- All operations in pandas like mean, sum, etc intrinsically skip missing values.
Listwise Deletion / Dropping rows:
- During listwise deletion, complete rows(which contain the missing values) are deleted.
- As a result, it is also called "Complete Case Deletion".
- Like Pairwise deletion, listwise deletions are also only used for MCAR values.
- It is advisable to use it only when the number of missing values is very small because major chunk of data and hence a lot of information can be lost.
Dropping complete columns:
- If a column contains a lot of missing values, say more than 80%, and the feature is not significant, you might want to delete that feature.
- However, again, it is not a good methodology to delete data.

8. Other methods

Regression Imputation
- This method involves treating variables without missing values as features and the variable to be imputed as the target for a regression task.
- While this preserves relationships between variables since it predicts missing values based on other variables in the data, it doesn't maintain variability between predictions.
  - Consider that in regression analysis, the regression line has no random component - it exists as the regression value itself.
Stochastic Regression Imputation
- Similar to regression imputation, this method predicts missing values by calculating regression results from related variables in the same dataset with added random residuals for final prediction of missing values.
- It maintains all the advantages of the regression method while also benefiting from having a random component.
Extrapolation and Interpolation
- Estimates missing values from other data samples obtained from the same subject within a certain range formed by a set of data points.
- This is only possible with longitudinal data(such as height data collected while tracking children's growth).
Hot-Deck Imputation
- Replaces missing values by randomly sampling data from a set of correlated or similar variables.
- Advantageous the range of possible values for variables with missing data is limited.
- Additionally, since values are randomly selected, it adds some variability, which contributes to the accuracy of standard errors.
Cold-Deck Imputation
- Similar to hot-deck imputation, this method replaces missing values by selecting from similar data in other variables.
- However, unlike hot-deck imputation, it doesn't use random sampling but selects value based on certain rules(for example, taking the kth sample).
- This removes the random variation present in hot-deck imputation.

Single Imputation vs. Multiple Imputation

Single Imputation:
- Selecting a single value as a replacement for missing values.
- However, unless the missing values occurred randomly, this leads to biased parameter estimates such as mean, correlation, and regression coefficients.
- Due to this estimation bias, it might perform worse than simply deleting rows with missing values.
Multiple Imputation:
- Selecting multiple estimated values as replacements for missing values.
- Among the imputation methods mentioned above, hot-deck and stochastic regression are commonly used: due to their random component.

Imputation Algorithm Packages

autoimpute

Handles categorical variables
Can process variables individually
Compatible with scikit-learn
https://pypi.org/project/autoimpute/

impyute

Provides common imputation methods like mice, em, knn
Does not support categorical variables
https://pypi.org/project/impyute/

Scikit-learn built-in Imputers

https://scikit-learn.org/stable/modules/impute.html

Datawig

Works well with categorical and non-numeric variables using deep learning
Only possible for single variable: so speed is slow
https://github.com/awslabs/datawig

missingpy

KNN and random forest based imputation methods
Can impute numeric variables and interger-encoded categorical variables
https://github.com/epsilon-machine/missingpy

MIDA (Multiple Imputation based Denoising Autoencoders)

Not a package but enables multiple imputations using denoising autoencoder
https://arxiv.org/abs/1705.02737
Works with categorical variables (https://github.com/Oracen/MIDAS/blob/master/midas/midas_base.py)
https://github.com/ambareeshsrja16/Python-Module-for-Missing-Data-Imputation

knn_impute

Supports both numeric and categorical variables
Distance-based imputation
https://gist.github.com/YohanObadia

Reference

A Guide to Handling Missing values in Python

Explore and run machine learning code with Kaggle Notebooks | Using data from multiple data sources

www.kaggle.com

6 Different Ways to Compensate for Missing Data (Data Imputation with examples)

Popular strategies to statistically impute missing values in a dataset.

towardsdatascience.com

[ Python ] imputation algorithm package 정리

도움이 되셨다면, 광고 한번만 눌러주세요. 블로그 관리에 큰 힘이 됩니다 ^^ categorical 변수 가장 빈도수 많은 것으로 대체할 때, df_most_common_imputed = colors.apply(lambda x: x.fillna(x.value_counts().index[0])) d

data-newbie.tistory.com

Seven Ways to Make up Data: Common Methods to Imputing Missing Data - The Analysis Factor

There are many ways to approach missing data. The most common, I believe, is to ignore it. But making no choice means that your statistical software is choosing for you. Most of the time, your software is choosing listwise deletion. Listwise deletion may o

www.theanalysisfactor.com

Surround yourself with people who believe in your dreams and want to see you succeed.

- Max Holloway -

저작자표시 비영리 변경금지 (새창열림)

'캐글 보충' 카테고리의 다른 글

[Kaggle Extra Study] 9. Plots with Missing Data (4)	2024.10.28
[Kaggle Extra Study] 8. Imputation Techniques for Time Series Data (1)	2024.10.27
[Kaggle Extra Study] 6. Ensemble Method 앙상블 기법 (3)	2024.10.24
[Kaggle Extra Study] 5. Cross Validation 교차 검증 (3)	2024.10.23
[Kaggle Extra Study] 4. Curse of Dimensionality 차원의 저주 (9)	2024.10.23

현재글[Kaggle Extra Study] 7. Data Imputation

backend, Prompt Engineering, ML, Kaggle, 단타, nodejs, 코인, 경제, llm, 캐글, 투자, 비트코인, 티스토리챌린지, 매매일지, home credit default risk, 오블완, nlp, dl, cibmtr - equity in post-hct survival predictions, Express,

Today :
Yesterday :

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

동선생