반응형
While training machine learning models, it is necessary to preprocess missing values because it might greatly affect the performance of the model. Accordingly, there are various methods to deal with those missing values.
Types of Missing Values:
- Missing completely at random (MCAR)
- Case when missing values occur independently of other variables, completely random in other words.
- No correlation with specific variables.
- Independent of both observed and missing values.
- For example, data missing due to system errors, communication problems, etc.
- NO DISTRIBUTION CHANGE BEFORE AND AFTER IMPUTATION
- Deletion is viable if there are many observations.
- Case when missing values occur independently of other variables, completely random in other words.
- Missing at random (MAR)
- When missing values are related to specific variables but not correlated with the desired outcome.
- Missing values occur randomly.
- Can occur conditionally based on specific variable values.
- Can be estimated using various imputation techniques.
- For example, sundial (cannot measure at night): clear cause exists
- When missing values are related to specific variables but not correlated with the desired outcome.
- Not missing at random (NMAR)
- When missing values are correlated with other variables.
- Missing values do not occur randomly.
- Influenced by both observed and missing values.
- Difficult to identify cause of missing values, and simple imputation methods alone are insufficient.
- Clear reasons for missing data exist, but they are abnormal and hard to identify.
- For example, when respondents intentionally hide themselves and provide false responses.
- When missing values are correlated with other variables.
At first glance, these distinctions might seem unclear. However, represents an ideal case where missing values occur completely at random, making their removal acceptable. In contrast, while both MAR and NMAR missing values arise from definite causal relationships, MAR has clearly identifiable causes, while NMAR's causes remain unknown - this is the key difference between them.
Methods of Data Imputation
1. Do Nothing
- Some algorithms such as XGBoost automatically identifies missing values and train itself on how to deal with those missing values for better performance in terms of the loss function.
- There are also algorithms that literally ignore missing values: LightGBM(use_missing=False)
- However, many algorithms returns error on missing values. To deal with those errors, we should handle the error and make cleaner train dataset.
2. Imputation Using Mean or Median Values
- Pro:
- Easy and fast.
- Legitimate for numerical variables that isn't that large.
- Con:
- Not that accurate.
- Only applicable at the column level, without considering correlations between variables.
- Not suitable for encoded categorical variables.
- Unable to explain the uncertainty in the results of missing value imputation.
Code Example:
#Impute the values using scikit-learn SimpleImpute Class
from sklearn.impute import SimpleImputer
imp_mean = SimpleImputer( strategy='mean') #for median imputation replace 'mean' with 'median'
imp_mean.fit(train)
imputed_train_df = imp_mean.transform(train)
3. Imputation Using Most Frequent(Mode) or Zero/Constant Values
- Mode method's pro:
- Mode value can be used in categorical variables too.
- Mode method's con:
- Doesn't consider correlations between variables
- Creates bias in the data
- Zero / Constant method is to simply fill out the missing values with 0 or random constant value.
Code Example:
#Impute the values using scikit-learn SimpleImpute Class
from sklearn.impute import SimpleImputer
imp_mean = SimpleImputer( strategy='most_frequent')
imp_mean.fit(train)
imputed_train_df = imp_mean.transform(train)
4. Imputation Using k-NN
- k-nearest neighbor is an algorithm used for simple classifications.
- Uses 'feature similarity(변수 유사도)' to predict the value of new data points.
- A new data point is assigned a predicted value based on how close it is to the points in the training dataset.
- k-NN algorithm is useful for finding neighboring data for samples with missing values and replacing missing values based on the valid values of neighboring data samples.
- How does it work?
- The KNN algorithm builds a KDTree using a complete list generated from basic mean imputation values. KDTree, K-Dimensional Tree), is a binary tree data structure used for storing and efficiently searching data in k-dimensional space.
- It then uses the KDTree to calculate the Nearest Neighbors.
- Once k-NNs are found, the algorithm takes the weighted average of the neighboring data.
- Pro:
- Can achieve much more accurate results than mean, median or mode imputation methods.
- Variations exist depending on the dataset.
- Con:
- Computationally resource-intensive.
- KNN is an algorithm that loads the entire training set into memory for calculations.
- Unlike SVM, KNN is sensitive & vulnerable to outliers in the data.
Code Example:
import sys
from impyute.imputation.cs import fast_knn
sys.setrecursionlimit(100000) #Increase the recursion limit of the OS
# start the KNN training
imputed_training=fast_knn(train.values, k=30)
# sklearn version
from sklearn.impute import KNNImputer
train_knn = train.copy(deep=True)
knn_imputer = KNNImputer(n_neighbors=2, weights="uniform")
train_knn['Age'] = knn_imputer.fit_transform(train_knn[['Age']])
5. Imputation Using Multivariate Imputation by Chained Equation(MICE)
- Performed by repeatedly filling in missing values multiple times.
- A strategy for imputing missing values by modeling each feature with missing values as a function of other features in a round-robin fashion.
- It performs multiple regressions over random sample of the data, then takes the average of the multiple regression values and uses that value to impute the missing value.
- Multiple Imputations(MIs) are much better than single imputation because they provide better measure of uncertainty in missing values.
- The chained equations approach is very flexible, capable of handling various variable types such as continuous real numbers or binary values, and can deal with correlations between variables and even intentional non-response patterns in surveys.
- Statistical modeling is performed using the 'with' function and the results are obtained by averaging m imputation sets using 'pool' function.
- Imputation including MICE might work better in R than python.
- Steps:
- imputation: impute m datasets based on distribution
- analysis: analyze m completed datasets
- Pooling: synthesize results by calculating means, variances, and confidence intervals
Code Example:
from impyute.imputation.cs import mice
# MICE 학습
np_imputed=mice(df.values)
df_imputed = pd.DataFrame(np_immputed)
# sklearn ver
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
train_mice = train.copy(deep=True)
mice_imputer = IterativeImputer()
train_mice['Age'] = mice_imputer.fit_transform(train_mice[['Age']])
6. Imputation Using Deep Learning(Datawig library)
- This method works well with categorical variables and non-numeric variables.
- Datawig is a library that trains machine learning models using deep neural networks(DNN) to fill in missing values in dataframes.
- It supports both CPU and GPU for ml training.
- Pro:
- Considerably accurate compared to other imputation methods.
- Provides functions like Feature Encoder that can handle categorical variables.
- Supports CPU and GPU.
- Con:
- Only applicable to single column imputation.
- Can take quite long with large datasets.
- Must manually specify other columns that have high correlation with or contain information about the target column to be imputed.
Code example:
import datawig
df_train, df_test = datawig.utils.random_split(train)
#Initialize a SimpleImputer model
imputer = datawig.SimpleImputer(
input_columns=['1','2','3','4','5','6','7', 'target'], # column(s) containing information about the column we want to impute
output_column= '0', # the column we'd like to impute values for
output_path = 'imputer_model' # stores model data and metrics
)
#Fit an imputer model on the train data
imputer.fit(train_df=df_train, num_epochs=50)
#Impute missing values and return original dataframe with predictions
imputed = imputer.predict(df_test)
7. Delete row / column
- Removing entire rows or columns containing missing values.
- Act of deleting data itself risks losing data that contains important information. Moreover, it can cause lack of train data.
- Thus, we should only delete the missing values from a dataset if their proportion is very small.
- Example of removal standards:
- Less than 10%: Delete or impute
- 10% ~ 50%: Regression or model based imputation
- Over 50%: Remove the variable entirely
- Pairwise Deletion:
- Pairwise deletion is used when values are missing completely at random, i.e. MCAR.
- During pairwise deletion, only the missing values are deleted.
- All operations in pandas like mean, sum, etc intrinsically skip missing values.
- Listwise Deletion / Dropping rows:
- During listwise deletion, complete rows(which contain the missing values) are deleted.
- As a result, it is also called "Complete Case Deletion".
- Like Pairwise deletion, listwise deletions are also only used for MCAR values.
- It is advisable to use it only when the number of missing values is very small because major chunk of data and hence a lot of information can be lost.
- Dropping complete columns:
- If a column contains a lot of missing values, say more than 80%, and the feature is not significant, you might want to delete that feature.
- However, again, it is not a good methodology to delete data.
8. Other methods
- Regression Imputation
- This method involves treating variables without missing values as features and the variable to be imputed as the target for a regression task.
- While this preserves relationships between variables since it predicts missing values based on other variables in the data, it doesn't maintain variability between predictions.
- Consider that in regression analysis, the regression line has no random component - it exists as the regression value itself.
- Stochastic Regression Imputation
- Similar to regression imputation, this method predicts missing values by calculating regression results from related variables in the same dataset with added random residuals for final prediction of missing values.
- It maintains all the advantages of the regression method while also benefiting from having a random component.
- Extrapolation and Interpolation
- Estimates missing values from other data samples obtained from the same subject within a certain range formed by a set of data points.
- This is only possible with longitudinal data(such as height data collected while tracking children's growth).
- Hot-Deck Imputation
- Replaces missing values by randomly sampling data from a set of correlated or similar variables.
- Advantageous the range of possible values for variables with missing data is limited.
- Additionally, since values are randomly selected, it adds some variability, which contributes to the accuracy of standard errors.
- Cold-Deck Imputation
- Similar to hot-deck imputation, this method replaces missing values by selecting from similar data in other variables.
- However, unlike hot-deck imputation, it doesn't use random sampling but selects value based on certain rules(for example, taking the kth sample).
- This removes the random variation present in hot-deck imputation.
Single Imputation vs. Multiple Imputation
- Single Imputation:
- Selecting a single value as a replacement for missing values.
- However, unless the missing values occurred randomly, this leads to biased parameter estimates such as mean, correlation, and regression coefficients.
- Due to this estimation bias, it might perform worse than simply deleting rows with missing values.
- Multiple Imputation:
- Selecting multiple estimated values as replacements for missing values.
- Among the imputation methods mentioned above, hot-deck and stochastic regression are commonly used: due to their random component.
Imputation Algorithm Packages
autoimpute
- Handles categorical variables
- Can process variables individually
- Compatible with scikit-learn
- https://pypi.org/project/autoimpute/
impyute
- Provides common imputation methods like mice, em, knn
- Does not support categorical variables
- https://pypi.org/project/impyute/
Scikit-learn built-in Imputers
Datawig
- Works well with categorical and non-numeric variables using deep learning
- Only possible for single variable: so speed is slow
- https://github.com/awslabs/datawig
missingpy
- KNN and random forest based imputation methods
- Can impute numeric variables and interger-encoded categorical variables
- https://github.com/epsilon-machine/missingpy
MIDA (Multiple Imputation based Denoising Autoencoders)
- Not a package but enables multiple imputations using denoising autoencoder
- https://arxiv.org/abs/1705.02737
- Works with categorical variables (https://github.com/Oracen/MIDAS/blob/master/midas/midas_base.py)
- https://github.com/ambareeshsrja16/Python-Module-for-Missing-Data-Imputation
knn_impute
- Supports both numeric and categorical variables
- Distance-based imputation
- https://gist.github.com/YohanObadia
Reference
Surround yourself with people who believe in your dreams and want to see you succeed.
- Max Holloway -
반응형
'캐글' 카테고리의 다른 글
[Kaggle Extra Study] 8. Imputation Techniques for Time Series Data (0) | 2024.10.27 |
---|---|
[Kaggle Study] Code CheatSheet (0) | 2024.10.27 |
[Kaggle Study] #1 Titanic - Machine Learning from Disaster (1) | 2024.10.26 |
[Kaggle Study] 2024 How I am studying (3) | 2024.10.25 |
[Kaggle Study] 1. Loss Function 손실 함수 & Gradient Descent 경사 하강법 (4) | 2024.10.25 |