반응형
During data preprocessing step of data analysis, it is necessary to check the scale of the features.
What do you mean by "scale of feature"?
- It simply means the range of values that a particular feature possesses.
- For example, when A feature has 1 ~ 10 range of numerical values and B feature has 500 ~ 1000 range of numerical values, we can say that these 2 features has a large difference in scale.
- Some of the algorithms such as Gradient Descent are very sensitive to scale of the features; thus, we should adjust the scale during data preprocessing step for better performance of the model.
- When the difference in scale between features is large, the process of finding optimal weights fluctuates significantly with large variations in values during learning process.
- In other words, because the gradient is large, the weights heavily fluctuates and we can say that this causes the model to converge unstably.
How can we adjust scale?
- Although there are various scaling methods available, standardization(표준화) is one of the most frequently used scaling techniques in neural networks.
- Standardization involves subtracting the mean and dividing by standard deviation.
- In other words, this process results in features with zero mean and unit variance(1).
- We can easily utilize sklearn's StandardScaler class.
- Due to similar rates of change across features, the weight update process rapidly converges to the optimal value.
Code example without any external packages:
train_mean = np.mean(x_train, axis=0)
train_std = np.std(x_train, axis=0)
x_train_scaled = (x_train - train_mean) / train_std
Scaling validation dataset
- Standardizing only the training set before proceeding with validation will not result in valid validation process.
- While the validation dataset must also be standardized, care must be taken to avoid scaling the training and validation sets at different ratios.
- If the validation set is scaled at a different ratio than the training set, the model's algorithm will misinterpret the validation set's sample data.
- This principle also applies when using the test set and deploying the model in production to process new samples.
- In production environments, as predictions are made for individual samples, it's not possible to calculate means or standard deviations for preprocessing.
- In these cases, the validation set should be transformed using the mean and standard deviation derived from training dataset.
x_val_scaled = (x_val - train_mean) / train_std
Reference
Your mindset is the most powerful tool you possess. Use it wisely.
- Max Holloway -
반응형
'캐글' 카테고리의 다른 글
[Kaggle Study] 4. Overfitting, Underfitting, Variance and Bias (2) | 2024.10.29 |
---|---|
[Kaggle Study] 3. Learning Rate (2) | 2024.10.29 |
[Kaggle Extra Study] 9. Plots with Missing Data (3) | 2024.10.28 |
[Kaggle Extra Study] 8. Imputation Techniques for Time Series Data (0) | 2024.10.27 |
[Kaggle Study] Code CheatSheet (0) | 2024.10.27 |