캐글

[Kaggle Study] 2. Scale of Features

dongsunseng 2024. 10. 29. 13:54
반응형

During data preprocessing step of data analysis, it is necessary to check the scale of the features. 

What do you mean by "scale of feature"?

  • It simply means the range of values that a particular feature possesses.
  • For example, when A feature has 1 ~ 10 range of numerical values and B feature has 500 ~ 1000 range of numerical values, we can say that these 2 features has a large difference in scale.
  • Some of the algorithms such as Gradient Descent are very sensitive to scale of the features; thus, we should adjust the scale during data preprocessing step for better performance of the model.
  • When the difference in scale between features is large, the process of finding optimal weights fluctuates significantly with large variations in values during learning process. 
  • In other words, because the gradient is large, the weights heavily fluctuates and we can say that this causes the model to converge unstably.

How can we adjust scale?

  • Although there are various scaling methods available, standardization(표준화) is one of the most frequently used scaling techniques in neural networks.
  • Standardization involves subtracting the mean and dividing by standard deviation.
  • In other words, this process results in features with zero mean and unit variance(1). 
  • We can easily utilize sklearn's StandardScaler class.
  • Due to similar rates of change across features, the weight update process rapidly converges to the optimal value.

Code example without any external packages:

train_mean = np.mean(x_train, axis=0)
train_std = np.std(x_train, axis=0)
x_train_scaled = (x_train - train_mean) / train_std

Scaling validation dataset

  • Standardizing only the training set before proceeding with validation will not result in valid validation process.
  • While the validation dataset must also be standardized, care must be taken to avoid scaling the training and validation sets at different ratios.
  • If the validation set is scaled at a different ratio than the training set, the model's algorithm will misinterpret the validation set's sample data.
  • This principle also applies when using the test set and deploying the model in production to process new samples.
  • In production environments, as predictions are made for individual samples, it's not possible to calculate means or standard deviations for preprocessing. 
  • In these cases, the validation set should be transformed using the mean and standard deviation derived from training dataset.
x_val_scaled = (x_val - train_mean) / train_std

 

 

Reference

 

Do it! 딥러닝 입문

★★★★★ 딥러닝을 배우고자 하는분께 강추합니다!(wtiger85 님) ★★★★★ 강추. 박해선님의 책은 일단 지른 다음에 생각합니다.(heistheguy 님) ♥♥♥♥ 코랩을 사용한 딥러닝을 알려주는 책 매

tensorflow.blog

 

 

Your mindset is the most powerful tool you possess. Use it wisely.

- Max Holloway -

 

반응형