[Kaggle Study] 2. Scale of Features

캐글

[Kaggle Study] 2. Scale of Features

dongsunseng 2024. 10. 29. 13:54

During data preprocessing step of data analysis, it is necessary to check the scale of the features.

What do you mean by "scale of feature"?

It simply means the range of values that a particular feature possesses.
For example, when A feature has 1 ~ 10 range of numerical values and B feature has 500 ~ 1000 range of numerical values, we can say that these 2 features has a large difference in scale.
Some of the algorithms such as Gradient Descent are very sensitive to scale of the features; thus, we should adjust the scale during data preprocessing step for better performance of the model.
When the difference in scale between features is large, the process of finding optimal weights fluctuates significantly with large variations in values during learning process.
In other words, because the gradient is large, the weights heavily fluctuates and we can say that this causes the model to converge unstably.

How can we adjust scale?

Although there are various scaling methods available, standardization(표준화) is one of the most frequently used scaling techniques in neural networks.
Standardization involves subtracting the mean and dividing by standard deviation.
In other words, this process results in features with zero mean and unit variance(1).
We can easily utilize sklearn's StandardScaler class.
Due to similar rates of change across features, the weight update process rapidly converges to the optimal value.

Code example without any external packages:

train_mean = np.mean(x_train, axis=0)
train_std = np.std(x_train, axis=0)
x_train_scaled = (x_train - train_mean) / train_std

Scaling validation dataset

Standardizing only the training set before proceeding with validation will not result in valid validation process.
While the validation dataset must also be standardized, care must be taken to avoid scaling the training and validation sets at different ratios.
If the validation set is scaled at a different ratio than the training set, the model's algorithm will misinterpret the validation set's sample data.
This principle also applies when using the test set and deploying the model in production to process new samples.
In production environments, as predictions are made for individual samples, it's not possible to calculate means or standard deviations for preprocessing.
In these cases, the validation set should be transformed using the mean and standard deviation derived from training dataset.

x_val_scaled = (x_val - train_mean) / train_std

Reference

Do it! 딥러닝 입문

★★★★★ 딥러닝을 배우고자 하는분께 강추합니다!(wtiger85 님) ★★★★★ 강추. 박해선님의 책은 일단 지른 다음에 생각합니다.(heistheguy 님) ♥♥♥♥ 코랩을 사용한 딥러닝을 알려주는 책 매

tensorflow.blog

Your mindset is the most powerful tool you possess. Use it wisely.

- Max Holloway -

저작자표시 비영리 변경금지 (새창열림)

'캐글' 카테고리의 다른 글

[Kaggle Study] 4. Overfitting, Underfitting, Variance and Bias (4)	2024.10.29
[Kaggle Study] 3. Learning Rate (2)	2024.10.29
[Kaggle Study] #1 Titanic - Machine Learning from Disaster (2)	2024.10.26
[Kaggle Study] 1. Loss Function 손실 함수 & Gradient Descent 경사 하강법 (6)	2024.10.25
ML/DL(3) - 손실 함수와 경사 하강법의 관계 (0)	2023.07.08

현재글[Kaggle Study] 2. Scale of Features

단타, 코인, backend, 매매일지, cibmtr - equity in post-hct survival predictions, ML, 비트코인, Express, 캐글, 오블완, Prompt Engineering, 투자, home credit default risk, nlp, 티스토리챌린지, Kaggle, dl, nodejs, 경제, llm,

Today :
Yesterday :

일	월	화	수	목	금	토
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31

동선생