반응형
What makes handling Categorical Variables important?
- Categorical variables - like colors, cities, or education levels - cannot be directly used in machine learning models since these models only understand numbers.
- Categorical variable encoding solves this problem by converting these text-based categories into numerical formats that machines can process.
- The challenge lies in choosing the right encoding method that best preserves the meaning of your categories while making them machine-readable.
- Let's explore the different methods available for this transformation.
Types of Categorical Variables
- Nominal variables: Values without order (cat, dog)
- Ordinal variables: Values with order (low, medium, high)
- Binary variables: Binary values (1, 0)
- Cyclic variables: Cyclical values (Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday)
1. Label Encoding
- Assigns numbers sequentially to each category.
- Pro:
- Simple and memory efficient.
- Con:
- May introduce unintended ordinal relationships.
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
encoded = encoder.fit_transform(['red', 'blue', 'green']) # [0, 1, 2]
2. One-Hot Encoding
- Transforms each category into separate binary columns.
- Unlike Ordinal Encoding below, one-hot encoding does not assume any order between categories.
- Therefore, this method works well when there is no clear oder in categorical data (for example, "Red" is neither greater nor less than "Yellow").
- Categorical variables without such clear ordering are called nominal variables.
- One-hot encoding generally does not work well when there are many categories(it is typically not recommended when there are more than 15 categories).
- Pro:
- Treats categories fairly without ordering.
- Con:
- Increases dimensionality and memory usage.
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
encoded = encoder.fit_transform([['red'], ['blue'], ['green']])
# [[1, 0, 0],
# [0, 1, 0],
# [0, 0, 1]]
3. Ordinal Encoding
- Used for categorical data with meaningful order.
- We call categorical variables that have order, "ordinal variables".
- Example: Education level(High school, Bachelor's, Master's, PhD).
from sklearn.preprocessing import OrdinalEncoder
encoder = OrdinalEncoder()
encoded = encoder.fit_transform([['High School'], ['Master'], ['PhD']])
4. Feature Hashing
- Uses hash functions to convert categories into fixed-size vectors.
- Pro:
- Memory efficient.
- Can handle new categories.
- Con:
- Possibility of hash collisions.
5. Target Encoding
- Replaces categories with mean of target variable.
- Pro:
- Can improve predictive power.
- Con:
- Risk of overfitting.
def target_encode(category_column, target_column):
return category_column.map(target_column.groupby(category_column).mean())
6. Count/Frequency Encoding
- Transforms categories based on their frequency.
def frequency_encode(category_column):
return category_column.map(category_column.value_counts(normalize=True))
Considerations when choosing a method:
- Data characteristics (presence of ordinal relationships)
- Categories with ordinal relationships:
- Ordinal Encoding might be appropriate
- Label Encoding could work since the numeric order has meaning
- Categories without ordinal relationships:
- One-Hot Encoding is usually better
- Label Encoding would be inappropriate as it suggests false ordering
- Categories with ordinal relationships:
- Number of categories
- Need to handle new categories
- Memory constraints
- Model characteristics
Reference
The world can be yours if you believe in yourself.
- Conor Mcgregor -
반응형
'캐글 보충' 카테고리의 다른 글
[Kaggle Extra Study] 17. Multiclass Classification Threshold Optimization 다중분류 임계값 최적화 (0) | 2024.11.17 |
---|---|
[Kaggle Extra Study] 15. GBM vs. XGBoost (0) | 2024.11.10 |
[Kaggle Extra Study] 14. Tree-based Ensemble Models (1) | 2024.11.10 |
[Kaggle Extra Study] 13. Weight Initialization (3) | 2024.11.09 |
[Kaggle Extra Study] 12. Drop-out (4) | 2024.11.07 |