[Kaggle Extra Study] 16. Handling Categorical Variables

캐글 보충

[Kaggle Extra Study] 16. Handling Categorical Variables

dongsunseng 2024. 11. 11. 00:06

What makes handling Categorical Variables important?

Categorical variables - like colors, cities, or education levels - cannot be directly used in machine learning models since these models only understand numbers.
Categorical variable encoding solves this problem by converting these text-based categories into numerical formats that machines can process.
The challenge lies in choosing the right encoding method that best preserves the meaning of your categories while making them machine-readable.
Let's explore the different methods available for this transformation.

Types of Categorical Variables

Nominal variables: Values without order (cat, dog)
Ordinal variables: Values with order (low, medium, high)
Binary variables: Binary values (1, 0)
Cyclic variables: Cyclical values (Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday)

1. Label Encoding

Assigns numbers sequentially to each category.
Pro:
- Simple and memory efficient.
Con:
- May introduce unintended ordinal relationships.

from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
encoded = encoder.fit_transform(['red', 'blue', 'green'])  # [0, 1, 2]

2. One-Hot Encoding

Transforms each category into separate binary columns.
Unlike Ordinal Encoding below, one-hot encoding does not assume any order between categories.
- Therefore, this method works well when there is no clear oder in categorical data (for example, "Red" is neither greater nor less than "Yellow").
- Categorical variables without such clear ordering are called nominal variables.
- One-hot encoding generally does not work well when there are many categories(it is typically not recommended when there are more than 15 categories).
Pro:
- Treats categories fairly without ordering.
Con:
- Increases dimensionality and memory usage.

from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
encoded = encoder.fit_transform([['red'], ['blue'], ['green']])
# [[1, 0, 0],
#  [0, 1, 0],
#  [0, 0, 1]]

3. Ordinal Encoding

Used for categorical data with meaningful order.
We call categorical variables that have order, "ordinal variables".
Example: Education level(High school, Bachelor's, Master's, PhD).

from sklearn.preprocessing import OrdinalEncoder
encoder = OrdinalEncoder()
encoded = encoder.fit_transform([['High School'], ['Master'], ['PhD']])

4. Feature Hashing

Uses hash functions to convert categories into fixed-size vectors.
Pro:
- Memory efficient.
- Can handle new categories.
Con:
- Possibility of hash collisions.

5. Target Encoding

Replaces categories with mean of target variable.
Pro:
- Can improve predictive power.
Con:
- Risk of overfitting.

def target_encode(category_column, target_column):
    return category_column.map(target_column.groupby(category_column).mean())

6. Count/Frequency Encoding

Transforms categories based on their frequency.

def frequency_encode(category_column):
    return category_column.map(category_column.value_counts(normalize=True))

Considerations when choosing a method:

Data characteristics (presence of ordinal relationships)
- Categories with ordinal relationships:
  - Ordinal Encoding might be appropriate
  - Label Encoding could work since the numeric order has meaning
- Categories without ordinal relationships:
  - One-Hot Encoding is usually better
  - Label Encoding would be inappropriate as it suggests false ordering
Number of categories
Need to handle new categories
Memory constraints
Model characteristics

Reference

Categorical Variables

Explore and run machine learning code with Kaggle Notebooks | Using data from multiple data sources

www.kaggle.com

The world can be yours if you believe in yourself.
- Conor Mcgregor -

저작자표시 비영리 변경금지 (새창열림)

'캐글 보충' 카테고리의 다른 글

[Kaggle Extra Study] 18. Types of Correlation Analysis (0)	2024.11.23
[Kaggle Extra Study] 17. Multiclass Classification Threshold Optimization 다중분류 임계값 최적화 (0)	2024.11.17
[Kaggle Extra Study] 15. GBM vs. XGBoost (0)	2024.11.10
[Kaggle Extra Study] 14. Tree-based Ensemble Models (1)	2024.11.10
[Kaggle Extra Study] 13. Weight Initialization (3)	2024.11.09

현재글[Kaggle Extra Study] 16. Handling Categorical Variables

home credit default risk, nlp, backend, 비트코인, 캐글, 오블완, 코인, 경제, 투자, 단타, dl, cibmtr - equity in post-hct survival predictions, llm, ML, Prompt Engineering, Express, nodejs, 티스토리챌린지, Kaggle, 매매일지,

Today :
Yesterday :

동선생

[Kaggle Extra Study] 16. Handling Categorical Variables

What makes handling Categorical Variables important?

Types of Categorical Variables

1. Label Encoding

2. One-Hot Encoding

3. Ordinal Encoding

4. Feature Hashing

5. Target Encoding

6. Count/Frequency Encoding

Considerations when choosing a method:

Reference

'캐글 보충' 카테고리의 다른 글

'캐글 보충'의 다른글

티스토리툴바

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

[Kaggle Extra Study] 16. Handling Categorical Variables

What makes handling Categorical Variables important?

Types of Categorical Variables

1. Label Encoding

2. One-Hot Encoding

3. Ordinal Encoding

4. Feature Hashing

5. Target Encoding

6. Count/Frequency Encoding

Considerations when choosing a method:

Reference

'캐글 보충' 카테고리의 다른 글

'캐글 보충'의 다른글

관련글

티스토리툴바