캐글 보충

[Kaggle Extra Study] 16. Handling Categorical Variables

dongsunseng 2024. 11. 11. 00:06
반응형

What makes handling Categorical Variables important?

  • Categorical variables - like colors, cities, or education levels - cannot be directly used in machine learning models since these models only understand numbers.
  • Categorical variable encoding solves this problem by converting these text-based categories into numerical formats that machines can process.
  • The challenge lies in choosing the right encoding method that best preserves the meaning of your categories while making them machine-readable.
  • Let's explore the different methods available for this transformation.

Types of Categorical Variables

  1. Nominal variables: Values without order (cat, dog)
  2. Ordinal variables: Values with order (low, medium, high)
  3. Binary variables: Binary values (1, 0)
  4. Cyclic variables: Cyclical values (Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday)

1. Label Encoding

  • Assigns numbers sequentially to each category.
  • Pro:
    • Simple and memory efficient.
  • Con:
    • May introduce unintended ordinal relationships.
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
encoded = encoder.fit_transform(['red', 'blue', 'green'])  # [0, 1, 2]

2. One-Hot Encoding

  • Transforms each category into separate binary columns.
  • Unlike Ordinal Encoding below, one-hot encoding does not assume any order between categories.
    • Therefore, this method works well when there is no clear oder in categorical data (for example, "Red" is neither greater nor less than "Yellow").
    • Categorical variables without such clear ordering are called nominal variables.
    • One-hot encoding generally does not work well when there are many categories(it is typically not recommended when there are more than 15 categories).
  • Pro:
    • Treats categories fairly without ordering.
  • Con:
    • Increases dimensionality and memory usage.
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
encoded = encoder.fit_transform([['red'], ['blue'], ['green']])
# [[1, 0, 0],
#  [0, 1, 0],
#  [0, 0, 1]]

3. Ordinal Encoding

  • Used for categorical data with meaningful order.
  • We call categorical variables that have order, "ordinal variables".
  • Example: Education level(High school, Bachelor's, Master's, PhD).
from sklearn.preprocessing import OrdinalEncoder
encoder = OrdinalEncoder()
encoded = encoder.fit_transform([['High School'], ['Master'], ['PhD']])

4. Feature Hashing

  • Uses hash functions to convert categories into fixed-size vectors.
  • Pro:
    • Memory efficient.
    • Can handle new categories.
  • Con:
    • Possibility of hash collisions.

5. Target Encoding

  • Replaces categories with mean of target variable.
  • Pro:
    • Can improve predictive power.
  • Con:
    • Risk of overfitting.
def target_encode(category_column, target_column):
    return category_column.map(target_column.groupby(category_column).mean())

6. Count/Frequency Encoding

  • Transforms categories based on their frequency.
def frequency_encode(category_column):
    return category_column.map(category_column.value_counts(normalize=True))

Considerations when choosing a method:

  1. Data characteristics (presence of ordinal relationships)
    • Categories with ordinal relationships:
      • Ordinal Encoding might be appropriate
      • Label Encoding could work since the numeric order has meaning
    • Categories without ordinal relationships:
      • One-Hot Encoding is usually better
      • Label Encoding would be inappropriate as it suggests false ordering
  2. Number of categories
  3. Need to handle new categories
  4. Memory constraints
  5. Model characteristics

Reference

 

Categorical Variables

Explore and run machine learning code with Kaggle Notebooks | Using data from multiple data sources

www.kaggle.com

 

 

The world can be yours if you believe in yourself.
- Conor Mcgregor -
반응형