캐글

[Kaggle Study] #2 Porto Seguro's Safe Driver Prediction

dongsunseng 2024. 11. 19. 20:03
반응형

Second competition following Yuhan Lee's kaggle curriculum. Binary classification competition using tabular data.

First Kernel

  • Data preparation and exploration kernel.

Insights / Summary:

1.

  • info method (e.g. train.info()) provides data type, null value existence, number of rows of each variable of the dataframe.

2. 

  • Storing metadata for each variable might help data management.
  • Code example:
data = []
for f in train.columns:
    # Defining the role
    if f == 'target':
        role = 'target'
    elif f == 'id':
        role = 'id'
    else:
        role = 'input'
         
    # Defining the level
    if 'bin' in f or f == 'target':
        level = 'binary'
    elif 'cat' in f or f == 'id':
        level = 'nominal'
    elif train[f].dtype == float:
        level = 'interval'
    elif train[f].dtype == int:
        level = 'ordinal'
        
    # Initialize keep to True for all variables except for id
    keep = True
    if f == 'id':
        keep = False
    
    # Defining the data type 
    dtype = train[f].dtype
    
    # Creating a Dict that contains all the metadata for the variable
    f_dict = {
        'varname': f,
        'role': role,
        'level': level,
        'keep': keep,
        'dtype': dtype
    }
    data.append(f_dict)
    
meta = pd.DataFrame(data, columns=['varname', 'role', 'level', 'keep', 'dtype'])
meta.set_index('varname', inplace=True)

 

3. 

  • Train data has about 3.645% of 1 values and the others are all 0 value. 
  • Extremely imbalanced which can result in false evaluation. 
  • Since we have rather large training set, we can try Undersampling.
  • Undersampling:
    • Undersampling is one of the techniques used to handle imbalanced data.
    • Example:
      • In credit card fraud detection:
        • Normal transactions: 99,000 cases (majority class)
        • Fraudulent transactions: 1,000 cases (minority class)
      • Applying Undersampling
        • Randomly select only 1,000 cases from majority class (normal transactions)
        • Keep all 1,000 cases from minority class (fraudulent transactions)
        • Results in a balanced dataset with 1:1 ratio
      • Advantages:
        • Reduced training time
        • Resolves class imbalance
        • Lower memory usage
      • Disadvantages:
        • Potential information loss
        • Might miss important patterns from majority class
desired_apriori=0.10

# Get the indices per target value
idx_0 = train[train.target == 0].index
idx_1 = train[train.target == 1].index

# Get original number of records per target value
nb_0 = len(train.loc[idx_0])
nb_1 = len(train.loc[idx_1])

# Calculate the undersampling rate and resulting number of records with target=0
undersampling_rate = ((1-desired_apriori)*nb_1)/(nb_0*desired_apriori)
undersampled_nb_0 = int(undersampling_rate*nb_0)
print('Rate to undersample records with target=0: {}'.format(undersampling_rate))
print('Number of records with target=0 after undersampling: {}'.format(undersampled_nb_0))

# Randomly select records with target=0 to get at the desired a priori
undersampled_idx = shuffle(idx_0, random_state=37, n_samples=undersampled_nb_0)

# Construct list with remaining indices
idx_list = list(undersampled_idx) + list(idx_1)

# Return undersample data frame
train = train.loc[idx_list].reset_index(drop=True)

# Result:
''' Rate to undersample records with target=0: 0.34043569687437886
Number of records with target=0 after undersampling: 195246 '''

 

4. 

  • Checking the cardinality of the categorical variables
    • Cardinality refers to the number of different values in a variable. 
    • As we might create dummy variables from the categorical variables, we need to check whether there are variables with many distinct values.
    • We should handle these variables differently as they would result in many dummy variables.
v = meta[(meta.level == 'nominal') & (meta.keep)].index

for f in v:
    dist_values = train[f].value_counts().shape[0]
    print('Variable {} has {} distinct values'.format(f, dist_values))

 

5. 

  • Target Encoding
    • Target encoding is one of the methods for converting categorical variables into numerical ones.
    • It replaces each category with the mean target value (dependent variable) of the data belonging to that category.
    • Advantages:
      • Effective even with many categories
      • No dimension increase compared to one-hot encoding
      • Directly reflects the relationship between categories and targets
    • Disadvantages:
      • Risk of overfitting
      • Requires careful handling when splitting training/test data
      • Need strategy for handling new categories
  • Data leakage:
    • Phenomenon where information that shouldn't be used "leaks" into the model during the training process.
    • Target Leakage
      • When future information that wouldn't be available at prediction time is included in training data
    • Train-Test Contamination
      • When test data information leaks into the training process
  • Problem Situation:
    • The current problem arose while considering how to encode the variable ps_car_11_cat, which has high cardinality.
    • This kernel's author's first solution was to use supervised ratio to handle this variable which led to data leakage problem.
      • The issue arose because ratios were calculated using target values and then applied to the entire dataset.
    • New solution:
      • The basic encoding calculation method remains the same:
        • Previous one:
          • cat_perc = train[['ps_car_11_cat', 'target']].groupby(['ps_car_11_cat'],as_index=False).mean()
        • New one:
          • averages = temp.groupby(by=trn_series.name)[target.name].agg(["mean", "count"])
      • Difference is that we added noise and smoothing.
        • The previous method reflected target values more explicitly in the encoded values because it didn't use smoothing and noise, but through noise and smoothing, the data leakage problem has been somewhat mitigated.
      • This is not a perfect solution, and the K-fold approach can solve the data leakage problem more fundamentally.
    • Why did we add noise?
      • Added slight noise to prevent overfitting
    • Smoothing?
      • Categories with fewer samples are weighted more heavily toward the overall mean (prior)
      • Categories with more samples are weighted more heavily toward their own category mean
    • fillna(prior)
      • New categories are replaced with the overall mean

Previous Code:

cat_perc = train[['ps_car_11_cat', 'target']].groupby(['ps_car_11_cat'],as_index=False).mean()
cat_perc.rename(columns={'target': 'ps_car_11_cat_tm'}, inplace=True)
train = pd.merge(train, cat_perc, how='inner', on='ps_car_11_cat')
train.drop('ps_car_11_cat', axis=1, inplace=True)
meta.loc['ps_car_11_cat','keep'] = False  # Updating the meta

 

New Code:

# Code: https://www.kaggle.com/ogrellier/python-target-encoding-for-categorical-features
def add_noise(series, noise_level):
    return series * (1 + noise_level * np.random.randn(len(series)))

def target_encode(trn_series=None, 
                  tst_series=None, 
                  target=None, 
                  min_samples_leaf=1, 
                  smoothing=1,
                  noise_level=0):
    """
    Smoothing is computed like in the following paper by Daniele Micci-Barreca
    https://kaggle2.blob.core.windows.net/forum-message-attachments/225952/7441/high%20cardinality%20categoricals.pdf
    trn_series : training categorical feature as a pd.Series
    tst_series : test categorical feature as a pd.Series
    target : target data as a pd.Series
    min_samples_leaf (int) : minimum samples to take category average into account
    smoothing (int) : smoothing effect to balance categorical average vs prior  
    """ 
    assert len(trn_series) == len(target)
    assert trn_series.name == tst_series.name
    temp = pd.concat([trn_series, target], axis=1)
    # Compute target mean 
    averages = temp.groupby(by=trn_series.name)[target.name].agg(["mean", "count"])
     # Compute smoothing
    smoothing = 1 / (1 + np.exp(-(averages["count"] - min_samples_leaf) / smoothing))
    # Apply average function to all target data
    prior = target.mean()
    # The bigger the count the less full_avg is taken into account
    averages[target.name] = prior * (1 - smoothing) + averages["mean"] * smoothing
    averages.drop(["mean", "count"], axis=1, inplace=True)
    # Apply averages to trn and tst series
    ft_trn_series = pd.merge(
        trn_series.to_frame(trn_series.name),
        averages.reset_index().rename(columns={'index': target.name, target.name: 'average'}),
        on=trn_series.name,
        how='left')['average'].rename(trn_series.name + '_mean').fillna(prior)
    # pd.merge does not keep the index so restore it
    ft_trn_series.index = trn_series.index 
    ft_tst_series = pd.merge(
        tst_series.to_frame(tst_series.name),
        averages.reset_index().rename(columns={'index': target.name, target.name: 'average'}),
        on=tst_series.name,
        how='left')['average'].rename(trn_series.name + '_mean').fillna(prior)
    # pd.merge does not keep the index so restore it
    ft_tst_series.index = tst_series.index
    return add_noise(ft_trn_series, noise_level), add_noise(ft_tst_series, noise_level)
# reading data
trn_df = pd.read_csv("../input/train.csv", index_col=0)
sub_df = pd.read_csv("../input/test.csv", index_col=0)

# Target encode ps_car_11_cat
trn, sub = target_encode(trn_df["ps_car_11_cat"], 
                         sub_df["ps_car_11_cat"], 
                         target=trn_df.target, 
                         min_samples_leaf=100,
                         smoothing=10,
                         noise_level=0.01)
trn.head(10)

 

6. 

  • Interval variables:
    • Has order(can compare sizes)
    • Has consistent intervals (equal spacing)
    • No absolute zero point
      • For example, while we can say that the difference between 10°C and 20°C is the same as the difference between 20°C and 30°C, we cannot say that 20°C is "twice as hot" as 10°C.
  • Example:
    • Celsius/Fahrenheit temperature
    • IQ scores
    • Calendar years
    • Test scores

7. 

  • Taking a sample of the train data can be a way to speed up eda process(plotting, checking correlation, ...)
s = train.sample(frac=0.1)

 

8. 

  • If there are 2 variables that show high correlation, how can we choose which variable to keep?
    • Since removing redundant variables is necessary.
  • We could perform Principal Component Analysis (PCA) on the variables to reduce the dimensions.
  • But when the number of correlated variables is rather low, we can let the model do the heavy-lifting.

PCA Example(Reducing numerical features with PCA):

1) Standardizing the numerical features before performing PCA

sc = StandardScaler()
train_nums_std = sc.fit_transform(train[numFeatures])

 

2) PCA

  • Set n_components to None to keep all principal components and their explained variance
pca = PCA(n_components=None)
train_nums_pca = pca.fit_transform(train_nums_std)
varExp = pca.explained_variance_ratio_

 

3) Plot the cumulative explained variance as a function of the number of components

cumVarExplained = []
nb_components = []
counter = 1
for i in varExp:
    cumVarExplained.append(varExp[0:counter].sum())
    nb_components.append(counter)
    counter += 1

plt.subplots(figsize=(8, 6))
plt.plot(nb_components, cumVarExplained, 'bo-')
plt.ylabel('Cumulative Explained Variance')
plt.xlabel('Number of Components')
plt.ylim([0.0, 1.1])
plt.xticks(np.arange(1, len(nb_components), 1.0))
plt.yticks(np.arange(0.0, 1.1, 0.10))

  • With 7 components we already explain more than 90% of all variance in the features.
  • So we could reduce the number of features to half of the original numerical features.

9. 

Creating dummy variables

  • If the values of the categorical variables do not represent any order or magnitude, we can create dummy variables to deal with that.
    • For instance, category 2 is not twice the value of category 1.
  • We drop the first dummy variable as this information can be derived from the other dummy variables generated for the categories of the original variable.
v = meta[(meta.level == 'nominal') & (meta.keep)].index
print('Before dummification we have {} variables in train'.format(train.shape[1]))
train = pd.get_dummies(train, columns=v, drop_first=True)
print('After dummification we have {} variables in train'.format(train.shape[1]))

 

10. 

Creating interaction variables

v = meta[(meta.level == 'interval') & (meta.keep)].index
poly = PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)
interactions = pd.DataFrame(data=poly.fit_transform(train[v]), columns=poly.get_feature_names(v))
interactions.drop(v, axis=1, inplace=True)  # Remove the original columns
# Concat the interaction variables to the train data
print('Before creating interactions we have {} variables in train'.format(train.shape[1]))
train = pd.concat([train, interactions], axis=1)
print('After creating interactions we have {} variables in train'.format(train.shape[1]))
  • This adds extra interaction variables to the train data.
  • Interaction Variables:
    • Purpose of creating interaction variables is to model the 'interaction effects' between variables.
    • In the code, PolynomialFeatures is creating two types of interactions:
      • Products of two variables (e.g., X1 * X2)
      • Squares of each variable (e.g., X1², X2²) - included because of interaction_only=False
    • Main reasons for creating interaction variables:
      • Capturing non-linear relationships
      • Modeling synergy/interaction effects between variables
      • Increasing model expressiveness
  • Thanks to the get_feature_names method we can assign column names to these new variables.

11. 

Feature Selection

selector = VarianceThreshold(threshold=.01)
selector.fit(train.drop(['id', 'target'], axis=1)) # Fit to train without id and target variables

f = np.vectorize(lambda x : not x) # Function to toggle boolean array elements

v = train.drop(['id', 'target'], axis=1).columns[f(selector.get_support())]
print('{} variables have too low variance.'.format(len(v)))
print('These variables are {}'.format(list(v)))
  • Removing features with low or zero variance
    • Classifier these days are powerful enough to choose which features to keep or not.
    • But there is one thing that we can do ourselves: removing features with no or a very low variance.
    • Sklearn has a handy method to do that: VarianceThreshold.
      • By default it removes features with zero variance.
      • This will not be applicable for this competition as we saw there are no zero-variance variables in the previous steps.
      • But if we would remove features with less than 1% variance, we would remove 31 variables.
      • We would lose rather many variables if we would select based on variance.
      • But because we do not have so many variables, we'll let the classifier choose.
      • For data sets with many more variables this could reduce the processing time.
      • Sklearn also comes with other feature selection methods.
      • One of these methods is SelectFromModel in which you let another classifier select the best features and continue with these.

Code example:

X_train = train.drop(['id', 'target'], axis=1)
y_train = train['target']

feat_labels = X_train.columns

rf = RandomForestClassifier(n_estimators=1000, random_state=0, n_jobs=-1)

rf.fit(X_train, y_train)
importances = rf.feature_importances_

indices = np.argsort(rf.feature_importances_)[::-1]

for f in range(X_train.shape[1]):
    print("%2d) %-*s %f" % (f + 1, 30,feat_labels[indices[f]], importances[indices[f]]))
  • Based feature selection on the feature importances of a random forest.
  • With Sklearn's SelectFromModel you can then specify how many variables you want to keep.
  • You can set a threshold on the level of feature importance manually.
  • But we'll simply select the top 50% best variables.
sfm = SelectFromModel(rf, threshold='median', prefit=True)
print('Number of features before selection: {}'.format(X_train.shape[1]))
n_features = sfm.transform(X_train).shape[1]
print('Number of features after selection: {}'.format(n_features))
selected_vars = list(feat_labels[sfm.get_support()])

# result:
# Number of features before selection: 162
# Number of features after selection: 81

train = train[selected_vars + ['target']]

Second Kernel

  •  

Insights / Summary:

1. 

 

Third Kernel

  • XGBoost CV kernel.

Insights / Summary:

1. Optimizing Number of Rounds

MAX_ROUNDS = 400
OPTIMIZE_ROUNDS = False
LEARNING_RATE = 0.07
EARLY_STOPPING_ROUNDS = 50  
# Note: I set EARLY_STOPPING_ROUNDS high so that (when OPTIMIZE_ROUNDS is set)
#       I will get lots of information to make my own judgment.  You should probably
#       reduce EARLY_STOPPING_ROUNDS if you want to do actual early stopping.
  • "Round" relates to how XGBoost operates:
    • XGBoost is a boosting algorithm, where learning occurs iteratively over multiple rounds.
    • In each round, a new tree is created and it compensates for prediction errors from previous rounds.
  • This kernel recommends initially setting MAX_ROUNDS fairly high and using OPTIMIZE_ROUNDS to find optimal round value.
    • Two recommended methods for setting MAX_ROUNDS:
      1. Using best_ntree_limit as reference:
        • Look at the maximum value of best_ntree_limit across all cross-validation folds
        • If the model is well regularized, you can set it slightly higher than this value
      2.  Using verbose=True method:
        • Output detailed training process information
        • Find a number of rounds that works well across all folds
    • Example:
      • If best_ntree_limit values in 5-fold CV are [95, 98, 92, 97, 90]
      • Since the maximum value is 98, set MAX_ROUNDS to around 100 or 105
      • Note: This higher setting is only safe if the model is well regularized against overfitting
  • Then, turn off OPTIMIZE_ROUNDS and set MAX_ROUNDS to the identified optimal number.
  •  Problems with Early Stopping:
    • Choosing best round per fold leads to overfitting on validation data
    • Can result in:
      • Not optimal model for test data prediction
      • Excessive weight in ensemble/stacking scenarios
    • Alternative Approach (XGBoost Default):
      • Use the round where early stop occurs
        • Include lag time to verify no improvement
          • Lag time:
            • Waiting period to confirm that the model has truly stopped improving.
            • E.g. 
              • Round   |  Validation Score
                1       |  0.65
                2       |  0.70
                3       |  0.72
                4       |  0.72
                5       |  0.72
                6       |  0.72
                7       |  0.72
              • If lag = 3:
                • No improvement starts at round 4
                • Instead of stopping immediately, wait for 3 more rounds (5,6,7)
                • Verify that there's really no improvement
        • Results so far:
          • Solves overfitting issue
          • But may lead to underfitting
          • For real in this competition, 20-round early stopping per fold performed worse than fixed rounds

2. 

# Compute gini

# from CPMP's kernel https://www.kaggle.com/cpmpml/extremely-fast-gini-computation
@jit
def eval_gini(y_true, y_prob):
    y_true = np.asarray(y_true)
    y_true = y_true[np.argsort(y_prob)]
    ntrue = 0
    gini = 0
    delta = 0
    n = len(y_true)
    for i in range(n-1, -1, -1):
        y_i = y_true[i]
        ntrue += y_i
        gini += y_i * delta
        delta += 1 - y_i
    gini = 1 - 2 * gini / (ntrue * (n - ntrue))
    return gini
  • gini - Competition's evaluation metric
    • Gini coefficient is a performance metric commonly used for binary classification models, especially in finance and insurance.
  • Details:
    • @jit decorator uses Numba to compile the code for faster execution
      • Numba:
        • JIT (Just-In-Time) compiler that converts Python code to machine code.
      • Same code runs at C/C++ level speed with Numba.
      • Especially significant performance improvement in loops (for, while).
    • Gini Coefficient Characteristics:
      • Range: -1 to 1
      • Closer to 1 = better model
      • 0 = random prediction
      • Negative values = poor prediction
  • Process:
    1. Function takes two inputs:
      • y_true: actual values (0 or 1)
      • y_prob: predicted probabilities
    2.  Convert to numpy array
      • y_true = np.asarray(y_true) 
    3. Sort by predicted probabilities
      • y_true = y_true[np.argsort(y_prob)]
    4. Calculates Gini coefficient through loop:
      • ntrue: number of actual positive (1) cases
      • delta: cumulative number of negative (0) cases
      • Applies formula to calculate final Gini coefficient

Fourth Kernel

  •  

Insights / Summary:

1. 


A positive attitude is the key to overcoming any obstacle in life.
- Max Holloway -
반응형