반응형
Annotation of a discussion post about feature engineering ideas:
https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550863
CIBMTR - Equity in post-HCT Survival Predictions
Improve prediction of transplant survival rates equitably for allogeneic HCT patients
www.kaggle.com
Feature Engineering Ideas
- Hi everyone! My current best CV score and LB score (CV 0.683 and LB 0.688) is just ensemble and/or stack various models (without feature engineering).
- Each model is trained with different targets and different losses.
- I have not performed any feature engineering or data augmentation or external data yet.
Categorical vs. Numerical
- Let's discuss feature engineering.
- This dataset has 35 categorical features and 22 numerical features (for total 57 features).
- However 55 features look like categorical with few unique values.
- Only donor_age and act_at_hct look like true numerical with many continuous values.
- It is true that 20 of the numerical features may be ordinal (which means that the order of values matters), but for my NN treating all features (except two ages) as categorical worked best.
- So we could treat all (except two ages) as categorical and combine them creatively.
Feature Engineering
- Let's have a discussion about which feature engineering to try.
- One technique is to combine columns with train["new"] = train["col1"].astype("str") + "_" + train["cols2".astype("str").
- Then we have a new categorical feature.
- We can even combine 3, 4, 5, etc columns.
- When we do this the cardinality increases, so we can try advanced techniques like target encoding, count encoding, etc to process the new high cardinality feature
- Another idea is to try mathematical combinations like train["new"] = function( train["col1"], train["cols2"] ).
- Here the function could just multiply the columns or it can do more advanced techniques like taking a product of the logs OR takes the difference etc etc.
Data Augmentation
- Another idea to boost CV and LB is data augmentation.
- With tabular data and GBDT, one way to perform data augmentation is to make copies of the train data.
- Then for each copy, we can augment (i.e. modify change) the data.
- Then we concatenate all the copies and train a GBDT on the new concatenated dataframe.
External Data
- Another thing that helps is external data.
- Has anyone found any good external data sets?
Recursive Feature Reduction
- Another idea is to remove each feature one by one and see if CV score and LB score increases.
- Sometimes there are features whose presence hurts CV score and LB score.
Model Hyerparameter Optimization
- It is true that optimizing each model's hyperparameters in our ensemble will boost our overall CV score and LB score, but at this time I am more interested in discussing feature engineering.
- So let's discuss feature engineering!
Let's Discuss
- Let's discuss ideas that we are trying to improve CV and LB score.
- So far the only public notebooks are using different models, but nobody has suggested or tried ways to modify, change, or increase the data.
Comments:
![](https://blog.kakaocdn.net/dn/brevOS/btsMcGO7cH8/FlWwoqQSmegp1xJeYlXIZk/img.png)
![](https://blog.kakaocdn.net/dn/c48vpf/btsMcd0XlOj/UeG1GZshr9cGUU4kgkNR8k/img.png)
뛰어나고 훌륭하게 시작할 필요는 없다. 그러나 훌륭하기 위해서는 시작해야 한다.
- 지그 지글러 -
반응형