CIBMTR - Equity in post-HCT Survival Predictions #14 Feature Engineering Ideas

대회

CIBMTR - Equity in post-HCT Survival Predictions #14 Feature Engineering Ideas

dongsunseng 2025. 2. 10. 21:07

Annotation of a discussion post about feature engineering ideas:

https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550863

CIBMTR - Equity in post-HCT Survival Predictions

Improve prediction of transplant survival rates equitably for allogeneic HCT patients

www.kaggle.com

Feature Engineering Ideas

Hi everyone! My current best CV score and LB score (CV 0.683 and LB 0.688) is just ensemble and/or stack various models (without feature engineering).
Each model is trained with different targets and different losses.
I have not performed any feature engineering or data augmentation or external data yet.

Categorical vs. Numerical

Let's discuss feature engineering.
This dataset has 35 categorical features and 22 numerical features (for total 57 features).
However 55 features look like categorical with few unique values.
Only donor_age and act_at_hct look like true numerical with many continuous values.
It is true that 20 of the numerical features may be ordinal (which means that the order of values matters), but for my NN treating all features (except two ages) as categorical worked best.
So we could treat all (except two ages) as categorical and combine them creatively.

Feature Engineering

Let's have a discussion about which feature engineering to try.
One technique is to combine columns with train["new"] = train["col1"].astype("str") + "_" + train["cols2".astype("str").
Then we have a new categorical feature.
We can even combine 3, 4, 5, etc columns.
When we do this the cardinality increases, so we can try advanced techniques like target encoding, count encoding, etc to process the new high cardinality feature
Another idea is to try mathematical combinations like train["new"] = function( train["col1"], train["cols2"] ).
Here the function could just multiply the columns or it can do more advanced techniques like taking a product of the logs OR takes the difference etc etc.

Data Augmentation

Another idea to boost CV and LB is data augmentation.
With tabular data and GBDT, one way to perform data augmentation is to make copies of the train data.
Then for each copy, we can augment (i.e. modify change) the data.
Then we concatenate all the copies and train a GBDT on the new concatenated dataframe.

External Data

Another thing that helps is external data.
Has anyone found any good external data sets?

Recursive Feature Reduction

Another idea is to remove each feature one by one and see if CV score and LB score increases.
Sometimes there are features whose presence hurts CV score and LB score.

Model Hyerparameter Optimization

It is true that optimizing each model's hyperparameters in our ensemble will boost our overall CV score and LB score, but at this time I am more interested in discussing feature engineering.
So let's discuss feature engineering!

Let's Discuss

Let's discuss ideas that we are trying to improve CV and LB score.
So far the only public notebooks are using different models, but nobody has suggested or tried ways to modify, change, or increase the data.

Comments:

뛰어나고 훌륭하게 시작할 필요는 없다. 그러나 훌륭하기 위해서는 시작해야 한다.
- 지그 지글러 -

저작자표시 비영리 변경금지 (새창열림)

'대회' 카테고리의 다른 글

CIBMTR - Equity in post-HCT Survival Predictions #13 How to make sense of the race group distribution in the data? (0)	2025.02.10
CIBMTR - Equity in post-HCT Survival Predictions #12 Deep understanding of (C-index) evaluation measure for better model (0)	2025.02.10
CIBMTR - Equity in post-HCT Survival Predictions #11 ESP EDA which makes sense ⭐️⭐️⭐️⭐️⭐️ (AFT Loss func sol #1) (0)	2025.02.10
CIBMTR - Equity in post-HCT Survival Predictions #10 A general Understanding for AFT Loss function (0)	2025.02.06
CIBMTR - Equity in post-HCT Survival Predictions #9 NN Starter Notebook (0)	2025.02.06

현재글CIBMTR - Equity in post-HCT Survival Predictions #14 Feature Engineering Ideas

경제, nodejs, Prompt Engineering, home credit default risk, nlp, 매매일지, llm, 오블완, Express, dl, ML, 투자, backend, Kaggle, 비트코인, 단타, 코인, cibmtr - equity in post-hct survival predictions, 캐글, 티스토리챌린지,

Today :
Yesterday :

일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

동선생

CIBMTR - Equity in post-HCT Survival Predictions #14 Feature Engineering Ideas