CIBMTR - Equity in post-HCT Survival Predictions #6 How To Train XGBoost with Survival Loss

대회

CIBMTR - Equity in post-HCT Survival Predictions #6 How To Train XGBoost with Survival Loss

dongsunseng 2025. 2. 5. 16:34

Annotation of Chris Deotte's discussion about "How To Train XGBoost with Survival Loss".

https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550141

CIBMTR - Equity in post-HCT Survival Predictions

Improve prediction of transplant survival rates equitably for allogeneic HCT patients

www.kaggle.com

How To Train XGBoost with Survival Loss

This competition involves training survival models.
We need to predict risk scores which are inversely proportional to how long a patient is event free.
XGBoost can train survival models! (This discussion is a continuation of my first discussion here).
- Annotation for the previous discussion here: https://dongsunseng.com/entry/CIBMTR-Equity-in-post-HCT-Survival-Predictions-5-How-To-Get-Started-Understanding-the-Metric

Targets Explained

For patients with efs=1, we observe they had an event and know exactly how long they were event free (namely efs_time).
For patients with efs=0, we observe that they were event free for efs_time but do not know if eventually they will have an event or not.
So we only know they are event free for at least efs_time.
Survival models are new to me so yesterday my starter notebook does not use survival models directly.
Instead I studied the metric and mathematically determined how to transform the two targets efs and efs_time into a single target y and then trained a regression model to predict a proxy for inverse risk score.
My starter discussion is here.
Today I learned that XGBoost and CatBoost can train survival models directly.

XGBoost Survival:Cox Model

Starting from my public starter notebook, we can train XGBoost survival model as follows.
First we make a new column called efs_time2 which includes the information of both efs and efs_time:

train["efs_time2"] = train.efs_time.copy()
train.loc[train.efs==0,"efs_time2"] *= -1

Then remove this new column from features by changing code cell #5 with:

RMV = ["ID","efs","efs_time","y","efs_time2"]

Then we train using this target:

y_train = train.loc[train_index,"efs_time2"]

And we change XGBoost parameters to:

    objective='survival:cox',
    eval_metric='cox-nloglik',

CV Score: 0.672

Horikita Saku's comment about this part:
- He tried:
  train["efs_time2"] = train.efs_time.copy()
  train.loc[train.efs==0,"efs_time2"] *= -1
- and train by:
  x_train = train.loc[train_index, FEATURES].copy()
  y_train = train.loc[train_index, "efs_time2"]
  x_valid = train.loc[test_index, FEATURES].copy()
  y_valid = train.loc[test_index, "efs_time2"]
- the params are:
  eval_metric='cox-nloglik',
  objective='survival:cox',
  boosting_type= "dart",
- ran the eval(scoring) by:
  from metric import score
  y_true = train[["ID","efs","efs_time","race_group"]].copy()
  y_pred = train[["ID"]].copy()
  y_pred["prediction"] = oof_xgb
  m = score(y_true.copy(), y_pred.copy(), "ID")
  print(f"\nOverall CV for XGBoost =",m)
- However, I obtained an Overall CV for XGBoost = 0.9889430880402769, but the LB is 0.58, which seems to be definitely an anomaly. Do you have any ideas on what might be causing this? 🤔
- Reply by the author:
  - It is because the lack of this code:
    RMV = ["ID","efs","efs_time","y","efs_time2"]
    FEATURES = [c for c in train.columns if not c in RMV]
  - "I'm guessing that your model is using efs_time2 as both the target and a feature. I will add this to the discussion above. Thanks for discovering this."
    - Details:
    - The core issue here is "Data leakage" problem
    - Where the problem occurred:
      - A new target variable efs_time2 was created, but it was accidentally used as a feature as well
      - As a result, the model received target information as a feature, leading to abnormally high cv scores(0.98)
      - However, since there's no such leakage in the actual test data, the LB score was very low (0.58)
    - Solution:
      - RMV = ["ID","efs","efs_time","y","efs_time2"]
        FEATURES = [c for c in train.columns if not c in RMV]
      - The RMV list specifies columns that should be excluded from features
      - FEATURES selects only columns not included in RMV
      - This explicitly prevents efs_time2 from being used as a feature
    - Why this code is necessary:
      - If the target variable (efs_time2) is included in features, the model essentially "cheats"
      - It ends up using information during training that would never be available in real prediction scenarios
      - This makes it impossible to accurately evaluate the model's generalization performance

CatBoost Survival:Cox Model

For CatBoost, we use the target efs_time2 and loss_function="Cox"

CV Score: 0.670

Starter Notebook

I publish a starter notebook demonstrating this code here.
Using these techniques I achieved CV=0.681 and LB=0.685
- Annotation of the starter notebook here: https://dongsunseng.com/entry/CIBMTR-Equity-in-post-HCT-Survival-Predictions-4-GPU-LightGBM-Baseline-CV-681-LB-685

CV Score: 0.681

UPDATE - We can use Survivial:AFT Model

We can also train XGBoost and CatBoost with Survivial:AFT loss.
See discussions here and notebook examples here and here.

My Annotation & Explanation about AFT model here:

UPDATE - NN Starter Notebook

I published an NN starter notebook here with CV 0.670 and LB 0.676!

My Annotation & Explanation about NN model here:

It's nice to be important, but it's more important to be nice.
- Dwayne Johnson -

저작자표시 비영리 변경금지 (새창열림)

'대회' 카테고리의 다른 글

CIBMTR - Equity in post-HCT Survival Predictions #8 Finding the best target transformation (0)	2025.02.05
CIBMTR - Equity in post-HCT Survival Predictions #7 AFT model (0)	2025.02.05
CIBMTR - Equity in post-HCT Survival Predictions #5 How To Get Started - Understanding the Metric (0)	2025.02.05
CIBMTR - Equity in post-HCT Survival Predictions #4 GPU LightGBM Baseline [CV 681 LB 685) (0)	2025.02.03
CIBMTR - Equity in post-HCT Survival Predictions #3 Understanding Survival Analysis - 2 (1)	2025.02.01

현재글CIBMTR - Equity in post-HCT Survival Predictions #6 How To Train XGBoost with Survival Loss

llm, 오블완, nlp, dl, home credit default risk, Prompt Engineering, 투자, 코인, 매매일지, 티스토리챌린지, Express, 비트코인, backend, nodejs, 단타, cibmtr - equity in post-hct survival predictions, 경제, Kaggle, ML, 캐글,

Today :
Yesterday :

일	월	화	수	목	금	토
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31

동선생