대회

CIBMTR - Equity in post-HCT Survival Predictions #6 How To Train XGBoost with Survival Loss

dongsunseng 2025. 2. 5. 16:34
반응형

Annotation of Chris Deotte's discussion about "How To Train XGBoost with Survival Loss".

https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550141

 

CIBMTR - Equity in post-HCT Survival Predictions

Improve prediction of transplant survival rates equitably for allogeneic HCT patients

www.kaggle.com

How To Train XGBoost with Survival Loss

Targets Explained

  • For patients with efs=1, we observe they had an event and know exactly how long they were event free (namely efs_time).
  • For patients with efs=0, we observe that they were event free for efs_time but do not know if eventually they will have an event or not.
  • So we only know they are event free for at least efs_time.
  • Survival models are new to me so yesterday my starter notebook does not use survival models directly.
  • Instead I studied the metric and mathematically determined how to transform the two targets efs and efs_time into a single target y and then trained a regression model to predict a proxy for inverse risk score.
  • My starter discussion is here.
  • Today I learned that XGBoost and CatBoost can train survival models directly.

XGBoost Survival:Cox Model

  • Starting from my public starter notebook, we can train XGBoost survival model as follows.
  • First we make a new column called efs_time2 which includes the information of both efs and efs_time:
train["efs_time2"] = train.efs_time.copy()
train.loc[train.efs==0,"efs_time2"] *= -1
  • Then remove this new column from features by changing code cell #5 with:
RMV = ["ID","efs","efs_time","y","efs_time2"]
  • Then we train using this target:
y_train = train.loc[train_index,"efs_time2"]
  • And we change XGBoost parameters to:
    objective='survival:cox',
    eval_metric='cox-nloglik',
CV Score: 0.672
  • Horikita Saku's comment about this part:
    • He tried:
      train["efs_time2"] = train.efs_time.copy()
      train.loc[train.efs==0,"efs_time2"] *= -1
    • and train by:
      x_train = train.loc[train_index, FEATURES].copy()
      y_train = train.loc[train_index, "efs_time2"]
      x_valid = train.loc[test_index, FEATURES].copy()
      y_valid = train.loc[test_index, "efs_time2"]
    • the params are:
      eval_metric='cox-nloglik',
      objective='survival:cox',
      boosting_type= "dart",
    • ran the eval(scoring) by:
      from metric import score
      y_true = train[["ID","efs","efs_time","race_group"]].copy()
      y_pred = train[["ID"]].copy()
      y_pred["prediction"] = oof_xgb
      m = score(y_true.copy(), y_pred.copy(), "ID")
      print(f"\nOverall CV for XGBoost =",m)
    • However, I obtained an Overall CV for XGBoost = 0.9889430880402769, but the LB is 0.58, which seems to be definitely an anomaly. Do you have any ideas on what might be causing this? 🤔
    • Reply by the author:
      • It is because the lack of this code:
         RMV = ["ID","efs","efs_time","y","efs_time2"]
         FEATURES = [c for c in train.columns if not c in RMV]
      • "I'm guessing that your model is using efs_time2 as both the target and a feature. I will add this to the discussion above. Thanks for discovering this."
        • Details:
        • The core issue here is "Data leakage" problem
        • Where the problem occurred:
          • A new target variable efs_time2 was created, but it was accidentally used as a feature as well
          • As a result, the model received target information as a feature, leading to abnormally high cv scores(0.98)
          • However, since there's no such leakage in the actual test data, the LB score was very low (0.58)
        • Solution:
          • RMV = ["ID","efs","efs_time","y","efs_time2"]
            FEATURES = [c for c in train.columns if not c in RMV]

          • The RMV list specifies columns that should be excluded from features
          • FEATURES selects only columns not included in RMV
          • This explicitly prevents efs_time2 from being used as a feature
        • Why this code is necessary:
          • If the target variable (efs_time2) is included in features, the model essentially "cheats"
          • It ends up using information during training that would never be available in real prediction scenarios
          • This makes it impossible to accurately evaluate the model's generalization performance

CatBoost Survival:Cox Model

For CatBoost, we use the target efs_time2 and loss_function="Cox"

CV Score: 0.670

Starter Notebook

CV Score: 0.681

UPDATE - We can use Survivial:AFT Model

  • We can also train XGBoost and CatBoost with Survivial:AFT loss.
  • See discussions here and notebook examples here and here.
My Annotation & Explanation about AFT model here: 

UPDATE - NN Starter Notebook

  • I published an NN starter notebook here with CV 0.670 and LB 0.676!
My Annotation & Explanation about NN model here: 

It's nice to be important, but it's more important to be nice.
- Dwayne Johnson -
반응형