반응형
Annotation of Chris Deotte's discussion about "How To Train XGBoost with Survival Loss".
https://www.kaggle.com/competitions/equity-post-HCT-survival-predictions/discussion/550141
How To Train XGBoost with Survival Loss
- This competition involves training survival models.
- We need to predict risk scores which are inversely proportional to how long a patient is event free.
- XGBoost can train survival models! (This discussion is a continuation of my first discussion here).
- Annotation for the previous discussion here: https://dongsunseng.com/entry/CIBMTR-Equity-in-post-HCT-Survival-Predictions-5-How-To-Get-Started-Understanding-the-Metric
Targets Explained
- For patients with efs=1, we observe they had an event and know exactly how long they were event free (namely efs_time).
- For patients with efs=0, we observe that they were event free for efs_time but do not know if eventually they will have an event or not.
- So we only know they are event free for at least efs_time.
- Survival models are new to me so yesterday my starter notebook does not use survival models directly.
- Instead I studied the metric and mathematically determined how to transform the two targets efs and efs_time into a single target y and then trained a regression model to predict a proxy for inverse risk score.
- My starter discussion is here.
- Today I learned that XGBoost and CatBoost can train survival models directly.
XGBoost Survival:Cox Model
- Starting from my public starter notebook, we can train XGBoost survival model as follows.
- First we make a new column called efs_time2 which includes the information of both efs and efs_time:
train["efs_time2"] = train.efs_time.copy()
train.loc[train.efs==0,"efs_time2"] *= -1
- Then remove this new column from features by changing code cell #5 with:
RMV = ["ID","efs","efs_time","y","efs_time2"]
- Then we train using this target:
y_train = train.loc[train_index,"efs_time2"]
- And we change XGBoost parameters to:
objective='survival:cox',
eval_metric='cox-nloglik',
CV Score: 0.672
- Horikita Saku's comment about this part:
- He tried:
train["efs_time2"] = train.efs_time.copy()
train.loc[train.efs==0,"efs_time2"] *= -1 - and train by:
x_train = train.loc[train_index, FEATURES].copy()
y_train = train.loc[train_index, "efs_time2"]
x_valid = train.loc[test_index, FEATURES].copy()
y_valid = train.loc[test_index, "efs_time2"] - the params are:
eval_metric='cox-nloglik',
objective='survival:cox',
boosting_type= "dart", - ran the eval(scoring) by:
from metric import score
y_true = train[["ID","efs","efs_time","race_group"]].copy()
y_pred = train[["ID"]].copy()
y_pred["prediction"] = oof_xgb
m = score(y_true.copy(), y_pred.copy(), "ID")
print(f"\nOverall CV for XGBoost =",m) - However, I obtained an Overall CV for XGBoost = 0.9889430880402769, but the LB is 0.58, which seems to be definitely an anomaly. Do you have any ideas on what might be causing this? 🤔
- Reply by the author:
- It is because the lack of this code:
RMV = ["ID","efs","efs_time","y","efs_time2"]
FEATURES = [c for c in train.columns if not c in RMV] - "I'm guessing that your model is using efs_time2 as both the target and a feature. I will add this to the discussion above. Thanks for discovering this."
- Details:
- The core issue here is "Data leakage" problem
- Where the problem occurred:
- A new target variable efs_time2 was created, but it was accidentally used as a feature as well
- As a result, the model received target information as a feature, leading to abnormally high cv scores(0.98)
- However, since there's no such leakage in the actual test data, the LB score was very low (0.58)
- Solution:
- RMV = ["ID","efs","efs_time","y","efs_time2"]
FEATURES = [c for c in train.columns if not c in RMV] - The RMV list specifies columns that should be excluded from features
- FEATURES selects only columns not included in RMV
- This explicitly prevents efs_time2 from being used as a feature
- RMV = ["ID","efs","efs_time","y","efs_time2"]
- Why this code is necessary:
- If the target variable (efs_time2) is included in features, the model essentially "cheats"
- It ends up using information during training that would never be available in real prediction scenarios
- This makes it impossible to accurately evaluate the model's generalization performance
- It is because the lack of this code:
- He tried:
CatBoost Survival:Cox Model
For CatBoost, we use the target efs_time2 and loss_function="Cox"
CV Score: 0.670
Starter Notebook
- I publish a starter notebook demonstrating this code here.
- Using these techniques I achieved CV=0.681 and LB=0.685
- Annotation of the starter notebook here: https://dongsunseng.com/entry/CIBMTR-Equity-in-post-HCT-Survival-Predictions-4-GPU-LightGBM-Baseline-CV-681-LB-685
CV Score: 0.681
UPDATE - We can use Survivial:AFT Model
- We can also train XGBoost and CatBoost with Survivial:AFT loss.
- See discussions here and notebook examples here and here.
My Annotation & Explanation about AFT model here:
UPDATE - NN Starter Notebook
- I published an NN starter notebook here with CV 0.670 and LB 0.676!
My Annotation & Explanation about NN model here:
It's nice to be important, but it's more important to be nice.
- Dwayne Johnson -
반응형