Introduction
- Basically Survey Analysis Competition
- Predicting transplant survival rates for allogeneic HCT patients
- allogeneic: transplanting cells, tissues, or organs from a donor of the same species who is not genetically identical to the recipient
- HCT: Hematopoietic Stem Cell Transplantation is a treatment method used to fundamentally treat diseases such as leukemia(백혈병) where abnormalities occur during cell differentiation(세포 분화), or conditions like aplastic anemia(재생불량성빈혈) where problems arise due to decreased numbers of hematopoietic stem cells.
Competition Description
- "In this competition, you’ll develop models to improve the prediction of transplant survival rates for patients undergoing allogeneic Hematopoietic Cell Transplantation (HCT) — an important step in ensuring that every patient has a fair chance at a successful outcome, regardless of their background."
- "Improving survival predictions for allogeneic HCT patients is a vital healthcare challenge. Current predictive models often fall short in addressing disparities related to socioeconomic status, race, and geography. Addressing these gaps is crucial for enhancing patient care, optimizing resource utilization, and rebuilding trust in the healthcare system."
- That's why they put "Equity" on the title of the competition: maybe decreasing those disparities during prediction is the key point of this competition
 
- "This competition aims to encourage participants to advance predictive modeling by ensuring that survival predictions are both precise and fair for patients across diverse groups. By using synthetic data—which mirrors real-world situations while protecting patient privacy—participants can build and improve models that more effectively consider diverse backgrounds and conditions."
- "You’re challenged to develop advanced predictive models for allogeneic HCT that enhance both accuracy and fairness in survival predictions. The goal is to address disparities by bridging diverse data sources, refining algorithms, and reducing biases to ensure equitable outcomes for patients across diverse race groups. Your work will help create a more just and effective healthcare environment, ensuring every patient receives the care they deserve."
Evaluation Metric
Evaluation Criteria
The evaluation of prediction accuracy in the competition will involve a specialized metric known as the Stratified Concordance Index (C-index), adapted to consider different racial groups independently. This method allows us to gauge the predictive performance of models in a way that emphasizes equitability across diverse patient populations, particularly focusing on racial disparities in transplant outcomes.
Concordance index
It represents the global assessment of the model discrimination power: this is the model’s ability to correctly provide a reliable ranking of the survival times based on the individual risk scores. It can be computed with the following formula:

The concordance index is a value between 0 and 1 where:
- 0.5 is the expected result from random predictions,
- 1.0 is a perfect concordance and,
- 0.0 is perfect anti-concordance (multiply predictions with -1 to get 1.0)
- If the predicted values are all perfectly opposite to the actual values, resulting in a concordance index of 0.0
- If we multiply these predicted values by -1 (i.e., reverse the signs)
- The predictions will perfectly match the actual values, resulting in a concordance index of 1.0
 
Stratified Concordance Index
For this competition, we adjust the standard C-index to account for racial stratification, thus ensuring that each racial group's outcomes are weighed equally in the model evaluation. The stratified c-index is calculated as the mean minus the standard deviation of the c-index scores calculated within the recipient race categories, i.e., the score will be better if the mean c-index over the different race categories is large and the standard deviation of the c-indices over the race categories is small. This value will range from 0 to 1, 1 is the theoretical perfect score, but this value will practically be lower due to censored outcomes.
The submitted risk scores will be evaluated using the score function. This evaluation process involves comparing the submitted risk scores against actual observed values (i.e., survival times and event occurrences) from a test dataset. The function specifically calculates the stratified concordance index across different racial groups, ensuring that the predictions are not only accurate overall but also equitable across diverse patient demographics.
Final score = Mean(c-index for each race) - Standard deviation(c-index for each race)
Evaluation metric implementation:
https://www.kaggle.com/code/metric/eefs-concordance-index
"""
To evaluate the equitable prediction of transplant survival outcomes,
we use the concordance index (C-index) between a series of event
times and a predicted score across each race group.
 
It represents the global assessment of the model discrimination power:
this is the model’s ability to correctly provide a reliable ranking
of the survival times based on the individual risk scores.
 
The concordance index is a value between 0 and 1 where:
 
0.5 is the expected result from random predictions,
1.0 is perfect concordance (with no censoring, otherwise <1.0),
0.0 is perfect anti-concordance (with no censoring, otherwise >0.0)
"""
import pandas as pd
import pandas.api.types
import numpy as np
from lifelines.utils import concordance_index
class ParticipantVisibleError(Exception):
    pass
def score(solution: pd.DataFrame, submission: pd.DataFrame, row_id_column_name: str) -> float:
    """
    >>> import pandas as pd
    >>> row_id_column_name = "id"
    >>> y_pred = {'prediction': {0: 1.0, 1: 0.0, 2: 1.0}}
    >>> y_pred = pd.DataFrame(y_pred)
    >>> y_pred.insert(0, row_id_column_name, range(len(y_pred)))
    >>> y_true = { 'efs': {0: 1.0, 1: 0.0, 2: 0.0}, 'efs_time': {0: 25.1234,1: 250.1234,2: 2500.1234}, 'race_group': {0: 'race_group_1', 1: 'race_group_1', 2: 'race_group_1'}}
    >>> y_true = pd.DataFrame(y_true)
    >>> y_true.insert(0, row_id_column_name, range(len(y_true)))
    >>> score(y_true.copy(), y_pred.copy(), row_id_column_name)
    0.75
    """
    
    del solution[row_id_column_name]
    del submission[row_id_column_name]
    
    # Define key columns
    event_label = 'efs' # event occurrence
    interval_label = 'efs_time' # survival time
    prediction_label = 'prediction' # predicted value
    
    # Validate submitted predictions
    for col in submission.columns:
        if not pandas.api.types.is_numeric_dtype(submission[col]):
            raise ParticipantVisibleError(f'Submission column {col} must be a number')
    
    # Merging solution and submission dfs on ID
    merged_df = pd.concat([solution, submission], axis=1)
    merged_df.reset_index(inplace=True)
    merged_df_race_dict = dict(merged_df.groupby(['race_group']).groups)
    
    # Calculate c-index for each racial group
    metric_list = []
    for race in merged_df_race_dict.keys():
        # Retrieving values from y_test based on index
        indices = sorted(merged_df_race_dict[race])
        merged_df_race = merged_df.iloc[indices]
        # Calculate the concordance index
        c_index_race = concordance_index(
                        merged_df_race[interval_label],
                        -merged_df_race[prediction_label],
                        merged_df_race[event_label])
        metric_list.append(c_index_race)
    return float(np.mean(metric_list)-np.sqrt(np.var(metric_list)))- def score(solution: pd.DataFrame, submission: pd.DataFrame, row_id_column_name: str)
- solution: actual answer data including efs, efs_time, race_group columns
- submission: participant's submitted predictions: prediction column
- row_id_column_name: ID column name
 
- del solution[row_id_column_name] del submission[row_id_column_name]
- Removing id column
 
- Validate submitted predictions:
- Check if all predictions are numeric
 
- Merging solution and submission dfs on ID
 - First merge solution and submission
- Second create new index
- Last, classify data by racial groups
 
- Calculate c-index for each racial group
- Calculate concordance_index for each racial group
- Note: prediction is multiplied by -1 (high risk score should correlate with low survival time)
 
- return float(np.mean(metric_list)-np.sqrt(np.var(metric_list)))
- Returns the mean of racial c-indices minus their standard deviation
- This considers both overall performance (mean) and performance differences between races (standard deviation)
 
- More about c-index:
- c-index = concordance-index
- basic concept:
- C-index measures how well a model predicts the "relative risk ranking"
- When comparing two patients, it evaluates whether the model predicted higher risk for the patient who actually died earlier (or experienced the event)
 
- concordance_index(actual_survival_time, predicted_risk, event_occurrence)
- Compares all possible patient pairs
- Concordant pair: pairs where predicted ranking matches actual ranking
- C-index = (number of concordant pairs) / (total number of comparable pairs)
 
- Example:
- Patient A: Survival time 10 days, deceased
- Patient B: Survival time 20 days, deceased
- Patient C: Survival time 15 days, censored
- Predicted risk scores: A: 0.8 (high risk) B: 0.3 (low risk) C: 0.5 (medium risk)
- A vs B: concordant (A died earlier + model predicted higher risk for A)
- A vs C: not comparable (C is censored)
- B vs C: not comparable (C is censored)
 
- Why multiply predictions by -1:
- Originally, high risk score = low survival time
- Multiplying by -1 aligns directions (high risk = low survival time prediction)
 
 
Submission
Participants must submit their predictions for the test dataset as real-valued risk scores. These scores represent the model's assessment of each patient's risk following transplantation. A higher risk score typically indicates a higher likelihood of the target event occurrence.
The submission file must include a header and follow this format:
ID,prediction
28800,0.5
28801,1.2
28802,0.8
etc.where:
ID refers to the identifier for each patient in the test dataset.
prediction is the corresponding risk score generated by your model.
탁월성은 평범함에서 나온다
<GRIT>