대회

CZII - CryoET Object Identification #2 Baseline UNet Solution

dongsunseng 2025. 1. 15. 17:09
반응형

This post is an annotation of baseline unet solution kernel from "fnands".

https://www.kaggle.com/code/ldywinner/baseline-unet-train-submit/notebook#Baseline-UNet-training-+-prediction/submission

'Baseline UNet train + submit' - LB score 0.529

 

CZII - CryoET Object Identification #1 - Training Data

This post is an annotation of training data code kernel from "fnands".https://www.kaggle.com/code/fnands/create-numpy-dataset-exp-name Create Numpy dataset exp nameExplore and run machine learning code with Kaggle Notebooks | Using data from CZII - CryoET

dongsunseng.com

1) Installing offline deps

deps_path = '/kaggle/input/czii-cryoet-dependencies'
! cp -r /kaggle/input/czii-cryoet-dependencies/asciitree-0.3.3/ asciitree-0.3.3/
! pip wheel asciitree-0.3.3/asciitree-0.3.3/
! pip install asciitree-0.3.3-py3-none-any.whl
! pip install -q --no-index --find-links {deps_path} --requirement {deps_path}/requirements.txt
  • Process of installing dependency packages in an offline environment
  • "As this is a code comp, there is no internet. So we have to do some silly things to get dependencies in here. Why is asciitree such a PITA?"
  • In Kaggle competitions, internet access is restricted, so necessary packages must be prepared in advance
  • Kaggle competition environments block internet access for security and fairness.
  • The asciitree package, in particular, is tricky to install and requires special handling
  • All dependency packages must be prepared in advance in a locally installable format.

2) Import deps

from typing import List, Tuple, Union
import numpy as np
import torch
from monai.data import DataLoader, Dataset, CacheDataset, decollate_batch
from monai.transforms import (
    Compose, 
    EnsureChannelFirstd, 
    Orientationd,  
    AsDiscrete,  
    RandFlipd, 
    RandRotate90d, 
    NormalizeIntensityd,
    RandCropByLabelClassesd,
)

3) Define some helper functions

Patching helper functions

def calculate_patch_starts(dimension_size: int, patch_size: int) -> List[int]:
    """
    Calculate the starting positions of patches along a single dimension
    with minimal overlap to cover the entire dimension.
    
    Parameters:
    -----------
    dimension_size : int
        Size of the dimension
    patch_size : int
        Size of the patch in this dimension
        
    Returns:
    --------
    List[int]
        List of starting positions for patches
    """
    if dimension_size <= patch_size:
        return [0]
        
    # Calculate number of patches needed
    n_patches = np.ceil(dimension_size / patch_size)
    
    if n_patches == 1:
        return [0]
    
    # Calculate overlap
    total_overlap = (n_patches * patch_size - dimension_size) / (n_patches - 1)
    
    # Generate starting positions
    positions = []
    for i in range(int(n_patches)):
        pos = int(i * (patch_size - total_overlap))
        if pos + patch_size > dimension_size:
            pos = dimension_size - patch_size
        if pos not in positions:  # Avoid duplicates
            positions.append(pos)
    
    return positions

def extract_3d_patches_minimal_overlap(arrays: List[np.ndarray], patch_size: int) -> Tuple[List[np.ndarray], List[Tuple[int, int, int]]]:
    """
    Extract 3D patches from multiple arrays with minimal overlap to cover the entire array.
    
    Parameters:
    -----------
    arrays : List[np.ndarray]
        List of input arrays, each with shape (m, n, l)
    patch_size : int
        Size of cubic patches (a x a x a)
        
    Returns:
    --------
    patches : List[np.ndarray]
        List of all patches from all input arrays
    coordinates : List[Tuple[int, int, int]]
        List of starting coordinates (x, y, z) for each patch
    """
    if not arrays or not isinstance(arrays, list):
        raise ValueError("Input must be a non-empty list of arrays")
    
    # Verify all arrays have the same shape
    shape = arrays[0].shape
    if not all(arr.shape == shape for arr in arrays):
        raise ValueError("All input arrays must have the same shape")
    
    if patch_size > min(shape):
        raise ValueError(f"patch_size ({patch_size}) must be smaller than smallest dimension {min(shape)}")
    
    m, n, l = shape
    patches = []
    coordinates = []
    
    # Calculate starting positions for each dimension
    x_starts = calculate_patch_starts(m, patch_size)
    y_starts = calculate_patch_starts(n, patch_size)
    z_starts = calculate_patch_starts(l, patch_size)
    
    # Extract patches from each array
    for arr in arrays:
        for x in x_starts:
            for y in y_starts:
                for z in z_starts:
                    patch = arr[
                        x:x + patch_size,
                        y:y + patch_size,
                        z:z + patch_size
                    ]
                    patches.append(patch)
                    coordinates.append((x, y, z))
    
    return patches, coordinates

# Note: I should probably averge the overlapping areas, 
# but here they are just overwritten by the most recent one. 

def reconstruct_array(patches: List[np.ndarray], 
                     coordinates: List[Tuple[int, int, int]], 
                     original_shape: Tuple[int, int, int]) -> np.ndarray:
    """
    Reconstruct array from patches.
    
    Parameters:
    -----------
    patches : List[np.ndarray]
        List of patches to reconstruct from
    coordinates : List[Tuple[int, int, int]]
        Starting coordinates for each patch
    original_shape : Tuple[int, int, int]
        Shape of the original array
        
    Returns:
    --------
    np.ndarray
        Reconstructed array
    """
    reconstructed = np.zeros(original_shape, dtype=np.int64)  # To track overlapping regions
    
    patch_size = patches[0].shape[0]
    
    for patch, (x, y, z) in zip(patches, coordinates):
        reconstructed[
            x:x + patch_size,
            y:y + patch_size,
            z:z + patch_size
        ] = patch
        
    
    return reconstructed
  • "These are mostly used to split large volumes into smaller ones and stitch them back together"
  • This code implements functions for extracting and reconstructing patches from 3D image data
  • def calculate_patch_starts(dimension_size: int, patch_size: int) -> List[int]:
    • Purpose: Calculates the starting positions of patches in one dimension
    • Operation:
      • Returns [0] if dimension size is smaller than patch size
      • Calculates required number of patches: n_patches = ceil(dimension_size / patch_size)
      • Calculates overlap between patches
      • Generates starting positions for each patch considering overlap
    • Example: For dimension size 100 and patch size 40, returns list of positions like [0, 30, 60]
  • def extract_3d_patches_minimal_overlap(arrays: List[np.ndarray], patch_size: int) 
    • Purpose: Divides 3D arrays into smaller patches
    • Key features:
      • Input validation (array shape, size, etc.)
      • Calculates patch starting positions for each dimension (x, y, z)
      • Extracts patches from all possible positions
    • Return values:
      • patches: List of all extracted patches
      • coordinates: List of starting coordinates for each patch
  • def reconstruct_array(patches: List[np.ndarray], coordinates: List[Tuple[int, int, int]], original_shape: Tuple[int, int, int])
    • Purpose: Reconstructs patches back into original-sized 3D array
    • Operation:
      • Creates empty array of original size
      • Places each patch at its corresponding position
      • Overlapping regions are overwritten by most recent patch
    • Note:
      • As mentioned in code comments, using average values for overlapping regions might be better

Submission helper functions

import pandas as pd

def dict_to_df(coord_dict, experiment_name):
    """
    Convert dictionary of coordinates to pandas DataFrame.
    
    Parameters:
    -----------
    coord_dict : dict
        Dictionary where keys are labels and values are Nx3 coordinate arrays
        
    Returns:
    --------
    pd.DataFrame
        DataFrame with columns ['x', 'y', 'z', 'label']
    """
    # Create lists to store data
    all_coords = []
    all_labels = []
    
    # Process each label and its coordinates
    for label, coords in coord_dict.items():
        all_coords.append(coords)
        all_labels.extend([label] * len(coords))
    
    # Concatenate all coordinates
    all_coords = np.vstack(all_coords)
    
    df = pd.DataFrame({
        'experiment': experiment_name,
        'particle_type': all_labels,
        'x': all_coords[:, 0],
        'y': all_coords[:, 1],
        'z': all_coords[:, 2]
    })

    
    return df
  • Purpose:
    • Converts position coordinates of multiple particle types in 3D space into a structured dataframe format
    • Structures data to match the submission format for Kaggle competition
  • Input Parameters:
    • coord_dict: A dictionary with particle types as keys and their coordinates (N×3 array) as values
      • Example: {'apo-ferritin': array([[x1,y1,z1], [x2,y2,z2]...]), 'ribosome': array([[x3,y3,z3]...])}
    • experiment_name: Experiment name (e.g., 'TS_5_4')

4) Reading in the data

TRAIN_DATA_DIR = "/kaggle/input/create-numpy-dataset-exp-name"
TEST_DATA_DIR = "/kaggle/input/czii-cryo-et-object-identification"

train_names = ['TS_5_4', 'TS_69_2', 'TS_6_6', 'TS_73_6', 'TS_86_3', 'TS_99_9']
valid_names = ['TS_6_4']

train_files = []
valid_files = []

for name in train_names:
    image = np.load(f"{TRAIN_DATA_DIR}/train_image_{name}.npy")
    label = np.load(f"{TRAIN_DATA_DIR}/train_label_{name}.npy")

    train_files.append({"image": image, "label": label})
    

for name in valid_names:
    image = np.load(f"{TRAIN_DATA_DIR}/train_image_{name}.npy")
    label = np.load(f"{TRAIN_DATA_DIR}/train_label_{name}.npy")

    valid_files.append({"image": image, "label": label})
  • Loading data used in training and validation
  • For each experiment ID:
    • image: 3D volume data (.npy format)
    • label: corresponding label data
    • stored as dictionary {"image": image, "label": label}

Create the training dataloader

"I should probably find a way to create a dataloader that takes more batches."

# Non-random transforms to be cached
non_random_transforms = Compose([
    EnsureChannelFirstd(keys=["image", "label"], channel_dim="no_channel"),
    NormalizeIntensityd(keys="image"),
    Orientationd(keys=["image", "label"], axcodes="RAS")
])

raw_train_ds = CacheDataset(data=train_files, transform=non_random_transforms, cache_rate=1.0)


my_num_samples = 16
train_batch_size = 1

# Random transforms to be applied during training
random_transforms = Compose([
    RandCropByLabelClassesd(
        keys=["image", "label"],
        label_key="label",
        spatial_size=[96, 96, 96],
        num_classes=7,
        num_samples=my_num_samples
    ),
    RandRotate90d(keys=["image", "label"], prob=0.5, spatial_axes=[0, 2]),
    RandFlipd(keys=["image", "label"], prob=0.5, spatial_axis=0),    
])

# Final Dataset and DataLoader Creation:
train_ds = Dataset(data=raw_train_ds, transform=random_transforms)


# DataLoader remains the same
train_loader = DataLoader(
    train_ds,
    batch_size=train_batch_size,
    shuffle=True,
    num_workers=4,
    pin_memory=torch.cuda.is_available()
)
  • This code sets up the training data loader using the MONAI library for medical image data processing with transforms and data loaders
  • Data loader?
    • DataLoader is a pipeline that supplies data to the model
    • Main features:
      1. Batch Creation: Bundles multiple data samples together
      2. Shuffling: Randomly shuffles the order of data
      3. Parallel Processing: Accelerates data loading using multiple CPU cores
      4. Memory Efficiency: Loads data as needed instead of loading all at once Example:
  • Non-random Transforms Setup:
    • EnsureChannelFirstd: Moves channel dimension to first position
    • NormalizeIntensityd: Normalizes image values (standardizes image values)
    • Orientationd: Aligns 3D images to RAS (Right-Anterior-Superior) standard
  • raw_train_ds = CacheDataset(data=train_files, transform=non_random_transforms, cache_rate=1.0)
    • Caches transformed data in memory for fast access
    • cache_rate=1.0: Caches all data
  • Random Transforms Setup:
    • RandCropByLabelClassesd: Random cropping by label classes (96×96×96 size)
    • RandRotate90d: Random 90-degree rotation (50% probability)
    • RandFlipd: Random flipping (50% probability)
    • Random Transform doesn't replace original data but applies new transformations each time data is loaded
    • my_num_samples = 16  # Creates 16 transformed samples per image
      • Creates 16 different transformations from each original image per epoch
      • 6 training images × 16 samples = total of 96 samples used in each epoch
      • New random transformations are applied every epoch
  • Final Dataset and DataLoader Creation:
    • batch_size=1: Number of samples to process at once
    • shuffle=True: Shuffle data order each epoch
    • num_workers=4: Number of workers for parallel data loading
    • pin_memory=True: Memory performance optimization for GPU usage

Create the validation dataloader

  • "Here I deviate a little from the source notebooks."
  • "In the source, the validation dataloader also used the random transformations. This is bad practice and will result in noisy validation."
  • "Here I split the validation dataset in (slightly) overlapping blocks of (96, 96 , 96) so that we can have a consistent validation set that uses all the validation data.
val_images,val_labels = [dcts['image'] for dcts in valid_files],[dcts['label'] for dcts in valid_files]

val_image_patches, _ = extract_3d_patches_minimal_overlap(val_images, 96)
val_label_patches, _ = extract_3d_patches_minimal_overlap(val_labels, 96)

val_patched_data = [{"image": img, "label": lbl} for img, lbl in zip(val_image_patches, val_label_patches)]


valid_ds = CacheDataset(data=val_patched_data, transform=non_random_transforms, cache_rate=1.0)


valid_batch_size = 16
# DataLoader remains the same
valid_loader = DataLoader(
    valid_ds,
    batch_size=valid_batch_size,
    shuffle=False,
    num_workers=4,
    pin_memory=torch.cuda.is_available()
)
  • valid_batch_size = 16
    • Larger batch size than training(1)
  • shuffle=False
    • Maintaining order
  • Dataloader configuration details:
    • Consistency:
      • Random transforms would result in unstable performance evaluation
      • Fixed patches enable consistent evaluation
    • Completeness:
      • Using entire data allows more accurate evaluation
      • Slight overlap ensures good evaluation of boundary regions
    • Efficiency:
      • Can use larger batch size
      • Faster validation process than training

5) Initializing the model

import lightning.pytorch as pl

from monai.networks.nets import UNet
from monai.losses import TverskyLoss
from monai.metrics import DiceMetric

class Model(pl.LightningModule):
    def __init__(
        self, 
        spatial_dims: int = 3,
        in_channels: int = 1,
        out_channels: int = 7,
        channels: Union[Tuple[int, ...], List[int]] = (48, 64, 80, 80),
        strides: Union[Tuple[int, ...], List[int]] = (2, 2, 1),
        num_res_units: int = 1,
        lr: float=1e-3):
    
        super().__init__()
        self.save_hyperparameters()
        self.model = UNet(
            spatial_dims=self.hparams.spatial_dims,
            in_channels=self.hparams.in_channels,
            out_channels=self.hparams.out_channels,
            channels=self.hparams.channels,
            strides=self.hparams.strides,
            num_res_units=self.hparams.num_res_units,
        )
        self.loss_fn = TverskyLoss(include_background=True, to_onehot_y=True, softmax=True)  # softmax=True for multiclass
        self.metric_fn = DiceMetric(include_background=False, reduction="mean", ignore_empty=True)

        self.train_loss = 0
        self.val_metric = 0
        self.num_train_batch = 0
        self.num_val_batch = 0

    def forward(self, x):
        return self.model(x)

    def training_step(self, batch, batch_idx):
        x, y = batch['image'], batch['label']
        y_hat = self(x)
        loss = self.loss_fn(y_hat, y)
        self.train_loss += loss
        self.num_train_batch += 1
        torch.cuda.empty_cache()
        return loss

    def on_train_epoch_end(self):
        loss_per_epoch = self.train_loss/self.num_train_batch
        #print(f"Epoch {self.current_epoch} - Average Train Loss: {loss_per_epoch:.4f}")
        self.log('train_loss', loss_per_epoch, prog_bar=True)
        self.train_loss = 0
        self.num_train_batch = 0
    
    def validation_step(self, batch, batch_idx):
        with torch.no_grad(): # This ensures that gradients are not stored in memory
            x, y = batch['image'], batch['label'] # Extract images and labels from batch
            y_hat = self(x) # Perform model prediction
            
            # Process predictions
            metric_val_outputs = [AsDiscrete(
                argmax=True,  # Select class with highest probability
                to_onehot=self.hparams.out_channels  # Convert to one-hot encoding
            )(i) for i in decollate_batch(y_hat)]
            
            # Process labels
            metric_val_labels = [AsDiscrete(
                to_onehot=self.hparams.out_channels  # Convert labels to one-hot encoding
            )(i) for i in decollate_batch(y)]

            # compute metric for current iteration
            # Calculate Dice score for current batch
            self.metric_fn(y_pred=metric_val_outputs, y=metric_val_labels)
            # Calculate batch average metric
            metrics = self.metric_fn.aggregate(reduction="mean_batch")
            # Calculate mean across all particle types
            val_metric = torch.mean(metrics) # I used mean over all particle species as the metric. This can be explored.
            
            # Result Accumulation
            self.val_metric += val_metric 
            self.num_val_batch += 1
            
        torch.cuda.empty_cache()
        return {'val_metric': val_metric}

    def on_validation_epoch_end(self):
        metric_per_epoch = self.val_metric/self.num_val_batch
        #print(f"Epoch {self.current_epoch} - Average Val Metric: {metric_per_epoch:.4f}")
        self.log('val_metric', metric_per_epoch, prog_bar=True, sync_dist=False) # sync_dist=True for distributed training
        self.val_metric = 0
        self.num_val_batch = 0
    
    def configure_optimizers(self):
        return torch.optim.AdamW(self.parameters(), lr=self.hparams.lr)
  • This code implements a 3D U-Net model using PyTorch Lightning
  • def __init__(self, spatial_dims=3, in_channels=1, out_channels=7, ...):
    • UNet Model Configuration:
      • spatial_dims=3: Process 3D data
      • in_channels=1: Grayscale image input
      • out_channels=7: 7 class outputs (background + 6 particle types)
      • channels=(48, 64, 80, 80): Number of channels per layer
      • strides=(2, 2, 1): Stride for each layer
      • num_res_units=1: Number of residual units to include in each encoder and decoder block
        • Residual Unit structure:
          • Input -> Conv3D -> BatchNorm -> ReLU -> Conv3D -> BatchNorm -> Add(Input) -> ReLU -> Output
          • One "unit" consists of:
            • 2 3D convolution layers
            • 2 Batch Normalization layers
            • ReLU activation function
            • Skip connection (adding input to output)
        • num_res_units=1 means this structure is repeated once at each level. If num_res_units=2, this entire structure would be repeated twice in sequence
    • Loss Function and Evaluation Metrics:
      • TverskyLoss: Loss function robust to class imbalance
        • Tversky Loss is a generalized version of Dice Loss
        • Effective for handling class imbalance problems
        • Parameter explanation:
          • include_background=True: Include background (class 0) in loss calculation
          • to_onehot_y=True: Convert integer labels to one-hot vectors
          • softmax=True: Apply softmax for multi-class classification
      • DiceMetric: Segmentation performance metric
        • Dice coefficient is a standard metric for evaluating segmentation performance
        • Formula: 2|X∩Y| / (|X|+|Y|)
          • X: Predicted region
          • Y: Actual region
        • Parameter explanation:
          • include_background=False: Exclude background class from evaluation
          • reduction="mean": Average Dice scores across all classes
          • ignore_empty=True: Exclude cases where certain classes are absent
  • Variable initialization
    • self.train_loss = 0      # Accumulate training loss
    • self.val_metric = 0      # Accumulate validation metric
    • self.num_train_batch = 0 # Count processed training batches
    • self.num_val_batch = 0   # Count processed validation batches
    • These variables are used to calculate average performance during an epoch
    • Reset to 0 at the end of each epoch
  • def forward(self, x):
    • Basic inference method for PyTorch models
    • Passes input x through the model
    • Simple but important roles:
      1. Simplifies model calls (enables self(x) instead of model(x))
      2. Used for model inference in other methods
      3. Integrates with PyTorch Lightning's automated features
  • def training_step(self, batch, batch_idx):
    • Process batch data
    • Perform model prediction
    • Calculate and accumulate loss
  • def on_train_epoch_end(self):
    • Calculate average loss per epoch
    • Perform logging
  • def validation_step(self, batch, batch_idx):
    • OVERALL:
      • Perform validation without gradient calculation
      • Convert predictions to class labels
      • Calculate Dice score
    • with torch.no_grad():
      • Turns off gradient calculation as backpropagation isn't needed during validation
      • Reduces memory usage and improves computation speed
    • metric_val_outputs
      • decollate_batch(y_hat):
        • Separates batch into individual samples
        • Example: [16 batches] → [sample1, sample2, ..., sample16]
      • AsDiscrete(argmax=True):
        • Selects class with highest probability at each position
        • Example: [0.1, 0.7, 0.2] → 1 (second class)
      • to_onehot=7:
        • Converts selected class to one-hot vector
        • Example: 1 → [0, 1, 0, 0, 0, 0, 0]
    • metric_val_labels
      • decollate_batch(y):
        • Separates batch into individual samples
      • AsDiscrete(to_onehot=7):
        • Converts class index to one-hot vector
        • Example: 2 → [0, 0, 1, 0, 0, 0, 0]
  • def on_validation_epoch_end(self):
    • Same with on_train_epoch_end(self)
  • def configure_optimizers(self):
    • Use AdamW optimizer
    • Set learning rate
channels = (48, 64, 80, 80)
strides_pattern = (2, 2, 1)       
num_res_units = 1
learning_rate = 1e-3
num_epochs = 100

model = Model(channels=channels, strides=strides_pattern, num_res_units=num_res_units, lr=learning_rate)

6) Training the model

torch.set_float32_matmul_precision('medium')

# Check if CUDA is available and then count the GPUs
if torch.cuda.is_available():
    num_gpus = torch.cuda.device_count()
    print(f"Number of GPUs available: {num_gpus}")
else:
    print("No GPU available. Running on CPU.")
devices = list(range(num_gpus))
print(devices)


trainer = pl.Trainer(
    max_epochs=num_epochs,        # Total number of training epochs (100)
    #strategy="ddp_notebook",     # Distributed training strategy (currently commented)
    accelerator="gpu",           # Use GPU
    devices=[0],                 # Use only first GPU
    num_nodes=1,                 # Use single node
    log_every_n_steps=10,        # Log every 10 steps
    enable_progress_bar=True,    # Enable progress bar
)

trainer.fit(model, train_loader, valid_loader)
  • torch.set_float32_matmul_precision('medium')
    • Sets precision of 32-bit floating-point matrix multiplication to 'medium'
    • Balances speed and accuracy
  • GPU Availability Check
    • Checks for CUDA (GPU) availability
    • Counts available GPUs
    • Creates list of GPU indices (e.g., [0,1,2] for 3 GPUs)
  • Trainer Setup
    • max_epochs: Total number of training iterations
    • accelerator: Hardware to use for training (GPU/CPU)
    • devices: GPU numbers to use
    • num_nodes: Number of nodes for distributed training
    • log_every_n_steps: Logging frequency
    • enable_progress_bar: Visualize training progress
  • trainer.fit(model, train_loader, valid_loader)
    • Training Cycle:
      • Each epoch loads batch data from train_loader
      • Executes training_step method:
        • Model prediction
        • Loss calculation
        • Backpropagation and weight updates
      • Runs on_train_epoch_end at epoch end
    • Validation Cycle:
      • Loads data from valid_loader after each epoch
      • Executes validation_step method:
        • Model prediction
        • Dice score calculation
      • Runs on_validation_epoch_end at epoch end

7) Predicting on the test set

# Model setup
model.eval();
model.to("cuda");

# Configuration File Processing
import json
copick_config_path = TRAIN_DATA_DIR + "/copick.config"

with open(copick_config_path) as f:
    copick_config = json.load(f)

copick_config['static_root'] = '/kaggle/input/czii-cryo-et-object-identification/test/static'

copick_test_config_path = 'copick_test.config'

with open(copick_test_config_path, 'w') as outfile:
    json.dump(copick_config, outfile)

# Copick Setup
import copick

root = copick.from_file(copick_test_config_path)

copick_user_name = "copickUtils"
copick_segmentation_name = "paintedPicks"
voxel_size = 10
tomo_type = "denoised"
  • Switches the trained model to evaluation mode and prepares settings for test data
  • Model Setup
    • eval(): Switches dropout, batch normalization, etc. to evaluation mode
    • to("cuda"): Moves model to GPU memory
  • Configuration File Processing
    • Loads original configuration file
    • Updates test data path
    • Saves new configuration to file
  • Copick Setup
    • Loads configuration using copick library
    • Sets parameters needed for testing
# Setting up Inference Transformations:
# Non-random transforms to be cached
inference_transforms = Compose([
    EnsureChannelFirstd(keys=["image"], channel_dim="no_channel"),
    NormalizeIntensityd(keys="image"),
    Orientationd(keys=["image"], axcodes="RAS")
])

import cc3d

id_to_name = {1: "apo-ferritin", 
              2: "beta-amylase",
              3: "beta-galactosidase", 
              4: "ribosome", 
              5: "thyroglobulin", 
              6: "virus-like-particle"}
  • Setting up Inference Transformations:
    • Unlike training, no random transformations (for consistent predictions)
    • Applied transformations:
      • EnsureChannelFirstd: Moves channel dimension to first position
      • NormalizeIntensityd: Normalizes image values
      • Orientationd: Aligns 3D images to RAS standard
  • cc3d
    • Library for Connected Components analysis
    • Used to find and label connected regions in 3D images

Iterate over test set:

  1. Read in a run
  2. Split it into patches of size (96, 96, 96)
  3. Create a dataset from the patches
  4. Predict the segmentation mask
  5. Glue the mask back together
  6. Find the connected components for each class
  7. Find the centroids of the connected components
  8. Add to the dataframe
  • Then do this for all runs.
  • "This can probably be optimized quite a bit."
BLOB_THRESHOLD = 500
CERTAINTY_THRESHOLD = 0.5

classes = [1, 2, 3, 4, 5, 6]
with torch.no_grad():
    location_df = []
    for run in root.runs:
        print(run)
		
        # "Read in a run"
        tomo = run.get_voxel_spacing(10)
        tomo = tomo.get_tomogram(tomo_type).numpy()

        # "Split into patches"
        tomo_patches, coordinates  = extract_3d_patches_minimal_overlap([tomo], 96)

        # "Create a dataset"
        tomo_patched_data = [{"image": img} for img in tomo_patches]
        tomo_ds = CacheDataset(data=tomo_patched_data, transform=inference_transforms, cache_rate=1.0)

        # "Predict the segmentation mask"
        pred_masks = []

        for i in range(len(tomo_ds)):
            input_tensor = tomo_ds[i]['image'].unsqueeze(0).to("cuda")
            model_output = model(input_tensor)

            probs = torch.softmax(model_output[0], dim=0)
            thresh_probs = probs > CERTAINTY_THRESHOLD
            _, max_classes = thresh_probs.max(dim=0)

            pred_masks.append(max_classes.cpu().numpy())
            
        # "Glue the mask back together"
        reconstructed_mask = reconstruct_array(pred_masks, coordinates, tomo.shape)
        
        location = {}

        for c in classes:
            # "Find the connected components"
            cc = cc3d.connected_components(reconstructed_mask == c)
            stats = cc3d.statistics(cc)
            
            # "Find the centroids"
            zyx=stats['centroids'][1:]*10.012444 #https://www.kaggle.com/competitions/czii-cryo-et-object-identification/discussion/544895#3040071
            zyx_large = zyx[stats['voxel_counts'][1:] > BLOB_THRESHOLD]
            xyz =np.ascontiguousarray(zyx_large[:,::-1])

            location[id_to_name[c]] = xyz

        # "Add to the dataframe"
        df = dict_to_df(location, run.name)
        location_df.append(df)
    
    location_df = pd.concat(location_df)
location_df.insert(
    loc=0,                              # Insert at first position
    column='id',                        # Column name is 'id'
    value=np.arange(len(location_df))   # Sequential numbers starting from 0
)
location_df.to_csv("submission.csv", index=False)
  • Adding ID Column
    • Assigns unique ID to each predicted particle
    • Meets Kaggle submission format requirements
  • Saving to CSV file
    • index=False: Excludes DataFrame index from saved file
!cp -r /kaggle/input/hengck-czii-cryo-et-01/* .

from czii_helper import *
from dataset import *
from scipy.optimize import linear_sum_assignment
import matplotlib.pyplot as plt
  • !cp ~:
    • Linux command that copies required files from a Kaggle dataset to the current working directory
  • hengck-czii-cryo-et-01 includes:
    • czii_helper.py: Utility functions for evaluation metric calculations
    • dataset.py: Functions for data loading and processing
    • PARTICLE: a constant defined in the copied files, containing characteristics of each particle type (name, radius, difficulty level, etc.)
import os
if os.getenv('KAGGLE_IS_COMPETITION_RERUN'):
    MODE = 'submit'
else:
    MODE = 'local'







valid_dir ='/kaggle/input/czii-cryo-et-object-identification/train'
valid_id = ['TS_6_4', ]

def do_one_eval(truth, predict, threshold):
    P=len(predict)
    T=len(truth)

    if P==0:
        hit=[[],[]]
        miss=np.arange(T).tolist()
        fp=[]
        metric = [P,T,len(hit[0]),len(miss),len(fp)]
        return hit, fp, miss, metric

    if T==0:
        hit=[[],[]]
        fp=np.arange(P).tolist()
        miss=[]
        metric = [P,T,len(hit[0]),len(miss),len(fp)]
        return hit, fp, miss, metric

    #---
    distance = predict.reshape(P,1,3)-truth.reshape(1,T,3)
    distance = distance**2
    distance = distance.sum(axis=2)
    distance = np.sqrt(distance)
    p_index, t_index = linear_sum_assignment(distance)

    valid = distance[p_index, t_index] <= threshold
    p_index = p_index[valid]
    t_index = t_index[valid]
    hit = [p_index.tolist(), t_index.tolist()]
    miss = np.arange(T)
    miss = miss[~np.isin(miss,t_index)].tolist()
    fp = np.arange(P)
    fp = fp[~np.isin(fp,p_index)].tolist()

    metric = [P,T,len(hit[0]),len(miss),len(fp)] #for lb metric F-beta copmutation
    return hit, fp, miss, metric


def compute_lb(submit_df, overlay_dir):
    valid_id = list(submit_df['experiment'].unique())
    print(valid_id)

    eval_df = []
    for id in valid_id:
        truth = read_one_truth(id, overlay_dir) #=f'{valid_dir}/overlay/ExperimentRuns')
        id_df = submit_df[submit_df['experiment'] == id]
        for p in PARTICLE:
            p = dotdict(p)
            print('\r', id, p.name, end='', flush=True)
            xyz_truth = truth[p.name]
            xyz_predict = id_df[id_df['particle_type'] == p.name][['x', 'y', 'z']].values
            hit, fp, miss, metric = do_one_eval(xyz_truth, xyz_predict, p.radius* 0.5)
            eval_df.append(dotdict(
                id=id, particle_type=p.name,
                P=metric[0], T=metric[1], hit=metric[2], miss=metric[3], fp=metric[4],
            ))
    print('')
    eval_df = pd.DataFrame(eval_df)
    gb = eval_df.groupby('particle_type').agg('sum').drop(columns=['id'])
    gb.loc[:, 'precision'] = gb['hit'] / gb['P']
    gb.loc[:, 'precision'] = gb['precision'].fillna(0)
    gb.loc[:, 'recall'] = gb['hit'] / gb['T']
    gb.loc[:, 'recall'] = gb['recall'].fillna(0)
    gb.loc[:, 'f-beta4'] = 17 * gb['precision'] * gb['recall'] / (16 * gb['precision'] + gb['recall'])
    gb.loc[:, 'f-beta4'] = gb['f-beta4'].fillna(0)

    gb = gb.sort_values('particle_type').reset_index(drop=False)
    # https://www.kaggle.com/competitions/czii-cryo-et-object-identification/discussion/544895
    gb.loc[:, 'weight'] = [1, 0, 2, 1, 2, 1]
    lb_score = (gb['f-beta4'] * gb['weight']).sum() / gb['weight'].sum()
    return gb, lb_score


#debug
if 1:
    if MODE=='local':
    #if 1:
        submit_df=pd.read_csv(
           'submission.csv'
            # '/kaggle/input/hengck-czii-cryo-et-weights-01/submission.csv'
        )
        gb, lb_score = compute_lb(submit_df, f'{valid_dir}/overlay/ExperimentRuns')
        print(gb)
        print('lb_score:',lb_score)
        print('')


        #show one ----------------------------------
        fig = plt.figure(figsize=(18, 8))

        id = valid_id[0]
        truth = read_one_truth(id,overlay_dir=f'{valid_dir}/overlay/ExperimentRuns')

        submit_df = submit_df[submit_df['experiment']==id]
        for p in PARTICLE:
            p = dotdict(p)
            xyz_truth = truth[p.name]
            xyz_predict = submit_df[submit_df['particle_type']==p.name][['x','y','z']].values
            hit, fp, miss, _ = do_one_eval(xyz_truth, xyz_predict, p.radius)
            print(id, p.name)
            print('\t num truth   :',len(xyz_truth) )
            print('\t num predict :',len(xyz_predict) )
            print('\t num hit  :',len(hit[0]) )
            print('\t num fp   :',len(fp) )
            print('\t num miss :',len(miss) )

            ax = fig.add_subplot(2, 3, p.label, projection='3d')
            if hit[0]:
                pt = xyz_predict[hit[0]]
                ax.scatter(pt[:, 0], pt[:, 1], pt[:, 2], alpha=0.5, color='r')
                pt = xyz_truth[hit[1]]
                ax.scatter(pt[:,0], pt[:,1], pt[:,2], s=80, facecolors='none', edgecolors='r')
            if fp:
                pt = xyz_predict[fp]
                ax.scatter(pt[:, 0], pt[:, 1], pt[:, 2], alpha=1, color='k')
            if miss:
                pt = xyz_truth[miss]
                ax.scatter(pt[:, 0], pt[:, 1], pt[:, 2], s=160, alpha=1, facecolors='none', edgecolors='k')

            ax.set_title(f'{p.name} ({p.difficulty})')

        plt.tight_layout()
        plt.show()
        
        #--- 
        zz=0
  • Overall comprehensive evaluation and visualization of model predictions
  • do_one_eval:
    • Inputs:
      • truth: actual particle positions
      • predict: predicted particle positions
      • threshold: matching distance threshold
    • Main process:
      1. Handle exceptions (P=0 or T=0 cases)
      2. Calculate distances between predictions and truth
      3. Find optimal matching (using linear_sum_assignment)
      4. Filter valid matches based on threshold
      5. Calculate hits/misses/false positives
    • Returns:
      • hit: correct prediction indices
      • fp: false prediction indices
      • miss: missed particle indices
      • metric: [P, T, num_hits, num_misses, num_fps]
  • compute_lb
    • Inputs:
      • submit_df: prediction results dataframe to submit
      • overlay_dir: ground truth data path
    • Main process:
      1. Evaluate predictions for each experiment ID
      2. Calculate performance per particle type
      3. Calculate precision and recall
      4. Calculate f-beta4 score (beta=4 weights recall)
      5. Apply particle type weights
    • Returns:
      • gb: performance metrics per particle type
      • lb_score: final score
  • read_one_truth: loading the ground truth data
  • We are scoring the lb score based on the test data we configured: TS_6_4

I think it's very important to have a feedback loop, where you're constantly thinking about what you've done and how you could be doing it better. 
- Elon Musk -
반응형