반응형
This post is an annotation of baseline unet solution kernel from "fnands".
'Baseline UNet train + submit' - LB score 0.529
- Literally a baseline solution with no high lb score
- Based on 3 notebooks:
- Pre-computed the input data and stored them as numpy arrays so they don't have to be extracted every time the notebooks is run:
- My annotation of that part here: https://dongsunseng.com/entry/CZII-CryoET-Object-Identification-1-Training-Data
1) Installing offline deps
deps_path = '/kaggle/input/czii-cryoet-dependencies'
! cp -r /kaggle/input/czii-cryoet-dependencies/asciitree-0.3.3/ asciitree-0.3.3/
! pip wheel asciitree-0.3.3/asciitree-0.3.3/
! pip install asciitree-0.3.3-py3-none-any.whl
! pip install -q --no-index --find-links {deps_path} --requirement {deps_path}/requirements.txt
- Process of installing dependency packages in an offline environment
- "As this is a code comp, there is no internet. So we have to do some silly things to get dependencies in here. Why is asciitree such a PITA?"
- In Kaggle competitions, internet access is restricted, so necessary packages must be prepared in advance
- Kaggle competition environments block internet access for security and fairness.
- The asciitree package, in particular, is tricky to install and requires special handling
- All dependency packages must be prepared in advance in a locally installable format.
2) Import deps
from typing import List, Tuple, Union
import numpy as np
import torch
from monai.data import DataLoader, Dataset, CacheDataset, decollate_batch
from monai.transforms import (
Compose,
EnsureChannelFirstd,
Orientationd,
AsDiscrete,
RandFlipd,
RandRotate90d,
NormalizeIntensityd,
RandCropByLabelClassesd,
)
3) Define some helper functions
Patching helper functions
def calculate_patch_starts(dimension_size: int, patch_size: int) -> List[int]:
"""
Calculate the starting positions of patches along a single dimension
with minimal overlap to cover the entire dimension.
Parameters:
-----------
dimension_size : int
Size of the dimension
patch_size : int
Size of the patch in this dimension
Returns:
--------
List[int]
List of starting positions for patches
"""
if dimension_size <= patch_size:
return [0]
# Calculate number of patches needed
n_patches = np.ceil(dimension_size / patch_size)
if n_patches == 1:
return [0]
# Calculate overlap
total_overlap = (n_patches * patch_size - dimension_size) / (n_patches - 1)
# Generate starting positions
positions = []
for i in range(int(n_patches)):
pos = int(i * (patch_size - total_overlap))
if pos + patch_size > dimension_size:
pos = dimension_size - patch_size
if pos not in positions: # Avoid duplicates
positions.append(pos)
return positions
def extract_3d_patches_minimal_overlap(arrays: List[np.ndarray], patch_size: int) -> Tuple[List[np.ndarray], List[Tuple[int, int, int]]]:
"""
Extract 3D patches from multiple arrays with minimal overlap to cover the entire array.
Parameters:
-----------
arrays : List[np.ndarray]
List of input arrays, each with shape (m, n, l)
patch_size : int
Size of cubic patches (a x a x a)
Returns:
--------
patches : List[np.ndarray]
List of all patches from all input arrays
coordinates : List[Tuple[int, int, int]]
List of starting coordinates (x, y, z) for each patch
"""
if not arrays or not isinstance(arrays, list):
raise ValueError("Input must be a non-empty list of arrays")
# Verify all arrays have the same shape
shape = arrays[0].shape
if not all(arr.shape == shape for arr in arrays):
raise ValueError("All input arrays must have the same shape")
if patch_size > min(shape):
raise ValueError(f"patch_size ({patch_size}) must be smaller than smallest dimension {min(shape)}")
m, n, l = shape
patches = []
coordinates = []
# Calculate starting positions for each dimension
x_starts = calculate_patch_starts(m, patch_size)
y_starts = calculate_patch_starts(n, patch_size)
z_starts = calculate_patch_starts(l, patch_size)
# Extract patches from each array
for arr in arrays:
for x in x_starts:
for y in y_starts:
for z in z_starts:
patch = arr[
x:x + patch_size,
y:y + patch_size,
z:z + patch_size
]
patches.append(patch)
coordinates.append((x, y, z))
return patches, coordinates
# Note: I should probably averge the overlapping areas,
# but here they are just overwritten by the most recent one.
def reconstruct_array(patches: List[np.ndarray],
coordinates: List[Tuple[int, int, int]],
original_shape: Tuple[int, int, int]) -> np.ndarray:
"""
Reconstruct array from patches.
Parameters:
-----------
patches : List[np.ndarray]
List of patches to reconstruct from
coordinates : List[Tuple[int, int, int]]
Starting coordinates for each patch
original_shape : Tuple[int, int, int]
Shape of the original array
Returns:
--------
np.ndarray
Reconstructed array
"""
reconstructed = np.zeros(original_shape, dtype=np.int64) # To track overlapping regions
patch_size = patches[0].shape[0]
for patch, (x, y, z) in zip(patches, coordinates):
reconstructed[
x:x + patch_size,
y:y + patch_size,
z:z + patch_size
] = patch
return reconstructed
- "These are mostly used to split large volumes into smaller ones and stitch them back together"
- This code implements functions for extracting and reconstructing patches from 3D image data
- def calculate_patch_starts(dimension_size: int, patch_size: int) -> List[int]:
- Purpose: Calculates the starting positions of patches in one dimension
- Operation:
- Returns [0] if dimension size is smaller than patch size
- Calculates required number of patches: n_patches = ceil(dimension_size / patch_size)
- Calculates overlap between patches
- Generates starting positions for each patch considering overlap
- Example: For dimension size 100 and patch size 40, returns list of positions like [0, 30, 60]
- def extract_3d_patches_minimal_overlap(arrays: List[np.ndarray], patch_size: int)
- Purpose: Divides 3D arrays into smaller patches
- Key features:
- Input validation (array shape, size, etc.)
- Calculates patch starting positions for each dimension (x, y, z)
- Extracts patches from all possible positions
- Return values:
- patches: List of all extracted patches
- coordinates: List of starting coordinates for each patch
- def reconstruct_array(patches: List[np.ndarray], coordinates: List[Tuple[int, int, int]], original_shape: Tuple[int, int, int])
- Purpose: Reconstructs patches back into original-sized 3D array
- Operation:
- Creates empty array of original size
- Places each patch at its corresponding position
- Overlapping regions are overwritten by most recent patch
- Note:
- As mentioned in code comments, using average values for overlapping regions might be better
Submission helper functions
import pandas as pd
def dict_to_df(coord_dict, experiment_name):
"""
Convert dictionary of coordinates to pandas DataFrame.
Parameters:
-----------
coord_dict : dict
Dictionary where keys are labels and values are Nx3 coordinate arrays
Returns:
--------
pd.DataFrame
DataFrame with columns ['x', 'y', 'z', 'label']
"""
# Create lists to store data
all_coords = []
all_labels = []
# Process each label and its coordinates
for label, coords in coord_dict.items():
all_coords.append(coords)
all_labels.extend([label] * len(coords))
# Concatenate all coordinates
all_coords = np.vstack(all_coords)
df = pd.DataFrame({
'experiment': experiment_name,
'particle_type': all_labels,
'x': all_coords[:, 0],
'y': all_coords[:, 1],
'z': all_coords[:, 2]
})
return df
- Purpose:
- Converts position coordinates of multiple particle types in 3D space into a structured dataframe format
- Structures data to match the submission format for Kaggle competition
- Input Parameters:
- coord_dict: A dictionary with particle types as keys and their coordinates (N×3 array) as values
- Example: {'apo-ferritin': array([[x1,y1,z1], [x2,y2,z2]...]), 'ribosome': array([[x3,y3,z3]...])}
- experiment_name: Experiment name (e.g., 'TS_5_4')
- coord_dict: A dictionary with particle types as keys and their coordinates (N×3 array) as values
4) Reading in the data
TRAIN_DATA_DIR = "/kaggle/input/create-numpy-dataset-exp-name"
TEST_DATA_DIR = "/kaggle/input/czii-cryo-et-object-identification"
train_names = ['TS_5_4', 'TS_69_2', 'TS_6_6', 'TS_73_6', 'TS_86_3', 'TS_99_9']
valid_names = ['TS_6_4']
train_files = []
valid_files = []
for name in train_names:
image = np.load(f"{TRAIN_DATA_DIR}/train_image_{name}.npy")
label = np.load(f"{TRAIN_DATA_DIR}/train_label_{name}.npy")
train_files.append({"image": image, "label": label})
for name in valid_names:
image = np.load(f"{TRAIN_DATA_DIR}/train_image_{name}.npy")
label = np.load(f"{TRAIN_DATA_DIR}/train_label_{name}.npy")
valid_files.append({"image": image, "label": label})
- Loading data used in training and validation
- For each experiment ID:
- image: 3D volume data (.npy format)
- label: corresponding label data
- stored as dictionary {"image": image, "label": label}
Create the training dataloader
"I should probably find a way to create a dataloader that takes more batches."
# Non-random transforms to be cached
non_random_transforms = Compose([
EnsureChannelFirstd(keys=["image", "label"], channel_dim="no_channel"),
NormalizeIntensityd(keys="image"),
Orientationd(keys=["image", "label"], axcodes="RAS")
])
raw_train_ds = CacheDataset(data=train_files, transform=non_random_transforms, cache_rate=1.0)
my_num_samples = 16
train_batch_size = 1
# Random transforms to be applied during training
random_transforms = Compose([
RandCropByLabelClassesd(
keys=["image", "label"],
label_key="label",
spatial_size=[96, 96, 96],
num_classes=7,
num_samples=my_num_samples
),
RandRotate90d(keys=["image", "label"], prob=0.5, spatial_axes=[0, 2]),
RandFlipd(keys=["image", "label"], prob=0.5, spatial_axis=0),
])
# Final Dataset and DataLoader Creation:
train_ds = Dataset(data=raw_train_ds, transform=random_transforms)
# DataLoader remains the same
train_loader = DataLoader(
train_ds,
batch_size=train_batch_size,
shuffle=True,
num_workers=4,
pin_memory=torch.cuda.is_available()
)
- This code sets up the training data loader using the MONAI library for medical image data processing with transforms and data loaders
- Data loader?
- DataLoader is a pipeline that supplies data to the model
- Main features:
- Batch Creation: Bundles multiple data samples together
- Shuffling: Randomly shuffles the order of data
- Parallel Processing: Accelerates data loading using multiple CPU cores
- Memory Efficiency: Loads data as needed instead of loading all at once Example:
- Non-random Transforms Setup:
- EnsureChannelFirstd: Moves channel dimension to first position
- NormalizeIntensityd: Normalizes image values (standardizes image values)
- Orientationd: Aligns 3D images to RAS (Right-Anterior-Superior) standard
- raw_train_ds = CacheDataset(data=train_files, transform=non_random_transforms, cache_rate=1.0)
- Caches transformed data in memory for fast access
- cache_rate=1.0: Caches all data
- Random Transforms Setup:
- RandCropByLabelClassesd: Random cropping by label classes (96×96×96 size)
- RandRotate90d: Random 90-degree rotation (50% probability)
- RandFlipd: Random flipping (50% probability)
- Random Transform doesn't replace original data but applies new transformations each time data is loaded
- my_num_samples = 16 # Creates 16 transformed samples per image
- Creates 16 different transformations from each original image per epoch
- 6 training images × 16 samples = total of 96 samples used in each epoch
- New random transformations are applied every epoch
- Final Dataset and DataLoader Creation:
- batch_size=1: Number of samples to process at once
- shuffle=True: Shuffle data order each epoch
- num_workers=4: Number of workers for parallel data loading
- pin_memory=True: Memory performance optimization for GPU usage
Create the validation dataloader
- "Here I deviate a little from the source notebooks."
- "In the source, the validation dataloader also used the random transformations. This is bad practice and will result in noisy validation."
- "Here I split the validation dataset in (slightly) overlapping blocks of (96, 96 , 96) so that we can have a consistent validation set that uses all the validation data.
val_images,val_labels = [dcts['image'] for dcts in valid_files],[dcts['label'] for dcts in valid_files]
val_image_patches, _ = extract_3d_patches_minimal_overlap(val_images, 96)
val_label_patches, _ = extract_3d_patches_minimal_overlap(val_labels, 96)
val_patched_data = [{"image": img, "label": lbl} for img, lbl in zip(val_image_patches, val_label_patches)]
valid_ds = CacheDataset(data=val_patched_data, transform=non_random_transforms, cache_rate=1.0)
valid_batch_size = 16
# DataLoader remains the same
valid_loader = DataLoader(
valid_ds,
batch_size=valid_batch_size,
shuffle=False,
num_workers=4,
pin_memory=torch.cuda.is_available()
)
- valid_batch_size = 16
- Larger batch size than training(1)
- shuffle=False
- Maintaining order
- Dataloader configuration details:
- Consistency:
- Random transforms would result in unstable performance evaluation
- Fixed patches enable consistent evaluation
- Completeness:
- Using entire data allows more accurate evaluation
- Slight overlap ensures good evaluation of boundary regions
- Efficiency:
- Can use larger batch size
- Faster validation process than training
- Consistency:
5) Initializing the model
- This model is pretty much directly copied from https://www.kaggle.com/code/zhuowenzhao11/3d-u-net-pytorch-lightning-distributed-training
import lightning.pytorch as pl
from monai.networks.nets import UNet
from monai.losses import TverskyLoss
from monai.metrics import DiceMetric
class Model(pl.LightningModule):
def __init__(
self,
spatial_dims: int = 3,
in_channels: int = 1,
out_channels: int = 7,
channels: Union[Tuple[int, ...], List[int]] = (48, 64, 80, 80),
strides: Union[Tuple[int, ...], List[int]] = (2, 2, 1),
num_res_units: int = 1,
lr: float=1e-3):
super().__init__()
self.save_hyperparameters()
self.model = UNet(
spatial_dims=self.hparams.spatial_dims,
in_channels=self.hparams.in_channels,
out_channels=self.hparams.out_channels,
channels=self.hparams.channels,
strides=self.hparams.strides,
num_res_units=self.hparams.num_res_units,
)
self.loss_fn = TverskyLoss(include_background=True, to_onehot_y=True, softmax=True) # softmax=True for multiclass
self.metric_fn = DiceMetric(include_background=False, reduction="mean", ignore_empty=True)
self.train_loss = 0
self.val_metric = 0
self.num_train_batch = 0
self.num_val_batch = 0
def forward(self, x):
return self.model(x)
def training_step(self, batch, batch_idx):
x, y = batch['image'], batch['label']
y_hat = self(x)
loss = self.loss_fn(y_hat, y)
self.train_loss += loss
self.num_train_batch += 1
torch.cuda.empty_cache()
return loss
def on_train_epoch_end(self):
loss_per_epoch = self.train_loss/self.num_train_batch
#print(f"Epoch {self.current_epoch} - Average Train Loss: {loss_per_epoch:.4f}")
self.log('train_loss', loss_per_epoch, prog_bar=True)
self.train_loss = 0
self.num_train_batch = 0
def validation_step(self, batch, batch_idx):
with torch.no_grad(): # This ensures that gradients are not stored in memory
x, y = batch['image'], batch['label'] # Extract images and labels from batch
y_hat = self(x) # Perform model prediction
# Process predictions
metric_val_outputs = [AsDiscrete(
argmax=True, # Select class with highest probability
to_onehot=self.hparams.out_channels # Convert to one-hot encoding
)(i) for i in decollate_batch(y_hat)]
# Process labels
metric_val_labels = [AsDiscrete(
to_onehot=self.hparams.out_channels # Convert labels to one-hot encoding
)(i) for i in decollate_batch(y)]
# compute metric for current iteration
# Calculate Dice score for current batch
self.metric_fn(y_pred=metric_val_outputs, y=metric_val_labels)
# Calculate batch average metric
metrics = self.metric_fn.aggregate(reduction="mean_batch")
# Calculate mean across all particle types
val_metric = torch.mean(metrics) # I used mean over all particle species as the metric. This can be explored.
# Result Accumulation
self.val_metric += val_metric
self.num_val_batch += 1
torch.cuda.empty_cache()
return {'val_metric': val_metric}
def on_validation_epoch_end(self):
metric_per_epoch = self.val_metric/self.num_val_batch
#print(f"Epoch {self.current_epoch} - Average Val Metric: {metric_per_epoch:.4f}")
self.log('val_metric', metric_per_epoch, prog_bar=True, sync_dist=False) # sync_dist=True for distributed training
self.val_metric = 0
self.num_val_batch = 0
def configure_optimizers(self):
return torch.optim.AdamW(self.parameters(), lr=self.hparams.lr)
- This code implements a 3D U-Net model using PyTorch Lightning
- def __init__(self, spatial_dims=3, in_channels=1, out_channels=7, ...):
- UNet Model Configuration:
- spatial_dims=3: Process 3D data
- in_channels=1: Grayscale image input
- out_channels=7: 7 class outputs (background + 6 particle types)
- channels=(48, 64, 80, 80): Number of channels per layer
- strides=(2, 2, 1): Stride for each layer
- num_res_units=1: Number of residual units to include in each encoder and decoder block
- Residual Unit structure:
- Input -> Conv3D -> BatchNorm -> ReLU -> Conv3D -> BatchNorm -> Add(Input) -> ReLU -> Output
- One "unit" consists of:
- 2 3D convolution layers
- 2 Batch Normalization layers
- ReLU activation function
- Skip connection (adding input to output)
- num_res_units=1 means this structure is repeated once at each level. If num_res_units=2, this entire structure would be repeated twice in sequence
- Residual Unit structure:
- Loss Function and Evaluation Metrics:
- TverskyLoss: Loss function robust to class imbalance
- Tversky Loss is a generalized version of Dice Loss
- Effective for handling class imbalance problems
- Parameter explanation:
- include_background=True: Include background (class 0) in loss calculation
- to_onehot_y=True: Convert integer labels to one-hot vectors
- softmax=True: Apply softmax for multi-class classification
- DiceMetric: Segmentation performance metric
- Dice coefficient is a standard metric for evaluating segmentation performance
- Formula: 2|X∩Y| / (|X|+|Y|)
- X: Predicted region
- Y: Actual region
- Parameter explanation:
- include_background=False: Exclude background class from evaluation
- reduction="mean": Average Dice scores across all classes
- ignore_empty=True: Exclude cases where certain classes are absent
- TverskyLoss: Loss function robust to class imbalance
- UNet Model Configuration:
- Variable initialization
- self.train_loss = 0 # Accumulate training loss
- self.val_metric = 0 # Accumulate validation metric
- self.num_train_batch = 0 # Count processed training batches
- self.num_val_batch = 0 # Count processed validation batches
- These variables are used to calculate average performance during an epoch
- Reset to 0 at the end of each epoch
- def forward(self, x):
- Basic inference method for PyTorch models
- Passes input x through the model
- Simple but important roles:
- Simplifies model calls (enables self(x) instead of model(x))
- Used for model inference in other methods
- Integrates with PyTorch Lightning's automated features
- def training_step(self, batch, batch_idx):
- Process batch data
- Perform model prediction
- Calculate and accumulate loss
- def on_train_epoch_end(self):
- Calculate average loss per epoch
- Perform logging
- def validation_step(self, batch, batch_idx):
- OVERALL:
- Perform validation without gradient calculation
- Convert predictions to class labels
- Calculate Dice score
- with torch.no_grad():
- Turns off gradient calculation as backpropagation isn't needed during validation
- Reduces memory usage and improves computation speed
- metric_val_outputs
- decollate_batch(y_hat):
- Separates batch into individual samples
- Example: [16 batches] → [sample1, sample2, ..., sample16]
- AsDiscrete(argmax=True):
- Selects class with highest probability at each position
- Example: [0.1, 0.7, 0.2] → 1 (second class)
- to_onehot=7:
- Converts selected class to one-hot vector
- Example: 1 → [0, 1, 0, 0, 0, 0, 0]
- decollate_batch(y_hat):
- metric_val_labels
- decollate_batch(y):
- Separates batch into individual samples
- AsDiscrete(to_onehot=7):
- Converts class index to one-hot vector
- Example: 2 → [0, 0, 1, 0, 0, 0, 0]
- decollate_batch(y):
- OVERALL:
- def on_validation_epoch_end(self):
- Same with on_train_epoch_end(self)
- def configure_optimizers(self):
- Use AdamW optimizer
- Set learning rate
channels = (48, 64, 80, 80)
strides_pattern = (2, 2, 1)
num_res_units = 1
learning_rate = 1e-3
num_epochs = 100
model = Model(channels=channels, strides=strides_pattern, num_res_units=num_res_units, lr=learning_rate)
6) Training the model
torch.set_float32_matmul_precision('medium')
# Check if CUDA is available and then count the GPUs
if torch.cuda.is_available():
num_gpus = torch.cuda.device_count()
print(f"Number of GPUs available: {num_gpus}")
else:
print("No GPU available. Running on CPU.")
devices = list(range(num_gpus))
print(devices)
trainer = pl.Trainer(
max_epochs=num_epochs, # Total number of training epochs (100)
#strategy="ddp_notebook", # Distributed training strategy (currently commented)
accelerator="gpu", # Use GPU
devices=[0], # Use only first GPU
num_nodes=1, # Use single node
log_every_n_steps=10, # Log every 10 steps
enable_progress_bar=True, # Enable progress bar
)
trainer.fit(model, train_loader, valid_loader)
- torch.set_float32_matmul_precision('medium')
- Sets precision of 32-bit floating-point matrix multiplication to 'medium'
- Balances speed and accuracy
- GPU Availability Check
- Checks for CUDA (GPU) availability
- Counts available GPUs
- Creates list of GPU indices (e.g., [0,1,2] for 3 GPUs)
- Trainer Setup
- max_epochs: Total number of training iterations
- accelerator: Hardware to use for training (GPU/CPU)
- devices: GPU numbers to use
- num_nodes: Number of nodes for distributed training
- log_every_n_steps: Logging frequency
- enable_progress_bar: Visualize training progress
- trainer.fit(model, train_loader, valid_loader)
- Training Cycle:
- Each epoch loads batch data from train_loader
- Executes training_step method:
- Model prediction
- Loss calculation
- Backpropagation and weight updates
- Runs on_train_epoch_end at epoch end
- Validation Cycle:
- Loads data from valid_loader after each epoch
- Executes validation_step method:
- Model prediction
- Dice score calculation
- Runs on_validation_epoch_end at epoch end
- Training Cycle:
7) Predicting on the test set
# Model setup
model.eval();
model.to("cuda");
# Configuration File Processing
import json
copick_config_path = TRAIN_DATA_DIR + "/copick.config"
with open(copick_config_path) as f:
copick_config = json.load(f)
copick_config['static_root'] = '/kaggle/input/czii-cryo-et-object-identification/test/static'
copick_test_config_path = 'copick_test.config'
with open(copick_test_config_path, 'w') as outfile:
json.dump(copick_config, outfile)
# Copick Setup
import copick
root = copick.from_file(copick_test_config_path)
copick_user_name = "copickUtils"
copick_segmentation_name = "paintedPicks"
voxel_size = 10
tomo_type = "denoised"
- Switches the trained model to evaluation mode and prepares settings for test data
- Model Setup
- eval(): Switches dropout, batch normalization, etc. to evaluation mode
- to("cuda"): Moves model to GPU memory
- Configuration File Processing
- Loads original configuration file
- Updates test data path
- Saves new configuration to file
- Copick Setup
- Loads configuration using copick library
- Sets parameters needed for testing
# Setting up Inference Transformations:
# Non-random transforms to be cached
inference_transforms = Compose([
EnsureChannelFirstd(keys=["image"], channel_dim="no_channel"),
NormalizeIntensityd(keys="image"),
Orientationd(keys=["image"], axcodes="RAS")
])
import cc3d
id_to_name = {1: "apo-ferritin",
2: "beta-amylase",
3: "beta-galactosidase",
4: "ribosome",
5: "thyroglobulin",
6: "virus-like-particle"}
- Setting up Inference Transformations:
- Unlike training, no random transformations (for consistent predictions)
- Applied transformations:
- EnsureChannelFirstd: Moves channel dimension to first position
- NormalizeIntensityd: Normalizes image values
- Orientationd: Aligns 3D images to RAS standard
- cc3d
- Library for Connected Components analysis
- Used to find and label connected regions in 3D images
Iterate over test set:
- Read in a run
- Split it into patches of size (96, 96, 96)
- Create a dataset from the patches
- Predict the segmentation mask
- Glue the mask back together
- Find the connected components for each class
- Find the centroids of the connected components
- Add to the dataframe
- Then do this for all runs.
- "This can probably be optimized quite a bit."
BLOB_THRESHOLD = 500
CERTAINTY_THRESHOLD = 0.5
classes = [1, 2, 3, 4, 5, 6]
with torch.no_grad():
location_df = []
for run in root.runs:
print(run)
# "Read in a run"
tomo = run.get_voxel_spacing(10)
tomo = tomo.get_tomogram(tomo_type).numpy()
# "Split into patches"
tomo_patches, coordinates = extract_3d_patches_minimal_overlap([tomo], 96)
# "Create a dataset"
tomo_patched_data = [{"image": img} for img in tomo_patches]
tomo_ds = CacheDataset(data=tomo_patched_data, transform=inference_transforms, cache_rate=1.0)
# "Predict the segmentation mask"
pred_masks = []
for i in range(len(tomo_ds)):
input_tensor = tomo_ds[i]['image'].unsqueeze(0).to("cuda")
model_output = model(input_tensor)
probs = torch.softmax(model_output[0], dim=0)
thresh_probs = probs > CERTAINTY_THRESHOLD
_, max_classes = thresh_probs.max(dim=0)
pred_masks.append(max_classes.cpu().numpy())
# "Glue the mask back together"
reconstructed_mask = reconstruct_array(pred_masks, coordinates, tomo.shape)
location = {}
for c in classes:
# "Find the connected components"
cc = cc3d.connected_components(reconstructed_mask == c)
stats = cc3d.statistics(cc)
# "Find the centroids"
zyx=stats['centroids'][1:]*10.012444 #https://www.kaggle.com/competitions/czii-cryo-et-object-identification/discussion/544895#3040071
zyx_large = zyx[stats['voxel_counts'][1:] > BLOB_THRESHOLD]
xyz =np.ascontiguousarray(zyx_large[:,::-1])
location[id_to_name[c]] = xyz
# "Add to the dataframe"
df = dict_to_df(location, run.name)
location_df.append(df)
location_df = pd.concat(location_df)
location_df.insert(
loc=0, # Insert at first position
column='id', # Column name is 'id'
value=np.arange(len(location_df)) # Sequential numbers starting from 0
)
location_df.to_csv("submission.csv", index=False)
- Adding ID Column
- Assigns unique ID to each predicted particle
- Meets Kaggle submission format requirements
- Saving to CSV file
- index=False: Excludes DataFrame index from saved file
!cp -r /kaggle/input/hengck-czii-cryo-et-01/* .
from czii_helper import *
from dataset import *
from scipy.optimize import linear_sum_assignment
import matplotlib.pyplot as plt
- !cp ~:
- Linux command that copies required files from a Kaggle dataset to the current working directory
- hengck-czii-cryo-et-01 includes:
- czii_helper.py: Utility functions for evaluation metric calculations
- dataset.py: Functions for data loading and processing
- PARTICLE: a constant defined in the copied files, containing characteristics of each particle type (name, radius, difficulty level, etc.)
import os
if os.getenv('KAGGLE_IS_COMPETITION_RERUN'):
MODE = 'submit'
else:
MODE = 'local'
valid_dir ='/kaggle/input/czii-cryo-et-object-identification/train'
valid_id = ['TS_6_4', ]
def do_one_eval(truth, predict, threshold):
P=len(predict)
T=len(truth)
if P==0:
hit=[[],[]]
miss=np.arange(T).tolist()
fp=[]
metric = [P,T,len(hit[0]),len(miss),len(fp)]
return hit, fp, miss, metric
if T==0:
hit=[[],[]]
fp=np.arange(P).tolist()
miss=[]
metric = [P,T,len(hit[0]),len(miss),len(fp)]
return hit, fp, miss, metric
#---
distance = predict.reshape(P,1,3)-truth.reshape(1,T,3)
distance = distance**2
distance = distance.sum(axis=2)
distance = np.sqrt(distance)
p_index, t_index = linear_sum_assignment(distance)
valid = distance[p_index, t_index] <= threshold
p_index = p_index[valid]
t_index = t_index[valid]
hit = [p_index.tolist(), t_index.tolist()]
miss = np.arange(T)
miss = miss[~np.isin(miss,t_index)].tolist()
fp = np.arange(P)
fp = fp[~np.isin(fp,p_index)].tolist()
metric = [P,T,len(hit[0]),len(miss),len(fp)] #for lb metric F-beta copmutation
return hit, fp, miss, metric
def compute_lb(submit_df, overlay_dir):
valid_id = list(submit_df['experiment'].unique())
print(valid_id)
eval_df = []
for id in valid_id:
truth = read_one_truth(id, overlay_dir) #=f'{valid_dir}/overlay/ExperimentRuns')
id_df = submit_df[submit_df['experiment'] == id]
for p in PARTICLE:
p = dotdict(p)
print('\r', id, p.name, end='', flush=True)
xyz_truth = truth[p.name]
xyz_predict = id_df[id_df['particle_type'] == p.name][['x', 'y', 'z']].values
hit, fp, miss, metric = do_one_eval(xyz_truth, xyz_predict, p.radius* 0.5)
eval_df.append(dotdict(
id=id, particle_type=p.name,
P=metric[0], T=metric[1], hit=metric[2], miss=metric[3], fp=metric[4],
))
print('')
eval_df = pd.DataFrame(eval_df)
gb = eval_df.groupby('particle_type').agg('sum').drop(columns=['id'])
gb.loc[:, 'precision'] = gb['hit'] / gb['P']
gb.loc[:, 'precision'] = gb['precision'].fillna(0)
gb.loc[:, 'recall'] = gb['hit'] / gb['T']
gb.loc[:, 'recall'] = gb['recall'].fillna(0)
gb.loc[:, 'f-beta4'] = 17 * gb['precision'] * gb['recall'] / (16 * gb['precision'] + gb['recall'])
gb.loc[:, 'f-beta4'] = gb['f-beta4'].fillna(0)
gb = gb.sort_values('particle_type').reset_index(drop=False)
# https://www.kaggle.com/competitions/czii-cryo-et-object-identification/discussion/544895
gb.loc[:, 'weight'] = [1, 0, 2, 1, 2, 1]
lb_score = (gb['f-beta4'] * gb['weight']).sum() / gb['weight'].sum()
return gb, lb_score
#debug
if 1:
if MODE=='local':
#if 1:
submit_df=pd.read_csv(
'submission.csv'
# '/kaggle/input/hengck-czii-cryo-et-weights-01/submission.csv'
)
gb, lb_score = compute_lb(submit_df, f'{valid_dir}/overlay/ExperimentRuns')
print(gb)
print('lb_score:',lb_score)
print('')
#show one ----------------------------------
fig = plt.figure(figsize=(18, 8))
id = valid_id[0]
truth = read_one_truth(id,overlay_dir=f'{valid_dir}/overlay/ExperimentRuns')
submit_df = submit_df[submit_df['experiment']==id]
for p in PARTICLE:
p = dotdict(p)
xyz_truth = truth[p.name]
xyz_predict = submit_df[submit_df['particle_type']==p.name][['x','y','z']].values
hit, fp, miss, _ = do_one_eval(xyz_truth, xyz_predict, p.radius)
print(id, p.name)
print('\t num truth :',len(xyz_truth) )
print('\t num predict :',len(xyz_predict) )
print('\t num hit :',len(hit[0]) )
print('\t num fp :',len(fp) )
print('\t num miss :',len(miss) )
ax = fig.add_subplot(2, 3, p.label, projection='3d')
if hit[0]:
pt = xyz_predict[hit[0]]
ax.scatter(pt[:, 0], pt[:, 1], pt[:, 2], alpha=0.5, color='r')
pt = xyz_truth[hit[1]]
ax.scatter(pt[:,0], pt[:,1], pt[:,2], s=80, facecolors='none', edgecolors='r')
if fp:
pt = xyz_predict[fp]
ax.scatter(pt[:, 0], pt[:, 1], pt[:, 2], alpha=1, color='k')
if miss:
pt = xyz_truth[miss]
ax.scatter(pt[:, 0], pt[:, 1], pt[:, 2], s=160, alpha=1, facecolors='none', edgecolors='k')
ax.set_title(f'{p.name} ({p.difficulty})')
plt.tight_layout()
plt.show()
#---
zz=0
- Overall comprehensive evaluation and visualization of model predictions
- do_one_eval:
- Inputs:
- truth: actual particle positions
- predict: predicted particle positions
- threshold: matching distance threshold
- Main process:
- Handle exceptions (P=0 or T=0 cases)
- Calculate distances between predictions and truth
- Find optimal matching (using linear_sum_assignment)
- Filter valid matches based on threshold
- Calculate hits/misses/false positives
- Returns:
- hit: correct prediction indices
- fp: false prediction indices
- miss: missed particle indices
- metric: [P, T, num_hits, num_misses, num_fps]
- Inputs:
- compute_lb
- Inputs:
- submit_df: prediction results dataframe to submit
- overlay_dir: ground truth data path
- Main process:
- Evaluate predictions for each experiment ID
- Calculate performance per particle type
- Calculate precision and recall
- Calculate f-beta4 score (beta=4 weights recall)
- Apply particle type weights
- Returns:
- gb: performance metrics per particle type
- lb_score: final score
- Inputs:
- read_one_truth: loading the ground truth data
- We are scoring the lb score based on the test data we configured: TS_6_4
I think it's very important to have a feedback loop, where you're constantly thinking about what you've done and how you could be doing it better.
- Elon Musk -
반응형
'대회' 카테고리의 다른 글
CZII - CryoET Object Identification #3 Baseline YOLO11 Solution (0) | 2025.01.16 |
---|---|
CZII - CryoET Object Identification #1 - Training Data (0) | 2025.01.14 |
Child Mind Institute — Problematic Internet Use: The Greatest Shake-Up? (1) | 2024.12.23 |