CZII - CryoET Object Identification #1

대회

CZII - CryoET Object Identification #1 - Training Data

dongsunseng 2025. 1. 14. 02:09

This post is an annotation of training data code kernel from "fnands".

https://www.kaggle.com/code/fnands/create-numpy-dataset-exp-name

Create Numpy dataset exp name

Explore and run machine learning code with Kaggle Notebooks | Using data from CZII - CryoET Object Identification

www.kaggle.com

Kernel 'Create Numpy dataset exp name'

Overall this kernel is about PREPARING TRAINING DATA

!pip install git+https://github.com/copick/copick-utils.git matplotlib tqdm copick 
!pip install -q "monai-weekly[mlflow]"

This combination of packages creates a complete environment for processing, analyzing, and applying machine learning models to Cryo-electron microscope data.

copick-utils (`git+https://github.com/copick/copick-utils.git`)
- A utility library for processing Cryo-EM (Cryo-electron microscope) data
- Installed directly from GitHub repository
- Provides tools for processing, analyzing, and visualizing electron microscope images
matplotlib
- Python's primary visualization library
- Used for creating and displaying graphs, charts, and images
- Essential tool for visualizing data analysis results
tqdm
- Library that provides progress bars
- Enables real-time monitoring of long-running tasks
- Particularly useful when processing large datasets
copick
- Main library for Cryo-EM data
- Provides functionality for image processing, data management, and analysis
- Serves as the basic framework for utilizing copick-utils features
monai-weekly[mlflow]
- MONAI (Medical Open Network for AI) is a deep learning framework for medical images
- Built on PyTorch and specialized for medical image processing
- Key features:
  - Data preprocessing and augmentation
  - Neural network models for medical images
  - Training and evaluation tools
  - [mlflow] is an optional dependency where:
    - MLflow is a platform for tracking and managing machine learning experiments
    - Records and manages experimental results, models, and parameters
    - Helps compare and reproduce model performance
- The '-q' option means 'quiet' mode, which minimizes installation process output.

!pip install zarr
!pip install copick

zarr:
- Format and library for storing and processing N-dimensional arrays
- Main Features:
  - Chunked compression storage
  - Parallel processing support
  - Hierarchical organization capability
  - Cloud storage compatibility
  - NumPy-compatible interface
- Purpose:
  - Processing large-scale scientific data
  - Data sharing in distributed computing environments
  - Processing datasets larger than available memory
- Advantages:
  - Memory efficient: Can process data without loading entire dataset into memory
  - Fast I/O performance: Efficient data access through chunk-based approach
  - Flexible storage format: Supports various storage options (local disk, cloud, etc.)
  - Parallel processing: Multiple processes can access data simultaneously
- Common Use Cases:
  - Large scientific datasets (e.g., meteorological data, satellite images)
  - Machine learning datasets
  - Biological data (e.g., cryo-electron microscopy data)
- Relationship with asciitree package:
  - asciitree is used to visually represent Zarr data structure
  - Shows hierarchical structure of Zarr arrays in tree format in terminal
  - While asciitree is necessary for visualizing Zarr structures, it can sometimes be challenging to install or use

# Make a copick project
import os
import shutil

# Define configuration for protein structures and project settings
config_blob = """{
   "name": "czii_cryoet_mlchallenge_2024",
   "description": "2024 CZII CryoET ML Challenge training data.",
   "version": "1.0.0",

   "pickable_objects": [
       {
           "name": "apo-ferritin",
           "is_particle": true,
           "pdb_id": "4V1W",
           "label": 1,
           "color": [0, 117, 220, 128],
           "radius": 60,
           "map_threshold": 0.0418
       },
       {
           "name": "beta-amylase",
           "is_particle": true,
           "pdb_id": "1FA2", 
           "label": 2,
           "color": [153, 63, 0, 128],
           "radius": 65,
           "map_threshold": 0.035
       },
       {
           "name": "beta-galactosidase",
           "is_particle": true,
           "pdb_id": "6X1Q",
           "label": 3,
           "color": [76, 0, 92, 128],
           "radius": 90,
           "map_threshold": 0.0578
       },
       {
           "name": "ribosome",
           "is_particle": true,
           "pdb_id": "6EK0",
           "label": 4,
           "color": [0, 92, 49, 128],
           "radius": 150,
           "map_threshold": 0.0374
       },
       {
           "name": "thyroglobulin",
           "is_particle": true,
           "pdb_id": "6SCJ",
           "label": 5,
           "color": [43, 206, 72, 128],
           "radius": 130,
           "map_threshold": 0.0278
       },
       {
           "name": "virus-like-particle",
           "is_particle": true,
           "pdb_id": "6N4V",            
           "label": 6,
           "color": [255, 204, 153, 128],
           "radius": 135,
           "map_threshold": 0.201
       }
   ],

   "overlay_root": "/kaggle/working/overlay",
   "overlay_fs_args": {
       "auto_mkdir": true
   },
   "static_root": "/kaggle/input/czii-cryo-et-object-identification/train/static"
}"""

# Define paths
copick_config_path = "/kaggle/working/copick.config"
output_overlay = "/kaggle/working/overlay"

# Write configuration file
with open(copick_config_path, "w") as f:
   f.write(config_blob)
   
# Update the overlay
# Define source and destination directories
source_dir = '/kaggle/input/czii-cryo-et-object-identification/train/overlay'
destination_dir = '/kaggle/working/overlay'

# Walk through the source directory
for root, dirs, files in os.walk(source_dir):
   # Create corresponding subdirectories in the destination
   relative_path = os.path.relpath(root, source_dir)
   target_dir = os.path.join(destination_dir, relative_path)
   os.makedirs(target_dir, exist_ok=True)
   
   # Copy and rename each file
   for file in files:
       # Add prefix 'curation_0_' if not already present
       if file.startswith("curation_0_"):
           new_filename = file
       else:
           new_filename = f"curation_0_{file}"
       
       # Define full paths for the source and destination files
       source_file = os.path.join(root, file)
       destination_file = os.path.join(target_dir, new_filename)
       
       # Copy the file with the new name
       shutil.copy2(source_file, destination_file)
       print(f"Copied {source_file} to {destination_file}")

This code sets up a project for the competition:
shutil:
- shutil is a Python standard library - it stands for "shell utility"
- It provides high-level file operations such as copying, moving, and removing files and file collections
<<config_blob = """...""">> part
- Contains information about 6 protein structures:
  - apo-ferritin: Iron storage protein
  - beta-amylase: Enzyme protein
  - beta-galactosidase: Sugar breakdown enzyme
  - ribosome: Protein synthesis structure
  - thyroglobulin: Thyroid hormone precursor
  - virus-like-particle: Virus-like particle
- Attributes defined for each structure:
  - name: Structure name
  - is_particle: Particle status
  - pdb_id: Protein Data Bank ID
  - label: Classification label (1-6)
  - color: RGBA color value ([R,G,B,A])
  - radius: Particle radius
  - map_threshold: Mapping threshold
- "overlay_root": "/kaggle/working/overlay"
  - Specifies the root directory where generated data (overlays) will be stored
  - Represents the working directory for use in Kaggle environment
  - /kaggle/working/ is a writable directory in Kaggle notebooks
- "overlay_fs_args": {
  "auto_mkdir": true
  }
  - Sets file system related arguments
  - auto_mkdir: true means it will automatically create directories if they don't exist
  - Creates necessary paths automatically when saving files or data
- "static_root": "/kaggle/input/czii-cryo-et-object-identification/train/static"
  - Specifies the path where original or unchanging static data is stored
  - Path to input data for the Kaggle competition
  - /kaggle/input/ is the read-only data directory provided by Kaggle
- These configurations define in the Kaggle environment:
  - Where to read data from
  - Where to store processed results
  - How to manage the file system
- is_particle: (Particle status):
  - Set to true in the data
  - Indicates whether the object should be treated as an independent particle
  - true means this structure is an individually identifiable, separate particle
  - This affects how the object is handled during image processing and analysis
- pdb_id: (Protein Data Bank ID)
  - Unique identifier like "6N4V"
  - PDB (Protein Data Bank) is a global database storing 3D structural information of proteins and nucleic acids
  - This ID allows access to detailed structural information of the molecule
  - For example, "6N4V" for virus-like-particle is a unique identifier storing atomic-level details of this structure
  - Detailed information can be viewed by searching this ID on the PDB website (rcsb.org)
- Radius:
  - Typically measured in Angstroms (Å) or nanometers (nm)
  - Reflects the actual physical size of virus-like particles
  - Set based on average particle size visible in electron microscope images
- map_threshold:
  - Threshold value for identifying particles in electron density maps
  - Higher values mean stricter particle identification criteria
  - 0.201 is significantly higher than other particles (e.g., apo-ferritin's 0.0418, beta-amylase's 0.035)
  - This might be because virus-like particles show stronger contrast in electron microscope images
- color:
  - RGBA: color values with transparency %
  - Set for visualization purposes, doesn't affect analysis
  - Last value 128 indicates transparency (middle value in 0-255 range)

File System Setup:
- copick_config_path = "/kaggle/working/copick.config"
  output_overlay = "/kaggle/working/overlay"
  - Specifies paths for configuration file and output directory
  - Set up for Kaggle environment
For loop part:
- for root, dirs, files in os.walk(source_dir):
  - Uses os.walk to traverse all files and subdirectories in source directory
  - Creates identical directory structure at destination
- for file in files:
  - if else clause:
    - Adds "curation_0_" prefix to all file names
    - Keeps files that already have the prefix unchanged
  - shutil.copy2(source_file, destination_file):
    - Uses shutil.copy2 to copy files
    - Also copies metadata (creation time, modification time, etc.)
Overall prepares and structures the dataset needed for training machine learning models, specifically for identifying and classifying various protein structures captured by cryo-electron microscopy.

import os
import numpy as np
from pathlib import Path
import torch
import torchinfo
import zarr, copick
from tqdm import tqdm
from monai.data import DataLoader, Dataset, CacheDataset, decollate_batch
from monai.transforms import (
    Compose, 
    EnsureChannelFirstd, 
    Orientationd,  
    AsDiscrete,  
    RandFlipd, 
    RandRotate90d, 
    NormalizeIntensityd,
    RandCropByLabelClassesd,
)
from monai.networks.nets import UNet
from monai.losses import DiceLoss, FocalLoss, TverskyLoss
from monai.metrics import DiceMetric, ConfusionMatrixMetric
import mlflow
import mlflow.pytorch

Preparing the dataset

1. Get copick root

root = copick.from_file(copick_config_path)

copick_user_name = "copickUtils"
copick_segmentation_name = "paintedPicks"
voxel_size = 10
tomo_type = "denoised"

Initializing the basic configuration of the copick project
root = copick.from_file(copick_config_path)
- Initializes a copick object by reading the configuration file from copick_config_path
- Loads settings including protein structure information and paths into this object
copick_user_name = "copickUtils"
- Sets an identifier for the user/tool performing the work
- Used to track and distinguish results
copick_segmentation_name = "paintedPicks"
- Specifies the name for segmentation (image region distinction) results
- Results will be saved and referenced using this name
voxel_size = 10
- Sets the voxel size that defines the resolution of 3D images
- A voxel is the basic unit of 3D images, similar to pixels in 2D images
tomo_type = "denoised"
- Specifies the type of tomogram (3D image) data to use
- "denoised" means using processed images with noise removed
- Noise removal improves image quality and facilitates analysis

2. Generate multi-class segmentation masks from picks, and saved them to the copick overlay directory (one-time)

# Import segmentation-related utilities
from copick_utils.segmentation import segmentation_from_picks
import copick_utils.writers.write as write
from collections import defaultdict

# Just do this once
generate_masks = True

if generate_masks:
    # Stores label and radius information for each particle in a dictionary
    # Only processes objects where is_particle is true
    target_objects = defaultdict(dict)
    for object in root.pickable_objects:
        if object.is_particle:
            target_objects[object.name]['label'] = object.label
            target_objects[object.name]['radius'] = object.radius

    # Process Tomograms and Create Masks
    for run in tqdm(root.runs):
        # Get tomogram data
        tomo = run.get_voxel_spacing(10)
        tomo = tomo.get_tomogram(tomo_type).numpy()
        
        # Create empty target array
        target = np.zeros(tomo.shape, dtype=np.uint8)
        
        # Generate Segmentation Masks
        for pickable_object in root.pickable_objects:
            pick = run.get_picks(object_name=pickable_object.name, user_id="curation")
            if len(pick):  
                target = segmentation_from_picks.from_picks(pick[0], 
                                                            target, 
                                                            target_objects[pickable_object.name]['radius'] * 0.8,
                                                            target_objects[pickable_object.name]['label']
                                                            )
        write.segmentation(run, target, copick_user_name, name=copick_segmentation_name)

from collections import defaultdict
- defaultdict automatically handles default values for missing dictionary keys
generate_masks = True
- Flag that controls whether to generate segmentation masks or not
- Generating segmentation masks is a time-consuming operation
- It only needs to be done once (this is why the comment says "Just do this once")
- A segmentation mask is a binary or multi-class label map used to distinguish specific objects or regions in an image
- It's used to distinguish 6 different protein structures
- Each structure has a unique label (1-6)
- Background is marked as 0
- The mask is in 3D form, indicating which structure each voxel belongs to
for run in tqdm(root.runs):
- Gets tomogram data for each run
- Retrieves data at specified voxel size (10)
- Converts to numpy array for processing
- Creates empty array for storing segmentation masks
for pickable_object in root.pickable_objects:
- For each object:
  - Gets pick information
  - Creates segmentation mask if pick exists
  - Uses 80% of radius (* 0.8) for mask creation
  - Uses object's label
write.segmentation(run, target, copick_user_name, name=copick_segmentation_name)
- Saves generated segmentation masks
- Saves with specified user name and segmentation name
root.runs:
- Represents each experimental run in the dataset
- In this code, we can see there are 7 experimental datasets:
  1. TS_86_3
  2. TS_6_6
  3. TS_6_4
  4. TS_5_4
  5. TS_73_6
  6. TS_99_9
  7. TS_69_2
- Each run represents one electron microscope imaging session
- Therefore, for run in tqdm(root.runs)::
  - For each experimental session (TS_*)
  - Retrieves the tomogram data from that session
  - Locates each protein
  - Generates segmentation masks
- Each run contains:
  - Tomogram data (run.get_tomogram())
  - Protein location information (run.get_picks())
  - Other metadata

3. Get tomograms and their segmentaion masks (from picks) arrays

data_dicts = []  # Create empty list to store data
for run in tqdm(root.runs):  # Iterate over 7 experimental datasets
    # Get tomogram data
    tomogram = run.get_voxel_spacing(voxel_size)  # Get data at resolution set to voxel_size=10
    tomogram = tomogram.get_tomogram(tomo_type)   # Get "denoised" type tomogram
    tomogram = tomogram.numpy()                    # Convert to numpy array

    # Get segmentation masks
    segmentation = run.get_segmentations(
        name=copick_segmentation_name,    # "paintedPicks"
        user_id=copick_user_name,         # "copickUtils"
        voxel_size=voxel_size,           # 10
        is_multilabel=True               # Mask distinguishing multiple classes (proteins)
    )[0].numpy()

    # Add to data dictionary
    data_dicts.append({
        "name": run.name,        # Experiment name (e.g., "TS_86_3")
        "image": tomogram,       # Tomogram data
        "label": segmentation    # Segmentation mask
    })

# Print label values from first data
print(np.unique(data_dicts[0]['label']))  # Outputs [0 1 2 3 4 5 6]

Collects tomograms and segmentation masks from each experimental data
Results explanation:
- [0 1 2 3 4 5 6] are all unique values in the mask:
  - 0: Background
  - 1: apo-ferritin
  - 2: beta-amylase
  - 3: beta-galactosidase
  - 4: ribosome
  - 5: thyroglobulin
  - 6: virus-like-particle
Each dictionary created for experimental data includes:
- Experiment name
- Original image (tomogram)
- Segmentation mask (labels)
This prepared data can be used later for training machine learning models.

# For each of the 7 experimental datasets
for i in range(7):
    # Save image (tomogram) data
    with open(f"train_image_{data_dicts[i]['name']}.npy", 'wb') as f:
        np.save(f, data_dicts[i]['image'])
    
    # Save label (segmentation mask) data    
    with open(f"train_label_{data_dicts[i]['name']}.npy", 'wb') as f:
        np.save(f, data_dicts[i]['label'])

saves the previously created data to files
Specifically:
- Two .npy files are created for each experiment:
  - train_image_TS_XX_X.npy: tomogram data
  - train_label_TS_XX_X.npy: segmentation mask
- File format:
  - .npy: NumPy's array storage format
  - 'wb': open file in binary write mode

I could either watch it happen or be a part of it.
- Elon Musk -

저작자표시 비영리 변경금지 (새창열림)

'대회' 카테고리의 다른 글

CIBMTR - Equity in post-HCT Survival Predictions #1 About the Competition (0)	2025.01.30
CZII - CryoET Object Identification #4 Making synthetic data for Baseline YOLO11 Solution (0)	2025.01.28
CZII - CryoET Object Identification #3 Baseline YOLO11 Solution (0)	2025.01.16
CZII - CryoET Object Identification #2 Baseline UNet Solution (0)	2025.01.15
Child Mind Institute — Problematic Internet Use: The Greatest Shake-Up? (2)	2024.12.23

현재글CZII - CryoET Object Identification #1 - Training Data

llm, dl, Express, 오블완, 투자, ML, nodejs, home credit default risk, 경제, 티스토리챌린지, backend, Kaggle, nlp, 단타, cibmtr - equity in post-hct survival predictions, 매매일지, 캐글, 코인, 비트코인, Prompt Engineering,

Today :
Yesterday :

동선생