대회

CZII - CryoET Object Identification #1 - Training Data

dongsunseng 2025. 1. 14. 02:09
반응형

This post is an annotation of training data code kernel from "fnands".

https://www.kaggle.com/code/fnands/create-numpy-dataset-exp-name

 

Create Numpy dataset exp name

Explore and run machine learning code with Kaggle Notebooks | Using data from CZII - CryoET Object Identification

www.kaggle.com

Kernel 'Create Numpy dataset exp name'

  • Overall this kernel is about PREPARING TRAINING DATA
!pip install git+https://github.com/copick/copick-utils.git matplotlib tqdm copick 
!pip install -q "monai-weekly[mlflow]"

 

 

This combination of packages creates a complete environment for processing, analyzing, and applying machine learning models to Cryo-electron microscope data.

  1. copick-utils (`git+https://github.com/copick/copick-utils.git`)
    • A utility library for processing Cryo-EM (Cryo-electron microscope) data
    • Installed directly from GitHub repository
    • Provides tools for processing, analyzing, and visualizing electron microscope images
  2. matplotlib
    • Python's primary visualization library
    • Used for creating and displaying graphs, charts, and images
    • Essential tool for visualizing data analysis results
  3. tqdm
    • Library that provides progress bars
    • Enables real-time monitoring of long-running tasks
    • Particularly useful when processing large datasets
  4. copick
    • Main library for Cryo-EM data
    • Provides functionality for image processing, data management, and analysis
    • Serves as the basic framework for utilizing copick-utils features
  5. monai-weekly[mlflow]
    • MONAI (Medical Open Network for AI) is a deep learning framework for medical images
    • Built on PyTorch and specialized for medical image processing
    • Key features:
      • Data preprocessing and augmentation
      • Neural network models for medical images
      • Training and evaluation tools
      • [mlflow] is an optional dependency where:
        • MLflow is a platform for tracking and managing machine learning experiments
        • Records and manages experimental results, models, and parameters
        • Helps compare and reproduce model performance
    • The '-q' option means 'quiet' mode, which minimizes installation process output.
!pip install zarr
!pip install copick
  • zarr:
    • Format and library for storing and processing N-dimensional arrays
    • Main Features:
      • Chunked compression storage
      • Parallel processing support
      • Hierarchical organization capability
      • Cloud storage compatibility
      • NumPy-compatible interface
    • Purpose:
      • Processing large-scale scientific data
      • Data sharing in distributed computing environments
      • Processing datasets larger than available memory
    • Advantages:
      • Memory efficient: Can process data without loading entire dataset into memory
      • Fast I/O performance: Efficient data access through chunk-based approach
      • Flexible storage format: Supports various storage options (local disk, cloud, etc.)
      • Parallel processing: Multiple processes can access data simultaneously
    • Common Use Cases:
      • Large scientific datasets (e.g., meteorological data, satellite images)
      • Machine learning datasets
      • Biological data (e.g., cryo-electron microscopy data)
    • Relationship with asciitree package:
      • asciitree is used to visually represent Zarr data structure
      • Shows hierarchical structure of Zarr arrays in tree format in terminal
      • While asciitree is necessary for visualizing Zarr structures, it can sometimes be challenging to install or use
# Make a copick project
import os
import shutil

# Define configuration for protein structures and project settings
config_blob = """{
   "name": "czii_cryoet_mlchallenge_2024",
   "description": "2024 CZII CryoET ML Challenge training data.",
   "version": "1.0.0",

   "pickable_objects": [
       {
           "name": "apo-ferritin",
           "is_particle": true,
           "pdb_id": "4V1W",
           "label": 1,
           "color": [0, 117, 220, 128],
           "radius": 60,
           "map_threshold": 0.0418
       },
       {
           "name": "beta-amylase",
           "is_particle": true,
           "pdb_id": "1FA2", 
           "label": 2,
           "color": [153, 63, 0, 128],
           "radius": 65,
           "map_threshold": 0.035
       },
       {
           "name": "beta-galactosidase",
           "is_particle": true,
           "pdb_id": "6X1Q",
           "label": 3,
           "color": [76, 0, 92, 128],
           "radius": 90,
           "map_threshold": 0.0578
       },
       {
           "name": "ribosome",
           "is_particle": true,
           "pdb_id": "6EK0",
           "label": 4,
           "color": [0, 92, 49, 128],
           "radius": 150,
           "map_threshold": 0.0374
       },
       {
           "name": "thyroglobulin",
           "is_particle": true,
           "pdb_id": "6SCJ",
           "label": 5,
           "color": [43, 206, 72, 128],
           "radius": 130,
           "map_threshold": 0.0278
       },
       {
           "name": "virus-like-particle",
           "is_particle": true,
           "pdb_id": "6N4V",            
           "label": 6,
           "color": [255, 204, 153, 128],
           "radius": 135,
           "map_threshold": 0.201
       }
   ],

   "overlay_root": "/kaggle/working/overlay",
   "overlay_fs_args": {
       "auto_mkdir": true
   },
   "static_root": "/kaggle/input/czii-cryo-et-object-identification/train/static"
}"""

# Define paths
copick_config_path = "/kaggle/working/copick.config"
output_overlay = "/kaggle/working/overlay"

# Write configuration file
with open(copick_config_path, "w") as f:
   f.write(config_blob)
   
# Update the overlay
# Define source and destination directories
source_dir = '/kaggle/input/czii-cryo-et-object-identification/train/overlay'
destination_dir = '/kaggle/working/overlay'

# Walk through the source directory
for root, dirs, files in os.walk(source_dir):
   # Create corresponding subdirectories in the destination
   relative_path = os.path.relpath(root, source_dir)
   target_dir = os.path.join(destination_dir, relative_path)
   os.makedirs(target_dir, exist_ok=True)
   
   # Copy and rename each file
   for file in files:
       # Add prefix 'curation_0_' if not already present
       if file.startswith("curation_0_"):
           new_filename = file
       else:
           new_filename = f"curation_0_{file}"
       
       # Define full paths for the source and destination files
       source_file = os.path.join(root, file)
       destination_file = os.path.join(target_dir, new_filename)
       
       # Copy the file with the new name
       shutil.copy2(source_file, destination_file)
       print(f"Copied {source_file} to {destination_file}")
  • This code sets up a project for the competition:
  • shutil:
    • shutil is a Python standard library - it stands for "shell utility"
    • It provides high-level file operations such as copying, moving, and removing files and file collections
  • <<config_blob = """...""">> part
    • Contains information about 6 protein structures:
      • apo-ferritin: Iron storage protein
      • beta-amylase: Enzyme protein
      • beta-galactosidase: Sugar breakdown enzyme
      • ribosome: Protein synthesis structure
      • thyroglobulin: Thyroid hormone precursor
      • virus-like-particle: Virus-like particle
    • Attributes defined for each structure:
      • name: Structure name
      • is_particle: Particle status
      • pdb_id: Protein Data Bank ID
      • label: Classification label (1-6)
      • color: RGBA color value ([R,G,B,A])
      • radius: Particle radius
      • map_threshold: Mapping threshold
    • "overlay_root": "/kaggle/working/overlay"
      • Specifies the root directory where generated data (overlays) will be stored
      • Represents the working directory for use in Kaggle environment
      • /kaggle/working/ is a writable directory in Kaggle notebooks
    • "overlay_fs_args": {
          "auto_mkdir": true
      }

      • Sets file system related arguments
      • auto_mkdir: true means it will automatically create directories if they don't exist
      • Creates necessary paths automatically when saving files or data
    • "static_root": "/kaggle/input/czii-cryo-et-object-identification/train/static"
      • Specifies the path where original or unchanging static data is stored
      • Path to input data for the Kaggle competition
      • /kaggle/input/ is the read-only data directory provided by Kaggle
    • These configurations define in the Kaggle environment:
      • Where to read data from
      • Where to store processed results
      • How to manage the file system
    • is_particle: (Particle status):
      • Set to true in the data
      • Indicates whether the object should be treated as an independent particle
      • true means this structure is an individually identifiable, separate particle
      • This affects how the object is handled during image processing and analysis
    • pdb_id: (Protein Data Bank ID)
      • Unique identifier like "6N4V"
      • PDB (Protein Data Bank) is a global database storing 3D structural information of proteins and nucleic acids
      • This ID allows access to detailed structural information of the molecule
      • For example, "6N4V" for virus-like-particle is a unique identifier storing atomic-level details of this structure
      • Detailed information can be viewed by searching this ID on the PDB website (rcsb.org)
    • Radius:
      • Typically measured in Angstroms (Å) or nanometers (nm)
      • Reflects the actual physical size of virus-like particles
      • Set based on average particle size visible in electron microscope images
    • map_threshold:
      • Threshold value for identifying particles in electron density maps
      • Higher values mean stricter particle identification criteria
      • 0.201 is significantly higher than other particles (e.g., apo-ferritin's 0.0418, beta-amylase's 0.035)
      • This might be because virus-like particles show stronger contrast in electron microscope images
    •  color:
      • RGBA: color values with transparency %
      • Set for visualization purposes, doesn't affect analysis
      • Last value 128 indicates transparency (middle value in 0-255 range)
  •  File System Setup:
    • copick_config_path = "/kaggle/working/copick.config"
      output_overlay = "/kaggle/working/overlay"
      • Specifies paths for configuration file and output directory
      • Set up for Kaggle environment
  • For loop part:
    • for root, dirs, files in os.walk(source_dir):
      • Uses os.walk to traverse all files and subdirectories in source directory
      • Creates identical directory structure at destination
    • for file in files:
      • if else clause:
        • Adds "curation_0_" prefix to all file names
        • Keeps files that already have the prefix unchanged
      • shutil.copy2(source_file, destination_file):
        • Uses shutil.copy2 to copy files
        • Also copies metadata (creation time, modification time, etc.)
  • Overall prepares and structures the dataset needed for training machine learning models, specifically for identifying and classifying various protein structures captured by cryo-electron microscopy.
import os
import numpy as np
from pathlib import Path
import torch
import torchinfo
import zarr, copick
from tqdm import tqdm
from monai.data import DataLoader, Dataset, CacheDataset, decollate_batch
from monai.transforms import (
    Compose, 
    EnsureChannelFirstd, 
    Orientationd,  
    AsDiscrete,  
    RandFlipd, 
    RandRotate90d, 
    NormalizeIntensityd,
    RandCropByLabelClassesd,
)
from monai.networks.nets import UNet
from monai.losses import DiceLoss, FocalLoss, TverskyLoss
from monai.metrics import DiceMetric, ConfusionMatrixMetric
import mlflow
import mlflow.pytorch

Preparing the dataset

1. Get copick root

root = copick.from_file(copick_config_path)

copick_user_name = "copickUtils"
copick_segmentation_name = "paintedPicks"
voxel_size = 10
tomo_type = "denoised"
  • Initializing the basic configuration of the copick project
  • root = copick.from_file(copick_config_path)
    • Initializes a copick object by reading the configuration file from copick_config_path
    • Loads settings including protein structure information and paths into this object
  • copick_user_name = "copickUtils"
    • Sets an identifier for the user/tool performing the work
    • Used to track and distinguish results
  • copick_segmentation_name = "paintedPicks"
    • Specifies the name for segmentation (image region distinction) results
    • Results will be saved and referenced using this name
  • voxel_size = 10
    • Sets the voxel size that defines the resolution of 3D images
    • A voxel is the basic unit of 3D images, similar to pixels in 2D images
  • tomo_type = "denoised"
    • Specifies the type of tomogram (3D image) data to use
    • "denoised" means using processed images with noise removed
    • Noise removal improves image quality and facilitates analysis

2. Generate multi-class segmentation masks from picks, and saved them to the copick overlay directory (one-time)

# Import segmentation-related utilities
from copick_utils.segmentation import segmentation_from_picks
import copick_utils.writers.write as write
from collections import defaultdict

# Just do this once
generate_masks = True

if generate_masks:
    # Stores label and radius information for each particle in a dictionary
    # Only processes objects where is_particle is true
    target_objects = defaultdict(dict)
    for object in root.pickable_objects:
        if object.is_particle:
            target_objects[object.name]['label'] = object.label
            target_objects[object.name]['radius'] = object.radius

    # Process Tomograms and Create Masks
    for run in tqdm(root.runs):
        # Get tomogram data
        tomo = run.get_voxel_spacing(10)
        tomo = tomo.get_tomogram(tomo_type).numpy()
        
        # Create empty target array
        target = np.zeros(tomo.shape, dtype=np.uint8)
        
        # Generate Segmentation Masks
        for pickable_object in root.pickable_objects:
            pick = run.get_picks(object_name=pickable_object.name, user_id="curation")
            if len(pick):  
                target = segmentation_from_picks.from_picks(pick[0], 
                                                            target, 
                                                            target_objects[pickable_object.name]['radius'] * 0.8,
                                                            target_objects[pickable_object.name]['label']
                                                            )
        write.segmentation(run, target, copick_user_name, name=copick_segmentation_name)
  • from collections import defaultdict
    • defaultdict automatically handles default values for missing dictionary keys
  • generate_masks = True
    • Flag that controls whether to generate segmentation masks or not
    • Generating segmentation masks is a time-consuming operation
    • It only needs to be done once (this is why the comment says "Just do this once")
    • A segmentation mask is a binary or multi-class label map used to distinguish specific objects or regions in an image
    • It's used to distinguish 6 different protein structures
    • Each structure has a unique label (1-6)
    • Background is marked as 0
    • The mask is in 3D form, indicating which structure each voxel belongs to
  • for run in tqdm(root.runs):
    • Gets tomogram data for each run
    • Retrieves data at specified voxel size (10)
    • Converts to numpy array for processing
    • Creates empty array for storing segmentation masks
  • for pickable_object in root.pickable_objects:
    • For each object:
      • Gets pick information
      • Creates segmentation mask if pick exists
      • Uses 80% of radius (* 0.8) for mask creation
      • Uses object's label
  • write.segmentation(run, target, copick_user_name, name=copick_segmentation_name)
    • Saves generated segmentation masks
    • Saves with specified user name and segmentation name
  • root.runs:
    • Represents each experimental run in the dataset
    • In this code, we can see there are 7 experimental datasets:
      1. TS_86_3
      2. TS_6_6
      3. TS_6_4
      4. TS_5_4
      5. TS_73_6
      6. TS_99_9
      7. TS_69_2
    •  Each run represents one electron microscope imaging session
    • Therefore, for run in tqdm(root.runs)::
      • For each experimental session (TS_*)
      • Retrieves the tomogram data from that session
      • Locates each protein
      • Generates segmentation masks
    • Each run contains:
      • Tomogram data (run.get_tomogram())
      • Protein location information (run.get_picks())
      • Other metadata

3. Get tomograms and their segmentaion masks (from picks) arrays

data_dicts = []  # Create empty list to store data
for run in tqdm(root.runs):  # Iterate over 7 experimental datasets
    # Get tomogram data
    tomogram = run.get_voxel_spacing(voxel_size)  # Get data at resolution set to voxel_size=10
    tomogram = tomogram.get_tomogram(tomo_type)   # Get "denoised" type tomogram
    tomogram = tomogram.numpy()                    # Convert to numpy array

    # Get segmentation masks
    segmentation = run.get_segmentations(
        name=copick_segmentation_name,    # "paintedPicks"
        user_id=copick_user_name,         # "copickUtils"
        voxel_size=voxel_size,           # 10
        is_multilabel=True               # Mask distinguishing multiple classes (proteins)
    )[0].numpy()

    # Add to data dictionary
    data_dicts.append({
        "name": run.name,        # Experiment name (e.g., "TS_86_3")
        "image": tomogram,       # Tomogram data
        "label": segmentation    # Segmentation mask
    })

# Print label values from first data
print(np.unique(data_dicts[0]['label']))  # Outputs [0 1 2 3 4 5 6]
  • Collects tomograms and segmentation masks from each experimental data
  • Results explanation:
    • [0 1 2 3 4 5 6] are all unique values in the mask:
      • 0: Background
      • 1: apo-ferritin
      • 2: beta-amylase
      • 3: beta-galactosidase
      • 4: ribosome
      • 5: thyroglobulin
      • 6: virus-like-particle
  • Each dictionary created for experimental data includes:
    • Experiment name
    • Original image (tomogram)
    • Segmentation mask (labels)
  • This prepared data can be used later for training machine learning models.
# For each of the 7 experimental datasets
for i in range(7):
    # Save image (tomogram) data
    with open(f"train_image_{data_dicts[i]['name']}.npy", 'wb') as f:
        np.save(f, data_dicts[i]['image'])
    
    # Save label (segmentation mask) data    
    with open(f"train_label_{data_dicts[i]['name']}.npy", 'wb') as f:
        np.save(f, data_dicts[i]['label'])
  • saves the previously created data to files
  • Specifically:
    • Two .npy files are created for each experiment:
      • train_image_TS_XX_X.npy: tomogram data
      • train_label_TS_XX_X.npy: segmentation mask
    • File format:
      • .npy: NumPy's array storage format
      • 'wb': open file in binary write mode

I could either watch it happen or be a part of it.
- Elon Musk -
반응형