반응형
This post is an annotation of training data code kernel from "fnands".
https://www.kaggle.com/code/fnands/create-numpy-dataset-exp-name
Kernel 'Create Numpy dataset exp name'
- Overall this kernel is about PREPARING TRAINING DATA
!pip install git+https://github.com/copick/copick-utils.git matplotlib tqdm copick
!pip install -q "monai-weekly[mlflow]"
This combination of packages creates a complete environment for processing, analyzing, and applying machine learning models to Cryo-electron microscope data.
- copick-utils (`git+https://github.com/copick/copick-utils.git`)
- A utility library for processing Cryo-EM (Cryo-electron microscope) data
- Installed directly from GitHub repository
- Provides tools for processing, analyzing, and visualizing electron microscope images
- matplotlib
- Python's primary visualization library
- Used for creating and displaying graphs, charts, and images
- Essential tool for visualizing data analysis results
- tqdm
- Library that provides progress bars
- Enables real-time monitoring of long-running tasks
- Particularly useful when processing large datasets
- copick
- Main library for Cryo-EM data
- Provides functionality for image processing, data management, and analysis
- Serves as the basic framework for utilizing copick-utils features
- monai-weekly[mlflow]
- MONAI (Medical Open Network for AI) is a deep learning framework for medical images
- Built on PyTorch and specialized for medical image processing
- Key features:
- Data preprocessing and augmentation
- Neural network models for medical images
- Training and evaluation tools
- [mlflow] is an optional dependency where:
- MLflow is a platform for tracking and managing machine learning experiments
- Records and manages experimental results, models, and parameters
- Helps compare and reproduce model performance
- The '-q' option means 'quiet' mode, which minimizes installation process output.
!pip install zarr
!pip install copick
- zarr:
- Format and library for storing and processing N-dimensional arrays
- Main Features:
- Chunked compression storage
- Parallel processing support
- Hierarchical organization capability
- Cloud storage compatibility
- NumPy-compatible interface
- Purpose:
- Processing large-scale scientific data
- Data sharing in distributed computing environments
- Processing datasets larger than available memory
- Advantages:
- Memory efficient: Can process data without loading entire dataset into memory
- Fast I/O performance: Efficient data access through chunk-based approach
- Flexible storage format: Supports various storage options (local disk, cloud, etc.)
- Parallel processing: Multiple processes can access data simultaneously
- Common Use Cases:
- Large scientific datasets (e.g., meteorological data, satellite images)
- Machine learning datasets
- Biological data (e.g., cryo-electron microscopy data)
- Relationship with asciitree package:
- asciitree is used to visually represent Zarr data structure
- Shows hierarchical structure of Zarr arrays in tree format in terminal
- While asciitree is necessary for visualizing Zarr structures, it can sometimes be challenging to install or use
# Make a copick project
import os
import shutil
# Define configuration for protein structures and project settings
config_blob = """{
"name": "czii_cryoet_mlchallenge_2024",
"description": "2024 CZII CryoET ML Challenge training data.",
"version": "1.0.0",
"pickable_objects": [
{
"name": "apo-ferritin",
"is_particle": true,
"pdb_id": "4V1W",
"label": 1,
"color": [0, 117, 220, 128],
"radius": 60,
"map_threshold": 0.0418
},
{
"name": "beta-amylase",
"is_particle": true,
"pdb_id": "1FA2",
"label": 2,
"color": [153, 63, 0, 128],
"radius": 65,
"map_threshold": 0.035
},
{
"name": "beta-galactosidase",
"is_particle": true,
"pdb_id": "6X1Q",
"label": 3,
"color": [76, 0, 92, 128],
"radius": 90,
"map_threshold": 0.0578
},
{
"name": "ribosome",
"is_particle": true,
"pdb_id": "6EK0",
"label": 4,
"color": [0, 92, 49, 128],
"radius": 150,
"map_threshold": 0.0374
},
{
"name": "thyroglobulin",
"is_particle": true,
"pdb_id": "6SCJ",
"label": 5,
"color": [43, 206, 72, 128],
"radius": 130,
"map_threshold": 0.0278
},
{
"name": "virus-like-particle",
"is_particle": true,
"pdb_id": "6N4V",
"label": 6,
"color": [255, 204, 153, 128],
"radius": 135,
"map_threshold": 0.201
}
],
"overlay_root": "/kaggle/working/overlay",
"overlay_fs_args": {
"auto_mkdir": true
},
"static_root": "/kaggle/input/czii-cryo-et-object-identification/train/static"
}"""
# Define paths
copick_config_path = "/kaggle/working/copick.config"
output_overlay = "/kaggle/working/overlay"
# Write configuration file
with open(copick_config_path, "w") as f:
f.write(config_blob)
# Update the overlay
# Define source and destination directories
source_dir = '/kaggle/input/czii-cryo-et-object-identification/train/overlay'
destination_dir = '/kaggle/working/overlay'
# Walk through the source directory
for root, dirs, files in os.walk(source_dir):
# Create corresponding subdirectories in the destination
relative_path = os.path.relpath(root, source_dir)
target_dir = os.path.join(destination_dir, relative_path)
os.makedirs(target_dir, exist_ok=True)
# Copy and rename each file
for file in files:
# Add prefix 'curation_0_' if not already present
if file.startswith("curation_0_"):
new_filename = file
else:
new_filename = f"curation_0_{file}"
# Define full paths for the source and destination files
source_file = os.path.join(root, file)
destination_file = os.path.join(target_dir, new_filename)
# Copy the file with the new name
shutil.copy2(source_file, destination_file)
print(f"Copied {source_file} to {destination_file}")
- This code sets up a project for the competition:
- shutil:
- shutil is a Python standard library - it stands for "shell utility"
- It provides high-level file operations such as copying, moving, and removing files and file collections
- <<config_blob = """...""">> part
- Contains information about 6 protein structures:
- apo-ferritin: Iron storage protein
- beta-amylase: Enzyme protein
- beta-galactosidase: Sugar breakdown enzyme
- ribosome: Protein synthesis structure
- thyroglobulin: Thyroid hormone precursor
- virus-like-particle: Virus-like particle
- Attributes defined for each structure:
- name: Structure name
- is_particle: Particle status
- pdb_id: Protein Data Bank ID
- label: Classification label (1-6)
- color: RGBA color value ([R,G,B,A])
- radius: Particle radius
- map_threshold: Mapping threshold
- "overlay_root": "/kaggle/working/overlay"
- Specifies the root directory where generated data (overlays) will be stored
- Represents the working directory for use in Kaggle environment
- /kaggle/working/ is a writable directory in Kaggle notebooks
- "overlay_fs_args": {
"auto_mkdir": true
}
- Sets file system related arguments
- auto_mkdir: true means it will automatically create directories if they don't exist
- Creates necessary paths automatically when saving files or data
- "static_root": "/kaggle/input/czii-cryo-et-object-identification/train/static"
- Specifies the path where original or unchanging static data is stored
- Path to input data for the Kaggle competition
- /kaggle/input/ is the read-only data directory provided by Kaggle
- These configurations define in the Kaggle environment:
- Where to read data from
- Where to store processed results
- How to manage the file system
- is_particle: (Particle status):
- Set to true in the data
- Indicates whether the object should be treated as an independent particle
- true means this structure is an individually identifiable, separate particle
- This affects how the object is handled during image processing and analysis
- pdb_id: (Protein Data Bank ID)
- Unique identifier like "6N4V"
- PDB (Protein Data Bank) is a global database storing 3D structural information of proteins and nucleic acids
- This ID allows access to detailed structural information of the molecule
- For example, "6N4V" for virus-like-particle is a unique identifier storing atomic-level details of this structure
- Detailed information can be viewed by searching this ID on the PDB website (rcsb.org)
- Radius:
- Typically measured in Angstroms (Å) or nanometers (nm)
- Reflects the actual physical size of virus-like particles
- Set based on average particle size visible in electron microscope images
- map_threshold:
- Threshold value for identifying particles in electron density maps
- Higher values mean stricter particle identification criteria
- 0.201 is significantly higher than other particles (e.g., apo-ferritin's 0.0418, beta-amylase's 0.035)
- This might be because virus-like particles show stronger contrast in electron microscope images
- color:
- RGBA: color values with transparency %
- Set for visualization purposes, doesn't affect analysis
- Last value 128 indicates transparency (middle value in 0-255 range)
- Contains information about 6 protein structures:
- File System Setup:
- copick_config_path = "/kaggle/working/copick.config"
output_overlay = "/kaggle/working/overlay"- Specifies paths for configuration file and output directory
- Set up for Kaggle environment
- copick_config_path = "/kaggle/working/copick.config"
- For loop part:
- for root, dirs, files in os.walk(source_dir):
- Uses os.walk to traverse all files and subdirectories in source directory
- Creates identical directory structure at destination
- for file in files:
- if else clause:
- Adds "curation_0_" prefix to all file names
- Keeps files that already have the prefix unchanged
- shutil.copy2(source_file, destination_file):
- Uses shutil.copy2 to copy files
- Also copies metadata (creation time, modification time, etc.)
- if else clause:
- for root, dirs, files in os.walk(source_dir):
- Overall prepares and structures the dataset needed for training machine learning models, specifically for identifying and classifying various protein structures captured by cryo-electron microscopy.
import os
import numpy as np
from pathlib import Path
import torch
import torchinfo
import zarr, copick
from tqdm import tqdm
from monai.data import DataLoader, Dataset, CacheDataset, decollate_batch
from monai.transforms import (
Compose,
EnsureChannelFirstd,
Orientationd,
AsDiscrete,
RandFlipd,
RandRotate90d,
NormalizeIntensityd,
RandCropByLabelClassesd,
)
from monai.networks.nets import UNet
from monai.losses import DiceLoss, FocalLoss, TverskyLoss
from monai.metrics import DiceMetric, ConfusionMatrixMetric
import mlflow
import mlflow.pytorch
Preparing the dataset
1. Get copick root
root = copick.from_file(copick_config_path)
copick_user_name = "copickUtils"
copick_segmentation_name = "paintedPicks"
voxel_size = 10
tomo_type = "denoised"
- Initializing the basic configuration of the copick project
- root = copick.from_file(copick_config_path)
- Initializes a copick object by reading the configuration file from copick_config_path
- Loads settings including protein structure information and paths into this object
- copick_user_name = "copickUtils"
- Sets an identifier for the user/tool performing the work
- Used to track and distinguish results
- copick_segmentation_name = "paintedPicks"
- Specifies the name for segmentation (image region distinction) results
- Results will be saved and referenced using this name
- voxel_size = 10
- Sets the voxel size that defines the resolution of 3D images
- A voxel is the basic unit of 3D images, similar to pixels in 2D images
- tomo_type = "denoised"
- Specifies the type of tomogram (3D image) data to use
- "denoised" means using processed images with noise removed
- Noise removal improves image quality and facilitates analysis
2. Generate multi-class segmentation masks from picks, and saved them to the copick overlay directory (one-time)
# Import segmentation-related utilities
from copick_utils.segmentation import segmentation_from_picks
import copick_utils.writers.write as write
from collections import defaultdict
# Just do this once
generate_masks = True
if generate_masks:
# Stores label and radius information for each particle in a dictionary
# Only processes objects where is_particle is true
target_objects = defaultdict(dict)
for object in root.pickable_objects:
if object.is_particle:
target_objects[object.name]['label'] = object.label
target_objects[object.name]['radius'] = object.radius
# Process Tomograms and Create Masks
for run in tqdm(root.runs):
# Get tomogram data
tomo = run.get_voxel_spacing(10)
tomo = tomo.get_tomogram(tomo_type).numpy()
# Create empty target array
target = np.zeros(tomo.shape, dtype=np.uint8)
# Generate Segmentation Masks
for pickable_object in root.pickable_objects:
pick = run.get_picks(object_name=pickable_object.name, user_id="curation")
if len(pick):
target = segmentation_from_picks.from_picks(pick[0],
target,
target_objects[pickable_object.name]['radius'] * 0.8,
target_objects[pickable_object.name]['label']
)
write.segmentation(run, target, copick_user_name, name=copick_segmentation_name)
- from collections import defaultdict
- defaultdict automatically handles default values for missing dictionary keys
- generate_masks = True
- Flag that controls whether to generate segmentation masks or not
- Generating segmentation masks is a time-consuming operation
- It only needs to be done once (this is why the comment says "Just do this once")
- A segmentation mask is a binary or multi-class label map used to distinguish specific objects or regions in an image
- It's used to distinguish 6 different protein structures
- Each structure has a unique label (1-6)
- Background is marked as 0
- The mask is in 3D form, indicating which structure each voxel belongs to
- for run in tqdm(root.runs):
- Gets tomogram data for each run
- Retrieves data at specified voxel size (10)
- Converts to numpy array for processing
- Creates empty array for storing segmentation masks
- for pickable_object in root.pickable_objects:
- For each object:
- Gets pick information
- Creates segmentation mask if pick exists
- Uses 80% of radius (* 0.8) for mask creation
- Uses object's label
- For each object:
- write.segmentation(run, target, copick_user_name, name=copick_segmentation_name)
- Saves generated segmentation masks
- Saves with specified user name and segmentation name
- root.runs:
- Represents each experimental run in the dataset
- In this code, we can see there are 7 experimental datasets:
- TS_86_3
- TS_6_6
- TS_6_4
- TS_5_4
- TS_73_6
- TS_99_9
- TS_69_2
- Each run represents one electron microscope imaging session
- Therefore, for run in tqdm(root.runs)::
- For each experimental session (TS_*)
- Retrieves the tomogram data from that session
- Locates each protein
- Generates segmentation masks
- Each run contains:
- Tomogram data (run.get_tomogram())
- Protein location information (run.get_picks())
- Other metadata
3. Get tomograms and their segmentaion masks (from picks) arrays
data_dicts = [] # Create empty list to store data
for run in tqdm(root.runs): # Iterate over 7 experimental datasets
# Get tomogram data
tomogram = run.get_voxel_spacing(voxel_size) # Get data at resolution set to voxel_size=10
tomogram = tomogram.get_tomogram(tomo_type) # Get "denoised" type tomogram
tomogram = tomogram.numpy() # Convert to numpy array
# Get segmentation masks
segmentation = run.get_segmentations(
name=copick_segmentation_name, # "paintedPicks"
user_id=copick_user_name, # "copickUtils"
voxel_size=voxel_size, # 10
is_multilabel=True # Mask distinguishing multiple classes (proteins)
)[0].numpy()
# Add to data dictionary
data_dicts.append({
"name": run.name, # Experiment name (e.g., "TS_86_3")
"image": tomogram, # Tomogram data
"label": segmentation # Segmentation mask
})
# Print label values from first data
print(np.unique(data_dicts[0]['label'])) # Outputs [0 1 2 3 4 5 6]
- Collects tomograms and segmentation masks from each experimental data
- Results explanation:
- [0 1 2 3 4 5 6] are all unique values in the mask:
- 0: Background
- 1: apo-ferritin
- 2: beta-amylase
- 3: beta-galactosidase
- 4: ribosome
- 5: thyroglobulin
- 6: virus-like-particle
- [0 1 2 3 4 5 6] are all unique values in the mask:
- Each dictionary created for experimental data includes:
- Experiment name
- Original image (tomogram)
- Segmentation mask (labels)
- This prepared data can be used later for training machine learning models.
# For each of the 7 experimental datasets
for i in range(7):
# Save image (tomogram) data
with open(f"train_image_{data_dicts[i]['name']}.npy", 'wb') as f:
np.save(f, data_dicts[i]['image'])
# Save label (segmentation mask) data
with open(f"train_label_{data_dicts[i]['name']}.npy", 'wb') as f:
np.save(f, data_dicts[i]['label'])
- saves the previously created data to files
- Specifically:
- Two .npy files are created for each experiment:
- train_image_TS_XX_X.npy: tomogram data
- train_label_TS_XX_X.npy: segmentation mask
- File format:
- .npy: NumPy's array storage format
- 'wb': open file in binary write mode
- Two .npy files are created for each experiment:
I could either watch it happen or be a part of it.
- Elon Musk -
반응형