반응형
    
    
    
  This is an annotation of code that produces datasets for YOLO solution with additional data(synthetic data)
- Trained weights: https://www.kaggle.com/datasets/sersasj/czii-yolo-l-trained-with-synthetic-data/data
- Code: https://www.kaggle.com/code/sersasj/czii-making-datasets-for-yolo-synthetic-data#CZII:-Creating-Datasets-for-YOLO-with-Additional-Data
CZII making datasets for YOLO + synthetic data
- Modified version of https://www.kaggle.com/code/itsuki9180/czii-making-datasets-for-yolo
- Basically generates datasets with additional synthetic data, denoised using Gaussian denoising
- Quite controversy whether models trained with synthetic data perform better or not (https://www.kaggle.com/competitions/czii-cryo-et-object-identification/discussion/555247)
- However, discovered true for YOLO
- "If someone manages to incorporate the original denoise model or IsoNet, I’m sure that better results could be achieved."
- Model was trained with TS_5_4, TS_69_2, TS_6_4, TS_6_6 as validation.
1) Install + Import
!pip install zarr opencv-python
import json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import zarr
from tqdm import tqdm
import glob, os
import cv2
import shutil2) Load + Organize data
runs = sorted(glob.glob('/kaggle/input/czii-cryo-et-object-identification/train/overlay/ExperimentRuns/*'))
print(runs)
runs = [os.path.basename(x) for x in runs]
# Processing additional dataset
additional_runs = sorted(glob.glob('/kaggle/input/czii10441/10441/T*'))
print(additional_runs)
additional_runs = [os.path.basename(x) for x in additional_runs]
runs = runs + additional_runs
# Creating mapping dictionaries
i2r_dict = {i: r for i, r in zip(range(len(runs)), runs)}
r2t_dict = {r: i for i, r in zip(range(len(runs)), runs)}
print("Runs:", i2r_dict)- runs = sorted(glob.glob('/kaggle/input/czii-cryo-et-object-identification/train/overlay/ExperimentRuns/*'))
- Collects paths from the base training dataset
- Uses glob.glob() to get all experiment paths
- Uses sorted() to arrange the paths
 
- runs = [os.path.basename(x) for x in runs]
- Extracting experiment names from paths
- Uses os.path.basename() to extract just the experiment names from full paths
- Processes all paths using list comprehension
 
- Processing additional data part
- Processes paths from additional dataset (czii10441) in the same way
- Merges additional dataset with existing experiment list
 
- Creating mapping dictionaries part
- i2r_dict: Maps indices to experiment names (index to run)
- r2t_dict: Maps experiment names to indices (run to index)
- Used as lookup tables for later data processing or reference
 
3) Helper function - Normalize function
# Normalize the image to a value between 0 and 255
def convert_to_8bit(x):
    # 1. Calculate percentiles for outlier removal
    lower, upper = np.percentile(x, (0.5, 99.5))
    
    # 2. Remove extreme values (clipping)
    x = np.clip(x, lower, upper)
    
    # 3. Convert to 0-255 range using Min-max normalization
    x = (x - x.min()) / (x.max() - x.min() + 1e-12) * 255
    
    # 4. Convert to 8-bit integer
    return x.round().astype("uint8")- Normalizes image data to 8-bit format (0-255 range)
- Crucial preprocessing step in CryoET image processing
- Clipping: Reduces the impact of noise and extreme values
- Min-max normalization:
- 1e-12: Small value added to prevent division by zero
- * 255: Scales 0-1 range to 0-255 range
 
4) Information about labels
p2i_dict = {
    'apo-ferritin': 0,
    'beta-amylase': 1,
    'beta-galactosidase': 2,
    'ribosome': 3,
    'thyroglobulin': 4,
    'virus-like-particle': 5
}
i2p = {v: k for k, v in p2i_dict.items()}
particle_radius = {
    'apo-ferritin': 60,
    'beta-amylase': 65,
    'beta-galactosidase': 90,
    'ribosome': 150,
    'thyroglobulin': 130,
    'virus-like-particle': 135,
}
particle_names = ['apo-ferritin', 'beta-amylase', 'beta-galactosidase', 'ribosome', 'thyroglobulin', 'virus-like-particle']from scipy.ndimage import gaussian_filter, median_filter
def denoise_tomogram(tomogram, method='gaussian', **kwargs):
    """
    Apply denoising to a tomogram.
    Parameters:
        tomogram (np.ndarray): The input tomogram to denoise.
        method (str): The denoising method ('gaussian' or 'median').
        kwargs: Parameters for the respective method.
    
    Returns:
        np.ndarray: The denoised tomogram.
    """
    if method == 'gaussian':
        return gaussian_filter(tomogram, sigma=kwargs.get('sigma', 1))
    elif method == 'median':
        return median_filter(tomogram, size=kwargs.get('size', 3))
    else:
        raise ValueError(f"Unsupported denoising method: {method}")- Removes noise using Gaussian or median filter
- Filter parameters can be flexibly adjusted via kwargs
name_map = {
    'apo-ferritin': 'ferritin_complex',
    'beta-amylase': 'beta_amylase',
    'beta-galactosidase': 'beta_galactosidase',
    'ribosome': 'cytosolic_ribosome',
    'thyroglobulin': 'thyroglobulin',
    'virus-like-particle': 'pp7_vlp',
}def ndjson_to_json(ndjson_path):
    # Check if file exists
    if not os.path.isfile(ndjson_path):
        raise FileNotFoundError(f"The file {ndjson_path} does not exist.")
    data = []
    # Parse each line as JSON object
    try:
        with open(ndjson_path, 'r', encoding='utf-8') as ndjson_file:
            for line_number, line in enumerate(ndjson_file, start=1):
                stripped_line = line.strip()
                if stripped_line:  
                    try:
                        json_object = json.loads(stripped_line)
                        data.append(json_object)
                    except json.JSONDecodeError as e:
                        raise json.JSONDecodeError(
                            f"Error decoding JSON on line {line_number}: {e.msg}",
                            e.doc,
                            e.pos
                        )
    except Exception as e:
        raise e
    return data- Parses NDJSON (Newline Delimited JSON) files
- Converts each line to individual JSON objects
- Includes error handling and line number tracking
import os
import glob
import json
import pandas as pd
import numpy as np
import zarr
import cv2
from tqdm import tqdm
# Takes experiment name, train/validation flag, synthetic data flag as input
def make_annotate_yolo(run_name, is_train_path=True, is_syntetic=False):
    dataset_split = 'train' if is_train_path else 'val'
    
    # Loading and preprocessing volume data
    # Setting the path to the denoised volume(data)
    if is_syntetic:
        vol_path = glob.glob(f'/kaggle/input/czii10441/10441/{run_name}/**/Tomograms/**/*.zarr', recursive=True)
        if not vol_path:
            print(f"No volume found for run {run_name} in synthetic data.")
            return
        vol_path = vol_path[0]
    else:
        vol_path = f'/kaggle/input/czii-cryo-et-object-identification/train/static/ExperimentRuns/{run_name}/VoxelSpacing10.000/denoised.zarr'
    
    print(f"Volume path: {vol_path}")
    if not os.path.exists(vol_path):
        print(f"Volume file not found: {vol_path}")
        return
    # Read the volume
    vol = zarr.open(vol_path, mode='r') # loads volume data in zarr format
    vol = vol[0]
    if is_syntetic:
        vol = denoise_tomogram(np.array(vol)[:184], method='gaussian', sigma=1)  # Apply denoise for synthetic data
    vol2 = convert_to_8bit(vol) # into 8-bit format
    
    n_imgs = vol2.shape[0]
    print(n_imgs)
    
    # Image generation - CONVERT 3D Volume data into 2D Images that YOLO can process
    for j in range(n_imgs):
        # 1. Extract current slice
        newvol = vol2[j]
        
        # 2. Convert grayscale to RGB
        newvolf = np.stack([newvol]*3, axis=-1)
        
        # 3. Resize to YOLO input size
        newvolf = cv2.resize(newvolf, (640, 640))
        
        # 4. Save image
        image_filename = f'images/{dataset_split}/{run_name}_{j*10}.png'
        cv2.imwrite(image_filename, newvolf)
        
        # 5. Create empty label file
        label_filename = f'labels/{dataset_split}/{run_name}_{j*10}.txt'
        with open(label_filename, 'w') as f:
            pass
    
    # Process each particle type (label processing)
    for p, particle in enumerate(tqdm(particle_names, desc=f"Processing particles for run {run_name}")):
        if particle == "beta-amylase":
            continue
        
        # Load JSON data for each particle
        if is_syntetic:
            particle_name_in_file = name_map.get(particle)
            if not particle_name_in_file:
                print(f"Particle name mapping not found for: {particle}")
                continue
            
            ndjson_each_particle = glob.glob(f'/kaggle/input/czii10441/10441/{run_name}/**/Annotations/**/*.ndjson', recursive=True)
            if not ndjson_each_particle:
                print(f"No NDJSON files found for particle: {particle} in run: {run_name}")
                continue
            
            filtered_ndjson_files = [f for f in ndjson_each_particle if particle_name_in_file in f]
            if not filtered_ndjson_files:
                print(f"No NDJSON files match the particle: {particle} for run: {run_name}")
                continue
            
            json_each_particle = ndjson_to_json(filtered_ndjson_files[0])
            df = pd.DataFrame(json_each_particle)
            
        # Data loading for real data
        else:
            json_each_particle = f"/kaggle/input/czii-cryo-et-object-identification/train/overlay/ExperimentRuns/{run_name}/Picks/{particle}.json"
            
            if not os.path.exists(json_each_particle):
                print(f"JSON file not found: {json_each_particle}")
                continue
            print(f"Loading JSON file: {json_each_particle}")
            try:
                df = pd.read_json(json_each_particle)
            except ValueError as e:
                print(f"Error reading JSON file {json_each_particle}: {e}")
                continue
                
        # Coordinate Extraction Processing 
        if is_syntetic:
            column_name = 'location'
        else:
            column_name = 'points'
        if  column_name not in df.columns:
            print(f"'{column_name}' column not found in DataFrame for particle: {particle}")
            continue
        
        if is_syntetic:
            # Flattens nested JSON data into dataframe format
            normalized_data = pd.json_normalize(df[column_name])
            # *10.012: applies pixel scaling factor(converts to actual physical size)
            df[['x', 'y', 'z']] = normalized_data * 10.012
        
        # For real data: Extracts coordinates for each axis (x, y, z)
        else:      
            for axis in ["x", "y", "z"]:
                df[axis] = df[column_name].apply(lambda x: x["location"][axis] if "location" in x and axis in x["location"] else np.nan)
                print("aquiii",df.head())
        # Missing Value Handling: Removes rows with NaN in any of x, y, z coordinates
        df.dropna(subset=["x", "y", "z"], inplace=True)
        # Get defined radius for each particle type
        radius = particle_radius.get(particle)
        if radius is None:
            print(f"Radius not defined for particle: {particle}")
            continue
        divide_by = 10.012
        
        # Convert to YOLO format
        for i, row in df.iterrows():    
            # Calculate Z-axis range (range of slices where particle is visible)
            start_z = np.round(row['z'] - radius).astype(np.int32)
            start_z = max(0, start_z//10) 
            end_z = np.round(row['z'] + radius).astype(np.int32)
            end_z = min(n_imgs, end_z//10)
            
            # Generate YOLO format labels for each slice
            for j in range(start_z, end_z):
                label_filename = f'labels/{dataset_split}/{run_name}_{j*10}.txt'
                
                # Calculate normalized coordinates
                x_center = row["x"] / divide_by / vol2.shape[1]
                y_center = row["y"] / divide_by / vol2.shape[2]
                box_width = (radius * 2) / divide_by / vol2.shape[1]
                box_height = (radius * 2) / divide_by / vol2.shape[2]
                
                # Save in YOLO format
                # format: class_id center_x center_y width height
                with open(label_filename, 'a') as f:
                    f.write(f'{p2i_dict.get(particle, 0)} {x_center:.6f} {y_center:.6f} {box_width:.6f} {box_height:.6f}\n')- Generating datasets for YOLO training
- Overall process:
- Convert 3D coordinates to 2D YOLO format
- Generate labels for all slices within particle's Z-axis range
- Normalize coordinates and box sizes to 0-1 range
- YOLO format: class_id x_center y_center width height
- This code plays a crucial role in converting 3D particle location information into 2D bounding box format that YOLO can understand.
 
- Image generation
- newvolf = np.stack([newvol]*3, axis=-1)
- [newvol]*3: Replicate the same grayscale image 3 times
- axis=-1: Stack along the last dimension (creating R,G,B channels)
- Result: (height, width) -> (height, width, 3)
 
- Image resizing:
- 640x640 is YOLOv5's default input size
 
 
- newvolf = np.stack([newvol]*3, axis=-1)
- Label Processing
- Labels here refer to annotation information used for training YOLO object detection models
- Exclude beta-amylase (excluded from competition evaluation)
- Data Loading - For Synthetic Data
- Synthetic data stored in NDJSON format
- Filter by matching particle type in filename
- Convert NDJSON to DataFrame
 
-  Data Loading - For Real Data
- Real data stored in JSON format
- Direct JSON file loading
 
- Coordinate Extraction Processing
- Different coordinate extraction methods for synthetic/real data
- Normalize and scale coordinate values
 
 
5) Prepare folders
os.makedirs("images/train", exist_ok=True)
os.makedirs("images/val", exist_ok=True)
os.makedirs("labels/train", exist_ok=True)
os.makedirs("labels/val", exist_ok=True)- exist_ok=True: No error if directories already exist
6) Create Dataset
validation_indices = [0, 1, 2, 3]  # TS_5_4, TS_69_2 TS_6_4 TS_6_6
#runs = runs[:7] 
    
for i, r in enumerate(runs):
    # Determine if training or validation
    is_train_path = i not in validation_indices
    
    # Determine if synthetic data (after index 7 is synthetic)
    is_syntetic = i > 7
    
    print(f"Processing Run {i}: {r}, Is Train: {is_train_path}")
    
    # Call dataset generation function
    make_annotate_yolo(r, is_train_path=is_train_path, is_syntetic=is_syntetic)- Generates the dataset by splitting it into training and validation sets
images_train_dir = "images/train"
labels_train_dir = "labels/train"7) Organize Dataset Folder Structure
# Create top-level dataset directory
os.makedirs('datasets/czii_det2d', exist_ok=True)
# Move image and label files to new locations
shutil.move('images/train', 'datasets/czii_det2d/images/train')
shutil.move('images/val', 'datasets/czii_det2d/images/val')
shutil.move('labels/train', 'datasets/czii_det2d/labels/train')
shutil.move('labels/val', 'datasets/czii_det2d/labels/val')- Reorganizes the generated training data into final directory structure expected by YOLO
- Final dir structure:
- datasets/
 └── czii_det2d/
 ├── images/
 │ ├── train/ # Training images
 │ └── val/ # Validation images
 └── labels/
 ├── train/ # Training labels
 └── val/ # Validation labels
 
- datasets/
8) Create Configuration File for YOLO
config_content = """
path: /kaggle/input/czii-making-datasets-for-yolo/datasets/czii_det2d  # Dataset root path
train: images/train  # Training images path (relative to path)
val: images/val      # Validation images path (relative to path)
# Classes
names:               # Class (particle type) definitions
  0: apo-ferritin
  1: beta-amylase
  2: beta-galactosidase
  3: ribosome
  4: thyroglobulin
  5: virus-like-particle
"""
# Create YAML file
with open("czii_conf.yaml", "w") as f:
    f.write(config_content.strip())
- Generates a configuration file (YAML) for YOLO model training
In order to make the impossible possible, you need to change the rules.
- Elon Musk -
반응형