[Kaggle Extra Study] 15. GBM vs. XGBoost

캐글 보충

[Kaggle Extra Study] 15. GBM vs. XGBoost

dongsunseng 2024. 11. 10. 14:42

In this post, I want to talk about the difference between GBM(Gradient Boosting Machine) and XGBoost.

reference: https://towardsdatascience.com/https-medium-com-vishalmorde-xgboost-algorithm-long-she-may-rein-edd9f99be63d

Reading the previous post will be helpful for your understanding:

[Kaggle Extra Study] 14. Tree-based Ensemble Models

Tree-based Ensemble Models?LightGBM, XGBoost, CatBoost, Random Forest are all tree-based ensemble models.These algorithms are popular on Kaggle because of their high prediction performance, fast learning speed, and ability to handle various data types.They

dongsunseng.com

GBM vs. XGBoost

According to the image above, XGBoost is basically optimized Gradient Boosting algorithm through parallel processing, tree pruning, handling missing values and regularization to avoid overfitting/bias.
- In detail, Gradient Boosting Machine had issues with slow speed and potential overfitting and XGBoost was created to address these problems.
We can fundamentally separate the difference between GBM and XGBoost in 2 big categories: System Optimization and Algorithmic Updates.

System Optimization

Parallelization(병렬 처리)
- XGBoost supports parallel processing.
  - Thinking of a simple double loop structure, traditionally the inner loop had to be processed before the outer loop could proceed.
  - However, XGBoost has enabled the interchangeability of loop orders through global scanning initialization of all instances and parallel thread-based sorting to improve execution speed.
  - This change enhances performance by getting rid of computational overheads.
    - "computational overheads" refers to the extra processing time and resources that were required in the traditional sequential processing of nested loops.
Tree Pruning
- Instead of using the "criterion first" approach, XGBoost sets the "max_depth" parameter.
- This "depth-first" approach has brought significant computational benefits.
- "The stopping criterion for tree splitting within GBM framework is greedy in nature and depends on the negative loss criterion at the point of split. XGBoost uses ‘max_depth’ parameter as specified instead of criterion first, and starts pruning trees backward."
  - Greedy approach: decides splits based on 'loss reduction'.
    - Continues splitting at each node until there's no further loss reduction.
    - This can be computationally expensive and inefficient.
  - "starts pruning trees backward" meaning:
    - First grows the tree to maximum depth specified by the 'max_depth' parameter.
    - Then performs pruning backward.
      - pruning: an important technique in decision trees to prevent overfitting and simplify the model.
        
        To explain simply:
        
        Just as a gardener cuts unnecessary branches to make a healthier tree, in machine learning, pruning is the process of removing unnecessary tree branches (nodes).
        
        Pre-pruning
        
        Pruning while the tree is growing
        
        Example: Limiting with parameters like max_depth, min_samples_split
        
        Post-pruning
        
        Pruning after growing the tree to its maximum
        
        This is the method used by XGBoost
        
        Uses validation dataset to remove unnecessary splits
    - This is called the depth-first approach.
- Results in:
  - More efficient memory access (allows continuous memory access).
  - Easier parallel processing (can process each depth simultaneously).
  - Pruning is done at once, increasing computational efficiency.
- To explain with an example:
  - Regular GBM: Similar to a restaurant deciding whether to prepare food as orders come in from each table
  - XGBoost: Similar to taking all table orders first and then processing them efficiently all at once
Hardware Optimization
- XGBoost is designed to efficiently utilize hardware resources.
- It allocates each thread to an internal buffer using cache memory storage to store gradient statistics.
- Advanced features like out-of-core computing optimize disk space while processing large dataframes that don't fit in memory.

Algorithmic Updates

Regularization
- The model applies penalties through Lasso (L1) and Ridge (L2) regularization.
- These methods can help control overfitting.
- You can find more information about L1 and L2 regularization in my previous posts.
  - https://dongsunseng.com/entry/Kaggle-Study-5-Regularization-%EA%B0%80%EC%A4%91%EC%B9%98-%EA%B7%9C%EC%A0%9C
Sparsity Awareness
- XGBoost acknowledges sparsity by automatically learning the best missing values based on training loss, efficiently handling sparse patterns in various data types.
  - Sparsity: A state where data has many missing values or zeros
  - Examples: Customer purchase data, user-item matrix in recommendation systems
  - XGBoost's Smart Processing Method:
    - Doesn't simply replace missing values with 0 or mean values.
    - Automatically learns the optimal direction for missing values during training.
    - Utilizes the information contained in missing patterns themselves.
Weighted Quantile Sketch
- XGBoost uses the "distributed weighted Quantile sketch" algorithm to efficiently find optimal split points.
  - Weighted Quantile Sketch(가중 분위수 스케치) Algorithm:
    - Quantile Sketch?
      - A method to approximately calculate percentiles of large-scale data.
      - Can efficiently process data without loading the entire dataset into memory.
    - Meaning of "Weighted"
      - Assigns different importance (weights) to each data point.
      - Weights are determined based on prediction errors from previous trees
      - Higher weights are given to samples that are harder to predict.
Cross-validation
- Cross-validation is built into the system.

Reference

XGBoost Algorithm: Long May She Reign!

The new queen of Machine Learning algorithms taking over the world…

towardsdatascience.com

The more you seek the uncomfortable, the more you will become comfortable.
- Conor Mcgregor -

저작자표시 비영리 변경금지 (새창열림)

'캐글 보충' 카테고리의 다른 글

[Kaggle Extra Study] 17. Multiclass Classification Threshold Optimization 다중분류 임계값 최적화 (0)	2024.11.17
[Kaggle Extra Study] 16. Handling Categorical Variables (2)	2024.11.11
[Kaggle Extra Study] 14. Tree-based Ensemble Models (1)	2024.11.10
[Kaggle Extra Study] 13. Weight Initialization (3)	2024.11.09
[Kaggle Extra Study] 12. Drop-out (4)	2024.11.07

현재글[Kaggle Extra Study] 15. GBM vs. XGBoost

backend, 캐글, 오블완, 코인, nlp, Express, 매매일지, 단타, dl, ML, Kaggle, llm, cibmtr - equity in post-hct survival predictions, Prompt Engineering, 경제, 티스토리챌린지, 비트코인, home credit default risk, 투자, nodejs,

Today :
Yesterday :

동선생