캐글 보충

[Kaggle Extra Study] 11. Polars

dongsunseng 2024. 11. 6. 11:37
반응형

Polars?

  • Pandas library is widely used for data analysis and processing because it is flexible and convenient to use.
  • However, Pandas has limitations in handling large-scale data due to processing speed and memory usage constraints when dealing with big data.
  • To address this issues, Polars has emerged as a new data processing tool. 
  • Polars is implemented in Rust that offers great performance for processing large-scale data while using minimal memory
  • Polars supports a wider range of data types than Pandas. 
  • Polars supports parallel processing, enabling faster processing of large-scale data.
  • Polars is also convenient to use because it provides an API similar to Pandas.

Details

  • Through parallel processing and vectorized operations that maximize the utilization of single machine resources, it optimizes column-based processing and efficiently manages caching, which is why it's also called a Vectorized Query Engine library.
  • Apache Arrow-based library
    • Using the Apache Arrow model, it defines data in column structure in memory, and improves performance through vectorized operations and CPU optimization using SIMD(Single Instruction Multiple Data). 
      • SIMD: a type of parallel computing where a single instruction processes multiple data simultaneously(a method commonly used in vector processors like GPU).
    • It enables zero-copy data sharing and has highly efficient serialization/deserialization, which can reduce data exchange costs when multiple cores or processes are working.
    • Recently, open source projects like Pandas(after v2.0), Dask, and Ray have adopted Arrow and use PyArrow as their implementation. 
    • However, Polars uses a Rust-based implementation of Arrow internally
    • When using Arrow, data can be exchanged in ArrowTable format, maintaining some level of compatibility with other open source tools.

reference: https://ko.wikipedia.org/wiki/SIMD

  • IO feature
    • It supports various data storage layers such as local files, cloud storage, and databases. 
    • Natively supports reading/writing various formats including CSV, JSON, Parquet, and Avro, and conveniently allows reading multiple files using Globs Patterns like '*'.
    • In practice, the read_database() function is frequently used to submit queries to databases or query engines like Trino and return results as a polars.DataFrame.
    • When using scan_* functions instead of read_* functions, it returns a LazyFrame for Lazy API, enabling immediate Lazy operations.
    • This allows for more efficient processing by optimizing operations before loading all data into memory rather than loading everything immediately.
    • Polars' Lazy API is a lazy evaluation method that doesn't perform operations immediately, but instead establishes an operation plan called a Query Plan and executes operations at the optimal time.
    • This reduces unnecessary intermediate operations and uses optimization techniques like filtering and pushdown to process only the necessary data, thereby reducing memory consumption and computational complexity.
    • In Polars, besides polars.DataFrame, there's polars.LazyFrame which doesn't perform operations immediately but stores only the query plan and performs the computation when the values are needed, that is, when materializing by calling the collect() function.
  • Streaming API (out-of-core processing)
    • Polars enables out-of-core processing through its streaming feature, which processes large datasets by loading and processing data in chunks from disk or network rather than loading all data into memory at once.
  • User experience
    • Polars provides SQL-like syntax and structure, making it familiar and easy to use for those accustomed to SQL operations.
    • Polars offers convenient column selection methods, allowing users to filter columns based on data types or regular expressions, making it especially useful for handling encoded columns during data analysis and feature engineering.
    • Polars excels in time series operations, supporting various time-related data types, offering features like resampling and time window-based grouping, and includes asof join functionality for matching closest values when exact keys don't exist.

Alternatives 

  • There are obviously alternatives such as Spark or Dask, Ray, Modin, etc.
  • Spark
    • There is a definitely higher learning curve compared to Pandas or Polars.
    • Spark has low cost efficiency: expensive resources.
    • Overhead and slow start problems when not dealing with large-scale data.
  • Dask
    • While having similar syntax to Pandas with relatively better performance, Dask's Dataframe had performance limitations due to its structure of multiple partitions that divide Pandas Dataframe for parallelization.
    • Showed weaknesses in memory consumption, possibly due to its use of Pandas Dataframe.

Reference

 

Polars로 데이터 처리를 더 빠르고 가볍게 with 실무 적용기 | 우아한형제들 기술블로그

배달시간예측서비스팀은 배달의민족 앱 내의 각종 서비스(배민배달, 비마트, 배민스토어 등)에서 볼 수 있는 배달 예상 시간과 주문 후 고객에게 전달되기까지의 시간을 데이터와 AI를 활용하여

techblog.woowahan.com

 

 

There's no talent here. This is hard work. This is an obsession. Talent does not exist, we are all equal as human beings. You could be anyone if you put in the time.

- Conor Mcgregor -

 

반응형