캐글

[Kaggle Study] #15 2017 Kaggle Machine Learning & Data Science Survey

dongsunseng 2024. 12. 5. 00:57
반응형

Fourteenth(Last) course following Youhan Lee's curriculum. Not competition.

First Kernel: Novice to Grandmaster

  • The biggest problem that we might face is fake and bogus responses. 
  • As it is a survey, not everyone will answer with proper credentials, and thus I assume that there will be a lot many outlier. 

Second Kernel: What do Kagglers say about Data Science ?

  • EDA Kernel with trying some prediction with modeling techniques.

Insight / Summary:

1. Dimensionality reduction and 2D-plotting

  • The most known / used dimensionality reduction technique has to be PCA. 
  • The problem with PCA is that it works best for numerical / continuous variables which is not the case here.
  • A similar technique, Multi Correspondence Analysis (MCA), is used to achieve dimensionality reduction for categorical data.
  • Simply put, It's a technique that use chi-2 independence tests to create a distance between row points that will be further contained in a matrix.
  • Each of the eigenvalues of this matrix has an inertia (similar to expressed variance for PCA) and the process to obtain the 2D visualization is the same.
### NOT WORKING ON KAGGLE SERVERS (no module prince)####
#import prince
#np.random.seed(42)
#mca = prince.MCA(data_viz, n_components=2,use_benzecri_rates=True)
#mca.plot_rows(show_points=True, show_labels=False, color_by='CompensationAmount', ellipse_fill=True)

Third Kernel: PLOTLY TUTORIAL - 1

  • Literally plotting plots analyzing response data using PLOTLY.

The first step is to establish that something is possible; then probability will occur.
- Elon Musk -
반응형