Getting Scikit-Learn To Run On Top Of Pandas

Ami Tavory

Ami is a data scientist at Facebook Research's Core Data Science group. He previously worked as a machine learning researcher in the fields of bioinformatics and algorithmic trading. In 2010 he received a Ph.D in Electrical Engineering from Tel Aviv University, in the field of financial information theory. His bachelor's and master's are from Tel Aviv University too.

Ami uses Python and C++ for data analysis. He contributed to various open source projects, and is the author of a libstd C++ extension shipped with g++ (pb_ds: policy-based data structures). </div>

Abstract

Tags: code-introspection scikit-learn pandas data-science python machine learning

Scikit-Learn is built directly over numpy, Python's numerical array library. Pandas adds to numpy metadata and higher-level munging capabilities. This talk describes how to intelligently auto-wrap Scikit-Learn for creating a version that can leverage pandas's added features.

Description

Scikit-Learn is the de-facto standard Python library for general-purpose machine learning. It operates over NumPy, an efficient, but low-level, homogeneic array library. Pandas adds to NumPy metadata, heterogeneity, and higher-leve munging capabilities.

In the field of visualization, newer generation libraries, e.g., Seaborn and Bokeh, are providing safer, more readable, and higher-level functionality, by operating over Pandas data structures. Some of these are implemented using Matplotlib, a lower-level NumPy-based plotting library.

This talk describes a library for a Pandas-based version of sickit-learn. Here, too, giving a Pandas interface to a machine-learning library, provides code which is safer to use, more readable, and allows direct integration with Pandas's higher-level munging capabilities.

Due to the large-scale, and evolving nature, of sicikit-learn's codebase, it is infeasible to manually wrap it. Except for a small number of intentional deviations from sickit-learn, the library wraps Scikit-Learn modules lazily through module and class introspection, and dynamic module loading.

Following a short review of the relevant points of Pandas and Scikit-Learn, the talk is roughly divided into two aspects:

  1. Scikit-Learn And Pandas User Perspective
    1. Safety Advantages Of Pandas-Based Estimators
    2. Using Metadata For Inter-Instance Aggregated Features And Cross-Validation
    3. Using Metadata For Advanced Meta-Algorithms: Stacking, Nested Labeled And Stratified Cross-Valdiation
  2. Python Develop Perspective
    1. Unique Challenges Of Scikit-Learn Introspection And Decoration
    2. Two Approaches For Wrapping Scikit-Learn Estimators
    3. Lazy Dynamic Module Loading