Getting Scikit-Learn To Run On Top Of Pandas
Ami is a data scientist at Facebook Research's Core Data Science group. He previously worked as a machine learning researcher in the fields of bioinformatics and algorithmic trading. In 2010 he received a Ph.D in Electrical Engineering from Tel Aviv University, in the field of financial information theory. His bachelor's and master's are from Tel Aviv University too.
Ami uses Python and C++ for data analysis. He contributed to various open source projects, and is the author of a libstd C++ extension shipped with g++ (pb_ds: policy-based data structures). </div>
Abstract
Tags: code-introspection scikit-learn pandas data-science python machine learning
Scikit-Learn is built directly over numpy, Python's numerical array library. Pandas adds to numpy metadata and higher-level munging capabilities. This talk describes how to intelligently auto-wrap Scikit-Learn for creating a version that can leverage pandas's added features.
Description
Scikit-Learn is the de-facto standard Python library for general-purpose machine learning. It operates over NumPy, an efficient, but low-level, homogeneic array library. Pandas adds to NumPy metadata, heterogeneity, and higher-leve munging capabilities.
In the field of visualization, newer generation libraries, e.g., Seaborn and Bokeh, are providing safer, more readable, and higher-level functionality, by operating over Pandas data structures. Some of these are implemented using Matplotlib, a lower-level NumPy-based plotting library.
This talk describes a library for a Pandas-based version of sickit-learn. Here, too, giving a Pandas interface to a machine-learning library, provides code which is safer to use, more readable, and allows direct integration with Pandas's higher-level munging capabilities.
Due to the large-scale, and evolving nature, of sicikit-learn's codebase, it is infeasible to manually wrap it. Except for a small number of intentional deviations from sickit-learn, the library wraps Scikit-Learn modules lazily through module and class introspection, and dynamic module loading.
Following a short review of the relevant points of Pandas and Scikit-Learn, the talk is roughly divided into two aspects:
- Scikit-Learn And Pandas User Perspective
- Safety Advantages Of Pandas-Based Estimators
- Using Metadata For Inter-Instance Aggregated Features And Cross-Validation
- Using Metadata For Advanced Meta-Algorithms: Stacking, Nested Labeled And Stratified Cross-Valdiation
- Python Develop Perspective
- Unique Challenges Of Scikit-Learn Introspection And Decoration
- Two Approaches For Wrapping Scikit-Learn Estimators
- Lazy Dynamic Module Loading