Scikit-learn Pipeline with Feature Engineering

  • To ensure data consistency, the pipeline should include every step (such as feature engineering) required to train and score training and testing datasets, and score real time requests. The pipeline does not need to include one-off steps such as removing duplicates.
  • Numerical features are transformed using scikit-learn classes. SimpleImputer is used to fill missing values and StandardScaler for scaling.
  • Categorical columns are similarly transformed. OneHotEncoder is applied transforming columns containing categorical values. Importantly, I like to define the categories argument to prevent the Curse of dimensionality that might occur when too many categories are present.
  • An example custom feature engineering class DailyTrendFeature is included in the pipeline for illustration.
  • The pipeline allows for parallel preprocessing subject to the limits of the computing environment. For example, the preprocessing of categorical and numerical features can take place in parallel because the transformation steps are independent of each other. This is accomplished using scikit-learn’s FeatureUnion(n_jobs=-1, …) class that combines other pipeline steps.

Complete article




Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Mindcuber — Python

Promoted to a Team Lead on your DEV project?

GSoC @OpenMRS 3rd week

GDPR Compliance with Spring Boot Applications Part I: External Databases

Short and hopefully simple guide to install Python libraries to a offline station/computer

How Often Should Instrument Calibration Be Done?

weeklyOLM #16 & #17 (21st & 28th April)

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Adam Novotny

Adam Novotny

More from Medium

An Intro to Univariate Linear Regression

Machine learning scholar adventure: Chapter 5

PyPy vs. Cython: Difference Between The Two Explained

PyPy vs. Cython: Difference Between The Two Explained

Creating a pipelines using sklearn machine learning.