Scikit-learn Pipeline with Feature Engineering

  • To ensure data consistency, the pipeline should include every step (such as feature engineering) required to train and score training and testing datasets, and score real time requests. The pipeline does not need to include one-off steps such as removing duplicates.
  • Numerical features are transformed using scikit-learn classes. SimpleImputer is used to fill missing values and StandardScaler for scaling.
  • Categorical columns are similarly transformed. OneHotEncoder is applied transforming columns containing categorical values. Importantly, I like to define the categories argument to prevent the Curse of dimensionality that might occur when too many categories are present.
  • An example custom feature engineering class DailyTrendFeature is included in the pipeline for illustration.
  • The pipeline allows for parallel preprocessing subject to the limits of the computing environment. For example, the preprocessing of categorical and numerical features can take place in parallel because the transformation steps are independent of each other. This is accomplished using scikit-learn’s FeatureUnion(n_jobs=-1, …) class that combines other pipeline steps.

Complete article




Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Accessibility on the Web


Locators in Selenium WebDriver with Python

How do you get started with Python — {examples/jobs/interview questions}

Progressive Web Apps — Features and Business Advantages

Yield: For Beginners

Replacing my face with AR

Sarcophagus Tech Talks #3

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Adam Novotny

Adam Novotny

More from Medium

Detecting Frauds with Machine Learning

What is Big Data?

Machine Learning

How to Evaluate a Machine Learning Model