Scikit-learn Pipeline with Feature Engineering

  • To ensure data consistency, the pipeline should include every step (such as feature engineering) required to train and score training and testing datasets, and score real time requests. The pipeline does not need to include one-off steps such as removing duplicates.
  • Numerical features are transformed using scikit-learn classes. SimpleImputer is used to fill missing values and StandardScaler for scaling.
  • Categorical columns are similarly transformed. OneHotEncoder is applied transforming columns containing categorical values. Importantly, I like to define the categories argument to prevent the Curse of dimensionality that might occur when too many categories are present.
  • An example custom feature engineering class DailyTrendFeature is included in the pipeline for illustration.
  • The pipeline allows for parallel preprocessing subject to the limits of the computing environment. For example, the preprocessing of categorical and numerical features can take place in parallel because the transformation steps are independent of each other. This is accomplished using scikit-learn’s FeatureUnion(n_jobs=-1, …) class that combines other pipeline steps.

Complete article

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Top 7 Online Courses to Learn Java EE (Jakarta EE) in 2022

What is a Static Library?

Manage IBM Cloud Security and Compliance Center resources via Terraform

3 Steps to Improve the Business DevOps Performance

Project 4: Microservices at Scale using Kubernetes — AWS Cloud DevOps Engineer Nanodegree Program

Game Dev Journey Day1

Increasing Triplet Subsequence

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Adam Novotny

Adam Novotny

More from Medium

Audio processing in Python with Feature Extraction for machine learning

Easy deployment of machine learning models on Flask

Sentiment Analysis using Pyspark

Model Deployment using Streamlit (A Practical Approach)