I have gone through many iterations of what my preferred scikit-learn custom pipeline looks like. As of 6/2020, here is my latest iteration.
In general, a machine learning pipeline should have the following characteristics:
- Include every step shared between training and scoring to ensure consistency. The pipeline does not need to include one-off steps such as removing duplicates which would not be relevant at scoring time.
- Have as few custom components as necessary. For example, when filling missing values in numerical columns with median value, there is no reason NOT to use SimpleImputer class from sklearn.impute. In my example gist below I included a custom FeatureSelector class just to illustrate how to defined a custom pipeline step. In reality, an already defined scikit-learn class with the same functionality would be used.
- Allow for parallel preprocessing subject to the computing environment limits. For example, the preprocessing of categorical and numerical features can take place in parallel because the transformation steps do not affect each other.
- (optional) The pipeline should be as small as possible when serialized to enable a serverless deployment. Serverless frameworks such as AWS Lambda limit the zipped deployment size of all code. As of 6/2020 Lambda limits the deployment size to 250MB. This is enough for certain combinations of code such as scikit-learn and XGBoost with certain optimization.
- FeatureSelector is defined to illustrate what a custom pipeline step looks like. In this case, the step simply selects a subset of columns to which the transformation defined in the next step of a scikit-learn Pipeline class is applied.
- Numerical features are transformed using a scikit-learn Pipeline class. First, the appropriate columns are selected using the above FeatureSelector class. Then all numerical columns have their missing values filled using each column’s median value. Lastly, all numerical columns are scaled.
- Categorical columns are similarly transformed. OneHotEncoder is applied transforming columns containing categorical values. Importantly, I like to define the categories argument to prevent the Curse of dimensionality that might occur when too many categories are present.
Also published on adamnovotny.com