Google Colab and AutoML: Auto-sklearn Setup

Auto ML is fast becoming a popular solution to build minimal viable models for new projects. A popular library for Python is Auto-sklearn that leverages the most popular Python ML library scikit-learn. Auto-sklearn runs a smart search over scikit-learn models and parameters to find the best performing ensemble of models.

Logos of Google Drive + Colab + Scikit-learn + Auto-sklearn

This tutorial describes how to setup Auto-sklearn on Google Colab. The complete notebook gist includes a toy project that uses an old Airbnb dataset from Kaggle.

The key first step is to install linux dependencies alongside Auto-sklearn:

After running these commands in Colab, restart the Colab runtime and run all commands again.

The Airbnb dataset can be used for a regression project where price is the label. I selected a few numerical and categorical features randomly so the dataset used for modeling has the following characteristics:

Airbnb dataset description

A more sophisticated ML project would require a detailed feature selection process and data analysis at this stage. For example, does the maximum value of 1,250 for minimum_nights make sense? In this case, I am simply showing the Auto-sklearn setup so I will skip these time consuming steps.

Next, all numerical features are standardized and missing values filled. Scikit-learn (and therefore Auto-sklearn) cannot handle string categories so categorical features are one hot encoded. Also, infrequently appearing categories are combined into a single bucket to combat the Curse of dimensionality. In this case, any neighborhood that appears less than 0.5% of the time is renamed to “neighborhood_other”. Before transformations, the first 5 rows of the training dataset have the following items:

Training dataset before transformations

After transformations, the first few columns of the 5 rows look like this:

Training dataset after transformations

I am finally ready to explore Auto-sklearn using few simple commands that fit a new model:

Finally, here is how the model performs on a test dataset:

An alternative approach that doesn’t use Auto-sklearn would be to manually select a model and run a grid search to find best parameters. A typical, well-performing algorithm is RandomForestRegressor so I might try the following:

For comparison, the performance of this model would be:

Impressively, the default Auto-sklearn R2 performance of 0.186 is nearly twice as good as simplistic scikit-learn-only performance of 0.098. These are not intended to be absolute benchmarks because I performed no customization but the relative performance is worth noting. The results suggest that Auto-sklearn can set a very reasonable lower performance bound that no model deployed in production should underperform.

More about me:

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store