# Google Colab and Auto-sklearn with Profiling

--

This article is a follow up to my previous tutorial on how to setup Google Colab and auto-sklean. Here, I will go into more detail that shows auto-sklearn performance on an artificially created dataset. The full notebook gist can be found here.

First, I generated a regression dataset using scikit learn.

`X, y, coeff = make_regression(`

n_samples=1000,

n_features=100,

n_informative=5,

noise=0,

shuffle=False,

coef=True

)

This generates a dataset with 100 numerical features where the first 5 features are informative (these are labeled as “feat_0” to “feat_4”). The rest (“feat_5” to “feat_99”) are random noise. We can see this in the scatter matrix above where only the first 5 features show a correlation with the label.

We know that this is a simple regression problem which could be solved using a linear regression perfectly. However, knowing what to expect helps us to verify the performance of auto-sklearn which trains its ensemble model using the following steps:

`import autosklearn.regressionautoml = autosklearn.regression.AutoSklearnRegressor(`

time_left_for_this_task=300,

n_jobs=-1

)

automl.fit(

X_train_transformed,

df_train["label"]

)

I also created random categorical features which are then one-hot-encoded into a feature set “X_train_transformed“. Running the AutoSklearnRegressor for 5 minutes (time_left_for_this_task=300) produced the following expected results:

`predictions = automl.predict(X_train_transformed)`

r2_score(df_train["label"], predictions)

>> 0.999

predictions = automl.predict(X_test_transformed)

r2_score(df_test["label"], predictions)

>> 0.999

A separate pip package PipelineProfiler helps us visualize the steps auto-sklearn took to achieve the result:

Above we can see the attempts auto-sklearn made to generate the best emsemble of models within the 5 minute constraint I set. The best model found was Liblinear SVM, which produced R2 of nearly 1.0. As a result, this toy ensemble model gives weight of 1.0 to just one algorithm. Libsvm Svr and Gradient boosting scored between 0.9–0.96.

This article was originally published on my personal website adamnovotny.com