3. Scikit-learn

The Scikit-learn package provides a set of Machine Learning implementations for classification and regression such as random forests, SVD, boosting, discriminant analysis and k-nearest neighbors. It also implements easy to use cross-validation and data-splitting as well Pipelines for easily combining multiple transformations and a final estimator into a single estimator.

Let’s start with a simple example, let us first set-up some generated data:

[1]:

import numpy as np
import scipy.stats as stats

n_train = 8000
n_test = 800
d = 30

beta = stats.norm.rvs(size=d)
func = lambda x: (np.dot(beta, x) + stats.norm.rvs(scale=2)) > 0

x_train = stats.norm.rvs(scale=3, size=n_train*d).reshape((n_train, d))
y_train = np.apply_along_axis(func, 1, x_train).astype(int)

x_test = stats.norm.rvs(scale=3, size=n_test*d).reshape((n_test, d))
y_test = np.apply_along_axis(func, 1, x_test).astype(int)

Now, let’s create a 3-neirest neighbors classifier for this:

[2]:

from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier(3)

And the fit the estimator to data:

[3]:

clf.fit(x_train, y_train)

[3]:

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=3, p=2,
           weights='uniform')

We can now use our clf to classify data for us:

[4]:

clf.predict(x_test)

[4]:

array([0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1,
       0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1,
       0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0,
       0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0,
       0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1,
       0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1,
       0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1,
       1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0,
       1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1,
       1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1,
       1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1,
       1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0,
       0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1,
       0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1,
       0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1,
       0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1,
       1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0,
       0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1,
       1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1,
       1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0,
       1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,
       1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1,
       1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1,
       1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0,
       0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1,
       0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1,
       0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0,
       0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1,
       1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1,
       0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0])

Try this to check its accuracy:

[5]:

1 - np.abs((clf.predict(x_test) - y_test)).sum() / y_test.shape[0]

[5]:

0.74625000000000008

Or using the similar built-in method:

[6]:

clf.score(x_test, y_test)

[6]:

0.74624999999999997

3.1. Cross validation

Instead of setting a fixed value for the tuning parameters of the estimator, you can easily use cross-validation to choosing

[7]:

from sklearn.model_selection import GridSearchCV
gs_params = {'n_neighbors': np.arange(1, 4)}
gs_clf = GridSearchCV(clf, gs_params)

If you prefer, you can also use data-splitting instead:

[8]:

from sklearn.model_selection import ShuffleSplit
#Validation set == 15% of data
cv = ShuffleSplit(n_splits=1, test_size=0.15, random_state=0)
gs_clf = GridSearchCV(clf, gs_params, cv=cv)

You can now fit the new estimator you created gs_clf to data:

[9]:

gs_clf.fit(x_train, y_train)
gs_clf.predict(x_test)
gs_clf.score(x_test, y_test)

[9]:

0.74624999999999997

Note that gs_clf will automatically use the "best" estimator for classification when using predict or score. You can also, see the list of results and individual performance of each estimator on the validation set:

[14]:

#List performance of each estimator
gs_clf.cv_results_

#Obtain best estimator
gs_clf.best_estimator_

#Obtain a specif parameter of the best estimator
gs_clf.best_params_["n_neighbors"]

[14]:

3.2. Pipeline

Sometimes, like when working with image or text processing, you need to do some kind of transformation on data before actually fitting the estimator. Scikit-learn has many of the common transformations already implemented, suppose for instance that your data come in dictionary like fashion:

[15]:

from sklearn.feature_extraction import DictVectorizer
x_train = [
           {'car': 1, 'house': 2},
           {'car': 3, 'man': 1},
           {'house': 3, 'man': 1, 'dog': 2},
           {'man': 1, 'dog': 3, 'car': 4},
           {'dog': 4, 'man': 1},
           {'house': 2, 'man': 5},
          ]
y_train = [1, 1, 0, 1, 0, 1]
transf = DictVectorizer(sparse=False)
x_train_transformed = transf.fit_transform(x_train)

You can now fit this transformed as usual:

[16]:

clf = KNeighborsClassifier(3)
clf.fit(x_train_transformed, y_train)

[16]:

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=3, p=2,
           weights='uniform')

And transform new data and then predict it:

[17]:

x_test = [
           {'car': 3, 'house': 4},
           {'dog': 2, 'man': 4, 'house': 3},
          ]
x_test_transformed = transf.transform(x_test)
clf.predict(x_test_transformed)

[17]:

array([1, 0])

But you can also easily create a single estimator that will automatically transform data and apply the classifier using Pipeline:

[18]:

from sklearn.pipeline import Pipeline
full_clf = Pipeline([
                     ('transformer', DictVectorizer(sparse=False)),
                     ('classificator', KNeighborsClassifier())
                    ])
full_clf.fit(x_train, y_train)
full_clf.predict(x_test)

[18]:

array([1, 1])

You can also use GridSearchCV together with Pipeline:

[19]:

gs_full_params = {'classificator__n_neighbors': np.arange(1, 4),
                  'classificator__p': np.arange(1, 10)}
gs_full_clf = GridSearchCV(full_clf, gs_full_params, cv=2)
gs_full_clf.fit(x_train, y_train)
gs_full_clf.predict(x_test)

[19]:

array([1, 0])

Here, cv=2 means we a working with a 2-fold cross-validation.

3.3. Pipeline memory and parallel GridSearchCV

There is an interesting argument for Pipeline called memory, it allows the transformations to be stored in the hard drive and reused when using the same Pipeline with the same data. You can set it a local folder of your choice or use mkdtemp to set it a temporary folder:

[20]:

...
from tempfile import mkdtemp
full_clf = Pipeline([
                     ('transformer', DictVectorizer(sparse=False)),
                     ('classificator', KNeighborsClassifier())
                    ], memory=mkdtemp())
...

[20]:

Ellipsis

There is also an useful argument for GridSearchCV called n_jobs that sets the number of jobs will be executed in parallel (useful for nowadays multi-core CPUs):

[21]:

...
gs_full_clf = GridSearchCV(full_clf, gs_full_params, cv=2, n_jobs=4)
...

[21]:

Ellipsis

3.4. Regression

You can work with regression in a similar fashion. First, let’s use the built-in samples_generator from Scikit-learn to generate some data for us:

[22]:

from sklearn.datasets import samples_generator
x_train, y_train = samples_generator.make_regression()

Now, let’s fit Lasso regression:

[23]:

from sklearn.linear_model import LassoCV
reg = LassoCV()
reg.fit(x_train, y_train)
reg.predict(np.zeros(100).reshape(1, -1))

[23]:

array([-0.02769625])

3.5. Density estimation

Scikit-learn also has some estimators for problems other than classification and regression. To ilustrate, here’s a code for working with density estimation using Kernel Density Estimation with bandwidth chosen using cross-validation:

[26]:

from sklearn.neighbors import KernelDensity
n = 800
y = stats.norm.rvs(scale=3, size=n).reshape((-1, 1))
params_for_kde_cv = {'bandwidth': np.logspace(-2, 3, 100)}
grid = GridSearchCV(KernelDensity(), params_for_kde_cv)
grid.fit(y)

#obtain estimated density at some points
points = np.array([1.3, 0, -1.3]).reshape(-1, 1)
np.exp(grid.best_estimator_.score_samples(points))

#obtain true density for those points
stats.norm.pdf(np.array(points), scale=3)

[26]:

array([[ 0.12106354],
       [ 0.13298076],
       [ 0.12106354]])

3.6. Further reading

http://scikit-learn.org/