Better function transformers in ML pipelines

One of the most convenient features in scikit-learn is the ability to build complex models by chaining transformers and estimators into pipelines.

Importantly, all (hyper-)parameters of each transformer remain accessible and tunable. The simplicity suffers somewhat once we need to add custom preprocessing functions into the pipeline. The “standard” approach using sklearn.preprocessing.FunctionTransformer felt decidedly unsatisfactory once I tried to define some parameter search spaces, so I looked into implementing a more usable alternative:

Beautiful is better than ugly!

Full source code for this post is available here: https://github.com/ig248/mario/blob/master/notebooks/factory.ipynb

The problem with `FunctionTransformer`

Let’s consider a simple stateless transform (i.e. one that does not need to store any fitted parameters):

def scale(x, factor=1.):
    """Scale array by given factor."""
    return x * factor

To wrap this in a pipeline, we can use the built-in FunctionTransformer:

pipeline = Pipeline([
    ('scaler', FunctionTransformer(scale))
])

We can use pipeline.get_params() or pipeline.named_steps['scaler'].get_params() to inspect the settable parameters. In fact, the only way to change factor is through FunctionTransformer’s catch-all kw_args:

pipeline.set_params(scaler__kw_args={'factor': 2.})

If we wanted to perform a hyperparameter search, we would need to define a grid in a rather cumbersome way:

param_grid = {
    'scaler__kw_args': [{'factor': 1.}, {'factor': 2.}]
}

Imagine having a function with more than one hyperparameter: instead of having access to various search strategies in multi-dimensional space (including the more advanced ones such as https://scikit-optimize.github.io/), we have to construct a 1D list of parameter dictionaries.

Creating a custom transformer

If we really like the scale function, we can wrap it in a custom transformer class:

from sklearn.base import BaseEstimator, TransformerMixin

class ScaleTransformer(BaseEstimator, TransformerMixin):
    """Custom scaling transformer"""
    def __init__(self, factor=1.):
        self.factor = factor

    def fit(self, X):
        return self

    def transform(self, X):
        return scale(X, factor=self.factor)

pipeline = Pipeline([
    ('scaler', ScaleTransformer())
])

Now, we can use pipeline.set_params(scaler__factor=2.)!

The magic happens under the hood: Pipeline inspects the __init__ method of the transformer to determine what parameters are available. However, writing all this boilerplate for each parametric function seems repetitive and outright un-pythonic.

Creating custom transformers dynamically

What I wanted was a transformer factory, which can construct the equivalent transformer class (or instance) from the function alone, along the lines of:

pipeline = Pipeline([
    ('scaler', function_transformer(scale))
])
pipeline.set_params(scaler__factor=2.)

To this end, we need to solve three problems: 1. Determine the signature of the input function 2. Create functions for class methods __init__, fit, transform 3. Create the transformer class

Getting the function signature

Using the all-powerfull inspect module, we can get the function name, function args, kwargs, and their default values:

import inspect

signature = inspect.signature(func)
args = [name for name, param in signature.parameters.items() if param.default is inspect._empty]
kwargs_defaults = [(name, param.default) for name, param in signature.parameters.items() if param.default is not inspect._empty]
kwargs, defaults = zip(*kwargs_defaults)
all_args = list(args) + list(kwargs)

Creating the class methods

Unfortunately, the only way to create the class methods seems to rely on eval - the FunctionMaker from the decorator module provides ome respite.

from decorator import FunctionMaker

init_signature = '__init__(self, {args})'.format(args=', '.join(kwargs))
init_kwarg_string = '\n'.join(['self.{kwarg}={kwarg}'.format(kwarg=kwarg) for kwarg in kwargs])
init_body = """self.func = func
{init_kwarg_string}""".format(init_kwarg_string=init_kwarg_string)

proto__init = FunctionMaker.create(init_signature, init_body, {'func': func}, defaults=defaults)
proto_fit = FunctionMaker.create('fit(self, x)', 'return self', {})

kwarg_string = ', '.join(['{kwarg}=self.{kwarg}'.format(kwarg=kwarg) for kwarg in kwargs])
transform_body = 'return self.func(x, {kwarg_string})'.format(kwarg_string=kwarg_string)
proto_transform = FunctionMaker.create('transform(self, x)', transform_body, {})

Creating the new class

Having created the methods, we create a dictionary of methods and attributes for our new class (and throw in the docstring for goof measure):

proto_dict = {
    '__doc__': func.__doc__,
    '__init__': proto__init,
    'fit': proto_fit,
    'transform': proto_transform
}

Now, we can use built-in type() to create the class:

from sklearn.base import BaseEstimator, TransformerMixin
new_class = type('FunctionTransformer_'+func.__name__, (BaseEstimator, TransformerMixin), proto_dict)
new_transformer = new_class()

Behold, the new object now looks and feels as expected!

>>> new_transformer
FunctionTransformer_scale(factor=1.0)
>>> new_transformer.set_params(factor=3)
FunctionTransformer_scale(factor=3)

Complete code

Putting it all together, I arrived at the implementation found here:

from sklearn.pipeline import Pipeline
from mario.factory import function_transformer

pipeline = Pipeline([
    ('identity', function_transformer()),
    ('scaler', function_transformer(scale, factor=2))
])
pipeline.set_params(scaler__factor=10)

No more parameter grids over lists of kw_args dictionaries! We can perform parameter searches using

param_grid = {
    'scaler__factor': [1, 2, 3]
}

Now we just need to replace scaling with a preprocessing function that is actually helpful!

References

API design for machine learning software: experiences from the scikit-learn project, L. Buitinck et al., arXiv:1309.0238 [cs.LG] (2013)

type() in Python 3

sklearn.preprocessing.FunctionTransformer

Better function transformers in ML pipelines - 2018-11-21

A transformer factory using metaprogramming

The problem with `FunctionTransformer`

Creating a custom transformer

Creating custom transformers dynamically

Getting the function signature

Creating the class methods

Creating the new class

Complete code

References

Better function transformers in ML pipelines - 2018-11-21

A transformer factory using metaprogramming

The problem with FunctionTransformer

Creating a custom transformer

Creating custom transformers dynamically

Getting the function signature

Creating the class methods

Creating the new class

Complete code

References

The problem with `FunctionTransformer`