One of the most convenient features in scikit-learn
is the ability to build complex models by chaining transformers and estimators into pipelines.
Importantly, all (hyper-)parameters of each transformer remain accessible and tunable. The simplicity suffers somewhat once we need to add custom preprocessing functions into the pipeline. The “standard” approach using sklearn.preprocessing.FunctionTransformer
felt decidedly unsatisfactory once I tried to define some parameter search spaces, so I looked into implementing a more usable alternative:
Beautiful is better than ugly!
Full source code for this post is available here: https://github.com/ig248/mario/blob/master/notebooks/factory.ipynb
The problem with FunctionTransformer
Let’s consider a simple stateless transform (i.e. one that does not need to store any fitted parameters):
def scale(x, factor=1.):
"""Scale array by given factor."""
return x * factor
To wrap this in a pipeline, we can use the built-in FunctionTransformer
:
pipeline = Pipeline([
('scaler', FunctionTransformer(scale))
])
We can use pipeline.get_params()
or pipeline.named_steps['scaler'].get_params()
to inspect the settable parameters. In fact, the only way to change factor
is through FunctionTransformer
’s catch-all kw_args
:
pipeline.set_params(scaler__kw_args={'factor': 2.})
If we wanted to perform a hyperparameter search, we would need to define a grid in a rather cumbersome way:
param_grid = {
'scaler__kw_args': [{'factor': 1.}, {'factor': 2.}]
}
Imagine having a function with more than one hyperparameter: instead of having access to various search strategies in multi-dimensional space (including the more advanced ones such as https://scikit-optimize.github.io/), we have to construct a 1D list of parameter dictionaries.
Creating a custom transformer
If we really like the scale
function, we can wrap it in a custom transformer class:
from sklearn.base import BaseEstimator, TransformerMixin
class ScaleTransformer(BaseEstimator, TransformerMixin):
"""Custom scaling transformer"""
def __init__(self, factor=1.):
self.factor = factor
def fit(self, X):
return self
def transform(self, X):
return scale(X, factor=self.factor)
pipeline = Pipeline([
('scaler', ScaleTransformer())
])
Now, we can use pipeline.set_params(scaler__factor=2.)
!
The magic happens under the hood: Pipeline
inspects the __init__
method of the transformer to determine what parameters are available. However, writing all this boilerplate for each parametric function seems repetitive and outright un-pythonic.
Creating custom transformers dynamically
What I wanted was a transformer factory, which can construct the equivalent transformer class (or instance) from the function alone, along the lines of:
pipeline = Pipeline([
('scaler', function_transformer(scale))
])
pipeline.set_params(scaler__factor=2.)
To this end, we need to solve three problems:
1. Determine the signature of the input function
2. Create functions for class methods __init__
, fit
, transform
3. Create the transformer class
Getting the function signature
Using the all-powerfull inspect
module, we can get the function name, function args, kwargs, and their default values:
import inspect
signature = inspect.signature(func)
args = [name for name, param in signature.parameters.items() if param.default is inspect._empty]
kwargs_defaults = [(name, param.default) for name, param in signature.parameters.items() if param.default is not inspect._empty]
kwargs, defaults = zip(*kwargs_defaults)
all_args = list(args) + list(kwargs)
Creating the class methods
Unfortunately, the only way to create the class methods seems to rely on eval
- the FunctionMaker
from the decorator
module provides ome respite.
from decorator import FunctionMaker
init_signature = '__init__(self, {args})'.format(args=', '.join(kwargs))
init_kwarg_string = '\n'.join(['self.{kwarg}={kwarg}'.format(kwarg=kwarg) for kwarg in kwargs])
init_body = """self.func = func
{init_kwarg_string}""".format(init_kwarg_string=init_kwarg_string)
proto__init = FunctionMaker.create(init_signature, init_body, {'func': func}, defaults=defaults)
proto_fit = FunctionMaker.create('fit(self, x)', 'return self', {})
kwarg_string = ', '.join(['{kwarg}=self.{kwarg}'.format(kwarg=kwarg) for kwarg in kwargs])
transform_body = 'return self.func(x, {kwarg_string})'.format(kwarg_string=kwarg_string)
proto_transform = FunctionMaker.create('transform(self, x)', transform_body, {})
Creating the new class
Having created the methods, we create a dictionary of methods and attributes for our new class (and throw in the docstring for goof measure):
proto_dict = {
'__doc__': func.__doc__,
'__init__': proto__init,
'fit': proto_fit,
'transform': proto_transform
}
Now, we can use built-in type()
to create the class:
from sklearn.base import BaseEstimator, TransformerMixin
new_class = type('FunctionTransformer_'+func.__name__, (BaseEstimator, TransformerMixin), proto_dict)
new_transformer = new_class()
Behold, the new object now looks and feels as expected!
>>> new_transformer
FunctionTransformer_scale(factor=1.0)
>>> new_transformer.set_params(factor=3)
FunctionTransformer_scale(factor=3)
Complete code
Putting it all together, I arrived at the implementation found here:
from sklearn.pipeline import Pipeline
from mario.factory import function_transformer
pipeline = Pipeline([
('identity', function_transformer()),
('scaler', function_transformer(scale, factor=2))
])
pipeline.set_params(scaler__factor=10)
No more parameter grids over lists of kw_args
dictionaries! We can perform parameter searches using
param_grid = {
'scaler__factor': [1, 2, 3]
}
Now we just need to replace scaling with a preprocessing function that is actually helpful!
References
API design for machine learning software: experiences from the scikit-learn project, L. Buitinck et al., arXiv:1309.0238 [cs.LG] (2013)