I am looking into using a GPR for a rather peculiar context, where I need to write my own Kernel. However I found out there's no documentation about how to do this. Trying to simply inherit from Kernel and implementing the methods __call__, get_params, diag and is_stationary is enough to get the fitting process to work, but then breaks down when I try to predict y values and standard deviations. What are the necessary steps to build a minimal but functional class that inherits from Kernel while using its own function? Thanks!
Perhaps the most widely used kernel is probably the radial basis function kernel (also called the quadratic exponential kernel, the squared exponential kernel or the Gaussian kernel): k(xₙ, xₘ) = exp(-||xₙ - xₘ||²/2L²), where L the kernel length scale.
Overview. Gaussian processes are non-parametric kernel based Bayesian tools to perform inference. Non-parametric kernel solutions are based on providing a new solution for some new input by using the set of training data. This set is included in the so-called kernel matrix.
The only difference between the two models is the K in the regularisation term. The key theoretical advantage of the kernel approach is that it allows you to interpret a non-linear model as a linear model following a fixed non-linear transformation that doesn't depend on the sample of data.
In other words, the Gaussian kernel transforms the dot product in the infinite dimensional space into the Gaussian function of the distance between points in the data space: If two points in the data space are nearby then the angle between the vectors that represent them in the kernel space will be small.
Depending on how exotic your kernel will be, the answer to your question may be different.
I find the implementation of the RBF kernel quite self-documenting, so I use it as reference. Here is the gist:
class RBF(StationaryKernelMixin, NormalizedKernelMixin, Kernel):
    def __init__(self, length_scale=1.0, length_scale_bounds=(1e-5, 1e5)):
        self.length_scale = length_scale
        self.length_scale_bounds = length_scale_bounds
    @property
    def hyperparameter_length_scale(self):
        if self.anisotropic:
            return Hyperparameter("length_scale", "numeric",
                                  self.length_scale_bounds,
                                  len(self.length_scale))
        return Hyperparameter(
            "length_scale", "numeric", self.length_scale_bounds)
    def __call__(self, X, Y=None, eval_gradient=False):
        # ...
As you mentioned, your kernel should inherit from Kernel, which requires you to implement __call__, diag and is_stationary. Note, that sklearn.gaussian_process.kernels provides StationaryKernelMixin and NormalizedKernelMixin, which implement diag and is_stationary for you (cf. RBF class definition in the code).
You should not overwrite get_params! This is done for you by the Kernel class and it expects scikit-learn kernels to follow a convention, which your kernel should too: specify your parameters in the signature of your constructor as keyword arguments (see length_scale in the previous example of RBF kernel). This ensures that your kernel can be copied, which is done by GaussianProcessRegressor.fit(...) (this could be the reason that you could not predict the standard deviation).
At this point, you may notice the other parameter length_scale_bounds. That is only a constraint on the actual hyper parameter length_scale (cf. constrained optimization). This brings us to the fact, that you need to also declare your hyper parameters, that you want optimized and need to compute gradients for in your __call__ implementation. You do that by defining a property of your class that is prefixed by hyperparameter_ (cf. hyperparameter_length_scale in the code). Each hyper parameter that is not fixed (fixed = hyperparameter.fixed == True) is returned by Kernel.theta, which is used by GP on fit() and to compute the marginal log likelihood. So this is essential if you want to fit parameters to your data.
One last detail about Kernel.theta, the implementation states:
Returns the (flattened, log-transformed) non-fixed hyperparameters.
So you should be careful with 0 values in your hyper parameters as they can end up as np.nan and break stuff.
I hope this helps, even though this question is already a bit old. I have actually never implemented a kernel myself, but was eager to skim the sklearn code base. It is unfortunate that there are no official tutorials on that, the code base, however, is quite clean and commented.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With