We’ll start with the “classic” example. Consider the data plotted in figure 1. Each data point has 3 “properties”: an coordinate, an coordinate and a colour (red or blue). Suppose we want to be able to separate all data points into two groups: red points and blue points. Furthermore, we want to be able to do this linearly, i.e. we want to be able to draw a line (or plane or hyperplane) such that all points on one side are blue, all points on the other are red. This is called linear classification.
Suppose for each data point we generate a representation of the data point . We’ll call the feature map and call the feature vector of x. Note that each feature vector exists in . Figure 2 shows a different plot of the data, or rather, of their features. Now we can linearly separate the blue and red data sets using the grey plane as shown. This simple example illustrates that mapping the data into higher dimensional spaces can sometimes allow us to “see” things about the data more clearly or definitively. In this case our feature maps were finite dimensional, but do they need to be?
So far we haven’t actually mentioned what a kernel is. Informally, a kernel is a bi-linear operator on data points, defined as the inner product between the projections of the data points. Intuitively, it’s telling us how ‘close’ these two points are in feature space ( maps data points to points in feature space). More technically, a kernel is defined as follows:
Definition 1: kernel Let be a non-empty set. A bilinear function is called a kernel if there exists a real Hilbert space such that for every , , where is the inner product defined on .
Note that the only restriction on (the space in which our data points are defined) is that it be a non-empty set. Data can be anything! Numbers, vectors, strings of characters, images, you name it. Kernels have a few neat properties: they can be added together or multiplied together to form new kernels. Can you prove that the sum of two kernels defines a kernel? The product proof is a little more involved. Interested readers can see [1].
There are some immediately useful repercussions from these results. For example, for any kernel , we have that for any finite and constant , is also a valid kernel. Kernels of this form are referred to as polynomial kernels.
A small extension to this idea will introduce us to a group of very important kernels.
The set is defined as the set of all bounded infinite series of the form (see here for a reminder.)
This allows us the following lemma: let be a non-empty set and let the sequence of maps be such that for every and that, for every , the sequence is in . Then we have that, for any ,
is a valid kernel. We call the the co-ordinate of feature map .
The proof of for this makes nice use of the Cauchy-Schwarz inequality and I recommend trying it at home (if you get stuck, see here for help.)
The insightful reader may notice that square summable (infinite) series appear often in applied mathematics: as the Taylor series expansions of infinitely differentiable functions. We can use a restricted version of this result to increase our repertoire of kernels.
Assume the real-valued can be written in the form
for and for all n. Then for and posititve integer , let . We then have that, for any
is a well-defined kernel.
Again, I encourage the reader to try and prove this. Use Cauchy-Schwartz – what bounds the inner product of ? Use this to argue that the Taylor series converges. Then re-write in ‘conventional’ summation notation… do some re-arranging and relabelling to make it look like a ‘conventional’ inner product… and voilá!
All inner products on Hilbert spaces are positive definite so, similarly, all our kernels defined so far also have this property. What’s great is that the converse holds true too: for any bi-linear positive definite function, there exists at least one (though usually many) projection maps to a Hilbert space such that our original kernel definition holds true. More precisely…
Let be a non-empty set. The bi-linear function is positive definite if for any and any set and any , we have that
Luckily, inner products always give us positive definite functions:
Let be any Hilbert space and a non-empty set. Let . Then is positive definite.
Proof:
Let’s return to our earlier finite feature vector example
.
Suppose I define a function on which has the form
,
I can then just as easily write as We call a representation of and, following a fairly standard notational convention, write
to distinguish the representation of from . Often though the two are used interchangeably. Note that a given function may not have a unique representation and, more of than not will have many equivalent representations. So why do we need the representation of ? Well notice that, at any data point, , can be evaluated using
where I’ve dropped the subscript on the inner product for convenience. So lives in the same Hilbert space as our features! So why can’t we “evaluate” other elements in the Hilbert space? In particular, why can’t we evaluate other features? Well we can. Let’s evaluate the features of points and :
we just get back the kernel. So, by extension of our notation, we can write
We call the canonical feature map of this (Reproducing Kernel) Hilbert Space.
Now consider the space of all functions which have representations like that of . This space (or rather, the space of representations of these functions), denoted , constitutes a Hilbert space and is said to have the reproducing property, namely that for all and all , we have that . Of equal importance is that for every , something formally given to us by Riesz’s representation theorem (see next post). The Hilbert space is what we call a Reproducing Kernel Hilbert Space and is uniquely defined by its kernel (something we’ll discuss next time).
If things seem a bit confusing, don’t fret! Next time we’ll consider some alternative, equivalent ways of defining RKHS which may appear clearer and more intuitive to some. After that, we’ll start looking at what exactly RKHS is good for.
Acknowledgements
This (and other) posts draw heavily from Arthur Gretton’s notes on RKHS which can be found here: http://www.gatsby.ucl.ac.uk/~gretton/coursefiles/rkhscourse.html
Bibliography
[1] Theory of Reproducing Kernels N. Aronszajn, Transactions of the American Mathematical Society, Vol. 68, No. 3 (May, 1950), pp.337-404 (http://www.math.drexel.edu/~tolya/Aronszajn_RKHS.pdf)
Hi,
I have went through a lot of material about RKHS, but many have missed out the points that you cleared out on here. I just wanted to stop and say thank you.