We’ll start with the “classic” example. Consider the data plotted in figure 1. Each data point has 3 “properties”: an coordinate, an
coordinate and a colour (red or blue). Suppose we want to be able to separate all data points into two groups: red points and blue points. Furthermore, we want to be able to do this linearly, i.e. we want to be able to draw a line (or plane or hyperplane) such that all points on one side are blue, all points on the other are red. This is called linear classification.

Figure 1: A scatter of data with three properties: an x_1 coordinate, an x_2 coordinate and a colour.
Suppose for each data point we generate a representation of the data point . We’ll call
the feature map and call
the feature vector of x. Note that each feature vector exists in
. Figure 2 shows a different plot of the data, or rather, of their features. Now we can linearly separate the blue and red data sets using the grey plane as shown. This simple example illustrates that mapping the data into higher dimensional spaces can sometimes allow us to “see” things about the data more clearly or definitively. In this case our feature maps were finite dimensional, but do they need to be?

Figure 2: The same data projected into R^3 using the projection mentioned earlier. Notice how the grey plane can separate the different coloured clusters.
So far we haven’t actually mentioned what a kernel is. Informally, a kernel is a bi-linear operator on data points, defined as the inner product between the projections of the data points. Intuitively, it’s telling us how ‘close’ these two points are in feature space ( maps data points to points in feature space). More technically, a kernel is defined as follows:
Definition 1: kernel Let be a non-empty set. A bilinear function
is called a kernel if there exists a real Hilbert space
such that for every
,
, where
is the inner product defined on
.
Note that the only restriction on (the space in which our data points are defined) is that it be a non-empty set. Data can be anything! Numbers, vectors, strings of characters, images, you name it. Kernels have a few neat properties: they can be added together or multiplied together to form new kernels. Can you prove that the sum of two kernels defines a kernel? The product proof is a little more involved. Interested readers can see [1].
There are some immediately useful repercussions from these results. For example, for any kernel , we have that for any finite
and constant
,
is also a valid kernel. Kernels of this form are referred to as polynomial kernels.
A small extension to this idea will introduce us to a group of very important kernels.
The set is defined as the set of all bounded infinite series of the form (see here for a reminder.)
This allows us the following lemma: let be a non-empty set and let the sequence of maps
be such that
for every
and that, for every
, the sequence
is in
. Then we have that, for any
,
is a valid kernel. We call the the
co-ordinate of feature map
.
The proof of for this makes nice use of the Cauchy-Schwarz inequality and I recommend trying it at home (if you get stuck, see here for help.)
The insightful reader may notice that square summable (infinite) series appear often in applied mathematics: as the Taylor series expansions of infinitely differentiable functions. We can use a restricted version of this result to increase our repertoire of kernels.
Assume the real-valued can be written in the form
for and
for all n. Then for and posititve integer
, let
. We then have that, for any
is a well-defined kernel.
Again, I encourage the reader to try and prove this. Use Cauchy-Schwartz – what bounds the inner product of ? Use this to argue that the Taylor series converges. Then re-write
in ‘conventional’ summation notation… do some re-arranging and relabelling to make it look like a ‘conventional’ inner product… and voilá!
All inner products on Hilbert spaces are positive definite so, similarly, all our kernels defined so far also have this property. What’s great is that the converse holds true too: for any bi-linear positive definite function, there exists at least one (though usually many) projection maps to a Hilbert space such that our original kernel definition holds true. More precisely…
Let be a non-empty set. The bi-linear function
is positive definite if for any
and any set
and any
, we have that
Luckily, inner products always give us positive definite functions:
Let be any Hilbert space and
a non-empty set. Let
. Then
is positive definite.
Proof:
Let’s return to our earlier finite feature vector example
.
Suppose I define a function on which has the form
,
I can then just as easily write as
We call
a representation of
and, following a fairly standard notational convention, write
to distinguish the representation of from
. Often though the two are used interchangeably. Note that a given function may not have a unique representation and, more of than not will have many equivalent representations. So why do we need the representation of
? Well notice that, at any data point,
,
can be evaluated using
where I’ve dropped the subscript on the inner product for convenience. So lives in the same Hilbert space as our features! So why can’t we “evaluate” other elements in the Hilbert space? In particular, why can’t we evaluate other features? Well we can. Let’s evaluate the features of points
and
:
we just get back the kernel. So, by extension of our notation, we can write
We call the canonical feature map of this (Reproducing Kernel) Hilbert Space.
Now consider the space of all functions which have representations like that of . This space (or rather, the space of representations of these functions), denoted
, constitutes a Hilbert space and is said to have the reproducing property, namely that for all
and all
, we have that
. Of equal importance is that
for every
, something formally given to us by Riesz’s representation theorem (see next post). The Hilbert space
is what we call a Reproducing Kernel Hilbert Space and is uniquely defined by its kernel (something we’ll discuss next time).
If things seem a bit confusing, don’t fret! Next time we’ll consider some alternative, equivalent ways of defining RKHS which may appear clearer and more intuitive to some. After that, we’ll start looking at what exactly RKHS is good for.
Acknowledgements
This (and other) posts draw heavily from Arthur Gretton’s notes on RKHS which can be found here: http://www.gatsby.ucl.ac.uk/~gretton/coursefiles/rkhscourse.html
Bibliography
[1] Theory of Reproducing Kernels N. Aronszajn, Transactions of the American Mathematical Society, Vol. 68, No. 3 (May, 1950), pp.337-404 (http://www.math.drexel.edu/~tolya/Aronszajn_RKHS.pdf)
Hi,
I have went through a lot of material about RKHS, but many have missed out the points that you cleared out on here. I just wanted to stop and say thank you.