In a series of posts I hope to introduce Mathemafrica readers to some useful data analysis methods which rely on operations in a little back-water of Hilbert space, namely Reproducing Kernel Hilbert Space (or RKHS).

We’ll start with the “classic” example. Consider the data plotted in figure 1. Each data point has 3 “properties”: an x_1 coordinate, an x_2 coordinate and a colour (red or blue). Suppose we want to be able to separate all data points into two groups: red points and blue points. Furthermore, we want to be able to do this linearly, i.e. we want to be able to draw a line (or plane or hyperplane) such that all points on one side are blue, all points on the other are red. This is called linear classification.

Figure 1: A scatter of data with three properties: an x_1 coordinate, an x_2 coordinate and a colour.

Figure 1: A scatter of data with three properties: an x_1 coordinate, an x_2 coordinate and a colour.

Suppose for each data point we generate a representation of the data point \phi(x)=[x_1, x_2, x_1x_2] . We’ll call \phi the feature map and call \phi(x) the feature vector of x. Note that each feature vector exists in \mathbb{R}^3. Figure 2 shows a different plot of the data, or rather, of their features. Now we can linearly separate the blue and red data sets using the grey plane as shown. This simple example illustrates that mapping the data into higher dimensional spaces can sometimes allow us to “see” things about the data more clearly or definitively. In this case our feature maps were finite dimensional, but do they need to be?

Figure 2: The same data projected into R^3 using the projection mentioned earlier. Notice how the grey plane can separate the different coloured clusters.

Figure 2: The same data projected into R^3 using the projection mentioned earlier. Notice how the grey plane can separate the different coloured clusters.

So far we haven’t actually mentioned what a kernel is. Informally, a kernel is a bi-linear operator on data points, defined as the inner product between the projections of the data points. Intuitively, it’s telling us how ‘close’ these two points are in feature space (\phi maps data points to points in feature space). More technically, a kernel is defined as follows:

Definition 1: kernel Let \mathcal{X} be a non-empty set. A bilinear function k:\mathcal{X}\times\mathcal{X}\rightarrow\mathbb{R} is called a kernel if there exists a real Hilbert space \mathbb{H} such that for every x,x'\in\mathcal{X}, k(x,x')=\langle\phi(x),\phi(x')\rangle_{\mathbb{H}}, where \langle,\rangle_{\mathbb{H}} is the inner product defined on \mathbb{H}.

Note that the only restriction on \mathcal{X} (the space in which our data points are defined) is that it be a non-empty set. Data can be anything! Numbers, vectors, strings of characters, images, you name it. Kernels have a few neat properties: they can be added together or multiplied together to form new kernels. Can you prove that the sum of two kernels defines a kernel? The product proof is a little more involved. Interested readers can see [1].

There are some immediately useful repercussions from these results. For example, for any kernel k(x,x'), we have that for any finite n and constant c, l(x,x'):=(k(x,x')+c)^n is also a valid kernel. Kernels of this form are referred to as polynomial kernels.
A small extension to this idea will introduce us to a group of very important kernels.

The set l_2 is defined as the set of all bounded infinite series of the form (see here for a reminder.)

\sum_{i=0}^\infty a_i^p<\infty

This allows us the following lemma: let \mathcal{X} be a non-empty set and let the sequence of maps (\phi_i)_{i=1}^\infty be such that \phi_i:\mathcal{X}\rightarrow\mathbb{R} for every i and that, for every x\in\mathbb{X}, the sequence (\phi(x)_i)_{i=1}^\infty is in l_2. Then we have that, for any x,x'\in\mathbb{X},

k(x,x'):=\sum_{i=0}^\infty \phi(x)_i\phi(x')_i<\infty

is a valid kernel. We call the \phi_i the i^{th} co-ordinate of feature map \phi.

The proof of for this makes nice use of the Cauchy-Schwarz inequality and I recommend trying it at home (if you get stuck, see here for help.)

The insightful reader may notice that square summable (infinite) series appear often in applied mathematics: as the Taylor series expansions of infinitely differentiable functions. We can use a restricted version of this result to increase our repertoire of kernels.

Assume the real-valued f:\mathbb{R}\rightarrow\mathbb{R} can be written in the form

f(z)=\sum_{i=0}^\infty a_iz^i~~~z\in\mathbb{R}, |z|<r

for r\in[0,\infty) and a_n\geq0 for all n. Then for and posititve integer d, let \mathcal{X}:=\{x\in\mathbb{R}^d|\|x\|\leq r\}. We then have that, for any x,x'\in\mathbb{X}

k(x,x'):=f(\langle x,x'\rangle)=\sum_{i=0}^\infty a_i(\langle x,x'\rangle)^i<\infty

is a well-defined kernel.

Again, I encourage the reader to try and prove this. Use Cauchy-Schwartz – what bounds the inner product of \langle x,x'\rangle ? Use this to argue that the Taylor series converges. Then re-write \langle x,x'\rangle in ‘conventional’ summation notation… do some re-arranging and relabelling to make it look like a ‘conventional’ inner product… and voilá!

All inner products on Hilbert spaces are positive definite so, similarly, all our kernels defined so far also have this property. What’s great is that the converse holds true too: for any bi-linear positive definite function, there exists at least one (though usually many) projection maps to a Hilbert space such that our original kernel definition holds true. More precisely…

Let \mathcal{X} be a non-empty set. The bi-linear function k:\mathcal{X}\times\mathcal{X}\rightarrow\mathbb{R} is positive definite if for any n\in\mathbb{Z}_+ and any set (a_1,\ldots,a_n)\in\mathbb{R}^n and any (x_1,\ldots,x_n)\in\mathcal{X}^n, we have that

\sum_{i=1}^n\sum_{j=1}^n a_ia_jk(x,x')\geq0.

Luckily, inner products always give us positive definite functions:
Let \mathbb{H} be any Hilbert space and \mathcal{X} a non-empty set. Let \phi:\mathcal{X}\rightarrow\mathbb{H}. Then k:\mathcal{X}\times\mathcal{X}\rightarrow\mathbb{R} k(x,y):=\langle\phi(x),\phi(y)\rangle_{\mathbb{H}} is positive definite.

Proof:

\sum_{i=1}^n\sum_{j=1}^n a_ia_jk((x_i,x_j)=\sum_{i=1}^n\sum_{j=1}^n\langle a_i\phi(x_i,\phi(x_j)\rangle_{\mathbb{H}}\\ \phantom{\sum_{i=1}^n\sum_{j=1}^n a_ia_jk((x_i,x_j)}=\Vert\sum_{i=0}^na_i\phi(x_i)\Vert^2_{\mathbb{H}}\geq0.

Let’s return to our earlier finite feature vector example

\mathbf{x}=[x_1,x_2] \rightarrow \phi(x)=[x_1,x_2,x_1x_2].

Suppose I define a function on \mathcal{X} which has the form

f(\mathbf{x})=a_1x_1+a_2x_2+a_3x_1x_2,

I can then just as easily write f(\mathbf{x}) as [a_1,a_2,a_3][x_1,x_2,x_1x_2]^\mathtt{T}=[a_1,a_2,a_3]\phi(\mathbf{x})^\mathtt{T}. We call [a_1,a_2,a_3] a representation of f and, following a fairly standard notational convention, write

f(\cdot)=\left[\begin{array}{c}a_1\\a_2\\a_3\end{array}\right]

to distinguish the representation of f from f. Often though the two are used interchangeably. Note that a given function may not have a unique representation and, more of than not will have many equivalent representations. So why do we need the representation of f ? Well notice that, at any data point, x, f can be evaluated using

f(x)=\langle f(\cdot),\phi(x)\rangle,

where I’ve dropped the subscript on the inner product for convenience. So f(\cdot) lives in the same Hilbert space as our features! So why can’t we “evaluate” other elements in the Hilbert space? In particular, why can’t we evaluate other features? Well we can. Let’s evaluate the features of points x and y:

\langle \phi(x),\phi(y)\rangle=k(x,y)

we just get back the kernel. So, by extension of our notation, we can write

k(\cdot,y)=\left[\begin{array}{c}y_1\\y_2\\y_1y_2\end{array}\right]=\phi(y).

We call k(\cdot,y) the canonical feature map of this (Reproducing Kernel) Hilbert Space.

Now consider the space of all functions which have representations like that of f. This space (or rather, the space of representations of these functions), denoted \mathbb{F}, constitutes a Hilbert space and is said to have the reproducing property, namely that for all x\in\mathcal{X} and all f\in\mathbb{F}, we have that \langle f,k(\cdot,x)\rangle=f(x). Of equal importance is that k(\cdot,x)\in\mathbb{F} for every x\in\mathcal{X}, something formally given to us by Riesz’s representation theorem (see next post). The Hilbert space F is what we call a Reproducing Kernel Hilbert Space and is uniquely defined by its kernel (something we’ll discuss next time).

If things seem a bit confusing, don’t fret! Next time we’ll consider some alternative, equivalent ways of defining RKHS which may appear clearer and more intuitive to some. After that, we’ll start looking at what exactly RKHS is good for.

Acknowledgements
This (and other) posts draw heavily from Arthur Gretton’s notes on RKHS which can be found here: http://www.gatsby.ucl.ac.uk/~gretton/coursefiles/rkhscourse.html

Bibliography
[1] Theory of Reproducing Kernels N. Aronszajn, Transactions of the American Mathematical Society, Vol. 68, No. 3 (May, 1950), pp.337-404 (http://www.math.drexel.edu/~tolya/Aronszajn_RKHS.pdf)

How clear is this post?