What we are talking about
Linear regression is a common and useful statistical tool. You will have almost certainly come across it if your studies have presented you with any sort of statistical problems.
The pros of regression are that it is relatively easy to implement and that the relationship between inputs and outputs is linear (it’s in the name, but this simplifies the interpretation of the relationship significantly). On the downside, it relies fairly heavily on frequentist interpretation of probability (which is a little counterintuitive) and it’s very easy to draw erroneous conclusions from different models.
This post will deal with a measure of how good a model is: . First, I will go through what this value means and what it measures. Then, I will discuss an example of how reliance on is a dangerous game when it comes to linear models.
What you should know
Firstly, let’s establish a bit of context. We are doing linear regression, where we want to explain some output, which we will call , using some inputs, which we will call . In the examples presented here, we will be considering the case where we have only 1 input, , as it simplifies the discussion. However, this applies to the general case too. The explanation of this output is with a straight line- that is to say we believe that the output is a linear function of the input (plus some noise/random variation) and therefore that we can model it as:
where is a normally distributed random variable- with 0 mean that accounts for the random variation in what we observe.
To do this we fit a model using least squares minimisation (which is not discussed here). In everyday language this means that we fit a line through the data. Loosely speaking we want the line (representing our predictions) which has y values that are closest on average to the y values of the real data.
What is ?
The can be described as the proportion of the total variation that is explained by performing the regression analysis. If we call our prediction for each point, , and we call the mean of the observed values of equal to then we can define the following useful things:
Corrected Total Sum of Squares = CTSS =
SSReg =
CTSS is the squared sum of the differences between the true value of y for each data point and the mean of the y values.
SSReg is the squared sum of the differences between our predicted value of y for each data point and the mean of the y values.
is the ratio of these two values. i.e.
Right. Let’s look at this a bit more before we get into any pitfalls. If we look at the fraction above and imagine that all our predictions are exactly correct i.e. , it is clear to see that . Alternatively, if we predict the mean for every observation, . Anything between these extremes will give an indication of how well we are explaining the data.
The above image demonstrates a good linear model. For this model . We can see that the red line (our predictions) is much closer to the data than to the mean (the blue line). This is great and can be really useful. Any other straight line fit to this data would produce a lower value, so we have a good model and we have done well at modelling this process.
What can go wrong
The pitfall of using to compare models is thinking that high good model.
This is not necessarily the case, which will be demonstrated with 2 toy examples below.
The two images above show a potential pitfall of relying on .
The first of these shows a model where the assumption of a linear relationship between the inputs and outputs is clearly violated. We can see that the regression fit is first below the data, then above the data, then below the data again. This gives us a clue that the relationship is quadratic (the relationship is quadratic in this case but we’ll pretend we don’t know that). So this model is bad! We have a linear model of a quadratic relationship- the model does not accurately capture the real trend. What about the value? Intuitively, we might say that it is high, because the red line is still much closer to the data than it is to the blue line. We’d be right: it’s 0.97. This is partially because the underlying process generating the data is not very noisy. So here we have a good value for a bad model.
Now what about the second plot we see here? Here the line seems to go straight through the middle of the data with no clear patten around it (the relationship looks linear). So it looks like an appropriate model- but if we look at our line it is pretty far away from some of our data points. Maybe it doesn’t explain the data that well? The value is 0.56- not very good at all. Does this mean we have a bad model? Or that the above model is better? If we look at the plot again, we can see that the data itself has more noise than in the previous two plots- (to see this, look at how the y values for x values that are close together vary). This noise/variation makes the process harder to model in general and we must therefore expect worse values for the measure of performance. The model that we have is the best linear model that we can have for this data and it violates no model assumptions. That is to say, with a single input, x, and following the usual process of fitting regression, this is the best we can do. Here we have a bad value for a good model.
What this example aims to show is that can be useful in determining how good a model’s fit might be, but that it does tell us whether we have a good model all on its own. Specifically we have considered the case of data with a non-linear trend and data with a linear trend but that has high variation by nature.
Summary
In summary, when doing linear modelling, it is important to a) look at and understand your data and b) check model assumptions and make sure that there aren’t any violations in these model assumptions.
Really clearly written! Thank you Dean.