Introduction
So far we have discussed what p-values are and how they are calculated, as well as how bad experiments can lead to artificially small p-values. The next thing that we will look at comes from a paper by N.N. Taleb (1), in which he derives the meta-distribution of p-values i.e. what ranges of p-values we might expect if we repeatedly did an experiment where we sampled from the same underlying distribution.
The derivations are pretty in depth and this content and the implications of the results are pretty new to me, so any discrepancies/misinterpretations found should be pointed out and/or discussed.
Thankfully, in this video (2) there is an explanation that covers some of what the paper says as well as some Monte-Carlo simulations. My discussion will focus on some simulations of my own that are based on those that are done in the video.
What we are talking about
We have already discussed what p-values mean and how they can go wrong. Now we want to know how they vary under repeated experiments. We are drawing a random sample from an underlying distribution, which means that the test statistic that we get is random, which, in turn, means that our p-value is a random variable. So if we are going to interpret this random variable then we might want to know something about it- i.e. its distribution. We will now demonstrate the simple method for approximating this distribution.
The procedure
Since the analytical distribution is quite in depth I will limit this to simulating the distribution of the p-values of different experiments by sampling.
This is done as follows:
- Specify the true distribution that the data will be drawn from. In practice this is not known, but here we sample from a known distribution so that we can observe the behaviour of p-values under different circumstances that we already understand
- Draw 10 000 samples (relating to 10 000 repeated experiments) of size 30 from that distribution
- For each experiment, calculate a test statistic and a p-value
- Plot the distribution of the (10 000) p-values
The distributions that will be considered will be Gaussian with a standard deviation of 1 and varying means. The hypothesis that is being tested is that the mean is greater than 0. This, the test statistics will be therefore follow a t-distribution with 29 degrees of freedom (which we used in the first post on p-values).
Results
The plots below show the results of the simulated experiments from data with different means (0, 0.05, 0.1 and 0.3). The red line in each plot shows the 5% cutoff value: anything to the left of this line means we are rejecting the null hypothesis. The blue line shows the ‘true’ (or ‘typical’) p-value that we would get if our sample mean was exactly the true mean of the distribution from which we are sampling.
The first plot shows that when the data is sampled from a distribution such that the null hypothesis is true i.e. it is sampled from a Normal(0,1) distribution. The plot shows that the distribution of p-values is approximately uniform. We note that about 5% of the time we reject the null hypothesis (relating to our 5% false positive rate) and that about half of the observations have p-values below 0.5 (where we find the blue line). This is what we might expect from this situation. Looking at the first column of the table, the proportion of observations below lower than some typical cut-off points is roughly what would be expected from the uniform distribution.
Fig 1: mean = 0
It is in figures 2-4 that the pattern of interest begins to emerge. As the mean increase we see the ‘typical’ p-value decreases and move towards the 5% cutoff. Additionally, along with this move, the number of experiments that lead to a rejection of the null hypothesis increases. In a way this is good- even if a ‘typical’ draw from that distribution does not result in a rejection, the mean is different from 0 so it makes sense that we should be rejecting the null hypothesis more often. The flip side of this is that these rejections come from slightly atypical draws. But this is a little tangential to the point of this post.
As Taleb notes in the video: it is not that the p-values are stochastic that is surprising, but the asymmetry.
So what we should really be looking at is the shape of this distribution. The most striking thing here is the fat tails and large numbers of experiments with p-values very close to 0. It is clear that as the true mean of the underlying distribution steadily increases, the bin that grows the fastest is the one to the far left of the plot (corresponding to p-values less than 0.02).
So what does this mean? It means that as soon as we have a distribution that is even slightly different from the distribution under the null hypothesis (where a ‘typical’ experiment will not give a significant result), the distribution of the p-values resulting from repeated experiments has density that increases the fastest around areas with highly significant results.
Consider figure 3 and the final column in the table. If we had drawn a sample with a mean equal to the true mean, we would have a p-value of just over 0.05- borderline significant. We note that about half the time we get a significant result, i.e. the p-value is below 0.05 about half the time. That’s fine- if something is borderline and stochastic we might expect the split to be roughly 50-50. What is more interesting is that of the significant experiments, one-third of these have a p-value of less than 0.005, i.e. a full order of magnitude lower. Which means that about 1 out of every 6 experiments that are run on a phenomenon with a true underlying distribution that is typically borderline significant will produce results that are highly significant.
Fig 2: mean = 0.05
mean
p<x |
0 |
0.05 |
0.1 |
0.3 |
0.001 |
0.0007
|
0.0027
|
0.0048
|
0.0554
|
0.005 |
0.0048
|
0.0111
|
0.0185
|
0.1544
|
0.01 |
0.0093
|
0.02
|
0.0389
|
0.2269
|
0.05 |
0.0465
|
0.0849
|
0.1369
|
0.48
|
typical |
0.5017
|
0.5018
|
0.4965
|
0.4992
|
The columns in this table show the true mean of the distribution from which the data was sampled. The rows correspond to various cut-off values, where ‘typical’ is the p-value calculated with sample mean equal to true mean. The entries show the proportion of experiments/simulations that gave p-values below each point. Notice that as the true mean increases, the proportions of experiments giving extremely low p-values increases.
Discussion
This a problem that does not have a clear solution, beyond those ‘use a far stricter indicator of significance’ (which, of course, comes at a cost). It does, however, indicate that the behaviour of p-values can lead to unwanted results, even once we understand them and design experiments properly (the subjects of part 1 and part 2).
As noted, this content contains less established knowledge than the others, so any challenges/questions/complaints/edits are more than welcome.
- https://arxiv.org/pdf/1603.07532.pdf
- https://www.youtube.com/watch?v=8qrfSh07rT0
Thanks for preparing this. So the main takeaway is that p-values tend to overestimate significance?