*Warning: wonkish. Also long (but there’s a handy jump).*

Over the course of a career, you become accustomed to reviewers raising strange objections to your work. As sample size builds, though, a few strange objections come up repeatedly – and that’s interesting. Today: the bizarre notion that one shouldn’t do significance testing with simulation data.

I’ve use computer simulation models as a tool many times over my career. The point of a simulation model is to compare the behaviour of a modeled system with and without a particular bit of modeled biology. I routinely use standard significance testing to make that comparison – to ask whether I have reason to believe that the bit of modeled biology makes a difference to the results. And several times now, I’ve run into the same strange objection to this.* Here’s one reviewer (paraphrased for length and anonymity):

It seems inappropriate to do significance tests on simulations, because P-values depend on sample size. You can, therefore, get as small a P-value as you want simply by running more simulations.

I hope you can see immediately why this objection doesn’t make any sense. It’s absolutely true that with simulations, one can “get as small a *P*-value as you like” by running more simulations – **but critically, *only* if the null hypothesis is false.** If there is no real effect, you can run all the simulations you like, and you’ll get significant *P*-values at the expected rate α, but never more.**

If this is obvious to you, great – and you’ll probably want to take this jump past my demonstration that it’s true. But in case it *isn’t* obvious, I ran – appropriately enough – some simulations.

Here’s what I did. I ran ordinary two-sample t-tests, with each individual test comparing means of X and Y, where I drew *n* values of X and then *n* values of Y from normal distributions. What I’m doing here is, essentially, simulating simulations! You can think of each t-test comparing Xs and Ys as a comparison between two sets of simulation runs with and without that focal bit of modeled biology. (Actually, you pretty much have to think of my t-tests that way; otherwise, what follows is nothing more than a straight-up demonstration of how *P*-values work.)

I ran my t-tests-that-represent-simulations with sample sizes of 10 (per group), then for 100, then for 1,000, and so on. This is exactly what reviewers object to: “simply running more simulations” (including more Xs and more Ys in each comparison by t-test). In order to see clearly what happens to *P*-values as we run more and more simulations, I repeated the whole process 1,000 times for each sample size to see how the distribution of P-values changes – or doesn’t.*** Here are the results:

Look at the red lines first. Those are for simulations in which the null hypothesis is false: the *n *values of X were drawn from a normal distribution with mean 10 and standard deviation 1, while the *n *values of Y were drawn from a normal distribution with mean 10.1 and standard deviation 1. (In other words, the focal bit of simulated biology changes the model outcome, although not very much – only from 10 to 10.1, or about 1%.) With a comparison based on just a few simulations (*n* = 10; red dotted line), power is very low. Individual runs aren’t very likely to yield a small *P*-value, and the mean *P*-value is quite large. If we run a few more simulations (*n* = 100; red dashed line), we start to see small *P*-values cropping up more frequently. If we run more simulations still (*n* = 1,000; red solid line), the *P*-value is usually small (75% are < 0.01). If we go all-out and run a ton of simulations (*n* = 10,000; red vertical line), the *P*-value is always small (the distribution can’t be plotted on this scale, but all 1,000 were < 0.001 and the mean was 2 × 10^{-7}). So, when the null hypothesis is false, everything works as the reviewer suggests: running more simulations yields smaller *P*-values. And so it should: with more simulations, we get data that are increasingly unlikely under the null, and we’re more and more confident that we’re seeing a real effect.

Now the black lines. These are for simulations in which the null hypothesis is true: the *n* values of X and the *n* values of Y are both drawn from a normal distribution with mean 10 and standard deviation 1. (In other words, the focal bit of simulated biology does *not* change the model outcome.) This time, running more simulations makes absolutely no difference: whether you run 10 simulations (*n* = 10; black dotted dashed line), or 1,000 (*n* = 1,000; black dashed line), or 100,000 (*n* = 100,000; black solid line), the *P*-values are consistent with a distribution uniform on [0,1]. (We can breathe a sigh of relief, because if they weren’t, something would be seriously wrong with the universe.)

So: with simulation data, running more and more simulations and seeing what happens to *P*-values is in fact very nicely diagnostic. *P*-values shrink under the alternative hypothesis, but they do not do so under the null. There’s nothing surprising about this, of course: it’s just frequentist statistics working exactly as it should.

I think the more interesting question here is not whether you can use *P*-values with simulation data (of course you can!) – it’s what might lead someone to think that you can’t. Where does this strange objection come from? I can think of four possibilities, arranged here from least to most interesting (to me).

First, I may simply be running into reviewers who lack an intuitive understanding of basic statistics. That’s more common than you might think, even among professional biologists who use statistics to make inference; I suspect it’s partly because the subject is often abysmally taught. I may simply have had a reviewer who doesn’t realize that when the null is true, the expected distribution of *P*-values *doesn’t* depend on sample size. *[Note: in the first version of the post I wrote “when the null is false…”, which is about as embarrassing as a brain misfire can get! I knew what I meant…grrr.]*

Second, I may be running into reviewers who don’t understand the distinction between *P*-values and effect sizes. If you (mistakenly) believe that a small *P*-value indicates an important effect, then it would indeed be worrisome that *P*-values depend on sample sizes. But that’s not what *P*-values do. Running more simulations can make you more and more sure of an effect you’re seeing, but it won’t affect your estimate of how large that effect is.

Third, I may be running into reviewers who subscribe to a common but strange philosophical position about inference. The reviewer’s objection makes perfect sense if they believe that *the null hypothesis is always*

*false*– in which case the null-hypothesis case in my demonstration (black lines in the plot) is irrelevant and there’s no useful diagnosis to be had from running more simulations. I hear this objection a lot, and it’s always mystified me, because I can, with trivial ease, come up with a null that I’m 100% confident is true (astrology might be involved, or in simulations a throw-away variable that isn’t actually used to calculate results). If those examples seem silly, well, they’re meant to be. If you’re skeptical that more relevant nulls are ever true, please consult a longer exposition here.

Fourth, I may be running into reviewers who think simulation models are fundamentally different from experiments. They aren’t. An experiment – whether it’s in the lab or in the field – is always a simplified model of the real world. So is a simulation model. The only difference is that in a simulation model the connection between inputs and outputs involves electrons in chips (and is usually fully specified); in an experiment it involves neurons firing, or DNA replicating, or plant roots taking up nutrients, or whatever. Sure, you can run more simulations merely by changing a parameter and waiting longer, while running more experimental replicates may be harder (involving money, ethics, space, you name it). But in terms of how inference works there *just isn’t any difference*.

I suspect that the fourth of these lurks behind my reviewer’s objection, and here’s why. The first three explanations account for an objection to *P*-values – but not for an objection to *P*-values *for simulations*. In fact, the more-replicates-will-shrink-the-*P*-value objection applies to experiments just as much as it applies to simulations applies (if, that is, if it applies at all – which it doesn’t).

So it’s strange. This repeated objection, upon just a little close examination, makes no sense at all – and it betrays one of four very peculiar beliefs about the universe. Or more likely, it simply betrays the lack of that “close examination”. Which maybe isn’t strange at all. Most of us – probably all of us – hold a few beliefs that would crumble rapidly under close examination. Yes, even scientists (and I describe one of my own here).

Why did I write this post? Well, I’ve seen the “you can’t use *P*-values with simulations” objection often enough that I’m pretty sure I’m going to see it again. When I do, my response can simply point here. If you find yourself in the same situation, yours can too.

*© Stephen Heard May 7, 2019*

*Writing clearly about statistics is hard (which is one reason I admire **Whitlock and Schluter** so much). If something here is confusing, please let me know!*

*^Most recently, it happened last month in the reviews of a paper that’s now in press in *Conservation Biology*. Until I can link to the definitive version, the preprint is here. In case you’re wondering: yes, that paper will be published *with* P-values for its simulations, after a Response to Reviews that ended up being a dry run for this blog post.

**^Unless you P-hack by using a stopping rule dependent on *P*, of course.

***^If you’d like to do this yourself, here’s the very simple R code:

#set two group means

mean1 <-10

mean2 <- 10.1#set a sample size

samplesize <- 10000#set a number of replicate t-tests

numtests <-1000tps <- replicate(numtests,t.test(rnorm(samplesize, mean1),rnorm(samplesize, mean2), var.equal=TRUE)$p.value)

plot(density(tps))

write.csv(tps, file = ‘/data/hofalse-10000.csv’)

Notice that I include quick-and-dirty plots, but I didn’t make the illustrative figure above in R. I didn’t have twelve hours to spare, so I used Excel and SigmaPlot (an *actual* graphing program) and made it in about 10 minutes. Don’t @ me.

Markus EichhornWell this paper disagrees with you in its conclusion but invokes the same reason: when you set up the simulations, you know a priori that the null hypothesis is false, so you have invalidated the test from the outset.

https://onlinelibrary.wiley.com/doi/full/10.1111/j.1600-0706.2013.01073.x

LikeLike

ScientistSeesSquirrelPost authorWell, sure, in the trivial case that you already know that what you’re simulating has an effect! I’d question why one would even bother to do the research, if one has perfect foreknowledge of how the world/simulations work before doing it!

I’ve certainly done simulations in which a particular way of altering the modeled biology DIDN’T have any effect on the outcome. Unless we can be completely sure, all the time, that our judgement without statistics is correct (about which things will and won’t affect the outcome), then stats play their role. (And this is true of experiments too!)

[EDIT] – maybe “trivial” is too strong. I can probably imagine a case in which one knows the null is false, but does not know the magnitude of the effect? So it’s worth simulating to get effect sizes? Although still – what a strange position to be in, to be able to rule out a priori an effect size of zero, but no other effect size…

LikeLike

Jeff HoulahanSteve, is this statement a typo? “I may simply have had a reviewer who doesn’t realize that when the null is false, the expected distribution of P-values doesn’t depend on sample size.” Don’t you mean that when the null is true the distribution of p-values doesn’t depend on sample size?

And I’m with Markus and the authors of the paper he references, that there is a fundamental difference between testing null hypotheses in experiments versus in simulations, because the person conducting the null hypothesis test created the simulation. It’s true that for some relationships that might be examined in a simulation it might not be very intuitive whether you’ve built a world where the null is true or the null is false. But, you have built a world where the null is true or false and I would think that an analytical dissection of your simulation would allow you to know whether the effect of variable A on variable B was zero or something different than zero. I would think this is true in all if not almost all cases. When you build a simulation world, you know the underlying model, the variables, the functional relationships, the parameter estimates and the decision rules. I’m not saying it would be trivial to sort out whether the world you built would result in a non-zero relationship between two variables (but, in many cases, I think it would be). We don’t know what the underlying model(s) is/are in the natural world or laboratory world so there is no way to find out analytically.

I would have said that estimating the size of an effect or the relative importance of a particular driver in a model, given the world you’ve built (especially if it was complex), would be important outputs of a simulation … and that these would be difficult to get at analytically. But, is an effect non-zero? I would be interested in looking at a model that you tested the null on to see if I felt that a careful look at the model wouldn’t have allowed somebody to ‘a priori’ sort out whether the null was true or not. I would be surprised if there were many instances where we couldn’t sort out the truth of the null without ever running the simulations. By the way, this is often the kind of challenge where I find that I’m wrong. I’m still surprised by how often my intuition fails me.

Best, Jeff

LikeLike

ScientistSeesSquirrelPost authorJeff, good catch, you’re of course absolutely right about “when the null is TRUE”, and I’ve edited. Argh.

On your position on nulls in simulations: I don’t understand. If “an analytical dissection of your simulation would allow you to know whether the effect of variable A on variable B was zero or something different than zero”, then what are you running simulations for? Why aren’t you simply analyzing the analytical model?

LikeLike

Jeff HoulahanSteve, poor choice of words by me, I didn’t mean “analytically” in the sense of solving something analytically versus numerically or graphically. I meant that any model, whether it be mathematical or cellular automata or individual-based, involves equations or decision rules – if you just look at the structure of the model (without seeing output) you could tell if an effect will be non-zero or not. For you to run a simulation you have to create the model that will drive the simulation and you have built into the model whether the null is true or not. It may not be completely intuitive but you know the equations or decision rules and have built into your model and, I believe (but may be wrong), that in the vast majority of cases you could look at the model you’ve built and based on its structure know whether an effect would be non-zero or not.

LikeLike

ScientistSeesSquirrelPost authorI think you’re mistaking what the “null” is, in a simulation. When you say “you built into the model whether the null is true or not”, that suggests you think that the null is a statement about the treatment. It is not – it’s a statement about the outcome. Use an experiment as an analogy. I dump phosphate into six ponds, not into the other six. If my null is “the two ponds don’t differ (immediately) in phosphate”, well, sure, I know it’s false; but I’m not doing anything people would recognize as interesting statistical analysis. If my null is “frogs in the two pond grow equally quickly”, then I don’t know if that’s false or not.

Same thing in a simulation. If I set a = 1 in one set of sims and a = 2 in another, then sure, if you think the null is “a has the same value”, then it’s false by definition. But that’s not an interesting null. If the null is “b has the same average value in both”, then if you know for a fact that THAT null is false, why on earth are you devoting the effort to running simulations?

LikeLike

Jeff HoulahanThis example is great because it illustrates either my key point or my key misunderstanding. So, let’s keep the variables the same in the experiment and in the simulation (a = phosphate levels and b = frog growth rate). In the experiment, I don’t know what the underlying model is. It might be that phosphate increases primary productivity, which increases food availability resulting in higher growth rates in the ponds that received phosphate. But, it might not be. It might be that phosphate doesn’t affect primary productivity or that food availability doesn’t affect growth rate.

On the other hand, if growth rate is an output of your simulation and phosphate levels are an input, you must write into the code some indirect or direct relationship between phosphate levels and growth rate to get a non-zero effect. If you don’t write a direct or indirect effect into the code, the effect will be zero. And in most cases, you could look at the code and figure out if there was a zero or non-zero effect.

We don’t have the code in the experiment. We have the code in the simulation.

LikeLike

ScientistSeesSquirrelPost authorYes, agreed. And if the code is trivial enough that there is a known deterministic effect from a to b, then you’re right. I don’t quite see the point of such simulations. Perhaps this is the key difference in philosophy: I’m often simulating behaviour, or something like that, with randomness and complex environmental effects and *I don’t know what’s going to happen in advance*. If I do know what’s going to happen in advance, but I simulate anyway, the problem isn’t one of statistical philosophy – it’s more about my life choices!

LikeLike

Jeff HoulahanI promise I will stop after this, Steve – I can’t imagine code (no matter how complex) where it wouldn’t be possible, at least, in principle, to figure out whether there was a known deterministic effect from a to b. If there is, the null will be false and if there isn’t, the null will be true. So, when you say *I don’t know what’s going to happen in advance* – I’m saying, one could…always. It’s that last part – ‘…always’ that I’m not certain of.

LikeLike

Chris EdgeI think I fall between you two. You can’t build a model unless you know there are underlying relationships between the things you want to model. How would you paramaterize it or derive the underlying equations for the model? Building on the phosphate, frog growth model above there are two equations

eq1 B (phosphate) = B we can measure it with no error and know how much is added

eq2 R (Primary productivity) = B * X + err, where X translates B to primary prod

eq3 G (Frog Growth) = R * Y + err, where Y translates R to frog growth

One could run a series of simulations with different B (1,2) and err for both equations drawn from a defined distribution. In the end we want to know if G differs with between B=1 and B=2. A statistical test would achieve this goal. However a more intuitive test would be to allow B to vary between two values over many simulations and calculate an effect size for B. There is no need for a p-value in this case because all we are interested in is the effect size of B on G.

LikeLike

maxmaxxamxamI started writing a comment here objecting to your objection #3, but then I read your other post and I think that Gavin Simpson’s comments (https://scientistseessquirrel.wordpress.com/2017/04/03/two-tired-misconceptions-about-null-hypotheses/#comment-3180) do an excellent job of illustrating why all null hypotheses can be false.

Sort of a tangent from the point of the post, I suppose… I’d agree that if significance testing can be applied as a heuristic for the real world, why not also use it for simulations. (You just need to remember that all nulls are false when you’re using it!)

LikeLike

ScientistSeesSquirrelPost authorGavin’s comment is an interesting one. Importantly, he does not argue that all nulls are false. He does argue that all plausible nulls may be false. I don’t think that’s a useful way to think about things – at best, it’s circular (which ones are plausible? The ones for which we know the null is false)! But I’ll admit that there’s room for interesting arguments of the form Gavin makes, even if I don’t agree with them!

LikeLike

JustinUse of p-values in simulations is perfectly fine. Not sure I understand the critics’ argument. I also think I can make example where I add more data (increase n), but the X’s that I add don’t make the test statistic more extreme (so p-value doesn’t automatically get smaller).

Justin

http://www.statisticool.com

LikeLike

Michael BodeBetter late than never to this discussion, I suppose!

Could you give an example of where two models with different dynamics (structure, uncertainty distributions, parameter values, etc) resulted in interesting outputs that were, in fact, statistically indistinguishable? So in your graphed example, the situation where sample size made no difference was when the two models were the same model (and where a glance at the code would indicate that a statistical test was unnecessary).

LikeLike