The following is an archived copy of a message sent to the CASI Analysis List run by Cambridge Solidarity with Iraq.

Views expressed in this archived message are those of the author, not of Cambridge Solidarity with Iraq (CASI).

[Main archive index/search] [List information] [CASI Homepage]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[casi-analysis] Summary of criticisms of the Lancet study of Iraqi mortality

[ This message has been sent to you via the CASI-analysis mailing list ]

Dear All,

The below summary and evaluation of criticisms of the Lancet study of mortality in Iraq is very 

The original text is better rendered as a web page than as an email and can be found on:


Per Klevnas


November 11, 2004

Lancet roundup and literature review

Posted by Daniel

Well, the Lancet study has been out for a while now, and it seems as good a time as any to take 
stock of the state of the debate and wrap up a few comments which have hitherto been buried in 
comments threads. Lots of heavy lifting here has been done by Tim Lambert and Chris Lightfoot; I 
thoroughly recommend both posts, and while I’m recommending things, I also recommend a short 
statistics course as a useful way to spend one’s evenings (sorry); it really is satisfying to be 
able to take part in these debates as a participant and I would imagine, pretty embarrassing and 
frustrating not to be able to. As Tim Lambert commented, this study has been “like flypaper for 
innumerates”; people have been lining up to take a pop at it despite being manifestly not in 
possession of the baseline level of knowledge needed to understand what they’re talking about. 
(Being slightly more cynical, I suggested to Tim that it was more like “litmus paper for hacks”; 
it’s up to each individual to decide for themselves whether they think a particular argument is an 
innocent mistake or not). Below the fold, I summarise the various lines of criticism and whether 
they’re valid or (mostly) not.

Starting with what I will describe as “Hack critiques”, without prejudice that they might in 
isolated individual cases be innocent mistakes. These are arguments which are purely and simply 
wrong and should not be made because they are, quite simply, slanders on the integrity of the 
scientists who wrote the paper. I’ll start with the most widespread one.

The Kaplan “dartboard” confidence interval critique

I think I pretty much slaughtered this one in my original Lancet post, but it still spread; 
apparently not everybody reads CT (bastards). To recap; Fred Kaplan of Slate suggested that because 
the confidence interval was very wide, the Lancet paper was worthless and we should believe 
something else like the IBC total.

This argument is wrong for three reasons.

1) The confidence interval describes a range of values which are “consistent” with the model1. But 
it doesn’t mean that all values within the confidence interval are equally likely, so you can just 
pick one. In particular, the most likely values are the ones in the centre of a symmetrical 
confidence interval. The single most likely value is, in fact, the central estimate of 98,000 
excess deaths. Furthermore, as I pointed out in my original CT post, the truly shocking thing is 
that, wide as the confidence interval is, it does not include zero. You would expect to get a 
sample like this fewer than 2.5 times out of a hundred if the true number of excess deaths was less 
than zero (that is, if the war had made things better rather than worse).

2) As the authors themselves pointed out in correspondence with the management of Lenin’s Tomb,

“Research is more than summarizing data, it is also interpretation. If we had just visited the 32 
neighborhoods without Falluja and did not look at the data or think about them, we would have 
reported 98,000 deaths, and said the measure was so imprecise that there was a 2.5% chance that 
there had been less than 8,000 deaths, a 10% chance that there had been less than about 45,000 
deaths,….all of those assumptions that go with normal distributions. But we had two other pieces of 
information. First, violence accounted for only 2% of deaths before the war and was the main cause 
of death after the invasion. That is something new, consistent with the dramatic rise in mortality 
and reduces the likelihood that the true number was at the lower end of the confidence range. 
Secondly, there is the Falluja data, which imply that there are pockets of Anbar, or other 
communities like Falluja, experiencing intense conflict, that have far more deaths than the rest of 
the country. We set that aside these data in statistical analysis because the result in this 
cluster was such an outlier, but it tells us that the true death toll is far more likely to be on 
the high-side of our point estimate than on the low side.”

That is, the sample contains important information which is not summarised in the confidence 
interval, but which tells you that the central estimate is not likely to be a massive overestimate. 
The idea that the central 98,000 number might be an underestimate seemed to have blown the mind of 
a lot of commentators; they all just seemed to act like it Did Not Compute.

3. This gave rise to what might be called the use of “asymmetric rhetoric about a symmetric 
confidence interval”, but which I will give the more catchy name of “Kaplan’s Fallacy”. If your 
critique of an estimate is that the range is too wide, then that is one critique you can make. 
However, if this is all you are saying (“this isn’t an estimate, it’s a dartboard”), then 
intellectual honesty demands that you refer to the whole range when using this critique, not just 
the half of it that you want to think about. In other words, it is dishonest to title your essay 
“100,000 dead – or 8,000?” when all you actually have arguments to support is “100,000 dead – or 
8,000 – or 194,000?”. This is actually quite a common way to mislead with statistics; say in 
paragraph 1 “it could be more, it could be less” and then talk for the rest of the piece as if 
you’ve established “it’s probably less”.

The Kaplan piece was really very bad; as well as the confidence interval fallacy, there are the 
germs of several of the other fallacious arguments discussed below. It really looks to me as if 
Kaplan had decided he didn’t want to believe the Lancet number and so started looking around for 
ways to rubbish it, in the erroneous belief that this would make him look hard-headed and 
scientific and would add credibility to his endorsement of the IBC number. I would hazard a guess 
that anyone looking for more Real Problems For The Left would do well to lift their head up from 
the Bible for a few seconds and ponder what strange misplaced and hypertrophied sense of 
intellectual charity it was that made Kaplan, an antiwar Democrat, decide to engage in hackish 
critiques of a piece of good science that supported his point of view.

The cluster sampling critique

There are shreds of this in the Kaplan article, but it reached its fullest and most widely-cited 
form in a version by Shannon Love on the Chicago Boyz website. The idea here is that the cluster 
sampling methodology used by the Lancet team (for reasons of economy, and of reducing the very 
significant personal risks for the field team) reduces the power of the statistical tests and makes 
the results harder to interpret. It was backed up (wayyyyy down in comments threads) by people who 
had gained access to a textbook on survey design; most good textbooks on the subject do indeed 
suggest that it is not a good idea to use cluster sampling when one is trying to measure rare 
effects (like violent death) in a population which has been exposed to heterogeneous risks of those 
rare events (ie; some places were bombed a lot, some a little and some not at all).

There are two big problems with the cluster sampling critique, and I think that they are both so 
serious that this argument is now a true litmus test for hacks; anyone repeating it either does not 
understand what they are saying (in which case they shouldn’t be making the critique) or does 
understand cluster sampling and thus knows that the argument is fallacious. The problems are:

1) Although sampling textbooks warn against the cluster methodology in cases like this, they are 
very clear about the fact that the reason why it is risky is that it carries a very significant 
danger of underestimating the rare effects, not overestimating them. This can be seen with a simple 
intuitive illustration; imagine that you have been given the job of checking out a suspected 
minefield by throwing rocks into it.

This is roughly equivalent to cluster sampling a heterogeneous population; the dangerous bits are a 
fairly small proportion of the total field, and they’re clumped together (the mines). Furthermore, 
the stones that you’re throwing (your “clusters”) only sample a small bit of the field at a time. 
The larger each individual stone, the better, obviously, but equally obviously it’s the number of 
stones that you have that is really going to drive the precision of your estimate, not their size. 
So, let’s say that you chuck 33 stones into the field. There are three things that could happen:

a) By bad luck, all of your stones could land in the spaces between mines. This would cause you to 
conclude that the field was safer than it actually was.

b) By good luck, you could get a situation where most of your stones fell in the spaces between 
mines, but some of them hit mines. This would give you an estimate that was about right regarding 
the danger of the field.

c) By extraordinary chance, every single one of your stones (or a large proportion of them) might 
chance to hit mines, causing you to conclude that the field was much more dangerous than it 
actually was.

How likely is the third of these possibilities (analogous to an overestimate of the excess deaths) 
relative to the other two? Not very likely at all. Cluster sampling tends to underestimate rare 
effects, not overestimate them2.

And 2), this problem, and other issues with cluster sampling (basically, it reduces your effective 
sample size to something closer to the number of clusters than the number of individuals sampled) 
are dealt with at length in the sampling literature. Cluster sampling ain’t ideal, but needs must 
and it is frequently used in bog-standard epidemiological surveys outside war zones. The effects of 
clustering on standard results of sampling theory are known, and there are standard pieces of 
software that can be used to adjust (widen) one’s confidence interval to take account of these 
design effects. The Lancet team used one of these procedures, which is why their confidence 
intervals are so wide (although, to repeat, not wide enough to include zero). I have not seen 
anybody making the clustering critique who as any argument at all from theory or data which might 
give a reason to believe that the normal procedures are wrong for use in this case. As Richard 
Garfield, one of the authors, said in a press interview, epidemics are often pretty heterogeneously 
distributed too.

There is a variant of this critique which is darkly hinted at by both Kaplan and Love, but neither 
of them appears to have the nerve to say it in so many words3. This would be the critique that 
there is something much nastier about the sample; that it is not a random sample, but is 
cherry-picked in some way. In order to believe this, if you have read the paper, you have to be 
prepared to accuse the authors of telling a disgusting barefaced lie, and presumably to accept the 
legal consequences of doing so. They picked the clusters by the use of random numbers selected from 
a GPS grid. In the few cases in which this was logistically difficult (read: insanely dangerous), 
they picked locations off a map and walked to the nearest household). There is no realistic way in 
which a critique of this sort can get off the ground; in any case, it affected only a small 
minority of clusters.

The argument from the UNICEF infant mortality figures

I think that the source for this is Heiko Gerhauser, in various weblog comments threads, but again 
it can be traced back to a slightly different argument about death rates in the Kaplan piece. The 
idea here is that the Lancet study finds a prewar infant mortality rate of 29 per 1000 live births 
and a postwar infant mortality rate of 54 per 1000 live births. Since the prewar infant mortality 
rate was estimated by UNICEF to be over 100, this (it is argued) suggests that the study is giving 
junk numbers and all of its conclusions should be rejected.

This argument was difficult to track down to its lair, but I think we have managed it. One weakness 
is similar to the point I’ve made above; if you believe that the study has structurally 
underestimated infant mortality, then isn’t it also likely to have underestimated adult mortality? 
The authors discuss a few reasons why the movement in infant mortality might be exaggerated 
(mainly, issues of poor recall by the interview subjects), though, and it is good form to look very 
closely at any anomalies in data.

Which is what Chris Lightfoot did.

Basically, the UNICEF estimate is quoted as a 2002 number, but it is actually based on detailed, 
comprehensive, on-the-ground work carried out between 1995 and 1999 and extrapolated forward. The 
method of extrapolation is not one which would take into account the fact that 1999 was the year in 
which the oil-for-food program began to have significant effects on child malnutrition in Iraq. No 
detailed on-the-ground survey has been carried out since 1999, and there is certainly no systematic 
data-gathering apparatus in Iraq which could give any more solid number. The authors of the study 
believe that the infant mortality rates in neighbouring countries are a better comparator than 
pre-oil for food Iraq, and since one of them is Richard Garfield, who was acknowledged as the 
pre-eminent expert on sanctions-related child deaths in the 1990s, there is no reason to gainsay 

I’d add to Chris’ work a theory of my own, based on the cluster sampling issue discussed above. 
Infant mortality is rare, and it is quite possibly heterogeneously clustered in Iraq (not least, 
post-war, a part of the infant mortality was attributed to babies being born at home because it was 
too dangerous to go to hospital). So it’s not necessarily the case that one needs to have an 
explanation of why they might have been undersampled in this case. Since this undersampling would 
tend to underestimate infant mortality both before and after the war, it wouldn’t necessarily bias 
the estimate of the relative risk ratio and therefore the excess deaths. I’d note that my theory 
and Chris’s aren’t mutually exclusive; I suspect that his is the main explanation.

We now move into the area of what might be called “not intrinsically hack” critiques. These are 
issues which one could raise with respect to the study which are not based on either definite or 
likely falsehoods, but which do not impugn the integrity of the study, and which are not themselves 
based on evidence strong enough to make anyone believe that the study’s estimates were wrong unless 
they thought so anyway.

There are two of these that I’ve seen around and about.

The first might be called the “Lying Iraqis” theory. This would be the theory that the interview 
subjects systematically lied to the survey team. In fact, the team did attempt to check against 
death certificates in a subsample of the interviews and found that in 81% of cases, subjects could 
produce them. This would lead me to believe that there is no real reason to suppose that the 
subjects were lying. Furthermore, I would suspect that if the Iraqis hate us enough to invent 
deaths of household members to make us look bad in the Lancet, that’s probably a fairly serious 
problem too. However, the possibility of lying subjects can’t be ruled out in any survey, so it 
can’t be ruled out in this one, so this critique is not intrinsically hackish. Any attempt to 
bolster it either with an attack on the integrity of the researchers, or with a suggestion that the 
researchers mainly interviewed “the resistance” (they didn’t), however, is hack city.

The second, which I haven’t really seen anyone adopt yet, although some people looked like they 
might, could be called the “Outlier theory”. This is basically the theory that this survey is one 
gigantic outlier, and that a 2.5% probability event has happened. This would be a fair enough thing 
to believe, as long as one admitted that one was believing in something quite unlikely, and as long 
as it wasn’t combined with an attack on the integrity of the Lancet team.

Finally, we come onto two critiques of the study which I would say are valid. The first is the one 
that I made myself in the original CT post; that the extrapolated number of 98,000 is a poor way to 
summarise the results of the analysis. I think that the simple fact that we can say with 97.5% 
confidence that the war has made things worse rather than better is just as powerful and doesn’t 
commit one to the really quite strong assumptions one would need to make for the extrapolation to 
be valid.

The second one is one that is attributable to the editors of the Lancet rather than the authors of 
the study. The Lancet’s editorial comment on the study contained the phrase “100,000 civilian 
deaths”. The study itself counts excess deaths and does not attempt to classify them as combatants 
or civilians. The Lancet editors should not have done this, and their denial that they did it to 
sensationalise the claim ahead of the US elections is unconvincing. This does not, however, affect 
the science; to claim that it does is the purest imaginable example of argumentum ad hominem

Finally, beyond the ultra-violet spectrum of critiques are those which I would classify as “beyond 
hackish”. These are things which anyone who gave them a moment’s thought would realise are 
irrelevant to the issue.

In this category, but surprisingly and disappointingly common in online critiques, is the attempt 
to use the IBC numbers as a stick to beat the Lancet study. The two studies are simply not 
comparable. One final time; the Iraq Body Count is a passive reporting system[4], which aims to 
count civilian deaths as a result of violence. Of course it is going to be lower than the Lancet 
number. Let that please be an end of this.

And there are a number of odds and ends around the web of the sort “each death in this study is 
being taken to stand for XXYY deaths and that is ridiculous”. In other words, arguments which, if 
true, would imply that there could be no valid form of epidemiology, econometrics, opinion polling, 
or indeed pulling up a few spuds to see if your allotment has blight. This truly is flypaper for 

I would also include in this category attempts like that of the Obsidian Order weblog to chaw down 
the 98,000 number by making more or less arbitrary assumptions about what proportion of the excess 
deaths one might be able to call “combatants” and thus people who deserved to die. This is exactly 
what people accuse the Lancet of doing; it’s skewing a number by means of your own subjective 
assessment. Not only is there no objective basis for the actual subjective adjustments that people 
make, but the entire distinction between combatants and civilians is one which does not exist in 
nature. As a reason for not caring that 98,000 people might have died, because you think most of 
them were Islamofascists, it just about passes muster. As a criticism of the 98,000 figure, it’s 

Finally, there is the strange world of Michael Fumento, a man who is such a grandiose and 
unselfconscious hack that he brings a kind of grandeur to the role. I can no more summarise what a 
class A fool he’s made of himself in these short paragraphs than I could summarise King Lear. Read 
the posts on Tim’s site and marvel. And if your name is Jamie Doward of the Guardian, have a word 
with yourself; not only are you citing blogs rather than reading the paper, you’re treating Flack 
Central Station as a reliable source!

The bottom line is that the Lancet study was a good piece of science, and anyone who says otherwise 
is lying. Its results (and in particular, its central 98,000 estimate) are not the last word on the 
subject, but then nothing is in statistics. There is a very real issue here, and any pro-war person 
who thinks that we went to war to save the Iraqis ought to be thinking very hard about whether we 
made things worse rather than better (see this from Marc Mulholland, and a very honourable mention 
for the Economist). It is notable how very few people who have rubbished the Lancet study have 
shown the slightest interest in getting any more accurate estimates; often you learn a lot about 
people from observing the way that they protect themselves from news they suspect will disconcert 


1This is not the place for a discussion of Bayesian versus frequentist statistics. Stats teachers 
will tell you that it is a fallacy and wrong to interpret a confidence interval as meaning that 
“there is a 95% chance that the true value lies in this range”. However, I would say with 95% 
confidence that a randomly selected stats teacher would not be able to give you a single example of 
a case in which someone made a serious practical mistake as a result of this “fallacy”, so I say 
think about it this way.

2Pedants would perhaps object that the more common mines are in the field, the less the tendency to 
underestimate. Yes, but a) by the time you got to a stage where an overestimate became seriously 
likely, you would be talking not about a minefield, but a storage yard for mines with a few patches 
of grass in it and b) we happen to know that violent death in Iraq is still the exception rather 
than the norm, so this quibble is irrelevant.

3And quite rightly so; if said in so many words, this accusation would clearly be defamatory.

4That is, they don’t go out looking for deaths like the Lancet did; they wait for someone to report 
them. Whatever you think about whether there is saturation media coverage of Iraq (personally, I 
think there is saturation coverage of the green zone of Baghdad and precious little else), this is 
obviously going to be a lower bound rather than a central estimate, and in the absence of any hard 
evidence about casualties there is no reason at all to suppose that we have any basis other than 
convenient subjective air-pulling to adjust the IBC count for how much of an undersample we might 
want to believe they are making.

Sent via the CASI-analysis mailing list
To unsubscribe, visit
All postings are archived on CASI's website at

[Campaign Against Sanctions on Iraq Homepage]