Scientific Cooking (Part 1)
This posting is not about the type of cooking that’s done on
a stove, but the type that’s done in a laboratory (or more frequently, on a
computer with lab results). There is
much cooking that can be done to make raw research look a particular way. All scientists cook – papers don’t get
published with just a table of statistics – but whether they cook their results
in a straightforward manner that is easily reproduced and logical or they take
dress their results to appear to be something different is the difference
between legitimate research and misrepresentation.
Facts are misrepresented for many reasons – intent cannot be
determined from a paper in isolation.
The reasons range from accident (setting up a measurement wrong or
miscalculating) to ignorance (inadvertently biasing results) to willful
ignorance (trying to slant the results) to outright fraud (falsifying results).
Even with the gold standard in evidence
based IBD research – the double blind study – doesn’t completely protect
against misrepresentation. At one end,
investigator bias can creep into both protocols and interpretations. At the other, outright alteration of the
aggregate results can occur. This post
deals with some of the inadvertent bias that can creep into the results of
honest researchers, and provides some things to look for when evaluating a
research paper.
Before getting started, an understanding of p values is
necessary. The p value is the probability
of obtaining a result that is the same (or better) than what was observed,
assuming the null hypothesis is true. In
layman’s terms, the p value is used to determine if a result is significant or
not. If the p value is lower than a
particular threshold, it is assumed that the result is significant. The most common thresholds used are .05 and
.01, which correspond (kind of*) to a 1 in 20 or a 1 in 100 chance that the
result is random. The p value has been
often criticized, citing the random choice of a “significance threshold” of .05
and the ability to manipulate the results.(1)
Disraeli is quoted as saying “There are three kinds of lies:
lies, damned lies, and statistics”**. This is true in medical research as it is
elsewhere. Aside from a dishonest
researcher, how can an honest researcher fall victim to statistical traps?
Framing
Framing involves running an experiment in different sets
that are advantageous to the researcher.
Think of the difference between these two experimental designs:
1.
Flip 5 quarters 20 times each to determine if it
always land on heads.
2.
Flip 20 quarters 5 times each to determine if
any of the coins always land on heads.
At first glance, both look the same. If my hypothesis is that a quarter will
always land on “heads”, both experiments have 100 data points. The second, however, is really a repetition
of the same experiment ten times. In the
first experiment, the probability of getting all heads is on an individual set
is 1/(2^20), or 1 in 1024. The probability
of getting all heads in any set is approximately The probability of getting all
heads on any one set is approximately one in 200. In the second, the probability of an
individual set getting all heads is 1/(2^5), or 1 in 32. Still a fairly unlikely occurrence. However, the probability of any of the 10
sets landing all heads is slightly better than 1 in 2 – a much more likely
occurrence. Even though both experiments
rely on 100 coin flips, but changing the framing of the flips a research can affect
the likelihood of getting a set with all heads.
Discarding Data
While an honest researcher would not intentionally remove
data that does not fit the expected results, they may choose to discard certain
data points. It is easy for a
researcher, after the fact, to remove specific patients from the results and
find a reason to do so. One of the most
common reasons is dropout – when an individual drops out of an experiment, the
researcher must determine how to count that individual. They have a few options:
1.
They can drop the individual completely as
though they were never part of the experiment.
2.
They can use the last known datapoint for that
individual an extrapolate a result.
3.
They can treat the dropouts as a separate category.
4.
The dropouts can be counted using their last
datapoint without extrapolation.
Because there are multiple options, a researcher can choose
the one that is most beneficial if the choice is made after the event. A better practice is to define what to do
with dropouts before running the experiment.
Similarly, a researcher may discard data that disagrees with
a hypothesis by mentally convincing themselves there was a protocol error. Consider this scenario:
1.
The first clinical trial doctor finds that there
were adverse events that occurred in 1 out of their 20 patients.
2.
The second clinical trial doctor finds that
there were adverse events that occurred in 2 out of their 20 patients.
3.
The third clinical trial doctor finds that there
were adverse events that occurred in none of their 20 patients.
4.
The fourth clinical trial doctor finds that
there were adverse events that occurred in 10 out of their 20 patients.
Many researchers, instead of concluding that adverse events
occurred in 13 out of 80 patients, will try to look at protocol errors in the
fourth dataset in attempt to throw it out and conclude that 3 of 60 patients
had adverse events occur. This is a bias
in itself – the researcher should look for protocol errors in either all of the
experiments or none of the experiments.
Even worse, the researcher may throw out the whole trial and run it
again (and again and again) until they get the “expected” results.
High Dimensionality Experiments
When researchers are touting the benefits of a wonder cure
that has little prior probability of working, you will frequently see them go
on a fishing expedition by measuring dozens of different positive
outcomes. The researchers will then
cherry pick the outcome that supports their hypothesis (that the wonder cure
works). A scenario may go something like
this:
1.
The
research hypothesizes that taking a particular herb will have a positive impact
on Crohn’s disease.
2.
The herb will be given to half of 50 patients in
each of two blinded groups.
3.
The researcher will conduct blood tests, physical
tests, and selfreporting in 100 different areas.
In the above scenario, the researchers will frequently use
the p value inappropriately. If the control
group shows that 21 of the 25 patients showed improved CRP levels, they may
conclude that, based on p value, this is a significant finding. Unfortunately, when multiple dimensions are
measured, the p value needs to be adjusted for the number of measurement
categories. Choose enough categories and
what looked like a significant effect is really just random noise. If enough measurements are taken, any group
of individuals will improve on average
in half and decline on average in
half. Most of those will not be
statistically significant, but with enough measurement buckets and no
correction on p values, eventually one of them will appear to be significant,
and you can bet the researcher will tout that as a success for their
treatment. The even more dishonest researchers
will omit the fact that there were other measurements taken and discarded.
Determining Measurements After the Fact
Similar to the issues with high dimensionality experimentation,
there can be ex post facto issues with selecting measurements. In an ideal experiment, the specific things
to be measured and the success criteria are defined beforehand to avoid bias. Unfortunately, some researchers wait until after the experiment to determine the
measurements. Consider the following
dataset looking at C Reactive Protein, a frequent measurement of inflammation
(a CRP > 3 mg/L is considered High):
CONTROL
Patient

CRP Before

CRP After

A

5

6

B

4

2

C

2

2

D

3

4

E

5

5

TEST
Patient

CRP Before

CRP After

F

5

5

G

4

5

H

3

2

I

3

2

J

5

6

The average CRP for each group remained completely unchanged
(3.8 for the first and 4.0 for the second).
Consider how the following true statements can be used, however, to
slant the results and make it look like the TEST protocol was effective:
·
Twice as many individuals using TEST dropped
from a High CRP level to a moderate CRP level.
·
CRP was lowered in twice as many patients in the
TEST group
·
40% of the TEST group showed improvement,
compared to 20% of the CONTROL group
While all of the above are true, they are misleading and don’t
really show there was actual efficacy.
The numbers can work the other way as well. Consider the following:
CONTROL
Patient

CRP Before

CRP After

A

5

5

B

4

2

C

2

1

D

3

5

E

5

5

TEST
Patient

CRP Before

CRP After

F

5

0

G

4

5

H

3

4

I

3

3

J

5

6

Like the first experiment, the results don’t really show
anything. The Test group fell from 4 to
3.6, and the CONTROL group fell from 3.8 to 3.6. However, consider the following would be true
statements:
·
The TEST group’s average CRP fell twice as much
as the CONTROL group
·
The TEST group showed one patient completely
cured of inflammation, compared to no one in the control group
Part 2 will cover other ways that researchers cook the books
statistically speaking to make their research appear more substantial than it
really is.
Bottom Line
·
There are many ways to make research look
statistically sound, even though the underlying protocols are flawed
·
Research making wild claims with small a priori
probability should be viewed through a skeptical lens
* Technically the p value is only a measure of the evidence
against the null hypothesis. In science
based medicine, the prior probability of the event occurring needs to be taken
into account, in a more Bayesian approach.
That said, the p value is a quick and dirty way to do a base check.
** It was actually Mark Twain quoting Disraeli – there is
quite a bit of doubt as to whether or not Disraeli actually uttered the words.
1.
Sellke, Thomas; Bayarri, M. J.; Berger, James O.
(2001). "Calibration of p Values for Testing Precise Null
Hypotheses". The American Statistician 55 (1): 62–71.
doi:10.1198/000313001300339950.