Sunday, August 25, 2013

Research Chicanery and IBD

Scientific Cooking (Part 1)


This posting is not about the type of cooking that’s done on a stove, but the type that’s done in a laboratory (or more frequently, on a computer with lab results).  There is much cooking that can be done to make raw research look a particular way.  All scientists cook – papers don’t get published with just a table of statistics – but whether they cook their results in a straightforward manner that is easily reproduced and logical or they take dress their results to appear to be something different is the difference between legitimate research and misrepresentation.

Facts are misrepresented for many reasons – intent cannot be determined from a paper in isolation.  The reasons range from accident (setting up a measurement wrong or miscalculating) to ignorance (inadvertently biasing results) to willful ignorance (trying to slant the results) to outright fraud (falsifying results).  Even with the gold standard in evidence based IBD research – the double blind study – doesn’t completely protect against misrepresentation.  At one end, investigator bias can creep into both protocols and interpretations.  At the other, outright alteration of the aggregate results can occur.  This post deals with some of the inadvertent bias that can creep into the results of honest researchers, and provides some things to look for when evaluating a research paper.

Before getting started, an understanding of p values is necessary.  The p value is the probability of obtaining a result that is the same (or better) than what was observed, assuming the null hypothesis is true.  In layman’s terms, the p value is used to determine if a result is significant or not.  If the p value is lower than a particular threshold, it is assumed that the result is significant.  The most common thresholds used are .05 and .01, which correspond (kind of*) to a 1 in 20 or a 1 in 100 chance that the result is random.  The p value has been often criticized, citing the random choice of a “significance threshold” of .05 and the ability to manipulate the results.(1)

Disraeli is quoted as saying “There are three kinds of lies:  lies, damned lies, and statistics”**.  This is true in medical research as it is elsewhere.  Aside from a dishonest researcher, how can an honest researcher fall victim to statistical traps?

Framing


Framing involves running an experiment in different sets that are advantageous to the researcher.  Think of the difference between these two experimental designs:

1.       Flip 5 quarters 20 times each to determine if it always land on heads.
2.       Flip 20 quarters 5 times each to determine if any of the coins always land on heads.

At first glance, both look the same.  If my hypothesis is that a quarter will always land on “heads”, both experiments have 100 data points.  The second, however, is really a repetition of the same experiment ten times.  In the first experiment, the probability of getting all heads is on an individual set is 1/(2^20), or 1 in 1024.  The probability of getting all heads in any set is approximately The probability of getting all heads on any one set is approximately one in 200.  In the second, the probability of an individual set getting all heads is 1/(2^5), or 1 in 32.  Still a fairly unlikely occurrence.  However, the probability of any of the 10 sets landing all heads is slightly better than 1 in 2 – a much more likely occurrence.  Even though both experiments rely on 100 coin flips, but changing the framing of the flips a research can affect the likelihood of getting a set with all heads.

Discarding Data


While an honest researcher would not intentionally remove data that does not fit the expected results, they may choose to discard certain data points.  It is easy for a researcher, after the fact, to remove specific patients from the results and find a reason to do so.  One of the most common reasons is dropout – when an individual drops out of an experiment, the researcher must determine how to count that individual.  They have a few options:

1.       They can drop the individual completely as though they were never part of the experiment.
2.       They can use the last known datapoint for that individual an extrapolate a result.
3.       They can treat the dropouts as a separate category.
4.       The dropouts can be counted using their last datapoint without extrapolation.

Because there are multiple options, a researcher can choose the one that is most beneficial if the choice is made after the event.  A better practice is to define what to do with dropouts before running the experiment.
Similarly, a researcher may discard data that disagrees with a hypothesis by mentally convincing themselves there was a protocol error.  Consider this scenario:

1.       The first clinical trial doctor finds that there were adverse events that occurred in 1 out of their 20 patients.
2.       The second clinical trial doctor finds that there were adverse events that occurred in 2 out of their 20 patients.
3.       The third clinical trial doctor finds that there were adverse events that occurred in none of their 20 patients.
4.       The fourth clinical trial doctor finds that there were adverse events that occurred in 10 out of their 20 patients.

Many researchers, instead of concluding that adverse events occurred in 13 out of 80 patients, will try to look at protocol errors in the fourth dataset in attempt to throw it out and conclude that 3 of 60 patients had adverse events occur.  This is a bias in itself – the researcher should look for protocol errors in either all of the experiments or none of the experiments.  Even worse, the researcher may throw out the whole trial and run it again (and again and again) until they get the “expected” results.

High Dimensionality Experiments


When researchers are touting the benefits of a wonder cure that has little prior probability of working, you will frequently see them go on a fishing expedition by measuring dozens of different positive outcomes.  The researchers will then cherry pick the outcome that supports their hypothesis (that the wonder cure works).  A scenario may go something like this:

1.        The research hypothesizes that taking a particular herb will have a positive impact on Crohn’s disease.
2.       The herb will be given to half of 50 patients in each of two blinded groups.
3.       The researcher will conduct blood tests, physical tests, and self-reporting in 100 different areas.

In the above scenario, the researchers will frequently use the p value inappropriately.  If the control group shows that 21 of the 25 patients showed improved CRP levels, they may conclude that, based on p value, this is a significant finding.  Unfortunately, when multiple dimensions are measured, the p value needs to be adjusted for the number of measurement categories.  Choose enough categories and what looked like a significant effect is really just random noise.  If enough measurements are taken, any group of individuals will improve on average in half and decline on average in half.  Most of those will not be statistically significant, but with enough measurement buckets and no correction on p values, eventually one of them will appear to be significant, and you can bet the researcher will tout that as a success for their treatment.  The even more dishonest researchers will omit the fact that there were other measurements taken and discarded.

Determining Measurements After the Fact


Similar to the issues with high dimensionality experimentation, there can be ex post facto issues with selecting measurements.  In an ideal experiment, the specific things to be measured and the success criteria are defined beforehand to avoid bias.  Unfortunately, some researchers wait until after the experiment to determine the measurements.  Consider the following dataset looking at C Reactive Protein, a frequent measurement of inflammation (a CRP > 3 mg/L is considered High):

CONTROL
Patient
CRP Before
CRP After
A
5
6
B
4
2
C
2
2
D
3
4
E
5
5

TEST
Patient
CRP Before
CRP After
F
5
5
G
4
5
H
3
2
I
3
2
J
5
6

The average CRP for each group remained completely unchanged (3.8 for the first and 4.0 for the second).  Consider how the following true statements can be used, however, to slant the results and make it look like the TEST protocol was effective:

·         Twice as many individuals using TEST dropped from a High CRP level to a moderate CRP level.
·         CRP was lowered in twice as many patients in the TEST group
·         40% of the TEST group showed improvement, compared to 20% of the CONTROL group

While all of the above are true, they are misleading and don’t really show there was actual efficacy.  The numbers can work the other way as well.  Consider the following:

CONTROL
Patient
CRP Before
CRP After
A
5
5
B
4
2
C
2
1
D
3
5
E
5
5

TEST
Patient
CRP Before
CRP After
F
5
0
G
4
5
H
3
4
I
3
3
J
5
6

Like the first experiment, the results don’t really show anything.  The Test group fell from 4 to 3.6, and the CONTROL group fell from 3.8 to 3.6.  However, consider the following would be true statements:

·         The TEST group’s average CRP fell twice as much as the CONTROL group
·         The TEST group showed one patient completely cured of inflammation, compared to no one in the control group

Part 2 will cover other ways that researchers cook the books statistically speaking to make their research appear more substantial than it really is.

Bottom Line


·         There are many ways to make research look statistically sound, even though the underlying protocols are flawed
·         Research making wild claims with small a priori probability should be viewed through a skeptical lens

* Technically the p value is only a measure of the evidence against the null hypothesis.  In science based medicine, the prior probability of the event occurring needs to be taken into account, in a more Bayesian approach.  That said, the p value is a quick and dirty way to do a base check.
** It was actually Mark Twain quoting Disraeli – there is quite a bit of doubt as to whether or not Disraeli actually uttered the words.

1.       Sellke, Thomas; Bayarri, M. J.; Berger, James O. (2001). "Calibration of p Values for Testing Precise Null Hypotheses". The American Statistician 55 (1): 62–71. doi:10.1198/000313001300339950.


No comments:

Post a Comment