Tips for evaluating scientific evidence
How then do we validate the onslaught of products which supposedly claim to have scientific evidence? To combat this very problem an article was published in the journal of nature (a highly reputable scientific journal) with 20 tips for identifying scientific claims for validity. The tips are listed below and in some cases have been shortened from the original article by Sutherland et al., 2013.
The real world varies unpredictably. Science is mostly about discovering what causes the patterns we see. There are many explanations for such trends, so the main challenge of research is trying to identify which factor or process is causing the main effect. This challenge is complicated by the fact that most trends are affected by an innumerable number of other factors.
Practically all measurements have some error. If the measurement process were repeated, one might record a different result. In some cases, the measurement error might be large compared with real differences. Thus, if you are told that the economy grew by 0.13% last month, there is a moderate chance that it may have shrunk.
Experimental design or measuring devices may produce atypical results in each direction. For example, determining voting behavior by asking people on the street, at home or through the Internet will sample different proportions of the population, and all may give different results. Because studies that report 'statistically significant' results are more likely to be written up and published, the scientific literature tends to give an exaggerated picture of the magnitude of problems or the effectiveness of solutions. An experiment might be biased by expectations: participants provided with a treatment might assume that they will experience a difference and so might behave differently or report an effect. Researchers collecting the results can be influenced by knowing who received treatment. The ideal experiment is double-blind: neither the participants nor those collecting the data know who received what.
The average taken from many observations will usually be more informative than the average taken from a smaller number of observations. Thus, the effectiveness of a drug treatment will vary naturally between subjects. Its average efficacy can be more reliably and accurately estimated from a trial with tens of thousands of participants than from one with hundreds.
It is tempting to assume that one pattern causes another. However, the correlation might be coincidental, or it might be a result of both patterns being caused by a third factor — a 'confounding' or 'lurking' variable. For example, ecologists at one time believed that poisonous algae were killing fish in estuaries; it turned out that the algae grew where fish died. The algae did not cause the deaths.
Extreme patterns in data are likely to be, at least in part, anomalies attributable to chance or error. The next count is likely to be less extreme. For example, if speed cameras are placed where there has been a spate of accidents, any reduction in the accident rate cannot be attributed to the camera; a reduction would probably have happened anyway.
Patterns found within a given range do not necessarily apply outside that range. Thus, it is very difficult to predict the response of ecological systems to climate change, when the rate of change is faster than has been experienced in the evolutionary history of existing species, and when the weather extremes may be entirely new.
The ability of an imperfect test to identify a condition depends upon the likelihood of that condition occurring (the base rate). For example, a person might have a blood test that is '99% accurate' for a rare disease and test positive, yet they might be unlikely to have the disease. If 10,001 people have the test, of whom just one has the disease, that person will almost certainly have a positive test, but so too will a further 100 people (1%) even though they do not have the disease. This type of calculation is valuable when considering any screening procedure, say for terrorists at airports.
A control group is dealt with in the same way as the experimental group, except that the treatment is not applied. Without a control, it is difficult to determine whether a given treatment really had an effect. The control helps researchers to be reasonably sure that there are no confounding variables affecting the results. Sometimes people in trials report positive outcomes because of the context or the person providing the treatment, or even the color of a tablet. This underlies the importance of comparing outcomes with a control, such as a tablet without the active ingredient (a placebo).
Experiments should, wherever possible, allocate individuals or groups to interventions randomly. Comparing the educational achievement of children whose parents adopt a health program with that of children of parents who do not is likely to suffer from bias (for example, better-educated families might be more likely to join the program). A well-designed experiment would randomly select some parents to receive the program while others do not.
Results consistent across many studies, replicated on independent populations, are more likely to be solid. The results of several such experiments may be combined in a systematic review or a meta-analysis to provide an overarching view of the topic with potentially much greater statistical power than any of the individual studies. Applying an intervention to several individuals in a group, say to a class of children, might be misleading because the children will have many features in common other than the intervention. The researchers might make the mistake of pseudoreplication if they generalize from these children to a wider population that does not share the same commonalities. Pseudoreplication leads to unwarranted faith in the results. Pseudoreplication of studies on the abundance of cod in the Grand Banks in Newfoundland, Canada, for example, contributed to the collapse of what was once the largest cod fishery in the world.
Scientists have a vested interest in promoting their work, often for status and further research funding, although sometimes for direct financial gain. This can lead to selective reporting of results and occasionally, exaggeration. Peer review is not infallible: journal editors might favor positive findings and newsworthiness. Multiple, independent sources of evidence and replication are much more convincing.
Expressed as P, statistical significance is a measure of how likely a result is to occur by chance. Thus P = 0.01 means there is a 1-in-100 probability that what looks like an effect of the treatment could have occurred randomly, and in truth there was no effect at all. Typically, scientists report results as significant when the P-value of the test is less than 0.05 (1 in 20).
The lack of a statistically significant result (say a P-value > 0.05) does not mean that there was no underlying effect: it means that no effect was detected. A small study may not have the power to detect a real difference. For example, tests of cotton and potato crops that were genetically modified to produce a toxin to protect them from damaging insects suggested that there were no adverse effects on beneficial insects such as pollinators. Yet none of the experiments had large enough sample sizes to detect impacts on beneficial species had there been any.
Small responses are less likely to be detected. A study with many replicates might result in a statistically significant result but have a small effect size (and so, perhaps, be unimportant). The importance of an effect size is a biological, physical or social question, and not a statistical one. In the 1990s, the editor of the US journal Epidemiology asked authors to stop using statistical significance in submitted manuscripts because authors were routinely misinterpreting the meaning of significance tests, resulting in ineffective or misguided recommendations for public-health policy.
The relevance of a study depends on how much the conditions under which it is done resemble the conditions of the issue under consideration. For example, there are limits to the generalizations that one can make from animal or laboratory experiments to humans.
Broadly, risk can be thought of as the likelihood of an event occurring in some time frame, multiplied by the consequences should the event occur. People's risk perception is influenced disproportionately by many things, including the rarity of the event, how much control they believe they have and the adverseness of the outcomes, and whether the risk is voluntarily or not. For example, people in the United States underestimate the risks associated with having a handgun at home by 100-fold and overestimate the risks of living close to a nuclear reactor by 10-fold.
It is possible to calculate the consequences of individual events, such as an extreme tide, heavy rainfall and key workers being absent. However, if the events are interrelated, (for example a storm causes a high tide, or heavy rain prevents workers from accessing the site) then the probability of their co-occurrence is much higher than might be expected. The assurance by credit-rating agencies that groups of subprime mortgages had an exceedingly low risk of defaulting together was a major element in the 2008 collapse of the credit markets.
Evidence can be arranged to support one point of view. To interpret an apparent association between consumption of yoghurt during pregnancy and subsequent asthma in offspring, one would need to know whether the authors set out to test this sole hypothesis or happened across this finding in a huge data set. By contrast, the evidence for the Higgs boson specifically accounted for how hard researchers had to look for it — the 'look-elsewhere effect'. The question to ask is: 'What am I not being told?'
Any collation of measures (the effectiveness of a given school, say) will show variability owing to differences in innate ability (teacher competence), plus sampling (children might by chance be an atypical sample with complications), plus bias (the school might be in an area where people are unusually unhealthy), plus measurement error (outcomes might be measured in different ways for different schools). However, the resulting variation is typically interpreted only as differences in innate ability, ignoring the other sources. This becomes problematic with statements describing an extreme outcome ('the pass rate doubled') or comparing the magnitude of the extreme with the mean ('the pass rate in school x is three times the national average') or the range ('there is an x-fold difference between the highest- and lowest-performing schools'). League tables are rarely reliable summaries of performance.
End-of-Chapter Survey: How would you rate the overall quality of this chapter?
- Very Low Quality
- Low Quality
- Moderate Quality
- High Quality
- Very High Quality