If Not Now, It’ll Be Too Late. Science Must Change. We’re Regressing …


The year I graduated from high school, my family’s television required “rabbit ear” antennae with perched aluminum foil. Our farming family had little time to watch TV, but when we did, the ritual included a side trip to reset the antennae’s angle to ensure good reception. Today, I watch a clear picture on myriad devices, no antennae needed.

In the 1980s, my trips to a library to find medical literature were few. A single trip to the library would take hours and net only a small number of papers. Now, I obtain articles on any topic in a manner of minutes.

In the late 1980s and early 1990s, when I returned to school to learn decision-analysis and health services research, decision models were simple due to constraints imposed by computing power. Only when I completed the laborious number-crunching process could I slowly fashion insights about care. Today, models are complex, sophisticated instruments running at light-speed.

These remarkable advances in technology are obvious. But has the science of caring for people mirrored the same meteoric advances?

I think, actually, our ability to help patients with the technology of clinical science is regressing. The tools of science don’t impede; it is the data we study and weaknesses in our study designs that thwart.

This may seem a specious claim if one just superficially notes the number of resources for medical information. PubMed claims 29 million citations; 6,000 biomedical journals are indexed, but tens of thousands more journals are produced. How can it be that we are stagnant or even backtracking, given this treasure trove of information?

First, the fact there are so many publications is a symptom of the problem, not a solution. Nearly any sort of paper will see the light of day in some journal, somewhere. Researchers can, even now, pay to be published. The mass of information is so unstructured and scientifically unsound that it would take an act of a computing God to find importance floating in the mess.

The biggest problem, in my view, then, is the quality of clinical science foisted upon us. A faster way to garner superior clinical knowledge is desperately needed; once that remedy is found, as a byproduct, the number of poor insights and publications will decline.

The weakest components of the way we presently conduct the scientific method are under-emphasized and under-appreciated. My opinion is informed by assessing scientific studies for more than 25 years as a medical editor, as a teacher of evidence-based-medicine, as a researcher and writer. Usually critiques of evidence focus on how well the study was carried out. However, applying what we learn from a study to help patients is the problem, in my view, and present studies are limited in their ability to inform the vagaries of individuals.

To refine how insights from studies may be accurately generalized, the following components need improving. (I hint at the deficiency. Future blog posts will explore these in depth.)

  1. Studying the wrong populations in the first place. Our penchant to publish leads to research done with convenience samples, not random samples of patients. This leaves us uncertain if study results are generalizable.
  2. Lack of incorporating important variations in clinical/personal characteristics of studied populations. We flatten our insights by failing to include and plan for variations in individuals being studied. Hence, it is difficult to extrapolate study results to inform individuals whose characteristics, or combinations of characteristics, are unstudied.
  3. Failure to mask intervention and outcome. If a participant or researcher knows the treatment/plans for care and, second, if the outcome variable is subjective in nature, randomization is a waste and no insight is possible.
  4. Failed randomization when researchers are looking for small effect sizes. Random imbalances in numbers of patients in prognostic subgroups can negate insights, especially when small numbers of patients differ in outcome event rates for studied options.
  5. Inappropriately using data from comparative observational studies to inform patients. Observational comparative data, including big data, are either wrong or un-provable; acting on them is dangerous.
  6. Poor presentation of medical evidence to patients. Obfuscation is the norm.

Our present models of information management are outdated, slow and expensive, and information is often irrelevant by the time it reaches the patient’s bedside. We need a better approach to clinical science. At the end of this series, I will propose a model. 

Wrong Population. Wrong Results.

In 1936, the Literary Digest, a respected national magazine, undertook a public opinion poll. Who would win the race between Republican Alfred Landon, governor of Kansas, and Democratic incumbent Franklin D. Roosevelt? Mock ballots were mailed to 10 million, about 2.4 million responded—one of the largest survey samples ever created. Their prediction? Landon would carry the day.

Big Data 1936 Style. Note two socialists and a prohibitionist were on the ballot.

They were wrong—by a landslide for FDR. That’s because respondents were biased toward Landon and did not represent the preferences of all voters. Notably, George Gallop accurately predicted FDR’s victory using a smaller, but representative, sample of about 50,000 people.

While that slice of presidential election history provides an excellent example of polling error, it also illustrates a significant issue in clinical science: reliable extrapolations from clinical studies depend upon the relevance of the studied population to those for whom the results are generalized.

Clinical science seeks to find statistically and clinically significant differences in outcomes between one plan and another. The size and accuracy of the measured difference, and the ability to use that difference number to extrapolate a study’s result, are the paramount duties of clinical science.

As researchers plan clinical science, much time and effort is devoted to plotting study design, ruminating about statistical “power” and presaging interpretation. These considerations focus on “internal validity,” how well the study, per se, is done. As a result, the “population studied” can become an after thought—not ignored, but subjugated—a major reason that studies fail to properly inform.

Patients in a study are a part of an “eligible to be studied” whole. Clinical science uses information from the partial, studied group to infer to the whole. If the part does not share characteristics of the whole, inference is weak, or wrong. A study with flawless internally validity means little if results do not translate. An egregious example is the over reliance on clinical studies of heart disease among men as the basis for treatment protocols for women, despite gender disparities in how women experience heart disease and what treatments work best for them.

Let’s break this down further. There are three populations in research:

  1. The entire population of patients with the condition who are eligible to be studied.
  2. The part of the whole invited into a study.
  3. The group of invited patients who accept being in the study.

Unfortunately, the path from the entire eligible population to the group that actually participates may lead to a clinical trial with limited applicability. As an editor I once saw a group of nearly 10,000 eligible people drop to only 300 accepting to participate in the study. Based on the numbers, alone, it’s hard to imagine that 300 could properly inform the 10,000.

Whether the small part can be generalized to the whole begs a question: Is the best study one with large numbers of patients? We’ve already seen the flaws in the 1936 Literary Digest poll, one of the largest studies ever done. How about a contemporary clinical science example? In 2011, the National Lung Screening Trial (NLST) was published; 53,454 patients were enrolled from 33 centers to test if CT scanning saved lives from lung cancer.

As with all trials, inclusion and exclusion criteria created the population of patients eligible for study; but we don’t know if all potential patients who would have been eligible were taken into account at outset of the trial. Also, there is no description, either in the published report or in the study protocol sent to ClinicalTrials.gov, of how patients from the eligible population were invited. We don’t know if a consecutive sample was invited, a systematic sample scheme was used, a random sample of eligible patients was asked, or if doctors picked whom to ask.

This failure to describe which people got invited and who accepted being in the study is a profound omission—even, I would argue, a disqualifying omission, in terms of using the study to make policy or patient decisions. If the people were haphazardly, rather than systematically, invited, the large sample is nothing more than a convenience sample of handpicked patients. Random assignment to treatments after haphazard recruitment does not help us generalize results. It would be better to have a random sample of all eligible people at outset.

Is there evidence that the NLST study population was not generalizable, and, therefore, of limited value to individual patients?

After publication, CT scanning was promoted based on the trial results, and centers began screening. The experiences of other sites did not replicate the NLST findings. For example, the Veterans Administration found that their screened patients were older than those in the NLST (53 percent over 65 years of age versus 27 percent for the NLST), were more likely to be current smokers, had more abnormalities on CT requiring follow-up, found fewer lower stage cancers, and had a complication rate over twice as high as reported in the NLST. They also noted variations in patients’ experiences with outcomes, process, costs and complications across 8 study sites.

Large studies are large for a reason. There is little anticipated difference in the outcomes of a randomized trial; the base rates for outcomes are small. Some argue that a random sample of a population is unneeded when the base rates of outcome events are small, but the examples above nix that debate. Outcome rates, especially complication rates, vary by patients’ clinical and personal characteristics, and their means.

Large studies, like the NLST and others, to not pick just on that study, that fail to include all eligible patients or do not randomly invite people from all who are eligible are off on the wrong foot from the get-go. Simple random samples of patients, even, may also be inadequate for the future advancement of clinical studies—but that is a topic for a future post. 

Clinical research must be more like the 1936 Gallop poll than the convenience sampling of even huge numbers of people. If clinical science can’t get the right population to study at outset, advancing care via science will be slow and dangerous to some. Generalizability, not internal validity, should dominate study planning.

Leave a Reply