A Brief History of Failure: Performance Measurement

By KIP SULLIVAN

Over the last three decades, health care policymakers in Minnesota, where I live, have gotten into a bad habit: They recommend policies without asking whether there is sufficient evidence to implement the policy, and without spelling out how the policy is supposed to work. Measurement and “pay for performance” (P4P) schemes illustrate the problem. Multiple Minnesota commissions, legislators, agencies, and groups have endorsed the notion that it’s possible to measure the cost and quality of doctors, clinics, and hospitals accurately enough to produce results useful to regulators, patients, providers, and insurers.

But these policymakers did so with no explanation of how system-wide measurement was supposed to be done accurately, and without any reference to research demonstrating that accurate system-wide measurement is financially or technically feasible. The Minnesota Health Care Access Commission (in 1991) and the Minnesota Health Care Commission (in 1993) were the first of several commissions to exhibit this “shoot-first, aim-later” mentality. Both commissions recommended the establishment of massive data collection and reporting systems, and both articulated breathtaking expectations of the “report cards” these systems would produce. According to the latter commission, for example, the data collection and number crunching would facilitate “feedback of data that reflects the entire scope of the health care process, from the inputs or structural characteristics of health care to the processes and outcomes of care.” (p. 134) Yet neither commission offered even the crudest details on how such a scheme would be executed nor what it would cost, and, not surprisingly, neither commission offered evidence supporting their high hopes.

In 2008, two other commissions and the Minnesota Legislature exhibited the same casual attitude toward evidence and details. That year, the Legislature, egged on by the commissions, passed a law requiring the Minnesota Department of Health (MDH) to create a “standardized set” of quality measures for Minnesota that would be used to punish and reward “health care providers” (Minnesota Statutes, Section 62U.02). The law offered a few guidelines (such as MDH should “seek to avoid increasing the administrative burden on health care providers”), but it offered no details on how MDH was supposed to create useful measures.

Policymakers at the federal level have exhibited the same attitude. Like the half-dozen commissions that have advised the Minnesota Legislature over the last three decades, the Medicare Payment Advisory Commission (MedPAC) has endorsed measurement and P4P schemes for Medicare on the basis of zero empirical evidence and without working out the details. As the Minnesota Legislature followed the evidence-free recommendations of the Minnesota commissions, so Congress has followed MedPAC’s undocumented recommendations. MedPAC’s influence is most apparent in the Affordable Care Act of 2010 and the 2015 Medicare Access and CHIP Reauthorization Act, which enacted the nation’s largest P4P program (the insanely complex Merit-based Incentive Payment System).

A report card that measures a micro-fraction of all services delivered will be grossly inaccurate.

The proliferation of reporting and P4P schemes has triggered “significant rethinking of measurement activities at the federal government, by national measurement organizations and health care payers, and within state governments,” as MDH put it in a February 2019 report to the Legislature. (p. 8) Minnesota’s Legislature is among those doing some rethinking. It enacted a law in 2017 that requires MDH to develop a “framework” for evaluating MDH’s “performance” measurement program which was authorized by legislation enacted in 2008. Because feedback is useless if it is not accurate, MDH should make accuracy the single most important criterion in evaluating any proposed quality or cost measure. MDH should use this opportunity to explain to the Legislature why MDH’s measurement system and systems like it are grossly inaccurate.

Three impediments to accuracy

The inaccuracy of “performance” measurement has three distinct causes: 1) It measures a tiny fraction of the thousands of services a clinic or hospital delivers (the “bundled product” problem); 2) it is usually very difficult to determine which patient “belongs” to which doctor or clinic (the “attribution” problem); and 3) for all but the simplest of medical services, it is impossible to adjust scores accurately to reflect factors outside physician or hospital control (the “risk-adjustment” problem). I will illustrate each problem with an example, then examine each in more detail.

The “bundled product” problem is the easiest to understand. To illustrate this problem, consider this analogy. Imagine that you want to issue cost and quality report cards on Home Depot, Menard’s, and Lowe’s. For the sake of discussion, let’s say these stores sell ten thousand different items—appliances, tools, construction materials, paint, repair services, plants, etc. You decide your report card will issue grades on just five items—sod, circular saws, tile cleaner, varnish, and dry wall. You ignore the other 9,995 items and services. How useful is your report card?

Like home supply stores, clinics and hospitals sell thousands of services. There are 8,000 services doctors bill for (that’s roughly the number in the Current Procedural Terminology manual, the document all doctors use to find codes to put on their claim forms), and 68,000 diagnoses (that’s the number of diagnoses listed in the current iteration of the International Classification of Diseases maintained by the World Health Organization). MDH currently lists 29 measures on its website.

To illustrate the “attribution problem,” consider again the “optimal diabetes” measure discussed in Part I (https://tinyurl.com/mp-sullivan-p1) of this two-part series—a measure that Minnesota Community Measurement (MNCM) and many other report-card manufacturers use. This measures the percent of a doctor’s or clinic’s diabetic patients who have their blood sugar and blood pressure under control, who take aspirin and statins, and who don’t smoke. Obviously, the first step in calculating these percentages is to determine which patients “belong” to which clinic. But how do you do that? If you don’t do it accurately, you will be rewarding or punishing doctors for patients they don’t see.

To illustrate the third obstacle to accuracy—inaccurate adjustment of scores to reflect the impact of factors outside physician or hospital control—imagine that you have chosen the blood pressure measure within the “optimal diabetes” measure to be one of a handful of quality measures in your report card. You know that blood pressure is determined by multiple factors doctors have no control over, including patient age, income, education, willingness to exercise, stress levels at home and work, insurance coverage for and the price of blood pressure medications, etc. How do you adjust the scores on your report card to make sure they measure only physician expertise and not all those other factors?

Now imagine how inaccurate your report card is going to be if you can’t solve even one of these problems, never mind all three.

The bundled product problem: Treating to the test

Even the most expansive measurement-and-reporting schemes measure only a tiny fraction of the thousands of medical services sold in modern societies. Consider furthermore that each service can be evaluated at least four ways—by process measures (did the diabetic patient’s A1c levels get measured?), outcome measures (is the diabetic’s A1c level under 8?), structural measures (does the hospital have a catheterization lab?), and patient satisfaction as measured by surveys. The possible number of “quality” measures is in the tens of thousands. Compare tens of thousands to, for example, the 30 or so enforced by MDH and its contractor, MNCM, over the last 15 years.

A common argument presented by proponents of reporting schemes is that scores on some of the handful of measured services increase over time. But measurement proponents never investigate whether improvement on those scores was financed by “treating to the test,” that is, by shifting resources away from patients whose care was not measured. Common sense and a small body of research indicates that’s in fact what happens: the use of a tiny fraction of services that MNCM and other P4P proponents measure has induced teaching to the test. If in fact improvement on a few scores is financed by a worsening of the quality of unmeasured services, overall quality (at both the system and provider level) may not have improved at all. And if patient preferences were bulldozed by providers under pressure to honor the priorities set by report card producers, overall quality may have gotten worse. In either event, a report card that measures a micro-fraction of all services delivered will be a grossly inaccurate reflection of the quality of the providers subjected to measurement.

The attribution problem: Measuring phantom patients

Unlike the bundled product problem, the attribution problem does not afflict all measurements. We know, for example, exactly which hospitals and which surgeons performed bypass surgery on which patients. If we want to prepare a report card on heart surgeons or the hospitals where heart surgery is performed, we don’t have to make up arbitrary, complex rules to assign patients accurately. But we do have to make up arbitrary and complex rules to attribute patients to doctors, clinics, and hospital-clinic chains when the report card measures services like those in the “optimal diabetes” bundle.

Feedback is useless if it is not accurate.

The most widely used attribution rule is to assign patients (without their knowledge) to a clinic or hospital-clinic chain if, during a baseline (or “lookback”) period of one or two years, patients made a plurality of their visits to the clinic or chain. Thus, if I visit Clinic A three times in 2019, Clinic B once, and Clinic C once, the plurality-of-visits rule will “attribute” me to Clinic A for the “performance year” 2020. Even if I never set foot in Clinic A in 2020, the doctors in Clinic A will be rewarded or punished based on my blood pressure, my blood sugar levels, whether I resume smoking in 2020, etc., outcomes they were totally unable to influence during 2020. Health policy analysts and consultants measure the integrity of these attribution algorithms (or the lack thereof) by measuring their “leakage rates”—the rate at which patients fail to seek care often enough during the “performance year” to be assigned to the same clinic the next year. Research on the leakage rates of “accountable care organizations” (groups of clinics and hospitals) and “medical homes” (single clinics), for which the plurality-of-visits method is used, equal an astonishing 30% to 40%. As you can imagine, the addition of all those phantom patients to the denominator of measures like the “optimal diabetes” measures, and the subtraction of so many real patients, substantially augments the noise-to-signal ratio of such measures.

The risk-adjustment problem

The third major contributor of noise to “performance” measures is crude risk adjustment. Risk adjustment is done to adjust scores for factors providers and insurance companies have no control over. The most efficient way to convey the unacceptable inaccuracy of today’s risk adjusters is to review the inaccuracy of the nation’s most widely used, most studied, and probably most accurate risk adjuster—the one CMS developed in the early 2000s to adjust payments to Medicare Advantage plans. This method, known as the Hierarchical Condition Categories (HCC) model, can only predict 12% of the variation in spending among Medicare enrollees. To understand how bad that is, consider these statistics reported by MedPAC: the HCC overestimates spending on the healthiest 20% of beneficiaries by 62% and underestimates spending on the sickest 1% by 21%. MedPAC has made it clear they have no expectation that the HCC can be made substantially more accurate.

As these statistics suggest, inaccurate risk adjustment punishes providers who treat an above-average proportion of the sick and the poor and rewards those who treat an above-average proportion of the healthy and higher-income. This worsening-of-disparities effect can be seen, for example, in the outcomes of the Hospital Readmissions Reduction Program (HRRP), a program foisted on the fee-for-service Medicare program by the Affordable Care Act. The HRRP punishes hospitals with 30-day readmission rates above the national average. CMS uses a risk adjustment method similar to the HCC to adjust readmission rates for factors outside hospital control, but the risk adjuster is so bad it routinely punishes hospitals with sicker patients. Research published in the last three years indicates the HRRP may be killing heart failure and pneumonia patients.

MDH, MNCM, and other “performance measurers” use risk-adjustment schemes that are even cruder than the HCC, and in some cases they use no risk adjustment at all. MDH uses payer mix —the percent of patients insured by Medicaid, Medicare, and private insurers—as its risk adjuster. Unlike CMS, which reports the accuracy rate of its adjuster for at least cost (as opposed to quality), MDH has never reported what percent of the variation its payer-mix method explains. In a 2017 report to the Legislature (https://tinyurl.com/mp-2017-mdh), MDH did concede that “risk adjustment can typically only explain a fraction of differences in quality between providers,” and they knew of no way to improve the accuracy of their crude payer-mix method. But, MDH concluded, that’s OK because the payer-mix method is “reasonable.” (p. 14)

Learning from failure

In its 1993 report to the Legislature, the Minnesota Health Care Commission based its breathtaking expectations of “performance” measurement on this breathtaking assumption: “The commission assumes that the dimensions of health care quality can be defined and measured in a useful and equitable way.”(p 134) The commission endorsed this assumption without even acknowledging the sources of white noise discussed here—the bundled product, attribution, and risk adjustment problems—much less suggesting ways to overcome them. None of the subsequently appointed commissions questioned the 1993 commission’s fanciful assumption. Nor did the Legislature. It’s time Minnesota policymakers admit that that assumption was based solely on groupthink, that the assumption persists to this day because of groupthink, and the assumption must at long last be rejected.

Rejecting that assumption does not mean rejecting measurement. The issue at hand is not whether measurement is useful, but whether inaccurate measurement is useful. Nor does it mean abandoning all efforts to improve the quality of medical services or the health of Minnesotans. It means abandoning the default diagnosis that all problems in our health care system are due to defects in our doctors and hospitals, entertaining the possibility that those problems that might be within provider control are due to insufficient resources, and abandoning the comforting myth that it’s possible to adjust “performance” scores accurately to reflect factors outside provider control. Above all, it means accepting the obligation to ensure that measurements are accurate before they are unleashed on Minnesota’s doctors and hospitals.

Kip Sullivan, JD, is a member of the Health Care for All Minnesota Advisory Board. He was a member of Gov. Perpich’s Health Plan Regulatory Reform Commission. His articles have appeared in the New England Journal of Medicine, Health Affairs, and other peer-reviewed journals.

[d.] the deductible.

A Brief History of Failure: Performance Measurement

Like this:

Leave a Reply Cancel reply

Share this:

Like this:

Leave a Reply Cancel reply