Measuring Evidence, Physician Behavior, and the Value Proposition
The increasing fraction of the gross domestic product being spent on healthcare without a commensurate increase in measures of population health has led to the widespread assumption that we have a value problem. Many organizations and government agencies now routinely weigh in on what physicians and their organizations ought to do to improve the value of their services. Yet most have been frustrated by what is seen as the stubborn resistance to change by physicians in the face of the evidence assembled by the organization. Of course, having been a practicing physician for nearly 40 years, I see the issue as one of healthy skepticism, not stubbornness. Rather than bandying words, however, I wish to explore the types of evidence available to physicians and their patients, and then consider ways we might better exploit the strengths of different types of evidence in different clinical situations. I propose that applying the proper tools to the these different situations will improve the likelihood of being able to both improve outcomes and control costs. Types of Evidence Medical evidence is of four sorts: guidelines, registries, data mining, and “in my experience.” Guidelines are currently the most prevalent and have been subjected to the most analysis. There are literally thousands of guidelines addressing hundreds of different diseases and clinical problems. The best assemble all of the relevant medical knowledge, use predetermined criteria to establish valid studies, and then develop practice guidelines derived explicitly from those data deemed valid. The strength of the guideline is graded based upon the data source that underlies it. Those graded “A” for are derived from double-blind, randomized, controlled trials, (RCT). Grade B guidelines are derived from clinical trials that are less rigorous than the RCT. Grade C guidelines are those based upon the expert opinion of those writing the guideline. Unfortunately, most guidelines are grade B or C, because RCT are expensive, take a lot of time to organize and conduct, and then more time to analyze and publish in peer-reviewed journals. Consequently, they are reserved for diseases or clinical situations that are common and for which two or more therapeutic options, usually expensive or dangerous, seem comparable at the time the study is designed. There is no serious disagreement with the notion that a properly sized (powered) RCT showing a significant treatment effect should lead to widespread adoption of that treatment. However, there is disagreement about what constitutes a clinically significant, as opposed to a statistically significant, treatment effect. A more general problem is that patients in an RCT are carefully screened prior to entry in an attempt to make sure both arms of the study are similar. Study designers either have to screen out a lot of potential patients, or have to have a very large number of people in each arm of the study, to make sure the two treatment groups are reasonably similar. Any given patient may, or may not, resemble the study group, even if the physician knows the composition of the study group patients. Sophisticated guideline writers recognize this problem, and recognize that no guideline can adequately describe a treatment program for every patient. The problem arises when they fail to consider what proportion of patients “ought” to be considered appropriate for the guideline. A recent analysis shows that only about 20% of rigorously analyzed guidelines make any attempt to stratify the patient’s likelihood of benefiting from a given therapy. In practice, if a guideline recommends treatment X, a clinical performance measure, (CPM), is developed defining where the data are to be found and how a positive response is to be identified. The percentage of total eligible patients receiving recommended treatment X is thereby defined as the score. Physicians all went to medical school, and all got used to getting A’s on tests taken during their undergraduate years. They have all learned, however, that it is hard to get an A on the test, when achieving the goal depends a lot on whether or not the patient chooses to work on achieving the same grade. Guideline writers might do well to remember that most people were C students in school. The government has taken the view that certain guidelines should be viewed as rules, and the CPM goal should be 100%. Over the past ten years, they have developed increasing numbers of CPM for hospitalized patients and expect hospitals and physicians to move from previous levels of compliance to 100%. Those who lag behind are penalized financially. As anyone who has done a continuous quality improvement project knows, however, the biggest gains come early, with the first few iterations of the “plan, do, assess” cycle. Subsequent iterations of the process become progressively more expensive and usually lead to fewer and fewer gains until the number reaches some sort of stable equilibrium. One perfectly legitimate way to game the system, though, is to exclude patients from analysis. CMS actually permits this, recognizing that some patients have a contraindication to the use of some recommended drugs. In recent years the doctors at my hospital have mostly done a better job documenting why the patient should not be on the therapy. While this has some value, it is being achieved at great expense, for fear that our numbers will fall below the threshold and CMS will reduce payments. Of course, every other hospital in the county is doing precisely the same thing, so the cost of care goes up as everyone plays the game, but it is not clear that any real value is being derived, either at the individual patient level or at the societal level in the form of improved outcomes. While the government controls what it will pay for the service, hospitals must shift these extra expenses somewhere. While this “hidden tax” has gotten a lot harder for hospitals to levy, they still find ways to do it. Another type of evidence is obtained from disease registries. For instance, all patients requiring dialysis or a kidney transplant for treatment of kidney failure are entered into a national databank along with large amounts of required data collected on a regular basis. These data are then analyzed by a third party and published annually. As a result we have good information about the results of treatment over a long period of time and can easily tell when, or if, those results have changed. However, the data do not tell us why they have changed. Physicians often use registry data when considering a patient’s prognosis, and when considering if a new therapy is likely to be of benefit. Patients though, usually assume they will beat the odds. Data mining, sometimes called real world evidence, is a third type of evidence. With widespread computerization of data, it has become possible to mine all sorts of databases for new kinds of associations. Data mining is, indeed, real world data, and is usually current, or certainly more so than most other kinds of data. But real world evidence is derived from databases created for other, frequently non-clinical reasons, and inferences must be drawn very cautiously as a result. This kind of evidence is really too new to be in widespread use. Proponents tout it as a way to see what “works” in the real world, but I suspect it will be more useful in defining way does not work. Physicians know that a lot of what they do for patients does not work in the sense that the patient does not derive any important benefit from the treatment, test, etc. Giving them up would be easy if only we knew which ones did not matter. Sometimes patients help us along by deciding to stop one or more forms of treatment. More often than we might like to admit, nothing much bad happens to the patient at least over the short term. Whether it is really a good idea or note depends to a large extent over the time frame in which benefit is measured. I will come back to this idea later. The fourth kind of evidence is “in my experience.” Academic physicians scorn this sort of evidence, noting that no individual physician can hope to see enough aspects of a given disease to derive scientifically provable evidence. Despite this, a lot of the grade C evidence published in guidelines, and a lot of daily clinical practice is driven by experience, both of the physician and the patient. I think the key problem with using experience is not the lack of universality; it is that physicians remember their disastrous outcomes with stark clarity, often years after the event, but often do not remember when things went according to plan. The is called the anchoring heuristic, reflecting the powerful impact the bad case has on clinical decision making. Physicians have different emotional reactions to adverse events, depending upon whether the bad result is perceived a result of the disease or as a result of an action taken or withheld by the physician. Consequently, physicians will weight the desirability of a given course of action on the basis of how much emotional risk they are taking. While physicians, like all people, vary in their basic tendency to anxiety, a recent bad result will generate anxiety-driven behaviors in even the most stolid of individuals. Usually this is phrased as a fear of being sued, but in fact, is much broader than that. Anxiety-driven physician behavior is the same even when there is no practical risk of a lawsuit. Many physicians, particularly those who consider themselves scientists, are disturbed by the inherent subjectivity of experience-derived data. They do not like the apparent randomness and unpredictability of the effects listed above. (Government bureaucrats do not like unpredictability, either.) Consequently they seek solace in writing guidelines, with all the limitations listed above. On the other hand, experience, and the subjective doctor-patient encounter is exquisitely sensitive to the patient’s preferences in a way that no guideline can match. Applying Evidence to Different Types of Clinical Problems Previous efforts to address the value problem outlined initially have foundered on the limitations of the evidence. But it is clearly not feasible to simply throw up our hands and surrender. So how to proceed? Before describing a possible way forward, I want to take a short detour and consider a famous mathematical problem: how long is the coast of England? The answer is that it depends upon how you measure it. If you measure the coast line in kilometers you would get one answer. However, since the coast line is irregular, measuring in kilometers means you have to ignore some features of the line. If you thought it important, you could measure in meters, allowing you to capture those smaller features, but then you will markedly, but not infinitely, increase the estimate of the coastline’s length. Now it is not my intention to quote Mandelbrot’s proof of his mathematical estimate of the degree of roughness or smoothness of the curves described by the coast of England, but it is to suggest that a way forward depends upon defining an appropriate scale. From the clinician’s perspective, it seems reasonable to propose that our scale depend upon the clinical situation. Although probably too simple, a clinical problem can be characterized as either acute, meaning something of relatively abrupt onset that is likely to either go away or kill the patient soon, or chronic, meaning something the patient is going to have to endure. It is also possible to characterize either of these types of problems as minor or major, (although it must be admitted that a dichotomous classification would be difficult to use in practice.) Broadly speaking, then, we have four kinds of clinical problems: those that are acute and minor, those that are acute and major, those that are chronic and minor, and those that are chronic and major. However, before proceeding, it is critical to recognize that clinical problems evolve over time. One major limitation of all available guidelines is the complete avoidance of addressing the fact that chronic problems tend to get worse. Further, common experience suggests that chronic problems tend to proliferate with aging. Rare indeed is it to find a 70 year old with only one chronic disease. The acute minor problems have been addressed, but are rarely subject to CPM, probably because the cost is low and most of the care is in outpatient settings. Consider the problem of a symptomatic urinary tract infection in a healthy, sexually active woman. It is quite feasible to define a practice guideline for managing this problem that is so specific its execution can be left safely to physician extenders. We should expect a high degree of predictability—that is the CPM should approach 100%, and little variation should be expected in outcomes. Acute major problems would include something such as elective surgery or pneumonia. Here it is possible to write a guideline that can be quite specific, but we cannot expect the same high degree of performance, and there will be more variability in outcome. No matter how hard we try, there will be occasional deaths associated with elective surgery and more than occasional deaths from pneumonia. Although guidelines exist for both of these examples, it is more helpful to think of this problem as one of system design. Let me present an example. In our hospital, the cardiologists decided care of patients with a heart attack could be improved if interventional care could be delivered to the patient within an hour of onset of symptoms. To do this, though, it was necessary to set up a care pathway that took precedence over the normal way care is delivered for other problems. Over the past five years, the mortality rate has fallen from about 8% per admission to about 3% per admission. Similar models of care pathways have been developed in many institutions for various problems including not only heart attacks, but trauma and other acute conditions. Each care pathway is institution specific and as yet, there has been little coordinated effort to define which parts of a pathway are necessary to achieve meaningful improvement in outcomes, and which strategies are most effective in overcoming barriers to implementation I want to emphasize, though, that these problems are subject to more inherent variability, and there will be many patients with an acute problem, but where their chronic problems make it inadvisable to implement the pathway. What is needed, then, is a clearly defined pathway for the usual patient, and an “opt out” method for the patient who is not typical, or who chooses not to participate. This is different from the traditional model of care where the physician has to “opt in” by ordering everything that is to happen to the patient during the hospitalization. The chronic minor problem is the one most difficult to characterize. For example, osteoarthritis, or “wear and tear” arthritis, can be a nuisance in many patients and disabling in others, but the clinical features on exam and X-ray may not appear all that different. What is minor for one patient may be major for another. At present we do not have a useful way to account for these difference in the way a patient responds. Hypertension is another example. Hypertension controlled by a single medication is probably considered minor by both the physician and the patient, but could evolve, especially if not treated, into a major, life-threatening chronic problem. Furthermore, the presence of hypertension may well complicate management of either another chronic problem or of an acute severe problem. The same could be said for type 2 diabetes. Numerous guidelines have been developed for managing hypertension and diabetes, and there are numerous studies of various therapies for both of these problems. There is even recognition in some guidelines that a lot of patients have both problems, and represent a different group from those having either one alone. Achieving therapeutic targets for blood pressure control or for average blood sugar control, though, cannot be done with a prescription. The patient must actually buy the medication, take it as prescribed, not miss (m)any doses, and follow up from time to time so that clinical performance can be measured. A good doctor is defined as one whose patients meet the pre-specified targets at a rate higher than average for other doctors in the neighborhood. But what if the patient, for whatever reason, does not keep his or her end of the bargain and consequently blood pressure or blood sugar are not at the “A” level? Ideally, the physician and patient would decide on a plan B, C or D, and eventually would arrive at a satisfactory “score” on the CPM. If that did not happen, though, the physician might be tempted to run off the patient whose failure to achieve the goal was making the doctor look bad on the score care. Commonly referred to as cherrypicking, this phenomenon is a common first response. Another un-intended consequence of current CPM is that it fails to account for the benefit to both the patient and to society from incremental improvement. Is it better to have 20% of the population with diabetes at goal blood sugar levels, with the other 80% lost to follow up until catastrophe appears, or is it better to have 60% partway to the goal? If an individual physician’s income is at stake, he/she will take the first option, but society might be better served by the second option. I do not think current guidelines for management of diseases such as hypertension and diabetes are a bad idea, but I do think we need to refine the way we measure application of the guidelines in the real world, recognizing that a really good guideline might apply to 80% of the patients 80% of the time. We also need to use performance measures that capture improvement on a graduated scale rather than using a “pass-fail” approach. Perhaps data mining could be used to refine the CPM for these diseases. The last category is the chronic severe problem. End-stage kidney failure is a good example of this type of problem. Treatment of patients with dialysis is extremely expensive and survival of these patients is worse, on average, than that of patients with colon cancer. Here, too, guideline writers have been busy, but to date, virtually no studies have shown that applying evidence-based guidelines, regardless of the strength of evidence, has had any major impact upon patient morbidity or mortality. Here, again, the problem is not with guidelines that inform physicians about what is known about treating these patients, it is that the only thing dialysis patients have in common is dialysis, and every patient is moving along a continuum of disease characterized by increasing morbidity. It is also not that dialysis fails to work. Patients with symptomatic kidney failure live on average 3-6 months without dialysis, and live on average 30 months with it. What is needed are guidelines that are stratified based upon patient specific factors such as age, vitality, and co-morbidity, and that change as the burden of disease gets worse. Developing a workable tool for stratifying patients is the focus of my current clinical research, but presently no such tool exists. In disease states like this, then, we need to move away from the idea that there are many performance goals that should be applied universally, and move toward the idea that the focus of care must evolve from one driven by the physician’s understanding of what seems likely to produce the best outcome toward one driven by the patient’s understanding of the burdens of living with their chronic disease. A more detailed presentation of what this sort of process might look like will be presented in a subsequent paper. Affecting the Value of Health Care Would the more subtle application of evidence that I have sketched have any effect? The short answer is that I do not know. First, we have a measurement problem. It is difficult to assess the benefit of medical care, not healthcare, but it is relatively easy to measure costs. Thus the value equation is skewed toward costs. A second problem is defining to whom the benefit accrues. The patient’s assessment of benefit from dialysis therapy might differ from society’s assessment. Healthy individuals considering the problems of patients on dialysis often say that would not want to live like that, but relatively few patients with progressive kidney failure reject it outright. Third, improving the quality of care seems worthwhile, but there is currently no useful way to measure or reward it. In my dialysis units, the likelihood of being hospitalized in any given year is statistically lower than the national average. While this could be better selection of patients, I really think the reason is that I have managed to keep a very experienced staff of nurses who actually provide the treatment, monitor the patient, and apply the protocols I have developed to care for these people. This “center effect” is well recognized, but there are no generally accepted measures of experience, or expertise, and there are no financial rewards to either the dialysis unit or the physicians for having a positive center effect. A fourth problem is that it is difficult to measure costs not incurred. Let us assume we have a therapy that will reduce the death rate in dialysis patients from the current 22% per year to 20%. At first we will see some cost savings as those patient remain “healthier,” but soon the pool of patients on dialysis will increase and the total expenditure will rise back to the previous level. Since we are only delaying death, not preventing it forever, if we follow the group long enough, we will note that the death rate has gone back up. Note that the statistic of interest to the patient—how long will I live—is not addressed at all. A fifth problem is that of small area variability. Studies consistently show that the rate at which medical services are provided in one area differ from that in another area in ways that cannot be explained by the diseases present in the population. It is common to characterize the area with higher utilization, and hence expense, as wasteful. No one would argue in favor of waste, but the fact is that we don’t really know which small area has the optimum expenditure. Given that we do not have very many medical interventions that directly affect measurable outcomes for any disease or treatment, there is no a priori way to define waste by such studies. Is the prognosis so grim that the patient is terminal? I don’t think so, but I do think we are going to have to be content with incremental improvements. We need to continue to look for ways to standardize care where it is appropriate to do so, and to change the way care is delivered where appropriate. We also need to be ever alert for the inevitable unintended consequences of our efforts and make revisions when needed to mitigate the effects of those unintended consequences. In the early 1990’s, Jim Cooper, a congressman from my state was promoting health care reform in Washington. After a speech he gave to our local Chamber of Commerce, we had the chance to talk privately. I told him then, and still believe, that the fundamental problem is that we are trying to develop a rational system to deal with a problem that is driven by fundamentally irrational motives. People do not want to be sick, to hurt, or to die. Dramatic quantum leaps in changing the value equation are not going to happen unless we as a society decide it is okay to be a little sick or a little hurt without seeking medical care, and to die without seeing how much care we can endure in the process. 17 February 2013 |
Further Reading
Activating Patients - The Achilles Heel of Healthcare Reform? Studies show 25% of the population is not involved in their healthcare, but reform efforts assume wider application of evidence-based medicine is the key to better value. Perhaps it is the Achilles' Heel? Are We Too Task Oriented? The number of tasks doctors must complete grows exponentially. Have we become too task oriented at the expense of our patients? Patient-Centered Care A consideration of the interactions of patient preferences, evidence-based medicine and peer review. Putting Patients At The Center Of Healthcare Putting patients at the center is crucial for healthcare organizations, but how can it be done? The Limits Of The Medical Model |