Objectives The aim of this study was to compare three health utility instruments (15D, EQ-5D, SF-6D) and a rating scale for health (EQ-Visual Analogue Scale (VAS)) and to investigate their relationship to clinical parameters in patients with rheumatoid arthritis (RA).
Methods Data were collected from 1041 patients with RA. Agreement between the instruments was assessed with Bland–Altman plots. Linear regression models were fitted for the different instruments and Health Assessment Questionnaire (HAQ) scores, age, gender, patient global, disease duration and educational level. Differences in utility scores across levels of global health and disability, were investigated as well as correlations with disease-specific health status measures.
Results The score range in the 1041 patients with RA was 0.41–1.0 for 15D, −0.48 to 1.0 for EQ-5D, 0.0–1.0 for EQ-VAS and 0.30–1.0 for SF-6D, with a bimodal distribution for EQ-5D. Bland–Altman plots indicated poor agreement between EQ-5D and SF-6D/15D and moderate agreement between SF-6D and 15D. Utility scores were correlated with disease-specific measures, pain and fatigue (r>0.60). Mean utilities ranged from 0.30 (EQ-5D) to 0.69 (15D) in patients rating their own health as poor. When correcting for a non-linear relationship between HAQ and EQ-5D/SF-6D in linear regression models, the estimated utilities had non-overlying CI for HAQ values >1.4.
Conclusions Diverging scores were observed across utility instruments, especially in patients with high HAQ scores. The choice of utility instrument may have an impact on the results of cost-utility analyses, with large hypothetical differences in price per quality-adjusted life year.
Statistics from Altmetric.com
Rheumatoid arthritis (RA) is a chronic inflammatory disease with a major effect on both physical and psychological health and the ability to work.1,–,3 Several new therapies can improve outcome, but their cost represents a financial challenge to healthcare systems. Health authorities are increasingly using cost-effectiveness analyses for reimbursement decisions.4 In such analyses, the aim is to compare two alternative treatments by presenting the relationship between the net additional (incremental) resources used (costs) and the net additional health benefits achieved (effects) in an incremental cost-effectiveness ratio.5 6 In cost-utility analysis, a subgroup of cost-effectiveness analyses, the benefit is expressed in terms of quality-adjusted life years (QALYs).5
One central goal of RA therapy is to improve health-related quality of life (HRQoL),7 which can be measured with generic and disease-specific instruments.8,–,10 The utility instruments are a particular group of generic instruments used for economic evaluation (cost-utility analysis). The idea is that the utility (value) of a health state is measured on a scale from death (0.0) to perfect health (1.0) and reflects preferences among people for different health states. Some instruments (including EQ-5D) have health states worse than death with utility scores below 0.0. QALYs are computed as the utility score multiplied by life years. QALYs represent an important method for comparison between disease groups, especially when setting resource priorities.11 Here, the cost per QALY is used to identify treatments with the lowest costs compared with their effectiveness.
Several methods have been developed to measure utilities by studying people's preferences for health states. While methods such as standard gamble (SG) and time-trade-off (TTO) may be used to measure health states directly, they are less suitable for clinical research. Instead, so-called utility instruments (eg, SF-6D, EQ-5D, Health Utilities Index and 15D) measure utility indirectly. A varying number of health dimensions or attributes (eg, pain, vision, physical functioning) are then used to describe patients' health states by means of scoring systems. The scores are subsequently translated into a utility score between 0 and 1 by means of an algorithm or ‘tariff’. An alternative approach is to use a visual analogue scale (VAS). However, a VAS is not a true utility instrument because it is not based on preferences in the sense that respondents are not requested to sacrifice anything (eg, life years, money) when they express their valuation of health states. The SG and TTO are methods that fit economic models well, but are less widely used for feasibility reasons.11
Previous studies of patients with RA12,–,14 have indicated that these indirect utility instruments yield different utility scores for the same patients. The aim of this study was to compare utility scores from 15D, EQ-5D and SF-6D and examine their associations with commonly used health status measures in RA, especially the Health Assessment Questionnaire (HAQ). We used the Oslo RA register (ORAR) which includes a large number of patients and is representative of the underlying patient population in a well-defined geographical area.15 Data on patient reported health status are available from several mail surveys undertaken since 1994.
Patients and questionnaires
ORAR was established in 1994 and comprises about 85% of the patients with RA living in Oslo,15 a capital city with nearly 550 000 inhabitants. To be included in the register the patients must fulfil the American College of Rheumatology classification criteria.16 A mail survey was performed among all 1793 patients in the ORAR in 2004, and 1041 (58.1%) responded.17 The non-respondents received one reminder according to recommendations from the local ethical committee. The booklet of questionnaires included a variety of measures of utility and health status including SF-36, 15D, EQ-5D, EQ-VAS, the HAQ,18 the modified HAQ,19 the revised Arthritis Impact Measurement Scales (AIMS2)20 and other clinical and health-related parameters.
The 15D captures 15 dimensions with five response categories in each dimension, making it theoretically possible to describe more than 30 billion health states.21 22 All these utility scores fall between 0.0 and 1.0. The algorithm has been developed on the basis of multiattribute utility theory and the 15D weights are based on a Finnish study from 2001. This instrument is not yet widely used in RA but has been used in some HRQoL studies.23 24 Regression analyses were performed to handle missing values.22
Developed by the EuroQol group,25 26 the EQ-5D captures five health dimensions (mobility, self-care, usual activity, pain/discomfort and anxiety/depression). Three response categories are available for each dimension, describing 243 different health states.27 The UK tariff, in which 36 of the health states were valued directly in a large population survey in the UK, was used in this study. This tariff translates EQ-5D scores into utilities by means of a TTO technique.27 The potential EQ-5D values ranges from −0.59 to 1.0.
SF-6D is a utility instrument in which SF-36 scores can be translated into a utility score by means of an algorithm based on a SG technique.28 SF-6D has six dimensions, each with 4–6 levels. The SF-6D utility scores range from 0.29 to 1.00 and may describe 18 000 health states.29 Of these, 249 health states were evaluated directly while the rest were imputed from regression analyses.
On the EQ-VAS scale the patients are asked to place their own health on a thermometer from 0 (worst possible) to 100 (best imaginable health). This is not a true utility measure. A utility instrument is based on a reference material where people have been asked to give preferences in the sense that respondents are requested to sacrifice something (eg, life years, money) when they express their valuation of health states. Because EQ-VAS is often used in studies, it has been included in these analyses.
Disease-specific health status measures
The HAQ instrument evaluates the ability to fulfil daily tasks in eight categories (dressing, rising, eating, walking, hygiene, reach, grip and usual activities) with 20 questions. Scores for each category and for the total score range from 0 to 3 with 25 intervals.18 AIMS2 is a revised and expanded version of AIMS, constructed to measure health status in patients with arthritis. It is a questionnaire with 78 items, 58 of which make up 12 scales, which can produce a 3-component (physical, affect and symptom) or 5-component (physical, affect, symptom, social inter-action and role) model.20 The patients completed 100 mm VAS for pain and fatigue with anchors 0 (good health) and 100 (poor health), and self-reported disease activity (patient global VAS).
To make comparison easier, all EQ-VAS scores were divided by 100 to generate values between 0.0 and 1.0. Descriptive statistics (mean, median, SD, minimum, maximum and frequencies) were computed for the utility instruments and EQ-VAS. Histograms were used to visualise the distribution of the scores. Correlations with other health status and demographic data were examined by Spearman correlation (r) due to non-normality of EQ-5D data. Agreement between the utility instruments was assessed with Bland–Altman plots.30 A Bland–Altman plot is used to compare two methods of measuring similar variables. The x-axis presents the average of the two variables (in this study, two different utility scores) for each case while the y-axis represents the difference between the two utility scores. With good agreement, the observations are clustered evenly around the line representing y=0.
Utility levels were compared across a global health status measure. For this comparison we used the response categories in the first item of SF-36, which is not included in the calculations of SF-6D (‘In general, would you say your health is: excellent, very good, good, fair, poor’). Mean utility and EQ-VAS scores were calculated for each of the five response groups. Linear regression models were fitted for each utility instrument with the utility score as the dependent variable and HAQ, age, gender, disease duration, patient global and level of education as independent variables. Due to a non-linear relationship between SF-6D and EQ-5D, HAQ2 was included in the linear models. To simplify the model, level of education was divided into only two groups (completing and not completing at least 3 years of education after compulsory school) and smoking status was set to ever or never having smoked. R2 was calculated for each model.
The Statistical Package for Social Sciences (SPSS, Chicago, Illinois, USA) Version 14.0 was used for all analyses.
The 1041 individuals who responded to the mail survey represent 58.1% of the living patients included in ORAR in 2004; 78.1% (n=813) were women, 47.1% were rheumatoid factor (RF)-positive, the mean (SD) age was 61.7 (15.0) years (range 18–94), mean (SD) disease duration was 14.1 (10.9) years (range 0–55) and mean (SD) HAQ score was 1.06 (0.75) (range 0.0–3.0). This cohort represents a broad spectrum of patients, as ORAR is designed to include as many of the patients with RA in Oslo as possible.15 The respondents differed from the non-respondents with regard to mean age (respondents 61.7 years vs non-respondents 66.4 years, p<0.001) and mean disease duration (respondents 14.1 years vs non-respondents 15.5 years, p<0.05).
Distribution of scores
The mean/median scores of 15D, EQ-5D, EQ-VAS and SF-6D were 0.81/0.83, 0.60/0.69, 0.62/0.65 and 0.64/0.62, respectively. EQ-5D had a bimodal pattern with no observations between 0.364 and 0.516 (figure 1). Seventy-one patients (6.8%) had an EQ-5D value below 0. Of the 1041 patients, 4.2% had missing data for EQ-5D, 5.8% for EQ-VAS and 7.9% had missing data for SF-6D. For 15D, regression analyses were performed to handle missing data.22
Both SF-6D and 15D had poor agreement with EQ-5D (figure 2A, B). EQ-5D yielded lower scores than 15D and SF-6D for poor health states but, for health states close to perfect health, 15D and SF-6D yielded lower scores. There was more agreement between SF-6D and 15D (figure 2C), but 15D yielded higher values across all health states. These findings indicate that the instruments produce quite different utility scores for the same health states. The figure also reflects the bimodal distribution of EQ-5D.
Correlations with other health status measures
We examined correlations with clinical and demographic values but did not include components from SF-36 because these values would not be independent of SF-6D. The strongest correlations with utility scores and EQ-VAS were observed for physical and pain parameters, with reasonably similar findings for the four instruments (table 1). Demographic data showed very poor correlations, while mental, social and fatigue parameters were moderately correlated with utility scores and EQ-VAS.
Relationship to levels of global health and disability
The best discrimination was seen for EQ-5D across different levels of global health (question 1 from SF-36) (table 2). The scores achieved with 15D were consistently higher for all groups, whereas SF-6D and EQ-5D had similar mean scores in patients with fair to excellent health.
The linear regression models with utility scores as the dependent variable and HAQ score, HAQ2 score, gender, patient global, disease duration, age and education as independent variables are shown in table 3.
To better illustrate the result of the combined regression coefficients for HAQ and HAQ2, figure 3 shows the estimated utility values with CI. For HAQ values above 1.4, all three utility instruments had differing results. The estimated 15D values differ from EQ-5D and SF-6D values for all HAQ scores above 0.6.
The results of this study indicate substantial discrepancies between commonly used utility measures with regard to performance and agreement in a large cohort of patients with RA.
The agreement between EQ-5D and the other utility scores was poor, as shown by the Bland–Altman plots (figure 2). In both plots involving EQ-5D, some patients had differences between EQ-5D and the other utility measures exceeding 0.5, a few patients even had a difference of 1.0. These numbers are large, considering the range of the scales of the instruments. The differences also vary in size across levels of health states, making it difficult to recalculate from one measure to another (figure 2).
This study confirms earlier findings in smaller studies12 13 which have reported a bimodal distribution of EQ-5D scores (figure 1) as well as a ceiling effect for EQ-5D,12 13 31 which indicates that this instrument is less able to differentiate between patients in fairly good health. Our results also illustrate that 15D has a ceiling effect (figure 1).
HAQ is widely used to estimate utilities for economic evaluations of therapeutic interventions.4 Only 10–15% of the variance in our models was explained by HAQ alone (data not shown), and it is questionable if estimation of utilities from HAQ is a valid method. Such estimations add uncertainty to the calculations, and utility instruments should be included in studies when possible. This is also the recommendation in a study exploring the validity32 of a previously published prediction model33 for EQ-5D and SF-6D from HAQ. If utility instruments have not been included, models should ideally be assessing potential non-linear relationships. Previous prediction models for EQ-5D from HAQ have presented linear functions between the two34 35 or, to some extent, taken into account the non-linear relationship.33
To exemplify the differences in estimated utility scores from HAQ in our model, we calculated prices per QALY gained in two example patients, both 50-year-old women, of adding biological treatment to previous disease-modifying antirheumatic drugs for 1 year. The first patient had a change in HAQ from 1.6 to 1.2 and a change in patient global from 60 to 40. With the most common instrument (EQ-5D) as comparison, the price per QALY with SF-6D, EQ-VAS and 15D was 2.8, 1.7 and 3.6 times higher, respectively. The second patient had a change in HAQ from 2.6 to 2.2 with a change in patient global from 60 to 40. Here the price was 4.5, 2.1 and 4.5 times higher with SF-6D, EQ-VAS and 15D compared with EQ-5D. These examples are hypothetical and assume an instant stable effect of the new medication which might be artificial, although utility change has been found to be rapid and persistent.36 Taking into account the limitations, they highlight the problems of calculations of utility from HAQ and choice of utility instrument, and are supported by findings in other models.37 38
We found that the three utility instruments had different associations with HAQ. In particular, lower utility scores were observed for EQ-5D in patients with severe disabilities (figure 3). These differences are reproduced when comparing mean utility scores across groups according to rating of overall health (SF-36 question 1) (table 2). This result may be explained by the content of EQ-5D. Of the five dimensions, four (mobility, self-care, usual activity and pain/discomfort) are likely to be affected in patients with RA. It is conceivable that EQ-5D is particularly sensitive for conditions with physical limitations and disability. Some support for this can be found in a study comparing EQ-5D and SF-6D in seven diseases,31 showing larger mean differences between the two instruments in osteoarthritis than in diseases focusing on pain and discomfort such as irritable bowel syndrome.31 A study investigating the health states of patients with inflammatory arthritis with a score worse than death on EQ-5D found that a large proportion of these patients scored maximum on the pain dimension and moderate on all other dimensions.39 Thus, EQ-5D has a large potential for improvement when a patient is improved from severe to moderate pain or disability. One alternative explanation for the differences in performance is that EQ-5D scores are derived with the TTO technique while SF-6D scores are derived by means of the SG technique.
Limitations to the patient material presented in this study include the fact that the cohort had a low rate of RF-positive patients (47.1%) and a response rate in the mail survey of 58.1%. An extensive questionnaire might have contributed to the low response rate, as well as the higher age among non-responders. The participants answered a booklet of questionnaires including the utility instruments and a number of other instruments. The order of these instruments was fixed. It is possible that the answers influence each other, and a way of correcting this would have been to randomise the order in which the instruments appeared or the patients could have been randomised into receiving only one instrument each. This study was cross-sectional and could not answer questions regarding sensitivity to change. In the analyses, HAQ and to some extent patient global was used as a key measure of disease activity and severity, in accordance with the relationship to both damage and inflammatory activity.40 41 C-reactive protein and erythrocyte sedimentation rate data were not available but would have been interesting markers of disease activity to include in our study.
One of the strengths of this study is that a broad group of patients with RA was included. ORAR aims to include all patients with RA regardless of disease level, age and sex, and our study shows the performance of the instruments in a real-life setting. The study also includes a large number of patients. This is in contrast to clinical trials where patients often have high disease activity and the age group is more limited.
EQ-5D is the most commonly used generic utility instrument in cost-utility evaluation,42 and has also been used in studies of anti-tumour necrosis factor agents.43 The scores obtained in the same patient on the same day using 15D, EQ-5D, EQ-VAS and SF-6D may vary considerably. All of the three utility instruments have floor or ceiling effects. The disagreement between the instruments is most pronounced for patients with poor health and severe disability (HAQ ≥2). The differences between the instruments imply that the results of cost-effectiveness/cost-utility analyses may vary considerably depending on which instrument has been used. Policy makers need to be aware of these differences when interpreting economic evaluations. The differences in utility measures need further confirmation in longitudinal intervention studies in patients with different levels of disease severity.
The authors thank Inge Christoffer Olsen for making figure 3 and for statistical advice.
Competing interests None. The handling editor of this manuscript was Johannes WJ Bijlsma.
Funding Abbott provided an unrestricted grant to Diakonhjemmet Hospital corresponding to a 7-month salary for a research fellow (SL). The sponsor had no influence on the design of the study or wording of the manuscript and did not review or comment on the text of this manuscript at any stage of the research process.
Ethics approval This study was conducted with the approval of the regional ethics committee and the Data Inspectorate.
Provenance and peer review Not commissioned; externally peer reviewed.
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.