Original articles
Statistical Issues Encountered in the Comparison of Health-Related Quality of Life in Diseased Patients to Published General Population Norms: Problems and Solutions

https://doi.org/10.1016/S0895-4356(99)00014-1Get rights and content

Abstract

The objectives of this study were (1) to illustrate the statistical problems encountered when comparing health-related quality of life (HRQL) measured by the Medical Outcome Study Short Form-36 (SF-36) in a diseased group to general population norms, and (2) to define age- and gender-standardized dichotomous indicator variables for each health concept and show that these indicator variables facilitate comparisons between the diseased sample and the general population. Our “diseased” group consisted of 136 sequentially consenting patients referred to the syncope clinic for assessment and treatment. Participants completed the SF-36 questionnaire before undergoing diagnostic testing. General population norms for the SF-36 are available from the responses of 2474 participants in the National Survey of Functional Health Status, conducted in 1990 in the United States. Comparison of the SF-36 in a diseased sample with general population norms is difficult, owing to skewed and unusual distributions in both groups. In addition, making comparisons within age and gender strata is difficult if the within strata sample size is small. We propose a dichotomous indicator variable for each health concept that classifies an individual as having impaired health if he or she scored lower than the 25th percentile for the appropriate age and gender general population strata. By definition, the prevalence of impaired health in the general population is 25% for all eight health concepts. Comparison between the eight health-concept variables is easy because the population norm is the same for each of them. These indicator variables are age and gender adjusted, so that even if the sample did not have the age and gender distribution as the general population, comparisons can still be made with the value of 25.

Introduction

Clinimetrics, especially evaluating health-related quality of life (HRQL) has become an important part of clinical epidemiology, clinical trials, health care research. A number of scales to measure HRQL have been devised and validated (e.g., the Medical Outcome Study SF-36 and the EuroQol EQ-5D) and are now widely accepted 1, 2, 3, 4, 5. General population norms have been produced and are available for comparison with study samples 3, 4. We consider here the situation in which the HRQL of a sample of individuals with a disease are compared with general population norms to examine whether the HRQL is impaired (or the degree of impairment) in individuals with this disease. Such a study can be thought of as a case-control study with historical controls and, as such, has similar advantages and disadvantages [6]. The primary advantage is that it negates the necessity of selecting, recruiting, and collecting data on a control group thought to be representative of the general population. However, there are also numerous disadvantages, including the unavailability of complete data on the controls for statistical analysis. Other disadvantages peculiar to health care research may be cultural differences in reporting health, political differences that generate differences in access to the health care systems, and language problems in multicultural societies. For example, although these questionnaires are available in different languages [1], the SF-36 norms have been produced for the U.S. general population [3] and the EuroQol, for British and European populations [5].

Keeping these limitations in mind, we first discuss the problems in the statistical analysis of studies using the SF-36. We then propose a method of analysis that will circumvent these problems by determining age- and gender-standardized cut points for each of the eight health-concept scales, which identify individuals most likely to have substantial limitations in these health concepts.

The SF-36 Health Survey is a standardized generic questionnaire designed for self-completion. It was developed from a previous questionnaire known as the Medical Outcome Study General Health Survey Instrument. The SF-36 provides information on eight scales measuring different concepts of health: (1) physical functioning, (2) role limitation due to physical health problems, (3) bodily pain, (4) general health, (5) vitality (energy/fatigue), (6) social functioning, (7) role limitation due to emotional problems, and (8) mental health (psychological distress and psychological well-being). Each scale produces a standardized score that lies between 0 and 100, with lower scores indicating poorer health or higher disability. In addition, the SF-36 has four dichotomous station indicators. These dichotomous indicators identify individuals with (1) physical limitations, (2) emotional limitations, (3) role disability, and (4) an unfavorable personal evaluation of their health in general. The physical-limitation indicator identifies individuals reporting any physical limitation in response to the 10-item SF-36 physical-functioning scale. The emotional-limitation indicator identifies patients who score at or below 52 on the 0 to 100 mental-health scale. An individual is counted as having a role disability when he or she endorses any of the physical-role or emotional-role scale items. An individual is classified as having an unfavourable personal evaluation when he or she rates their health in general to be “fair” or “poor,” as opposed to “excellent,” “very good,” or “good.”

General population norms for the SF-36 are available from the responses of 2474 participants in the National Survey of Functional Health Status, conducted in 1990 in the United States [2]. These norms are provided by means of descriptive summary statistics of the distribution (mean, standard deviation, and quartiles), by age and gender, for each of the eight scales of the SF36 and prevalence by age and gender of the four dichotomous limitation indicators. The presentation of the mean and standard deviations of the dimensions of the SF-36 suggests that parametric methods of analysis can be used. However, the distributions are far from “normal,” as has been indicated previously [7].

In general, comparisons of the distribution of scores in the case group (i.e., individuals with hypothesized impaired HRQL) with the distribution of scores in the general population are difficult. Six of the eight scales of the SF-36 can be thought of as “continuous” in nature, with scores ranging from 0 to 100. As expected, scores in the general population tend to be left-skewed, with the majority of individuals in relatively good health. In samples of diseased individuals with severely impaired HRQL, the reverse skew can even occur. In both cases, the mean of such a distribution may be misleading, as it may not reflect the center of the distribution and be unduly influenced by unusual extreme values.

The standard deviation of a normally (or near normally) distributed variable is a useful measure of the variability in the sample, as approximately 68% (95%, 99%) of the observations will lie within one (two, three) standard deviations of the mean. Even if these percentages approximately reflect the number of observations between the bounds, the observations are unlikely to be symmetrically distributed on either side of the mean, as is implied when the mean ± standard deviation is presented.

The median is the most useful measure of central tendency in a nonsymmetric distribution, and in a symmetric distribution, the mean will be approximately equal to the median. Appropriate percentiles, such as the 25th and 75th, can be used to describe the spread of the distribution. Problems arise when we want to compare the distributions statistically to determine whether there is evidence of a statistically significant difference between the sample and the general population. One solution is to calculate a confidence interval for the median. A significant difference can be declared if the median for the general population does not lie within the 95% confidence interval for the sample median.

Although this provides a method for statistically comparing the center of the distributions, it does not solve all the problems in comparing other aspects of the distribution or the variability of the distribution. This is an important consideration because when investigating HRQL in health-impaired groups, interest usually lies in the degree of health impairment that might be experienced by patients at the lower end of the scale. In addition, this does not take into consideration the age and gender distribution of the sample. For realistic comparisons, we need to be comfortable with assuming that the age and gender distribution in our sample is similar to that in the population for which the norms were developed.

Norm scores for HRQL in the general population differ substantially according to age and gender. Norms for the SF-36 are presented by age group (seven groups from 18–24 to 75 and older) and gender [2]. The number of participants in each of these subgroups in the general population sample ranges from 103 to 503. Unfortunately, the numbers in these strata for the sample of diseased individuals are often much smaller, which makes comparisons within age and gender strata difficult. Overall comparisons of the sample with the general population will be inaccurate, unless the sample has age and gender distributions representative of the general population.

Last, we must keep in mind that the general population norms are calculated from a sample of the general population, albeit a large sample. Therefore, the sampling error of the estimate of the population norms should also be taken into account. This becomes increasingly important if there are to be comparisons made within age and gender strata.

The two remaining scales of the SF-36, role-emotional and role-physical, are inherently cateogorical although the scores range between 0 and 100. The possible standardized scores for role-physical are 0, 25, 50, 75, and 100 and for role-emotional are 0, 33, 67, and 100. The use of the mean and standard deviation is very misleading for these variables. In these cases, it would be better to use statistical methods for ordered categorical data.

Clearly, for all eight of the health-concept variables, comparisons with the general population based on statistical methods that require the assumption of a normal distribution (means and confidence intervals or t tests) are inappropriate. Analyses using rank-based methods (nonparametric statistics) are impossible because the original data for the control group are not available.

Norms for the general U.S. population for these four dichotomous indicator variables are presented as the prevalence of these limitations by age (seven groups from 18–24 to 75 and older) and gender [2]. Given that HRQL in the general population varies according to age and gender, it is important to ensure that an apparent difference between the diseased sample and the general population is not primarily due to a different distribution of age and gender within the sample.

The SF-36 dichotomous limitation indicators appear to provide somewhat arbitrary measures of limitation. The prevalence of limitations in the U.S. general population ranges from 4.4% to 96.5%, depending on the limitation and age and gender strata. The overall prevalence of physical limitations is 61.2% (ranging from 31.3% to 96.5% depending on age and gender); the overall prevalence of role disability is 42.8% (range, 26.5 to 39.5); the overall prevalence for emotional limitations is 13.4% (range, 7.4 to 18.8); and the overall prevalence for fair/poor personal health evaluation is 14.6% (range, 6.6 to 36.0). Thus, interpretation of “limitation” is difficult.

The dichotomous limitation indicators are not available for all eight health dimensions. In addition, the role-disability variable is a combination of the role limitation due to emotional problems and the role limitation due to physical problems. In some patient groups, it may be critical to make the distinction between these two forms of limitations.

Section snippets

Methods

We illustrate these problems using the data from a study in which they were encountered. Our “diseased” group sample consisted of 136 sequentially consenting patients referred to the syncope clinic for assessment and treatment. Participants completed the SF-36 questionnaire before undergoing diagnostic testing.

Results

The sample consisted of 136 (79 female and 57 male) patients with mean age of 40 years (SD = 17). In Figure 1, we illustrate the nonnormal behavior of the eight health-concept variables in patients with syncope and compare these distributions to the general population norms. Although patients with syncope do appear to have a different distribution than the general population, it is difficult at this point to determine a definitive conclusion as to how different are these distributions.

In Figure

Discussion

In summary, comparison of the SF-36 in a diseased sample with general population norms is often more difficult than anticipated. The two role concepts are inherently categorical, and the continuous type variables are frequently skewed. In addition, when measuring HRQL it is mandatory to account for age and gender, which is difficult to do if the within strata sample size is small. We propose that an individual be classified as having impaired health on a health concept if he or she scored lower

Acknowledgements

Supported in part by a grant (PG11188) from the Medical Research Council of Canada, Ottawa, Canada to Sheldon.

References (19)

There are more references available in the full text version of this article.

Cited by (109)

  • Parents' perceptions of health-related quality of life of children diagnosed with osteogenesis imperfecta

    2020, Journal of Pediatric Nursing
    Citation Excerpt :

    Ceiling effects (percentage scores) describe the portion of the group perceptions in the highest state of health in a domain. These contrast to the floor effects (percentage scores) which describe the lowest state of health perception in a domain (Rose et al., 1999). The more specific ceiling and floor percentages reflect the overall influence that a single HRQoL domain reflects with parental perceptions.

View all citing articles on Scopus
View full text