Statistics from Altmetric.com
- CTT, classical test theory
- ICC, intraclass correlation coefficient
- IRT, item response theory
- MCID, minimal clinically important difference
Function of the shoulder has conventionally been assessed with objective measures such as range of motion and strength. However, objective measures can be impractical in some settings, because they are time consuming and require face to face contact. Besides, although shoulder disorders are often associated with restricted range of motion and muscle weakness, these measures have no direct clinical meaning to patients, who just want to be free of pain and perform their daily activities. Nowadays, the efficacy of treatment is more often evaluated using outcomes that are directly relevant to patients. Both in clinical practice and research, using subjective measures that assess the ability to function in daily life ensures that the treatment and evaluations focus on the patient rather than on the disease.1
In the past decade a large number of shoulder disability questionnaires have been developed, which are designed to assess physical functioning (that is, the performance of daily activities).2–17 The choice of which questionnaire to use may be based on the study group, the purpose of the questionnaire, its clinimetric quality as shown by validity, reproducibility, responsiveness, and on practical considerations (for example, ease of scoring, and how long it takes to complete). Different questionnaires may be required for different patient groups, but this should be balanced against the need to standardise results from different studies by the use of a single instrument.18 Studies comparing the content and clinimetric quality of these shoulder disability questionnaires are lacking. Consequently, little evidence is available to guide the clinician and researcher during questionnaire selection.
Garratt et al stated that structured reviews are prerequisites for standardisation.19 We developed a checklist to evaluate the clinimetric quality of the instruments as shown by their validity, reproducibility, responsiveness, interpretability, and practical burden. The purpose of this paper is to systematically review the content and clinimetric quality of all published shoulder disability questionnaires in order to provide guidelines for clinicians and researchers enabling them to choose the most appropriate measure for different purposes.
Initially, studies were identified by searches of the computerised bibliographic database Medline (1966–July 2002). Subsequently, other databases—that is, CINAHL (1988–July 2002), SPORTDiscus (1949–July 2002), and PsychINFO were searched for additional studies. The following keywords were used to identify eligible studies: shoulder, upper-extremity, disability, functional status, questionnaire, self-report, self-assessment, outcome measure, outcome assessment (MESH term or text word). The names of identified instruments were used as terms for a further search of the electronic databases. References of retrieved articles were screened for additional relevant studies.
Instruments were included in the review if they were self assessed, condition-specific (shoulder or combined shoulder-upper limb problems), and included items on disability or physical functioning (that is, the performance of activities of daily living). Studies were eligible when the main focus of the study was the development and/or the clinimetric evaluation of a shoulder disability questionnaire. Furthermore, only studies that were written as full report (that is, no abstract or letter to the editor) and had been published in English were included. No restrictions were put on the year of publication. Instruments that were developed for groups whose primary complaint did not concern shoulder disorders (for example, wheelchair users, patients with cancer) were excluded.
Data abstraction and quality assessment
A checklist was composed to evaluate and compare the clinimetric properties of the questionnaire. The checklist was partly based on the review criteria developed by the Scientific Advisory Committee of the Medical Outcome Trust20 and the checklist developed by Bombardier and Tugwell.21 After testing the checklist on papers about other condition-specific questionnaires the final version of the checklist was completed. (This checklist can be found in appendix W1, available at http://www.annrheumdis.com/supplemental) Two reviewers independently scored the clinimetric quality of each study, according to the checklist. If an instrument had more than one scale, only the subscales containing items on physical functioning were reviewed. Disagreements between the reviewers were discussed and resolved during a consensus meeting.
Description of the questionnaires
Descriptive data extracted from the publications included the target population, domains to which the items could be classified (pain, symptoms, physical functioning, emotional functioning, and social functioning), number of scales, number of items, response options, range of minimal and maximal score, time needed to complete the questionnaire, and study groups used in the clinimetric studies about the questionnaires.
For administrative burden the scoring method was rated: easy, when the items were simply summed; moderate, when a visual analogue scale or simple formula was used; and difficult, when either a visual analogue scale in combination with a formula or a complex formula was used. For respondent burden a positive rating was given when the questionnaires could be completed within 10 minutes.
Validity is the degree to which an instrument measures what it is supposed to measure.20 The instruments were evaluated for content and construct validity. Content validity examines the extent to which the domain of interest is comprehensively sampled by the items in the questionnaire.22 Items on the questionnaire must reflect areas that are important to patients with shoulder disorders. Therefore, studies achieved a positive rating for content validity when patients were involved during item selection. A positive rating for readability or comprehension was given when patients tested the questionnaire in a pilot study.
Internal consistency is a measure of the homogeneity of a (sub)scale. It indicates the extent to which items in a (sub)scale are intercorrelated, thus measuring the same construct. Factor analysis should be applied to determine the dimensionality of the items—that is, to determine whether or not they formed only one overall dimension or more than one. A positive rating for internal consistency was achieved when the dimensional structure of the questionnaire was explored by factor analysis and Cronbach’s α for each dimension separately was between 0.70 and 0.90.23
Construct validity refers to the extent to which scores on a particular instrument relate to other measures in a manner that is consistent with theoretically derived hypotheses concerning the constructs that are measured.24 The associations of the questionnaire with other variables that measured disability or physical functioning were abstracted from the studies. Construct validity was considered to be adequately tested if hypotheses were specified and the results corresponded with these hypotheses.
Floor and ceiling effects were considered present if more than 15% of respondents achieved the highest or lowest possible score, respectively.25 Therefore, authors had to provide descriptive statistics of the distribution of scores, which included information on the presence of floor or ceiling effects.
Reproducibility is the extent to which an instrument is free of measurement error. It was assessed by rating test-retest reliability and agreement.26 We considered calculation of the intraclass correlation coefficient (ICC) for each domain as an adequate method for test-retest reliability.26 An ICC >0.70 for group comparisons was rated as positive.20,23,27 Confidence intervals should be presented as an index of the expected random variation. Application of Pearson correlation coefficients to estimate test-retest reliability was rated as doubtful, as it neglects systematic errors if present.28
A measure of agreement is important to quantify measurement error and detect systematic differences between two measurements. Calculation of the 95% limits of agreement,29 the κ coefficient,30 or the standard error of measurement (SEM) was regarded as an adequate measure of agreement. It was not possible to define adequate cut off points for the result of an agreement study. Therefore, a positive rating was received when an adequate method for agreement was used.
Responsiveness refers to an instrument’s ability to detect important change over time in the concept being measured.31–33 There is no single agreed method to assess responsiveness. Calculating change scores for a group of patients whose health is expected to have changed and to examine the correlation with corresponding changes in a reference measure or transition was considered to be a suitable method to assess responsiveness.32 This method requires predictions about how the results of the questionnaire should correlate with other related measures. Responsiveness was considered adequately tested if hypotheses were specified and when the results corresponded with these hypotheses.
Validity, reproducibility, and responsiveness depend on the setting and the population in which they are assessed. Therefore, the description of the design of each individual clinimetric study was rated. A clear description included characteristics of the study group (including diagnosis and clinical features), measurements, testing conditions, and data analysis. Furthermore, methodological weaknesses in the design or execution of a study were recorded. When an adequate description was lacking or methodological weaknesses were found, validity was rated as doubtful.
Interpretability was defined as the degree to which one can assign qualitative meaning to quantitative scores.20 The investigators should provide information about what (difference in) score would be clinically meaningful. We rated if the authors had presented a minimal clinically important difference (MCID) or if other information was present that could aid in interpreting the questionnaires’ scores—for instance, (a) presentation of means and standard deviations (SD) of patients scores before and after treatment; (b) comparative data on the distribution of scores in relevant subgroups; (c) information on the relationship of scores to well known functional measures or to clinical diagnosis; and (d) relating changes in disability score to patients’ global ratings of the magnitude of change they have experienced. Investigators had to provide at least two of the previously described types of information for a positive rating of interpretability.
The Medline search identified 553 publications, in which 22 self administered shoulder disability questionnaires were reported. The additional searches in CINAHL, SPORTDiscus, and PsychINFO identified one additional questionnaire (SPADI) and two additional studies about the clinimetric characteristics of included questionnaires.16,34 Reference tracking resulted in four additional clinimetric studies.11,15,35,36 Of the 23 identified questionnaires, 17 met the inclusion criteria. One questionnaire, the University of Pennsylvania Shoulder Scale (Upenn),37 was not evaluated because the clinimetric properties were obtained by Rasch analysis. We based our checklist on classical test theory (CTT) and did not accommodate an evaluation of rate item response theory (IRT) methods. Therefore, we were not able to evaluate this questionnaire. Four questionnaires were excluded because they were developed for special groups (that is, wheelchair users, patients with bone and soft tissue sarcoma, and athletes).38–41 Two questionnaires were excluded as they had no items on physical functioning.42,43 Finally, a total of 28 studies referring to 16 shoulder disability questionnaires were included in the review.
Description of questionnaires
Table 1 presents a description of the 16 included questionnaires (full names are given in appendix 1). The DASH, UEFS, and UEFL were designed as upper extremity questionnaires, but can be used for the evaluation of any joint or condition of the upper extremity, and all have been applied in patients with shoulder problems. Two questionnaires were developed for shoulder instability (SIQ, WOSI), one for rotator cuff tears (RC-QOL), and one for osteoarthritis (WOOS). The other questionnaires were developed for shoulder disorders in general. The ASES consist of a self administered and a performance based part. Only the self administered part was reviewed. The RC-QOL had the most items (n = 34), followed by the SSI (n = 31), and the DASH (n = 30), while the SSRS had the smallest number of items (n = 5).
Most questionnaires were developed out of a pool of items generated by patients, experts, and/or the investigator(s). The dimensional structure of only three questionnaires (SST, UEFS, and SPADI) was studied by factor analysis. In two studies the factor analysis of the SPADI showed loading on one factor only, although the questionnaire claims to measure two constructs: pain and disability.16,44 In contrast, factor analysis supported a two factor solution for the SST, while the SST claims to measure a single construct.44 Information on internal consistency was found for seven questionnaires. Cronbach’s α ranged from 0.71 to 0.96. The SIQ and the disability subscale of the SPADI had a Cronbach’s α above 0.90.
Construct validity was studied by correlating the score of the questionnaire with other disability questionnaires, with the physical function dimension of general health instruments, or with a global rating system for shoulder disorders. Six of 19 studies that investigated construct validity did not present hypotheses relating to the magnitude and direction of expected relationships with other instruments. The SSRS had moderate correlations with other shoulder disability questionnaires (0.47−0.50). The correlations between the SST, SSI, ASES, SPADI, and DASH were high (>0.74).
Three questionnaires showed a floor or ceiling effect: The SDQ-UK showed a ceiling effect in a community sample of people with shoulder pain, the UEFL a floor effect for older women in the community, and the SDQ-NL showed a ceiling effect for primary care patients.
Information on the test-retest reliability was found for 10 questionnaires. A Pearson correlation coefficient was used to calculate test-retest reliability of the SIQ and SRQ, while an ICC was reported for the other questionnaires. Except for the SPADI all coefficients were >0.70. Test-retest reliability of the SPADI was investigated in four studies and the ICC for the disability subscale ranged from 0.57 to 0.84.
Six studies presented information on agreement of, in total, 10 questionnaires. Methods used were the coefficient of reliability,29 the SEM, and the percentage of agreement on repeated measures.
The responsiveness of 13 questionnaires was evaluated in 14 studies. Four responsiveness studies presented hypotheses. Most studies compared scale scores before and after the treatment and presented mean change scores only. Furthermore the standardised response mean was used frequently. No data on responsiveness were found for the SDQ-UK, RC-QOL, and UEFL. The number of patients used to measure responsiveness was small (n<43) in eight of 14 responsiveness studies.
Five studies paid attention to interpretability of scores, and for three questionnaires (SRQ, SPADI, and SDQ-NL) an MCID was presented. Information on scores of different shoulder disability groups was available for the SST.35 Means and SD (or equivalent) of baseline and follow up scores or scores of relevant subgroups were available for nine questionnaires. No data on the distribution of scores from the SDQ-UK, WOSI, WOOS, SSI, and UEFL were found.
Detailed information on the clinimetric properties of the questionnaires (that is, validity, reproducibility, responsiveness, and interpretability) can be found in tables W1–W3 available at http://www.annrheumdis.com/supplemental.
Only a few studies gave an adequate description of the study design and population characteristics. Eight studies did not adequately describe its study group and in five studies information about data analyses was missing. Nine publications provided insufficient information on the methodological aspects to enable a good evaluation of the study design. Furthermore, information about non-response, subjects lost to follow up, and missing data were often lacking.
Table 2 shows the quality assessment of the 16 shoulder disability questionnaires, summarising each item as good, doubtful, or poor quality. A question mark indicates insufficient information about an aspect of quality. As results are dependent on the population studied, the kind of population is presented (that is, community, primary care, outpatient, or hospital patients). Overall, the DASH received the best ratings for its clinimetric properties (that is, 9 positive scores out of 12).
We identified 16 condition-specific questionnaires for the evaluation of physical functioning in patients with shoulder disorders for which the clinimetric characteristics had been evaluated. None of the questionnaires demonstrated satisfactory results for all categories. Overall, the DASH received the best ratings for its clinimetric properties.
When constructing a questionnaire, one should specify beforehand which constructs it is supposed to measure (that is, if the questionnaire will be a unidimensional or multidimensional instrument). Subsequently, the theoretically dimensional structure should be tested using factor analysis. We found that this is not properly done, or not done at all. One may assume that the number of scales corresponds with the number of dimensions, but only five questionnaires had an equal number of scales and dimensions. Seven questionnaires claimed to cover more than one dimension, but had one scale only, and the SST and SPADI appeared to have different structures than stated. Seven questionnaires claimed to cover more than one dimension, but had one scale only. When the dimensionality of a questionnaire is not analysed, Cronbach’s α may not be interpretable.
The DASH and WOSI were the only questionnaires with a positive rating for test-retest reliability. Test-retest reliability of the SRQ, SSRS, SST, WOOS, and SSI was done using small sample sizes (n = 22–41). When statistical estimates are derived from very small populations, confidence intervals will be wide. This indicates the high degree of uncertainty in the precision of the reliability coefficient.
Our checklist was developed to evaluate the measurement properties of questionnaires based on CTT. A relatively new method to develop and evaluate health status questionnaires is IRT.45 IRT has a number of potential advantages over CTT46 and can be helpful in developing health outcome measures with better clinimetric properties. IRT makes it possible to calibrate a large number of physical functioning items on the same scale, which allows different tests to be meaningfully compared with one another, even if they are administered to completely different groups.47 Cook et al used an IRT model to investigate the trait-specific reliability of the DASH, ASES, SST, and Upenn.37 They showed that the questionnaires did not measure all levels of shoulder functioning with equal precision (that is, the questionnaires were unable to measure accurately patients with very low or very high levels of shoulder functioning). The evaluation of shoulder disability questionnaires may be improved by using IRT.37
Validity studies were available for all questionnaires. It is important to formulate hypotheses before validity testing. These hypotheses should specify both magnitude and direction of the expected correlation. The same accounts for studies on responsiveness. Most authors looked at the treatment effect, but the magnitude of the treatment effect tells us little about the ability of an instrument to detect clinically relevant change.32
The presence of floor and ceiling effects may influence the responsiveness of an instrument. An intervention effect will be missed for people who occupy the lower levels of the scale before the intervention. Floor and ceiling effects are dependent upon the population being studied. The SDQ-UK had a ceiling effect for community people with shoulder pain, but not for primary care patients; the SDQ-NL showed a ceiling effect for patients with shoulder pain receiving physiotherapy treatment, but not for patients with shoulder disorders visiting their general practitioner.
More information is needed on the interpretation of scores. Only five studies paid attention to interpretability of the outcome scores and an MCID was stated for only three questionnaires (SRQ, SPADI, and SDQ-NL). When investigators do not provide an indication of how to interpret changes in health related quality of life score, the findings are of limited use to clinicians.48 Among others, Lydick and Epstein have described different approaches for interpretation of health related quality of life changes.49 It should be recognised that interpretation of the results is questionable when the clinimetric quality of an instrument is unknown or has not been adequately tested.
It is important to realise that the clinimetric properties of a questionnaire are not fixed, and may vary among different settings and populations.50 The use of various methods and various populations helps in “building” these properties. The DASH, SST, ASES, and SPADI have been studied most often. Besides rating the clinimetric properties of a questionnaire, the choice of a questionnaire depends on its purpose and applicability. An easy scoring method and information about acceptable levels of missing data enhances applicability.
Interest in using patient based instruments in clinical practice for assessment and treatment monitoring of individual patients is growing. These instruments enable clinicians to detect and treat functional and psychological problems that previously may have been missed. Furthermore, they promote shared decision making and facilitate doctor-patient communication.51 Questionnaires with fewer items and shorter administration may be more practical for routine use in clinical practice.25 Clearly, questionnaires used for clinical assessment of individual patients demand higher measurement standards than those used in groups. For test-retest reliability, an ICC >0.70 was regarded as adequate for group comparisons, yet for individual comparisons an ICC of ⩾0.90 should be required.20,23,25,27,52 This means that the SSRS and SPADI may not be applicable for individual patients. In addition, small confidence intervals around an individual patient score are needed to make the questionnaire useful for evaluating treatment results in individual patients. McHorney et al suggested that score confidence intervals must be fully documented before standardised health measures are routinely incorporated into clinical practice for assessment of individual patients.53 Cook et al were the only authors who presented confidence intervals around the reliability coefficients. Their result showed wide confidence intervals for the reliability coefficients of both the SPADI and ASES.54
This review provides information for researchers and clinicians to facilitate the choice among the existing questionnaires for shoulder disability. The “best” scale is always best for a particular purpose, where purpose is defined by the disease, the population, and the treatment.55 The DASH, SPADI, and ASES have been evaluated most often and, overall, the DASH received the best ratings for its clinimetric properties. The DASH and SPADI are recommended for evaluative purposes in outpatient clinics. These questionnaires received positive ratings for responsiveness and have no floor or ceiling effects. The SIQ is recommended for evaluation of patients with shoulder instability and the OSQ for evaluation of patients having a shoulder operation other than stabilisation. The SST is a short, unidimensional questionnaire that had an ICC of 0.99 for test-retest reliability in one study.60 Hence, for discriminative purposes the SST is suggested for patients with shoulder complaints in general. The SSRS and SPADI should not be used for assessment of individual patients.
There are no standardised criteria to evaluate the quality of subjective health measurement questionnaires. The criteria we used to evaluate the quality of the questionnaires may be disputed. However, it was not our intention to create a standardised evaluation checklist, but to provide information about the questionnaires’ clinimetric properties in order to facilitate the choice between questionnaires. Guidelines are needed to set standards and define the criteria by which these instruments should be assessed Continuing accumulation of research evidence for the clinimetric properties of a scale is important for demonstrating the scale’s usefulness in both clinical practice and research applications.
Table 3 shows the full names of the questionnaires included.
Web-only Appendix and Tables
The appendix and tables are available as downloadable PDFs (printer friendly files).
If you do not have Adobe Reader installed on your computer,
you can download this free-of-charge, please Click here
Files in this Data Supplement:
- [View PDF] - Appendix W1 Checklist for rating the clinimetric quality of self-assessment questionnaires
- [View PDF] -
Table W1 content and construct validity of the shoulder disability questionnaires
Table W2 reproducibility of the shoulder disability questionnaires
Table W3 responsiveness and interpretability of the shoulder disability questionnaires
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.