Article Text


Interrater reproducibility of clinical tests for rotator cuff lesions
  1. A J K Ostor,
  2. C A Richards,
  3. A T Prevost,
  4. B L Hazleman,
  5. C A Speed
  1. Rheumatology Research Unit, Addenbrooke’s Hospital, Cambridge, UK
  1. Correspondence to:
    Dr A J K Ostor
    Rheumatology Research Unit, Box 194, Hills Road, Cambridge CB2 2QQ, UK;


Background: Rotator cuff lesions are common in the community but reproducibility of tests for shoulder assessment has not been adequately appraised and there is no uniform approach to their use.

Objective: To study interrater reproducibility of standard tests for shoulder evaluation among a rheumatology specialist, rheumatology trainee, and research nurse.

Methods: 136 patients were reviewed over 12 months at a major teaching hospital. The three assessors examined each patient in random order and were unaware of each other’s evaluation. Each shoulder was examined in a standard manner by recognised tests for specific lesions and a diagnostic algorithm was used. Between-observer agreement was determined by calculating Cohen’s κ coefficients (measuring agreement beyond that expected by chance).

Results: Fair to substantial agreement was obtained for the observations of tenderness, painful arc, and external rotation. Tests for supraspinatus and subscapularis also showed at least fair agreement between observers. 40/55 (73%) κ coefficient assessments were rated at >0.2, indicating at least fair concordance between observers; 21/55 (38%) were rated at >0.4, indicating at least moderate concordance between observers.

Conclusion: The reproducibility of certain tests, employed by observers of varying experience, in the assessment of the rotator cuff and general shoulder disease was determined. This has implications for delegation of shoulder assessment to nurse specialists, the development of a simplified evaluation schedule for general practitioners, and uniformity in epidemiological research studies.

  • shoulders
  • epidemiology
  • rotator cuff lesions
  • reproducibility

Statistics from

Shoulder disorders are common and cause significant morbidity and disability.1–3 The annual incidence (episodes for which general practitioners are consulted) of shoulder pain has been estimated to be >1% in adults over the age of 45 years,4 and the prevalence in population studies lies between 7 and 21%.5,6 Lesions of the rotator cuff account for most episodes of shoulder pain (up to 70%).6 Lack of consensus about appropriate diagnostic criteria for shoulder disorders, however, plagues research in this area. In a review by Buchbinder et al, none of the classification systems for soft tissue disorders of the neck and upper limb appeared acceptable.7 This lack of consensus accounts, to a large degree, for the paucity of evidence based management approaches in the treatment of rotator cuff lesions.8

The need to develop consensus criteria for upper limb disorders has been recognised by the UK Health and Safety Executive9 and has been a major step forward in the development of structured protocols for the evaluation of specific soft tissue disorders. The criteria benefit from consensus backing but have not yet been fully validated. The schedule includes some limited clinical tests of the shoulder. Palmer and colleagues have contributed to the schedule by evaluating the criteria further and by adding items to form the Southampton Examination Schedule for the diagnosis of musculoskeletal disorders of the upper limb.10 They found that the schedule was sufficiently reproducible for epidemiological research in the general population.11

Despite these steps there is no universally recognised method for evaluating upper limb disorders, with over 20 clinical tests recommended for assessing the rotator cuff alone.12 None of these tests have been adequately validated in a general population and their sensitivity and specificity are unknown. As a consequence, patients selected for rotator cuff trials are likely to be a heterogeneous group because inclusion is based on clinical criteria. Further difficulties encountered in shoulder studies are problems with case definition,13 variable experience of examiners,14 variations in method of assessment, interpretation of physical signs, and diversity of diagnostic categories used.15–19 This has contributed to conflicting results in interobserver agreement studies, varying from excellent to poor correlation.20–23

To be applicable for widespread use, a test must be accurate and reproducible by clinicians with a spectrum of experience. Our study was undertaken to determine the reproducibility of clinical tests for rotator cuff lesions assessed by clinicians with varying levels of experience. The results would then inform which tests had the closest reproducibility and interrater agreement and, therefore, were the most useful.


Selection of clinical tests

The clinical tests outlined in the Southampton schedule formed the basis of the examination series used in the study.10 Additional tests commonly used for assessment of the rotator cuff were included (table 1). The assessors consisted of a consultant rheumatologist with a particular interest in shoulder disorders, a specialist registrar in rheumatology without specific training in shoulder complaints, and a research nurse with no formal musculoskeletal training. The assessment of the consultant was deemed the benchmark against which other assessments were compared.

Table 1

 Tests used in shoulder examination

Sample population

The study group consisted of consecutive patients with shoulder pain referred to the rheumatology unit at a teaching hospital, as well as patients identified in the rheumatology outpatients department whose chief complaint was shoulder pain. The patients were then assessed in a second weekly clinic held at the hospital. Inclusion criteria for participation in the study included ability to give informed consent; age 20–85 years; and shoulder pain regardless of possible underlying aetiology. All patients who were referred were offered appointments; however, a small number cancelled before review as their symptoms had improved. Patient characteristics of this group are unknown.


Before starting the trial, two consultant-led training sessions for the other assessors occurred over consecutive weeks. Ten patients were examined by a method similar to that described by Palmer et al.10 The sessions aimed at ensuring that the observing assessors were familiar with the clinical tests involved in the examination series and how to perform them. A handbook, with full details of all clinical tests as they have been reported, was given to each observer. “Training patients” comprised predominantly those with rotator cuff tendinopathy as formally diagnosed by the consultant. Each observer examined the patient under the supervision of the trainer.

Shoulder assessment

The assessment involved three phases, starting with historical information obtained by the first observer. This was followed by inspection of the shoulder for signs of swelling, deformity, tenderness, winging, degree of external rotation, and a “painful arc”. Finally, assessment of each part of the rotator cuff was undertaken by resistance testing and was defined as either normal, weakness>pain (implying tear), or pain>weakness (implying tendinosis/tendinitis).19 Other tests involved assessment of the long head of biceps (Speed and Yerguson tests12), signs of impingement using the Hawkins-Kennedy test, and assessment of the acromioclavicular joint (table 1). A provisional diagnosis was made using a predefined algorithm, depending upon the history and examination findings. More than one final diagnosis was possible using this algorithm (table 1).

Statistical methods

Between-observer reproducibility of physical signs of shoulder disease was evaluated using Cohen’s κ coefficient for categorical variables.24 Cohen’s κ is the preferred statistic for the evaluation of concordance between two clinicians for nominal categories measuring agreement beyond that expected to occur by chance. Landis and Koch proposed guidelines for the interpretation of the strength of concordance reflected by the κ coefficient: <0.00 “poor”; 0.00–0.20 “slight”; 0.21–0.40 “fair”; 0.41–0.60 “moderate”; 0.61–0.80 “substantial”; 0.81–1.00 “almost perfect”.25 A minimum sample size of 125 was chosen on the basis that an observed κ coefficient for a test would then have a 95% confidence interval that covered at most one of these concordance categories either side of the coefficient, on the assumption that the tests would have a test prevalence lying between 20 and 80%.

It is accepted that the sample size will not allow reproducibility of tests for the rare lesions—for example, subscapularis tendinopathies, to be evaluated with precision. Systematic bias between observers was examined with McNemar’s test. We did not evaluate intraobserver agreement.


In total, 136 patients were enrolled in the study over a 12 month period. Table 2 shows their demographic information. The number of affected shoulders was 159 with 23 subjects contributing two sets of observations for assessment; 113 shoulders were normal. The analysis assumes independent observations; however, the patients who contributed both shoulders represent a relatively small percentage of subjects and are unlikely to have an effect on the results.

Table 2

 Baseline characteristics of study subjects

κ Coefficients were calculated for all phases of the assessment. All tests were scored on a dichotomous scale. Fair to substantial agreement was obtained for the observations of tenderness, painful arc, and for external rotation (table 3). Agreement between raters was reduced owing to the lower level of prevalence of tenderness rated by the specialist registrar and nurse and the higher level of prevalence of external rotation rated by the nurse.

Table 3

 Agreement between the three clinicians about the presence of tenderness, painful arc, and external rotation in affected shoulders

Tests for supraspinatus and subscapularis also showed at least fair agreement between observers (table 4). Tests for the acromioclavicular joint showed the poorest reproducibility (table 4). Adequate rater agreement in these tests for the overall prevalence among the three raters and agreement in an individual patient was not achievable.

Table 4

 Agreement between three clinicians about resistance testing in the affected shoulder

According to the algorithm, diagnosis of shoulder pain showed at least fair concordance for rotator cuff disorders, adhesive capsulitis, and for referred pain, and this was largely associated with a tendency for overestimation of the prevalence of diagnoses by the nurse and specialist registrar relative to consultant. The diagnosis of impingement and acromioclavicular joint disease between consultant and nurse showed only slight agreement (table 5) (fig 1). When the patients were stratified by age, duration of symptoms, sex, and night pain, no improvement in reproducibility for a diagnosis of impingement was found (table 6). In total 40/55 (73%) of the κ coefficient assessments were rated at >0.2, indicating at least fair concordance between observers; 21/55 (38%) were rated at >0.40, indicating at least moderate concordance between observers. There was almost total agreement between observers in assessment of the unaffected shoulder. More detailed analysis of the agreement between observers with full 2×2 tables is available (see Appendix 1 at

Table 5

 Agreement between the three clinicians about diagnosis of affected arm

Table 6

 Agreement between clinicians about assessment of impingement related to age, sex, duration of symptoms, and presence of night pain

Figure 1

 Agreement among three clinicians in diagnosis in affected shoulders assessed by κ coefficient.


The results of our study show slight to substantial agreement among observers of varying experience in assessment of the rotator cuff in patients attending a hospital clinic. This is comparable with most previous studies looking at interobserver agreement of shoulder disorders, although the correlation between research nurse, registrar, and consultant, to our knowledge, has not previously been studied. In addition, earlier studies were not designed to look specifically at the rotator cuff, which represents a narrower field. The results suggest that, with training, delegation of assessment of the shoulder to health staff with no particular expertise may be possible as the optimal agreement for diagnosis was between the specialist registrar and research nurse. This could be useful for epidemiological studies as well as for triage purposes in primary care to maximise use of resources.11

In a study by de Winter et al, reproducibility was moderate (κ = 0.45) between two physiotherapists assessing shoulder disorders.26 The poorest diagnostic agreement was found for subjects with a high degree of pain, bilateral involvement, and chronic complaints. Their patient population was more selective than in our study and therefore our results might have improved if we had not included patients with shoulder pain regardless of possible underlying cause. One explanation for the diagnostic disagreement in the study of de Winter et al was the insufficient mutual exclusivity of diagnostic categories currently used for shoulder disorders—an example being pain associated with adhesive capsulitis rendering any further assessment of the shoulder difficult. In the study of Liesdek et al,20 in contradistinction to that of de Winter et al,26 agreement between general practitioners and physiotherapists in diagnosis of soft tissue disorders of the shoulder was greater in patients with a long duration of symptoms. Their overall κ value for the classification of shoulder disorders was fair at 0.31 and might have been an overestimation, as the physiotherapists were not “blinded” to the diagnosis of the general practitioners.

Bamji et al showed poor agreement between consultant rheumatologists in shoulder assessment, with agreement in only 46% when examined independently and 78% when assessed together.21 One study has, however, shown very high agreement between physiotherapists,22 which is in contrast with the other interobserver studies. The limitations of this study were the small numbers (19 patients), the paucity of information on the clinical characteristics of the patients, and the setting of the study.

The difficulty in assessing soft tissue disorders of the shoulder is further highlighted by observer variability in shoulder movement. Croft et al found that the assessment of external rotation was poorly reproducible owing to random variation in visual estimation and systematic variation in examination technique.27 In a study by Hoving et al, adequate reliability was only obtained for the movements of total shoulder flexion and hand behind back.28

Our results show that there was at least moderate agreement for lesions of the supraspinatus muscle and for signs of capsulitis. A limitation of the κ value, when the background prevalence of the parameter in question is particularly low or high, is that κ coefficients have been shown to be smaller than when the prevalence is in the middle of the range (50%) when the sensitivity and specificity of the test are kept the same. Another limitation with extreme prevalence is that the κ coefficient has relatively large sample to sample variability and wide confidence intervals. Taken together, these limitations may partially explain the reduced reproducibility in tests for less common lesions, which would require evaluation in a larger sample. Differences in prevalence between study populations also complicate effective comparison of κ coefficients between studies because the observed value of κ depends on the similarity of prevalence in the studies.29

It is axiomatic that the true worthiness of a diagnostic test lies in its ability to alter treatment and/or in prognostication. There is unfortunately scant evidence about the usefulness of clinical tests for shoulder disease in either of these areas. As the reproducibility of many of the tests in our study was poor, this hampers further efforts to use a large set of tests by various practitioners to aid in shoulder pain management. This issue is further complicated when assessment is undertaken in general practice, where the presentation or severity of disease may be quite different from that seen in a teaching hospital. The minimum set of tests we found to have at least moderate reproducibility comprised those for painful arc, external rotation, and empty can. This is a reasonable start as these tests imply lesions of the capsule (such as frozen shoulder, which is notoriously recalcitrant to treatment) and supraspinatus, the most commonly affected muscle in rotator cuff lesions and hence the main culprit causing shoulder pain.

Although we focused principally on rotator cuff lesions, rarer diagnoses encountered, such as instability, reflex sympathetic dystrophy, myofascial pain, and fibromyalgia, were not included in the diagnostic algorithm. The research nurse, who had no previous knowledge of shoulder disorders, could not make these diagnoses. This highlights one the limitations of this study and the difficulties in standardising the examination. Missing rare diagnoses by less experienced staff may not be important, however, as patients with an unclear initial assessment could be referred for specialist opinion. We felt a diagnostic algorithm was necessary in order to translate the clinical signs into an entity which could be used for treatment.

The usual management of rotator cuff lesions, although not adequately validated8 but employed by clinicians, involves a variable combination of analgesics, physiotherapy, and injection of corticosteroid and local anaesthetic into the subacromial space or glenohumoral joint. This may be initiated in primary care, with referral of the more severe or unclear cases to specialist clinics. Patients who undergo this treatment paradigm who do not improve could also be referred for specialist assessment.

Despite an attempt at standardisation made in the Southampton Examination Schedule, diagnosis of rotator cuff pathology was the most difficult and the least defined. Training of the observers lasted for 6 weeks until consistency was optimised, compared with two teaching sessions in our study, making it less applicable to daily practice. Our results will have to be repeated, however, to see if the reproducibility holds across all practitioners who deal with shoulder disorders and in primary care where subjects may have less defined illness. Our results might have improved with a longer duration and more intensive training session.

Validation of specific clinical tests requires a standard of reference which in many cases may not be available or easily accessible. In our study the consultant’s opinion was taken as the benchmark, as has been used in other studies10; however, this has limitations as true validity is not assessed. The consultant’s opinion was deemed appropriate and adequate, however, for a reproducibility study of this nature.

Primary care is the most appropriate location for appraisal of shoulder disorders. Direct orthopaedic referral in many institutions is becoming increasingly difficult, with the implication of increased referral to the rheumatology department. If most shoulder disorders could be adequately managed in primary care this would significantly reduce the tertiary referral workload.

In summary, this study has identified the reproducibility of certain tests, employed by observers of varying experience, in the assessment of the rotator cuff and general shoulder disease. Further study of these tests is required across all disciplines involved with shoulder disease. The results have important implications for future epidemiological and treatment studies.


Consulting services at the Centre for Applied Medical Statistics, University of Cambridge provided statistical support.


View Abstract
  • Web-only Appendix

    The appendix (Tables W1-6) is available as a downloadable PDF (printer friendly file).

    If you do not have Adobe Reader installed on your computer,
    you can download this free-of-charge, please Click here


    Files in this Data Supplement:

    • [View PDF] -

      Table W1 Agreement of Consultant and Specialist Registrar on resistance testing of the affected shoulder
      Table W2 Agreement of Consultant and Nurse on resistance testing of the affected shoulder
      Table W3 Agreement of Specialist Registrar and Nurse on resistance testing of the affected shoulder
      Table W4 Agreement of Consultant and Specialist Registrar on diagnosis of the affected shoulder
      Table W5 Agreement of Consultant and Nurse on diagnosis of the affected shoulder
      Table W6 Agreement of Specialist Register and Nurse on diagnosis of the affected shoulder

Request permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.