Article Text

Download PDFPDF

The Southampton examination schedule for the diagnosis of musculoskeletal disorders of the upper limb
  1. Keith Palmer,
  2. Karen Walker-Bone,
  3. Cathy Linaker,
  4. Isabel Reading,
  5. Samantha Kellingray,
  6. David Coggon,
  7. Cyrus Cooper
  1. The MRC Environmental Epidemiology Unit, University of Southampton, Southampton General Hospital, Southampton SO16 6YD
  1. Dr Palmer


OBJECTIVES Following a consensus statement from a multidisciplinary UK workshop, a structured examination schedule was developed for the diagnosis and classification of musculoskeletal disorders of the upper limb. The aim of this study was to test the repeatability and the validity of the newly developed schedule in a hospital setting.

METHOD 43 consecutive referrals to a soft tissue rheumatism clinic (group 1) and 45 subjects with one of a list of specific upper limb disorders (including shoulder capsulitis, rotator cuff tendinitis, lateral epicondylitis and tenosynovitis) (group 2), were recruited from hospital rheumatology and orthopaedic outpatient clinics. All 88 subjects were examined by a research nurse (blinded to diagnosis), and everyone from group 1 was independently examined by a rheumatologist. Between observer agreement was assessed among subjects from group 1 by calculating Cohen's κ for dichotomous physical signs, and mean differences with limits of agreement for measured ranges of joint movement. To assess the validity of the examination, a pre-defined algorithm was applied to the nurse's examination findings in patients from both groups, and the sensitivity and specificity of the derived diagnoses were determined in comparison with the clinic's independent diagnosis as the reference standard.

RESULTS The between observer repeatability of physical signs varied from good to excellent, with κ coefficients of 0.66 to 1.00 for most categorical observations, and mean absolute differences of 1.4°–11.9° for measurements of shoulder movement. The sensitivity of the schedule in comparison with the reference standard varied between diagnoses from 58%–100%, while the specificities ranged from 84%–100%. The nurse and the clinic physician generally agreed in their diagnoses, but in the presence of shoulder capsulitis the nurse usually also diagnosed shoulder tendinitis, whereas the clinic physician did not.

CONCLUSION The new examination protocol is repeatable and gives acceptable diagnostic accuracy in a hospital setting. Examination can feasibly be delegated to a trained nurse, and the protocol has the benefit of face and construct validity as well as consensus backing. Its performance in the community, where disease is less clear cut, merits separate evaluation, and further refinement is needed to discriminate between discrete pathologies at the shoulder.

  • classification
  • diagnosis
  • neck
  • upper limb
  • validity

Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Musculoskeletal disorders of the upper limb and neck are a common cause of morbidity,1-10 but their exact frequency and burden on health in the community are difficult to ascertain. This difficulty arises in part because they comprise a heterogeneous group of clinical disorders and non-specific regional pain syndromes. Disagreement exists about case definition and about the distinction, relation and overlap between conditions; this lack of consensus hampers meaningful comparison between studies.11 ,12

To address this issue, a workshop convened by the UK Health and Safety Executive (HSE) used a “Delphi” technique to develop consensus criteria for some of the more common upper limb disorders.13 By collating, analysing and re-discussing opinions in a structured manner, a broadly constituted group of rheumatologists, orthopaedic surgeons, occupational physicians, epidemiologists, physiotherapists and ergonomists were able to agree diagnostic criteria for nine categories of upper limb disorder (table1).

Table 1

Diagnostic criteria for upper limb disorders proposed by the HSE Workshop (adapted from Harrington et al, 199813)

These criteria provide a useful starting point for surveys of upper limb and neck complaints in the general population, but do not include full detail of the relevant procedures and definitions.12It is not apparent, for example, whether two observers would agree where the boundaries of each anatomical region lie, what procedures should be adopted to elicit pain on resisted movement, where the pain should be felt, and what degree of restricted movement is important.

Using the Workshop criteria as a basis for case definition, we have developed a detailed schedule of diagnostic procedures that could be followed in epidemiological field studies, together with a pre-defined algorithm by which the findings are translated into diagnoses. As a first step towards determining the validity and repeatability of our schedule, we have investigated its performance in a hospital setting among relatively clear cut cases of upper limb disease.


The repeatability of the schedule was assessed by comparing the physical signs elicited in a group of hospital outpatients who were examined independently by two observers—a research nurse (CL) and a rheumatologist (KWB). In addition, the face validity of diagnoses derived from the nurse's examination was assessed in comparison with diagnoses made independently in the hospital clinic as the reference standard.

The nurse and the rheumatologist had been trained in the examination schedule through a series of examinations carried out jointly on normal subjects and outpatients with soft tissue rheumatism, followed later by independent assessments with a comparison of findings. Training continued over a six week period (12 sessions) until a high level of consistency was achieved between the observers.

The examination entailed recording the location of pain at the shoulder, elbow, wrist and hand; eliciting signs of tenderness and pain on resisted movement at these sites; conducting three standard clinical provocation tests (Finkelstein's test, Phalen's test and Tinel's test), and searching for tender spots, as described in the American College of Rheumatology criteria for diagnosis of fibromyalgia.14 The range of shoulder movement was measured in accordance with the methods described by Norkin and White,15 using a goniometer for external rotation and a plane inclinometer (pleurimeter) for other shoulder movements.16 Lateral flexion, forward flexion and extension of the neck were also measured using the plane inclinometer. A hand diagram similar to that used by Katz et al 17 was used to record and classify the pattern of sensorineural complaints in the hand. The full schedule also includes a procedure for field measurement of median nerve conduction latencies using a portable electroneurometer: this method has been evaluated elsewhere,18 and is not reported here. (Further particulars of the schedule are available on request).

The study sample came from two sources. Subjects for the repeatability element of the survey were drawn from consecutive referrals to the soft tissue rheumatism clinic at Southampton General Hospital (group 1). Recruitment occurred over a seven month period between November 1997 and May 1998. Everyone who had been referred because of neck or upper limb symptoms was eligible, and all agreed to participate. Each subject underwent two examinations, one by the nurse and one by the rheumatologist. The examinations were conducted during the patient's visit to the clinic and were performed independently, spaced by an interval of a few minutes.

For the assessment of validity, information was separately abstracted on the diagnosis or diagnoses made by the clinic. In addition, to supplement this part of the investigation, a second group of subjects was recruited from patients attending rheumatology and orthopaedic outpatient clinics at Southampton General Hospital and two district general hospitals (group 2). Consecutive cases with one or more of a specified list of upper limb diagnoses were identified by the doctor in the clinic and invited to undergo examination by the research nurse. The conditions included were: adhesive capsulitis, bicipital tendinitis, rotator cuff tendinitis, lateral epicondylitis, medial epicondylitis, carpal tunnel syndrome, de Quervain's disease of the wrist, and tenosynovitis of the wrist. Everyone who was eligible agreed to be examined. Examinations were performed during the clinic visit with the nurse blind to the diagnosis that had been made in the clinic.

Observations on subjects from group 1 contributed to the assessment of repeatability, while those on subjects from both groups were used to assess the validity of the schedule. The between observer repeatability of physical signs was assessed among patients from group 1 by calculating a Cohen's κ coefficient (or weighted κ coefficient, as appropriate) for categorical variables,19 and mean differences and limits of agreement (mean (+2SD) difference)20 for the continuous variables—range of shoulder movement and range of neck movement. The repeatability of the diagnoses derived from the examination was also compared between observers. Finally, for all subjects (that is, both group 1 and group 2), the nurse's derived diagnoses were compared with those reached in the clinics, and the sensitivity and specificity of the nurse's diagnoses were determined, with those of the clinician as the reference standard.


Altogether, 88 subjects were examined (43 patients in group 1 and 45 in group 2). Participants had a median age of 49 years (IQ range 40–57) and 46 (52%) were women. Table 2 records the diagnoses made by the clinics in these subjects. A total of 56 subjects were identified as having one of the specific upper limb disorders covered by the schedule, three cases had two, and one case had three specific diagnoses. Twenty eight subjects had none of the disorders, but were suffering from other rheumatic complaints. The most common diagnoses were adhesive capsulitis (15 subjects altogether), carpal tunnel syndrome (15), rotator cuff tendinitis (12) and lateral epicondylitis (11). Among subjects considered to have none of the scheduled disorders, the most common diagnoses were cervical spondylosis or brachial neuralgia (14), diffuse arm pain (6), fibromyalgia (2), and seronegative polyarthritis (2).

Table 2

Clinical diagnoses in the study sample


The assessment of repeatability was based on the 43 subjects (86 limbs) from group 1. Table 3 records the extent of agreement on physical signs between the two observers. κ Coefficients varied between 0.54 and 0.93 at the shoulder, between −0.02 and 0.79 at the elbow and between 0 and 1 in the forearm and hand. There were particularly high levels of agreement at the shoulder for presence of a painful arc (κ = 0.93) and pain on resisted external rotation (κ = 0.90), flexion (κ = 0.83) and abduction (κ = 0.81); at the elbow for signs of lateral epicondylitis (κ = 0.75 for tenderness over the lateral epicondyle and for pain on resisted wrist extension); and in the hand for abnormal sensation of light touch affecting the little finger (κ = 1.0) and Finkelstein's test (κ = 0.79). Tenderness of the medial epicondyle and resisted movements of the fingers and thumb proved to be the least repeatable signs (κ −0.02 to 0.55).

Table 3

Between observer repeatability of physical signs included in the schedule

An analysis based on limbs (rather than subjects) could be biased towards agreement through a lack of independence in individuals between paired observations from their right and left arms. To allow for within subject concordance, we recalculated the κ values by examining agreement within patients rather than limbs, but the findings were little changed. Thus, for abnormal light touch in the thumb, the κ value fell from 0.39 to 0.38, while smaller differences than this were found for all of the other estimates of agreement.

The extent of between observer agreement for measurements of neck and shoulder movement is shown in table 4. Mean absolute differences were comparatively small: 1.4–11.9° for active shoulder movements; 1.4–11.0° for passive shoulder movements; and 0.1–6.2° for active cervical movements. Internal rotation was the least repeatable of the three measurements at the shoulder used in the algorithm to assess for adhesive capsulitis.

Table 4

Repeatability of measurements of shoulder and neck movement included in the examination schedule

In keeping with the broad agreement between observers on physical signs, application of the diagnostic algorithm produced complete between observer concordance on the diagnoses derived from the examination schedule.


The analysis of sensitivity and specificity was based on 176 limbs in all 88 subjects. Table 5 records the number of people who had each of the specified diagnoses according to the clinic doctors, the number according to the nurse, the number for whom the nurse and clinic doctors agreed on the diagnosis, and the sensitivity and specificity of the examination schedule based on the nurse's assessment relative to that of the clinic as the standard.

Table 5

The sensitivity and specificity of the examination schedule

Altogether, 65 specific diagnoses were made by the clinic and 71 by the nurse. For the more common conditions the schedule had a high overall specificity (84%–100%), and a somewhat lower sensitivity—adhesive capsulitis (87%), rotator cuff tendinitis (58%), lateral epicondylitis (73%), carpal tunnel syndrome (67%) and De Quervain's disease (71%).

The diagnosis of rotator cuff tendinitis was found to be the main area of disagreement between the nurse and clinic physician. The nurse diagnosed this condition more often (19 subjects versus 12 for the clinic); and in only seven of 24 cases diagnosed by either the nurse or clinic was there complete agreement between them. Among 12 other cases that were diagnosed by the nurse but not the clinic, adhesive capsulitis was also diagnosed both by the nurse and by the clinic. This suggests that in cases of capsulitis the schedule also makes a diagnosis of shoulder tendinitis, but practising clinicians do not. Five cases were diagnosed only by the clinic, and in these cases the patient did not report shoulder pain on the nurse's inquiry (the algorithm does not allow tendinitis to be diagnosed in the absence of shoulder pain). Similarly, absence of reported pain rather than restricted shoulder movements accounted for the discrepancy in two cases of adhesive capsulitis that were diagnosed by the clinic but not by the nurse.


There have been many investigations of the frequency,1-10 causes24-27 and treatment28-31 of soft tissue disorders of the upper limb and neck, particularly in occupational health settings, but interpretation of these studies has been hampered because of differences in nomenclature and approaches to case definition.12 ,13 ,32 Some studies have focused only on regional pain symptoms, but these health end points are too non-specific to have direct clinical value. Orthodox approaches to classification, based on classic text book descriptions of expected clinical features33 provide an alternative framework for assessment, but one that has not been standardised or validated. Differences exist in taxonomy and the current use of diagnostic labels,32 and in most cases no reliable standard exists against which to resolve disagreements.

Several researchers have adopted a more structured approach to the classification of soft tissue disorders of the upper limb and neck. A systematic review by Buchbinder et al 32 identified four systems—two from Finland33 ,34 and two from North America1 ,35—in which explicit criteria were proposed, intended to classify all, or a significant proportion of soft tissue disorders into distinct categories. All four schemes were developed to investigate neck and upper limb disorders in epidemiological surveys in the occupational health setting. However, all were criticised because of their failure to demonstrate satisfactory between and within observer repeatability, or to demonstrate construct validity against alternative systems in the same domain of enquiry. In addition, the protocols of Viikari-Juntura34 and Silverstein35 were considered to be elaborate, requiring one or more trained specialists with special skills to undertake the assessment; while in the case of McCormack et al 1 and Waris et al 33 there had been a failure to demonstrate that a delegated examination was adequate and accurate. Another concern, was the failure to cater for patients who partially fulfilled the criteria or who had patterns of complaint that fell outwith the specified categories. To these concerns may be added the practical observation that no system specified a method for conducting the assessment that could be described and followed by third parties.

Buchbinder et al recommended that future work should be directed toward improving the systems of classification or developing new approaches that fulfilled basic measurement criteria. Our own study is aimed to establish a practical examination schedule that has a measurable standard of construct validity and repeatability, and that can feasibly be delegated to research nurses in field epidemiological investigations. The criteria that underlie the protocol were based on the views of a multi-disciplinary workshop, and have the benefit of face validity and consensus backing.

Although consensus was reached at the HSE workshop, the number of participants was too small for it to be represented as a UK view, much less an international view. Nevertheless, the level of agreement achieved has enabled us to develop, test and endorse a practical measuring instrument that performs well in relation to clinical opinion. Assessment was also made of the between observer repeatability of elements of the examination and of its diagnostic conclusions. The data indicate that the components of the examination schedule are generally repeatable, as judged by Fleiss's criteria (a κ greater than 0.75 is said to denote excellent agreement and that of 0.4–0.75 a good agreement).36 These findings were based on an analysis of limbs, and might have been biased towards agreement if paired observations on the right and left side were not independent; but the κ values changed little in an analysis based on individual subjects rather than limbs. Furthermore, after a period of training, no systematic differences were evident between observers in the elicitation of signs such as tenderness and pain on resisted movement. The derived diagnoses were also found to be repeatable, although numbers in several of the diagnostic groups were small.

Our findings on reliability follow a period of training, provided to promote consistency between the observers. In this context, it should be noted that between observer agreement is not a fixed quantity, but varies according to the degree of training undertaken and how recent it is (and so periodic refresher training and re-evaluation would be warranted when using the instrument in a longitudinal investigation).

There was a high overall level of specificity with an acceptable sensitivity. The reference standard in our analysis of sensitivity and specificity was the diagnosis made in the clinics. We recognise that other clinicians in other clinics might have reached different diagnoses. However, the level of agreement achieved demonstrates face validity in relation to the everyday opinions of a group of rheumatologists and orthopaedic surgeons. In addition, attempting to assess a large number of diagnostic entities, in contrast with studies that consider the presence or absence of a single disorder, poses an extra challenge in interpretation. The imperative for this inclusive approach arises from the need to distinguish between competing diagnostic possibilities, but it has also enabled us to explore the overlap between disorders.

Although the schedule performed well in many respects, there were several areas of difficulty. Our observations suggest that less stringent criteria are used by hospital doctors than required by the HSE Workshop when faced with patients complaining of elbow pain. The criteria for epicondylitis proposed by the Workshop require local pain plus two physical signs, but relaxing the criteria to require only one of the two signs would increase the sensitivity for a diagnosis of lateral epicondylitis from 73% to 91% while only reducing the specificity from 97% to 96%. A similar adjustment to the algorithm would improve the sensitivity for a diagnosis of medial epicondylitis from 0% to 67% but leave the specificity little changed at 98% (down from 100%). In keeping with others,37 we found that patients with physical signs of carpal tunnel syndrome did not always shade a hand diagram in a way that indicated a classic or probable distribution of symptoms. Relaxing the criteria in the presence of supporting signs to permit a diagnosis based on a “possible” pattern of symptoms (as defined by Katz et al 17) would have resulted in complete agreement between the nurse and clinics. Adjustments to the algorithm might improve its performance in this setting, but could lead to an unacceptable rate of false-positive diagnoses when it was applied in the general population.

The discrimination between disorders of the shoulder was particularly difficult. In this small sample the doctors in the clinics never diagnosed rotator cuff tendinitis and adhesive capsulitis in the same person, whereas the schedule nearly always diagnosed the former in the presence of the latter. The data in table 4 indicate an element of disagreement between observers in the estimated range of shoulder movements, (albeit similar to that found elsewhere16), but altering the cut points used to define restriction did not affect the extent of overlap, which is not likely to be explained by this source of imprecision. Alternatively, differences might have occurred because the conditions frequently coexist but doctors fail to consider or record a second diagnosis—assuming capsulitis to be the main underlying pathology. However, a more probable explanation is that subjects with capsulitis also experience pain on resisted movement, and thereby fulfil the definition for tendinitis proposed by the HSE Workshop. Refinement of the diagnostic criteria may therefore be appropriate, and the problem merits more detailed inquiry, including observations on a larger sample, studies of natural history, and perhaps investigation with magnetic resonance imaging.

The schedule was tested in a panel of patients from hospital clinics, many of whom were known to be suffering specific disorders of the upper limb severe enough to present for secondary health care. The results of our study should be interpreted in this context. The specificity of a test in any population depends on the group chosen to represents those without the disorder. In a hospital outpatient setting, subjects who were seen in clinic but not considered to have the disease in question form the natural comparative group, and in this context it seems that a trained nurse can achieve an acceptable level of diagnostic accuracy. However, in the community, where non-specific regional pain syndromes may be more common and specific upper limb and neck disorders less severe, the outcome is less assured, and further assessment of the schedule would be required.

As with other initiatives in this area, the criteria that have been developed are not comprehensive as to all possible upper limb pathologies. They also suffer the common weakness of aggregating potentially dissimilar disorders into a single category of “non-specific upper limb pain” to accommodate presentations that do not satisfy the criteria of discrete clinical disorders. This reflects the extent of consensus at the HSE workshop, where there was no agreement, for example, on how to classify arm pain of cervical origin, and no discussion about acromioclavicular joint dysfunction or olecranon bursitis. However, the schedule provides a base from which further expansion is possible. We have added to it some modifications of our own in areas of evident deficiency: notably, the recording of neck pain and restricted neck movement so that its association with arm pain can be mapped. Provided that case definitions can be agreed for other specific disorders of the upper limb and neck, then a similar approach of validation can be followed, permitting the schedule's utility to be extended. A major challenge will be to determine whether patients currently labelled as having non-specific upper limb pain can be subdivided further, into groups defined by clusters of symptoms and signs that predict prognosis and response to treatment: a reproducible physical examination of the kind tested here will be indispensable to this task.


We are grateful to Vanessa Cox who provided computing support, to the staff at the MRC Unit, Southampton who helped in data handling, and to Denise Gould who prepared this manuscript.



  • Funding: this study was supported by a grant from the Health and Safety Executive and a project grant PO552 from the Arthritis Research Campaign. Dr Walker-Bone was supported by an ARC Clinical Research Fellowship and Isabel Reading by a grant from the Colt Foundation.