Objectives: Painful osteoarthritis (OA) of the hand is common and a validated ultrasound (US) scoring system would be valuable for epidemiological and therapeutic outcome studies. US is increasingly used to assess peripheral joints, though most of the US focus in rheumatic diseases has been on rheumatoid arthritis. We aimed to develop a preliminary US hand OA scoring system, initially focusing on relevant pathological features with potentially high reliability.
Methods: A group of experts in the fields of OA, US and novel tool development agreed on domains and suggested scaling of the items to be used in US hand OA scoring systems. A multi-observer reliability exercise was then performed to evaluate the draft items.
Results: Synovitis (grey scale and Power Doppler) and osteophytes (representing activity and damage domains) were included and evaluated as the initial components of the scoring system. All three features were evaluated for their presence/absence and if present were scored using a 1–3 scale. The reliability exercise demonstrated intra-reader κ values of 0.444–1.0, 0.211–1.0 and 0.087–1.0 for grey scale synovitis, power Doppler and osteophytes respectively. Inter-reader reliability κ values were 0.398, 0.327 and 0.530 grey-scale synovitis, power Doppler and osteophytes respectively. Without extensive standardisation, both intra- and inter-reader reliability were moderately good.
Conclusions: The draft scoring system demonstrated substantive to almost perfect percentage exact agreement on the presence/absence of the selected OA features and moderate to substantive percentage exact agreement on semi-quantitative grading. This preliminary process provides a good basis from which to further develop an US outcome tool for hand OA that has the potential to be utilised in multicentre clinical trials.
Statistics from Altmetric.com
Osteoarthritis (OA) is the most common joint disease1 and is associated with significant health economic consequences2 3 The prevalence of radiographic OA has been well documented in epidemiological studies; however, the prevalence of symptomatic hand OA is not well documented.4 The Framingham study has estimated the prevalence to be as high as 26% of women and 12% of men over 70.4 While treatment recommendations focus on a holistic approach, the pharmaceutical options are currently largely limited to analgesics5–7 However, the spectrum of pharmaceutical therapies is expanding, with a recent increase in interest in potential disease modifying therapies in OA.8 9 OA is assessed clinically with attention to symptoms and signs, and confirmed by structural changes on radiographs. Similarly, trials focus on clinical and structural outcomes. The Osteoarthritis Research Society International (OARSI) group has recently published guidelines for conduct of clinical trials of OA of the hand,10 recommending conventional radiographs (CRs) as the standard for assessing structural outcomes. However, they acknowledged that other novel imaging techniques may play a part and require further validation.
Ultrasound (US) appears favourably placed to assess OA both in the clinic and in clinical trials. It has a higher resolution than CR, does not involve ionising radiation, and allows multi-planar, dynamic imaging of joints. In addition, recent studies in inflammatory arthritis have demonstrated US to be more sensitive to synovitis than clinical examination11–13 more sensitive than CR to the presence of cortical defects,14 15 and have reasonable sensitivity compared with magnetic resonance imaging for the presence of synovitis and cortical defects13 16
A group of experts in OA, US and outcome measures, met under the auspices of the Disease Characteristics in Hand OA Group (DICHOA), to gain consensus on the content, and take preliminary steps towards validation of an US scoring system for OA of the hand.
Experts in OA, outcomes measures and ultrasonography from six countries (UK, Austria, France, Ireland, Norway, the Netherlands) took part in this process which consisted of three steps.
Systematic literature review
A systematic literature review was conducted to identify original articles examining the validity of B mode or Doppler US in OA of the hand between 1950 and February 2007. The aim was to identify studies that had defined and attempted to measure pathological features of hand OA detectable by US. Pubmed was searched using the terms “ultrasonography and hand and osteoarthritis” and “ultrasound and hand and osteoarthritis”. Ovid MEDLINE was also searched using the terms (Osteoarthritis, Hip/or osteoarthritis.mp. or osteoarthritis/or Osteoarthritis, Knee/) and (Hand/or Hand Bones/or hand Joints/or hand.mp./or finger joint/or thumb/or base of thumb.mb./or metacarpophalangeal joint/or carpal bones/or schapoid bone/or carpal Joints/or stt joint.mp./or finger joint/or pip joint.mp./or dip.m.p.) and (Ultrasonography, Doppler/or Ultrasonography/or Ultrasonography, Doppler, Colour/or Ultrasonography.mp.). The abstracts were reviewed, and articles were excluded if they did not include the use of B mode US in hand OA, were not in English, were case reports, pictorial reviews or review articles, or did not measure pathological features of hand OA.
Iterative internet exercise
The next step was an iterative internet discussion, in order to gain consensus on a draft scoring system developed. Considerations included the joints to be evaluated, domains to be scored, definitions of domains, and proposed scaling systems for each domain. Through this process consensus was obtained on a draft scoring system.
A workshop involving 15 experts took place at the Cochin Hospital in Paris, France. This included a multi-observer US reliability exercise to evaluate the draft scoring system. The US reliability exercise involved seven ultrasonographers (DK, RJW, MADA, AP, CSW, HG, HBH) using seven commercially available real time scanners (Technos MPX, Esaote, Genoa, Italy) and a LA435 linear multifrequency transducer of 8–14 MHz. The B mode frequency used was 13 MHz. The power Doppler settings were as follows: frequency of 10 MHz, pulse repetition frequency (PRF) of 750 kHz, and medium wall filter. The colour gain was 113 db and was set at the level at which noise artefacts appeared and then gradually reduced, until only a flow signal, if present, was left. Patients followed in the Department of Rheumatology of Cochin Hospital were invited to participate in this exercise. Seven patients with OA of the hands were scanned by each ultrasonographer to determine inter-reader reliability. Each ultrasonographer rescanned their first patient of the day at the end of the session to assess intra-reader reliability. The US examination imaged 15 joints of the dominant hand; the first carpometocarpal joint, metacarpophalangeal joints 1–5, proximal interphalangeal joints 1–5 and distal interphalangeal joints 2–5. The entire dorsal surface of the joint was imaged in grey scale, and Power Doppler was assessed in the dorsal longitudinal plane.
The inter- and intra-reader reliability was assessed according to κ scores, weighted κ scores (κ (w)),17 percentage exact agreement (PEA) and percentage close agreement (PCA). Arbitrary qualitative labels have previously been assigned to κ values, whereby a κ of <0.2 is slight, 0.21–0.4 fair, 0.41–0.6 moderate, 0.61–0.8 substantial and >0.81 almost perfect.18 PEA is the percentage of observations that were given the same score, while percentage close agreement is the number of observations that are either given the same score or ±1. We are not aware of any standard qualitative interpretation of PEA and PCA, so have applied the same cut-offs as for the κ values. Inter-reader reliability was assessed with regards to specific agreement as well, which assesses the agreement specific to each category in the domain. This is less likely to be affected by chance than the PEA. In addition, in assessing intra-reader reliability, the percentage of times the second observation was higher than the first observation was assessed (%m2>m1) in order to identify any potential bias.
The distribution of pathology in the subjects scanned was assessed post hoc by examining the results of the reliability exercise. Pathology was deemed present if four of seven scanners (ie, a majority) scored the pathology as present on the dichotomous scale.
Systematic literature review
Twenty-five articles were identified with the search; the abstracts of these articles were reviewed. Twenty-three articles were excluded; six were not in English, five articles did not involve OA of the hand, three articles were reviews, two were pictorial reviews, six articles utilised neither B mode or Doppler US, and one did not attempt to define or measure pathological features of hand OA. The remaining two articles are presented in table 1.
Iterative internet exercise
Consensus was obtained that 15 joints of the hand would be examined. The first carpometacarpal joint, metacarpophalangeal joints 1–5, proximal interphalangeal joints 1–5, and distal interphalangeal joints 2–5. Domains to be scored reflected domains of activity and damage: synovial hypertrophy and effusion, power Doppler signal and osteophytosis.
Synovial hypertrophy and effusion were considered together as a single domain “synovitis”. The OMERACT definitions of synovial hypertrophy and effusion developed for RA were applied.19 It was agreed that grey scale synovitis would be scored as either present or absent (0–1), and also on a semiquantitative scale of 0–3 analogous to the scoring systems developed in RA, where 0 represented no synovitis, 1 mild synovitis, 2 moderate synovitis and 3 severe synovitis.
Power Doppler signal was defined as a signal within a region of grey scale synovitis. It was decided to assess both dichotomous (present/absent, 0–1) and semiquantitative (0–3) scales.
Osteophytes were defined for the purpose of this exercise as cortical protrusions seen in two planes. Osteophytes were again evaluated using both dichotomous and semiquantitative scales (the latter scored at each joints as absent, mild, moderate or severe on a scale of 0–3).
It was decided not to include erosions, cartilage parameters or joint space narrowing because of concerns about reliable definitions, acquisition, current available US technology and feasibility related to duration of scanning (see also Discussion).
Intra- and inter-reader reliability for each domain and each scaling system are presented in tables 2 and 3. The intra-reader κ values (table 3) varied from light to almost perfect depending on the observer, domain and scoring scale. It was generally better from dichotomous scales, with substantive to almost perfect PEA for all observers for each domain. The semiquantitative scales generally demonstrated slightly lower κs and PEA. However, the PCA for the semiquantitative scales was substantive for all observers and all domains. While the intra-reader reliability was quite variable between readers; however, as each reader scanned a different patient, the variability should be interpreted cautiously, as those who scanned subjects with less pathology may be expected to have better reliability. The inter-reader κs (table 2) was fair to moderate, and once again, the dichotomous scales were more reliable than the semiquantitative scales. The PEA ranged from fair to almost perfect, once again depending on domain and scale. For synovitis, the PEA was best towards the normal end of the semiquantitative scale. For the other domains, the results are more variable, for example, agreement on osteophyte scores was better at the extremes of the scale (scores of 0 and 3). There was no consistent bias between or within readers with regards to the second set of observations being higher than the first.
The distribution of pathologies in the subjects scanned is demonstrated in table 4. Osteophytosis was seen in all proximal interphalangeal and distal interphalangeal joints, but there was a wide variation of pathology in other joints.
US has many features rendering it potentially valuable in investigating structure in hand OA. This preliminary work established that international experts in the field of US, OA and outcome measures believe it is worthwhile pursuing this imaging technique as a tool in hand OA. Also, even though US is recognised as a valuable tool in imaging joints in inflammatory diseases, the systematic literature review demonstrated a paucity of information on the validity of US in hand OA.
The development of a preliminary US hand scoring system via an iterative internet exercise allowed experts to come to a consensus as to what US detectable abnormalities in hand OA were both important and feasible domains to be included. These were grey-scale synovitis, power Doppler and osteophytosis. It was felt that current technology would not allow cartilage defects or joint space narrowing to be reliably or meaningfully interpreted, despite these being cardinal pathological features of OA. Given that OARSI guidelines recommend that measures of joint space narrowing are recorded in studies of structural outcomes,10 such a domain may need to be added to the tool in the future. Erosions were also excluded from the tool, largely due to perceived problems with the definition and reliability. US may be a suitable medium to investigate the relationship between erosive and non-erosive OA, although a single study has found US to be less sensitive to erosions in hand OA than CR.20 Future development of this tool may need to revisit the issue of whether to include erosions or not.
The reliability exercise demonstrated that the preliminary US scoring tool was reliable. Intra-reader reliability (interpreted with PCA or PEA) for the majority of observers and domains was moderate to almost perfect, being generally higher when the dichotomous scale was used. It is important to acknowledge that variation between observers may be confounded by which subject was used to assess intra-reader reliability. Reliability is likely to be best when less pathology exists (as has been found for the inter-reader reliability). As each observer scanned a different subject to determine their intra-reader reliability, results should be compared with caution.
Inter-reader reliability assessed with PEA was moderate to almost perfect, being over 70% for each domain using a dichotomous scale and being most reliable for the osteophyte domain. Even utilising the semi-quantitative scaling, the lowest PEA was 48% for synovitis, being higher for Doppler and osteophytosis. Perhaps as would be expected, agreement was generally greatest in the absence of pathology. Given there was no formal or extensive standardisation process prior to the exercise, these results are extremely encouraging. Finally, problems with the draft system and future directions were identified.
Despite limiting domains and numbers of joints scanned and imaging only the dorsal surface of the joint, scanning each subject took up to half an hour. This compares poorly with scoring radiographs of hand OA, in which the most time-consuming method has been shown to take an average of less than 4 min.21 The time taken in this exercise includes the acquisition of images (which is not considered when scoring CRs) and may decrease with increasing observer experience. However, it is likely to be one of the major barriers to a feasible US outcome measure in hand OA. We chose to examine 15 joints for the purpose of this preliminary exercise; however, further consideration of which joints to include in a scoring system, and whether to weight certain joints, is needed. The restriction of scoring systems to limited joints, and weighting of joints according to importance or significance requires further clinical and imaging studies are required to determine the relative significance of each joint or joint combinations.
We did not include erosions, joint space narrowing or cartilage defects in this tool. This was not because these domains were not felt to be of structural or pathological importance in OA, but rather because we chose to focus on domains that were felt to be feasible in a multicentre outcome measure given current technology. Erosions can be difficult to visualise due to overlying osteophytes, and in our experience where the cortical surface is very damaged, it can be difficult to determine where an erosion begins and an osteophyte ends. Furthermore the only study examining the validity of US in detecting erosions in hand OA found US to be less sensitive than CR.20 While cartilage damage is a cardinal feature of hand OA, there are several features of the US appearance of degenerative cartilage, including thinning, transparency and loss of clarity of the interface.22 As cartilage changes in OA are a spectrum, an appropriate scaling system would have been complex, time consuming, and of uncertain significance given that visualisation of cartilage is limited in the small joints of the hand, even with flexion of the joint, due to joint structure. US can only image surfaces of joints, the joint space in the central portion can be obscured by osteophytes in OA of the hand, so it was felt that trying to quantify joint space with US would not be feasible.
In the early phase of OA cytokines stimulating osteoblasts are released from the chondrocytes. Thus the osteophytes are early signs of pathology in the cartilage. Given the high resolution of US, small osteophytes not detected on CR may be clinically relevant for early diagnosis of OA.
The definitions we used may need further consideration. For example, we chose to use a composite of synovial hypertrophy and effusion to determine synovial inflammation, recognising that the clinical and pathological relevance of the two are uncertain. It was noted that effusion could occur in the absence of synovial hypertrophy, and a scoring system that grades the features separately may be of interest when time is not constrained. In addition, it was noted that Doppler signal could be seen within the capsule, but external to the hypoechoic areas of synovial hypertrophy. This signal was not scored in this exercise.
The determination of the clinical, pathological and prognostic importance of US detected abnormalities in hand OA, and relative importance of the joints involved was beyond the scope of this exercise. Proof of concept and epidemiological studies need to be undertaken to investigate these issues, and in the future the domains included in US outcome measures for hand OA may need to be revised.
This process has been a preliminary step in developing an US scoring tool for hand OA. Very good PEA for dichotomous scales were demonstrated in this exercise; however, the semiquantitative results suggest that a standardisation process could improve agreement. In addition, preliminary exercises might allow for selecting the most reliable observers. Importantly, this process has demonstrated that an US outcome measure suitable for multicentre trials is feasible and likely to be reliable. In addition it has provided a foundation upon which to further develop this tool.
We are grateful to MSD for an unrestricted educational grant that supported part of this work.
Competing interests: None.