Objectives To evaluate different global ultrasonographic (US) synovitis scoring systems as potential outcome measures of rheumatoid arthritis (RA) according to the Outcome Measures in Rheumatoid Arthritis Clinical Trials (OMERACT) filter.
Methods To study selected global scoring systems, for the clinical, B mode and power Doppler techniques, the following joints were evaluated: 28 joints (28-joint Disease Activity Score (DAS28)), 20 joints (metacarpophalangeals (MCPs) + metatarsophalangeals (MTPs)) and 38 joints (28 joints + MTPs) using either a binary (yes/no) or a 0–3 grade. The study was a prospective, 4-month duration follow-up of 76 patients with RA requiring anti-tumour necrosis factor (TNF) therapy (complete follow-up data: 66 patients). Intraobserver reliability was evaluated using the intraclass correlation coefficient (ICC), construct validity was evaluated using the Cronbach α test and external validity was evaluated using level of correlation between scoring system and C reactive protein (CRP). Sensitivity to change was evaluated using the standardised response mean. Discriminating capacity was evaluated using the standardised mean differences in patients considered by the doctor as significantly improved or not at the end of the study.
Results Different clinimetric properties of various US scoring systems were at least as good as the clinical scores with, for example, intraobserver reliability ranging from 0.61 to 0.97 versus from 0.53 to 0.82, construct validity ranging from 0.76 to 0.89 versus from 0.76 to 0.88, correlation with CRP ranging from 0.28 to 0.34 versus from 0.28 to 0.35 and sensitivity to change ranging from 0.60 to 1.21 versus from 0.96 to 1.36 for US versus clinical scoring systems, respectively.
Conclusion This study suggests that US evaluation of synovitis is an outcome measure at least as relevant as physical examination. Further studies are required in order to achieve optimal US scoring systems for monitoring patients with RA in clinical trials and in clinical practice.
Statistics from Altmetric.com
Numerous studies have clearly demonstrated that ultrasonographic examination of joints is more sensitive than clinical physical examination.1,–,6 Over the last decade, due to improvements in ultrasonographic (US) technology and as a result of educational training courses, the reliability of such examination has dramatically improved.7 8 Moreover, because of combined European League Against Rheumatism (EULAR)/Outcome Measures in Rheumatoid Arthritis Clinical Trials (OMERACT) initiatives, there is now consensus on techniques of acquisition and scoring of synovitis in rheumatoid arthritis (RA)9 at the joint level. However, few studies have focused on the properties of an ultrasonographic scoring system at the patient level (eg, ‘global’ scoring system)10 and this has been highlighted as an area for development by the OMERACT Ultrasonography Task Force.
The development of global US scoring systems for synovitis is of interest for at least two reasons: (i) to facilitate homogeneous decisions such as a therapeutic decision at a single point of time (eg, to initiate a therapy in case of ‘non-acceptable’ score or to taper/stop a therapy in case of an ‘acceptable’ score) and (ii) to evaluate the efficacy of specific therapeutic interventions during longitudinal follow-up. A validated US synovitis count could potentially replace the clinical synovitis count in the current composite indices evaluating RA disease activity, for example, Disease Activity Score (DAS).11 12
To use these scoring systems as outcome measures in clinical trials requires evaluation of the clinimetric/metrological properties according to the OMERACT filter by evaluating the three following characteristics: validity, discrimination and feasibility.13
Additionally, the choice of the scoring system has to consider the following: (1) the choice of the joints to be included in the scoring system, (2) the choice of the scoring system used at each individual joint level (grading or not) and (3) the choice of the US acquisition technique. The choice of the joints to be included in the scoring system has to take into account several parameters, including the frequency of involvement of such joints in RA, their inclusion in previous clinical synovitis scoring systems and the feasibility for analysis (eg, the time constraints of the technique). For each individual joint, the scoring system can be either a binary variable (presence of synovitis yes/no) or a semiquantitative variable (usually by using a scale from 0=no synovitis to 3=severe synovitis). The sum of the score of each individual joint in a global scoring system results in a joint count when using a binary variable and a joint score when using a semiquantitative variable. In the current study, we combined various options into several global joint count/score scoring systems, which were tested for a range of clinimetric properties.
Patients and methods
The study was a prospective, multicentre, international, 4-month duration follow-up of patients with RA requiring a tumour necrosis factor (TNF) blocker according to the opinion of the treating rheumatologist.
This design was chosen in order to evaluate the main clinimetric properties of the different proposed scoring systems, for example, reliability using screening and baseline visits, validity at baseline and sensitivity to change and discriminant capacity following treatment with a TNF blocker.
This study was approved by the appropriate ethical committees. All patients gave their written informed consent before entering the study.
Adult patients fulfilling the 1987 American College of Rheumatology (ACR) criteria for the diagnosis of RA14 were eligible for the study if, in the opinion of the investigators, he/she required TNF blocker therapy. The minimum level of disease activity was defined by a number of swollen joints at physical examination ≥6.
Demographics were collected at screening or baseline including gender, age, disease duration, history of surgery related to RA and previous RA treatments.
Joint count was performed by an investigator (either a research nurse or a rheumatologist) on 66 joints according to the ACR recommendations.15 In each of the nine participating centres, a single investigator was in charge of the monthly monitoring of the patients. For each joint, the scoring system for synovitis was a semiquantitative variable (eg, 0=definitely no synovitis, 1=possibly not, 2=yes moderate, 3=yes obvious and important). Clinical synovitis count was based on the sum of the number of joints with a score of at least 2.
Other outcome variables included the following collected at each visit: number of tender joints (68 joint count), patient's global assessment using a 0–100 visual analogue scale (VAS), functional impairment using the Health Assessment Questionnaire (HAQ) disability index16 and doctor's global assessment of the activity of the disease using a 0–100 VAS.
The US evaluation was performed on 38 joints: the 28 joints included in the 28-joint Disease Activity Score (DAS28) (eg, shoulders, elbows, wrists, metacarpophalangeals (MCPs) × 10, proximal interphalangeals (PIPs) × 10, knees and metatarsophalangeals (MTPs) × 10).17 These US examinations were performed in a darkened room. In each of the 9 participating centres, a single sonographer (either a radiologist or a rheumatologist) with experience (at least 70 different examinations) in evaluating synovitis was in charge of the monthly monitoring of the patients. The sonographer did not have access to the results of the clinical examination and vice versa. Systematic multiplanar greyscale (B mode) and power Doppler examination was carried out with commercially available real-time scanners (eg, ESAOTE Technos MPX, Toshiba Aplio, Esaote MyLab, Philips HD11, BK Mini Focus) using multifrequency linear transducers (7–12 MHz). Ultrasonography scanning techniques, greyscale and power Doppler machine settings and definitions of abnormality were standardised among investigators prior to the study during a 1.5-day meeting. The US scanning method has been described previously.5 18,–,22 Synovitis was defined according to the OMERACT published definitions.19 20 23 24 A B mode and a power Doppler examination was recorded for each joint. B mode synovitis scoring was evaluated using a four-grade scale from 0 to 3 with the following subjective definitions for each category: grade 0, absence of synovial thickening; grade 1, mild synovial thickening; grade 2, moderate synovial thickening; grade 3, marked synovial thickening. Power Doppler synovitis scoring was also evaluated using a four-grade scale from 0 to 3 with the following definitions for each category: grade 0: absence of signal, no intra-articular flow; grade 1: mild, one or two vessels signal (including one confluent vessel) for small joints and two or three signals for large joints (including two confluent vessels); grade 2: moderate confluent vessels (>grade 1) and less than 50% of normal area; grade 3: marked vessels signal in more than half of the synovial area.
US synovitis score for each group of joints (for 20, 28, 38 joint options) was based on the sum of these different grades ranging from 0 to 60, 0 to 84 and 114 from the 20, 28 and 38 joint groups, respectively.
US synovitis count for each different scoring options 20, 28 and 38 was based on the sum of the number of joints with a grade of at least 1 for the B mode and power Doppler evaluation independently.
A statistical analysis plan was elaborated in order to evaluate the different psychometric properties of the different a priori selected scoring systems.
A priori selected scoring systems
For the clinical evaluation, the US B mode and US power Doppler techniques the following groups of joints were evaluated: all MCPs+MTPs (eg, 20 joints); 28 joints of the DAS28 (28 joints) and DAS28 joints + 10MTPs (38 joints count). For each of the above 9 scoring systems, a count (referring to the binary variable: synovitis yes/no) and a score (referring to the semiquantitative variable: synovitis from 0 to 3) were evaluated resulting in 18 different scoring systems.
The clinimetric properties of each scoring system were evaluated according to the OMERACT filter as follows.13
Intraobserver reliability (discrimination) was evaluated on the data obtained at screening and baseline visits, with an assumption that the activity of the disease remained stable during these two visits with an interval of time of 2 to 4 days between the two visits. Reliability was evaluated using the intraclass correlation coefficient and its 95% CI and the weighted κ statistics.
Validity (truth) was evaluated on the data obtained at baseline. Internal construct validity was evaluated using the Cronbach's α coefficient. External validity was evaluated using the level of correlation (Spearman test) between, on one hand, the evaluated scoring system and, on the other hand, the level of C reactive protein (CRP) (mg/litre).
Sensitivity to change (discrimination) was based on the capacity of the scoring system to improve during the study using the TNF blocker therapy as a recognised effective therapy. The sensitivity to change was evaluated using the standardised response mean (SRM), which is the ratio between the mean changes over the SD of the changes. Usually, a SRM below 0.2 is considered as nil and above 0.6 as relevant.25 Such sensitivity to change has been evaluated on the changes observed between baseline and the end of the study. The end of the study was defined by completion of final visit (month 4) or, in case of early discontinuation, at the last visit while the patients were still on TNF blockers. The SD of each SRM was estimated using bootstrap resampling methodology. In order to avoid a potential overestimation of the number of swollen joints at physical examination at baseline (in order to fulfil the inclusion criteria, investigators had to find at least six swollen joints at physical examination), a second analysis of the sensitivity to change was performed between month 1 and month 4 using the assumption that the effect of TNF blockers on this parameter (eg, synovitis) is linear and constant during the first 3 to 4 months of therapy.
Discriminant capacity was based on the capacity of the changes in the scoring system to discriminate between two groups of patients. The groups of patients were defined according to their status at the end of the study. It is difficult to determine a truly independent global assessment of disease activity but we employed the doctor's global assessment. A 50% improvement during the study was the a priori chosen cut-off for defining a good response. Such discriminant capacity was evaluated using the standardised mean differences (SMD) which is the ratio between the differences in the mean changes (month 4 to baseline) between the two groups over the pooled SD of the changes of the two groups.
Simplicity of each scoring system was evaluated by recording the time spent by the sonographer for each group of joints, for example, shoulders, elbows, wrists, MCPs, PIPs, knees, MTPs.
Validity and simplicity were analysed on patients with a clinical and US evaluation at baseline. Sensitivity to change and discriminant capacity were analysed in patients with at least one evaluation performed during anti-TNF therapy.
Patients and study course
Of the 76 enrolled patients, only 7 had a systematic clinical and US evaluation of their synovitis at screening and baseline visits. Therefore, reliability was evaluated on these seven patients. Of the 76 patients with a clinical and US evaluation at baseline, 8 patients did not return for a second visit. For the remaining 68 patients with at least 1 evaluation performed during anti-TNF therapy, 2 were not evaluated at month 4 but at the discontinuation visit. Therefore, validity was evaluated on 76 patients and sensitivity to change and discriminant capacity on 68 patients. Of the 76 enrolled patients, 64 (84%) were women, mean age was 55±13 years old with a mean disease duration of 10±9 years. Rheumatoid factor was positive in 59 (78%). The mean number of previous disease modifying antirheumatic drugs (DMARDs) were 3 ± 2. At baseline, TNF blocker therapy was either initiated as the first TNF blocker, switched to another TNF blocker at baseline or initiated of another TNF blocker in a patient with a history of at least 1 TNF blocker intake in 52, 15 and 8 patients, respectively (1 patient missing data). At baseline, the mean±SD DAS28 erythrocyte sedimentation rate (ESR) was 5.12±1.31, mean CRP was 18±19 mg/litre) and HAQ was 1.41±0.68.
Intraobserver reliability of the different synovitis scoring systems, summarised in table 1, ranged from 0.53 to 0.97 with reliability of the US examination at least as good as that of physical examination. Validity of the different synovitis scoring systems is summarised in table 2. Construct validity was again at least as good as clinical examination. External validity was evaluated based on the level of correlation existing between the scoring system and the level of CRP. Whatever the scoring system, there was a statistically significant correlation with the level of CRP. Moreover, a trend, suggesting a higher correlation, was observed in the scoring systems involving more joints or at least large joints).
Sensitivity to change between baseline and the end of the study is summarised in table 3. Sensitivity to change was improved with greater joints assessed.
In order to test the hypothesis of an artificial overestimation of the clinical synovitis count at baseline, sensitivity to change was also evaluated between month 1 and the end of the study with the results also summarised in table 3. Apart from the composite index DAS, these data suggest an equal or even better sensitivity to change of the different US scoring systems in comparison to the clinical ones.
Discriminant capacity was evaluated on the capacity of the changes in the different scoring systems to discriminate groups of patients according to their status at the end of the study as estimated by a doctor global assessment. Table 4 summarises the results considering the condition of the patients at the end of the study based on the doctor's opinion. Whatever the scoring system, these analyses demonstrated a similar level of discriminant capacity of the different evaluated scoring systems with a trend in favour of the US scoring systems. There was no significant difference between the different joint count and joint score whatever the scoring system. Simplicity (time to perform the acquisition of the US scoring system at baseline) ranged from 11±4, 18±7 to 23±8 min for the 20, 28 and 38 joint scoring system, respectively.
This study evaluated different US scoring systems using a range of joint counts (20, 28 and 38) with binary and semiquantitative assessments and compared clinical US (in B mode and US power Doppler) scores resulting in 18 different global scoring systems. Moreover, this study also evaluated the longitudinal performances of the systems and the DAS28 making for 21 different scoring systems that we compared using the OMERACT filter to examine their major psychometric/metrological properties. The findings suggest that US evaluation of synovitis is an outcome measure at least as relevant as physical examination.
The a priori joint groups selected (eg, 20, 28 and 38) were based on several considerations. First, we wished to be able to compare novel US systems with ‘conventional’ clinical scoring systems. Therefore the 28 joints of the DAS28 were included.17 Second, the frequency of synovitis in the MTP joints in the early phase of RA,26 and the well known discrepancies between clinical and US evaluation in these joints4 prompted us to also include the 10 MTP joints resulting in the 38 joints group. Finally, the need for simplification prompted us to evaluate a group of joints that are frequently involved in RA, that is, all the MCPs+ and MTP joints and explaining the 20 joints group. We acknowledge that such a choice is arbitrary. Other groups have proposed other scoring systems involving different joint combinations such as the German US7 score, which evaluates joints of the dominant hand and foot (MCP II, PIP II, MTP II and MTP V) based on the observed frequency of involvement in rheumatoid arthritis.12
Two main barriers are usually put forward when considering the use of systematic US evaluation of synovitis in clinical trials/studies: its poor reliability and its high complexity.
It has to be recognised that these two parameters may be somewhat linked since better reliability of the technique requires suitable training of the sonographers. The last decade has seen a huge improvement across Europe with respect to US utilisation, with many rheumatology departments having their own US machine and appropriately trained rheumatologists. It has to be emphasised that such training is feasible and we have previously shown that, at least for the evaluation of synovitis of the small joints of the extremities, 70 examinations are sufficient for a junior sonographer to achieve competence.28 Hence, in the current study, the recruitment of centres was not an issue. This study compared the intraobserver reliability of the US examination to ‘conventional’ clinical examination and demonstrated that the intraobserver reliability of the different US scoring systems were at least as good as the clinical measure, although we acknowledge the limitation that this is based on a small cohort of seven patients only.
This study also systematically evaluated the time spent by the investigators collecting the US information. Depending of the number of evaluated joints, such time ranged from 10 (20 joints) to 25 (38 joints) min duration, which would be seen as satisfactory for patient acceptance.
The construct and the external validity both appeared to be acceptable for all scoring systems. The construct validity evaluated via the Cronbach α coefficient gave a threshold of 0.70, which is considered as relevant.29 With respect to the external validity, we arbitrarily chose a distinct item (the CRP), which evaluated the domain ‘inflammation’. The data obtained suggest that the US scoring systems are at least as valid as conventional clinical ones. In this study, we did not compare the intermachine power Doppler sensitivity. It has to be recognised that a heterogeneous sensitivity might have influenced the results. This study did not demonstrate any significant difference with respect to the scoring system (ie, binary scoring vs grade score) and also with regard to the number of evaluated joints, suggesting that a 28-joint US count system performs at least as well as the others. The demonstration of a treatment effect of an ‘anti-inflammatory’ agent is facilitated by the extent of inflammatory disease before therapy. This concept results in inclusion criteria requiring a prespecified number of swollen joints (usually at least 6 on a 28-joint count).30,–,32 Because of the willingness of the investigators to recruit patients, such a requirement might result in an overestimation of the number of synovitic joints at baseline. This issue might explain in part the ‘dramatic’ improvement in number of swollen joints in the placebo group of different studies.30,–,32 The US examination can easily prevent such baseline overestimation and therefore potentially improve the accuracy of changes observed after initiation of therapy. Such advantage can dramatically improve the conduct of clinical trials. The current study tried to test this hypothesis arguing the fact that the differences in the performances (sensitivity to change) of the scoring systems between clinical and US should be similar over time. The differences obtained in this study when considering the changes between the end of the study and baseline or between the end of the study and the first month of therapy (see table 3) are clearly in favour of the hypothesis of an overestimation of synovitis at physical examination at baseline. Another possible explanation is that with US, smaller amount of synovitis are picked up after the first dose of the TNF blocker which is no longer detectable by clinical examination.33
This study suggests that the US scoring systems for synovitis are at least as good and possibly more accurate than physical examination. Further studies are required in order to achieve an optimal US scoring system for monitoring patients with RA in clinical trials and in daily practice.
The authors would like to thank the different investigators who contributed in the recruitment and/or the monitoring of the patients (Jean-David Albert, Pierre Bourgeois, Maxime Breban, Françoise Carbonnelle, Tiffen Couchouron, Laurent Grange, Pascal Guggenbuhl, Cécile Hacquard-Bouder, Christophe Hudry, Rachida Inaoui, Catherine Le Bourlout, Damien Loeuille, Xavier Mariette, Jean-Marcel Meadeb, Anne Miquel).
Funding This study was supported by an unrestricted grant from Abbott, France.
Competing interests None.
Patient consent Obtained.
Ethics approval This study was conducted with the approval of each country participating in the study.
Provenance and peer review Not commissioned; externally peer reviewed.
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.