Background: Radiographic progression in clinical trials is assessed by interpreting changes in total radiographic joint score, and the reliability of those scores depends on an evaluation of sum scores. It is not known how consistently changes in individual joints are identified by independent readers and in independent readings.
Patients and Methods: 7255 single joints from 178 patients who participated in the Trial of Etanercept and Methothrexate with Radiographic Patient Outcomes (TEMPO) trial were evaluated. Every image was independently scored twice according to the Sharp–van der Heijde method by two independent readers, so that four scores per joint were available. Absolute agreement and consistency of negative and positive erosion change scores across readers and readings were compared on a per-joint level, as well as on a per-patient level.
Results: The number of joints showing a change for erosion was very low in this trial: 691/7255 analysed joints had at least one non-zero change score out of four readings. Absolute agreement between readings was remarkably poor: only 12 joints showed a consistently positive or negative change in all four readings. Change scores in opposite directions in the same joint across independent readings were rare (25 joints). Frequency of opposite joint scores in the same patient (mixed change patterns) was reader dependent.
Conclusion: Substantial intra and interreader disagreement in scoring change in individual joints is common. Opposite joint scores in the same patient, however, are rare and reader dependent. Notwithstanding these subtle inconsistencies on the individual joint level, the total Sharp score is a useful and discriminatory outcome measure.
Statistics from Altmetric.com
Joint damage progression as measured on consecutive plain radiographs of hands and feet is an important outcome when evaluating the course of rheumatoid arthritis (RA). Several scoring systems and modifications have been developed to quantify progression radiographically. The Larsen and Sharp methods, and their modifications, which were developed to quantify radiographic progression, are best known and applied most frequently in clinical trials.1 2 3 Trials evaluating tumour necrosis factor alpha-blocking drugs in the treatment of RA have recently introduced the phenomenon of negative change, which could, among other things, indicate the repair of previously existing (erosive) damage in joints. A few trials have shown statistically significant negative average progression scores on a group level, leading to the possibility of joint repair. However, it is not known what such mean negative scores truly imply.4 5 There is no doubt that part of the negative (but also the positive) scores is due to measurement error, but it is impossible to separate measurement error from true change. Results from recent studies have reported change scores in either direction that are so low that they could theoretically stem from changes within one or only a few joints. Currently, there are few insights into how the change scores at the patient level (patient score) reflect individual joint elements. Do negative and positive scores occur in the same patient, or does the direction of change dominate the patient score? Another unanswered question is “Are negative or positive changes in a single joint recognised independently of each other by independent readers, or in independent readings by one reader?”
There are a number of reports claiming the existence of joint repair in RA.6 7 8 9 A subcommittee of the Outcome Measures in Rheumatoid Arthritis Clinical Trials (OMERACT) Imaging Committee on the Healing of Erosions has conducted several exercises using selected case reports that underscore the validity of the concept of the repair of erosions.10 11 12 These exercises have provided corroborating evidence regarding the validity of currently existing scoring methods in the detection of repair.
The relation between negative and positive change scores, and the reliability of both phenomena, have never been investigated at the level of single joints in a large unselected sample of patients with RA.
We have therefore evaluated the consistency of positive and negative individual joint change scores, as well as their occurrence within the same patient, in the Trial of Etanercept and Methotrexate with Radiographic Patient Outcomes (TEMPO), which is a large randomised clinical trial that showed a statistically significant negative mean change score in one of the trial arms and a statistically significant positive change score in another trial arm.4 This trial was chosen because a large set of radiographs has been scored twice independently, by the same two readers, thus providing a unique opportunity to learn about agreement in scoring negative and positive changes in individual joints.
Patients and methods
The TEMPO trial was a 3-year study that evaluated clinical and radiographic outcomes of patients with RA treated with methotrexate alone, etanercept alone or the combination of both drugs.4 This analysis has used data collected during the first 2 years of the study.13 During the reading of the radiographic images at the end of the first year, three readers scored all baseline, 6-month and 12-month radiographs in such a manner that every patient was scored by two readers. During the readings after the second year, all available baseline and 12-month radiographs were scored again by two of the three readers of the first panel. By doing so, a set of four readings per joint was available for each of the patients included in this analysis. Radiographs were scored using the van der Heijde modification of the Sharp score method.3 This method quantifies the number and size of erosions in 32 joints of the hands and wrists and 12 joints of the forefeet, and the degree of joint space narrowing in 30 joints of the hands and wrist and 12 joints of the forefeet. Readers see all radiographs of a patient appearing on a screen grouped for the proximal interphalangeal joints, metacarpophalangeal joints, wrist and feet, score joint per joint, and decide on their joint scores by simultaneously comparing radiographs from the same patient at different time points, although they do not know the order in time (concealed time order). They do not score change directly, but they can bring change in their scores by assigning different scores to different time points. They cannot assign whether they think an observed change in a joint is due to repair or progression, because such an assignment requires knowledge about the true time order. We have demonstrated previously that readers are unable to assign the true time sequence (or to distinguish repair from progression) to pairs of single joints of hands and feet or pairs of entire radiographs, so that we assume for the remainder of this analysis that the occurrence of change in individual joints under conditions outlined above is a process not driven by the readers’ presumption about the sequence of images.
The analyses provided in this report are based only on those images that were scored four times. As one of the goals of this single joint study was to gain insight into the validity of negative joint scores, and the discussion about repair involves the repair of previously existing erosions rather than the restoration of articular cartilage (joint space width), the analyses provided here are limited to erosion scores only.
In total, change scores of 7255 single joints belonging to 178 patients were investigated. For all 7255 joints, four scores per joint were available. Of these 178 patients, 53 belonged to the methotrexate arm, 60 to the etanercept only arm and 65 to the methotrexate plus etanercept combination arm.
In a first analysis, frequencies of joints scored with negative change (improvement), positive change (worsening) or no change over time were described for each of the four readings regardless of the magnitude of the change.
In a second analysis, we investigated per reading (N = 4) whether negative and positive change scores occurred in the same patient, how frequently this phenomenon occurred, and what was the impact on total change scores.
Finally, the agreement of change scores per joint was investigated by establishing the concordance of positive and negative change scores across the four independent readings.
The frequency of positive and negative single joint change scores as a percentage of all 7255 single joints that were available for analysis, tabulated by reader and by reading, is shown in table 1. It is apparent that change, either positive or negative, was a very rare feature in this trial; the great majority of joints was scored as unchanged; between 1.3% and 5.8% of the joints were identified as changed readings. There was, however, intra and interreader variation: reader 1 scored a higher number of joints with change than reader 2 in both readings, and both readers assigned a change to a higher number of joints in the first reading compared with the second reading. In three of the four readings, there was a slight dominance of negative change scores over positive change scores and only reader 2 saw slightly more positive than negative change scores in one of the two readings.
We further analysed the extent to which both positive and negative single joint change scores co-occur in the same patient by aggregating single joint scores from each patient. In order to do so, three-dimensional frequency plots (histograms) were created, plotting the frequency of patients on the Y-axis, the number of joints with a positive change score on the X1-axis, and the number of joints with a negative change score on the X2-axis (fig 1). The analysis was carried out for each reading. In panel A of fig 1, some patients had no (neither positive nor negative) change in any joint, which is consistent with a sum score of zero (no change at a patient level). These patients (N = 43 for reader 1, first reading) are reflected by the highest bar at the crossing of the three inner axes of the graph (fig 1A). Patients who have one or more joints with only positive or only negative changes are depicted along one of the inner X-axes of the graph. They represent the second most frequent proportion of patients in this analysis. The three-dimensional space of the graph represents the patients who have some joints with positive changes and some joints with negative changes. The most extreme was a patient who had five joints with a negative change score and four joints with a positive change score (circle in fig 1A).
Looking at the four panels together, as well as the summarising table 2, it is obvious that in all readings (except reader 2, second reading, which was extremely “conservative”) the patients with some change outnumber the patients without any change, and that in patients with an observed change those with a unidirectional change outnumber those with a mixed change pattern, but that patients with a mixed pattern of change do exist.
It is also obvious from the figures, when comparing panels A and C with panels B and D, that reader 1 in comparison with reader 2 not only assigned more joints with change (table 1) but also provided more patients with a mixed pattern, both in reading 1 and in reading 2 (table 2).
Consistency of scoring across independent readings
In the subsequent analysis we investigated the degree of agreement among readers in assigning a positive or negative change score to the same joint (table 3). This table lists the frequency at which joints are assigned a positive or a negative change score in independent readings by independent readers (the possible categories are no change (not shown), positive change and negative change). Only 12 of the 7255 analysed joints had a similar change assignment in all four readings: six with a positive change score and six with a negative change score. More joints had similar change assignments in three of the four, or in two of the four readings.
Given the very poor reproducibility at the individual joint level, we investigated whether opposite scores were being assigned to joints in independent readings (table 4). The table lists the number of joints in which one reading, two readings or three readings assigned the same change (either positive or negative) to a single joint. It also lists per category of agreement the number of opposite scores assigned in one or more of the other readings. Note that the category of “positive in one reading only”, which is applicable to 215 single joints, implies that there are 645 (three times 215) scores available that stem from the other readings in which this joint was not assigned a positive score. The picture is clear in that the majority of “remaining readings” yielded no-change scores. However, opposite results did occur at a low frequency. We identified one joint that was assigned positive change scores in three readings and a negative change score in the remaining reading. Another observation is that opposite results occurred more frequently in the case of positive change scores (4.2%, 3.8% and 7.1%, respectively) than in case of negative change scores in the majority of the readings (2.8%, 2.2% and 0%, respectively).
This single-joint analysis on radiographic progression showed that the level of agreement in assigning a positive or a negative change score to a pair of joints among readers (or at subsequent occasions) is extremely low. As a matter of fact, in the case of a change score, full agreement (similar results in four out of four readings) was obtained in only 12 of the 706 joints (1.7%), and almost complete agreement (similar results in three out of four readings) in 40 of 706 joints (5.7%).
At a first glance, these figures seem disappointing and in contrast to the reproducibility of the Sharp score and its modifications seen in validation studies and clinical trials.4 14 15 Especially since modern clinical trials show only minimal progression, disagreement between readers at the joint level may have a potentially important impact on the study results. So the question is why aggregated joint scores actually do work appropriately in clinical trials. This study provides mitigating insight into how these seemingly discrepant observations can be explained, and how the van der Heijde–modified Sharp score (and probably other scoring systems) actually work in the context of a clinical trial.
First, a very small number of joints was assigned a (positive or negative) change score in at least one of the readings (table 1). The most likely explanation is that the large clinical trial database we investigated assessed therapies with confirmed efficacy for preventing the progression of structural damage. In the entire TEMPO trial, the mean change in erosion scores at one-year change was +1.68 for the methotrexate group and −0.30 for the methotrexate plus etanercept group. So one could expect a very low proportion of joints with a change assigned. More importantly, the poor absolute agreement in detecting change should be judged against the background of an extremely low previous probability of change, which may influence the performance of the readers. Suppose that only 3% of the joints are truly changed. This means that during a reading, the reader will assign “no change” 33 times more frequently than “change”, which may make him reluctant to assign change. Readers will tend to assign “no change” in case of doubt. This hypothesis is supported by our observation that opposite scores are actually very rare (table 4); opposite scores occur at a frequency of approximately 4% or less, which is close to the average percentage of joints with change across readings, and as such are most probably due to chance occurrences (differences in judgement).
Second, an aggregated score such as the van der Heijde–modified Sharp score does not give insight into the pattern of joint changes within a patient. In the pre-biologics era, with less effective treatments, the sum score was composed of positive changes in several joints, but recently a number of trials have shown an average progression of 0 units, or even negative changes. Theoretically, such mean scores around zero could be made up of joint scores with opposite change. We have shown here that this theoretical possibility indeed occurs, but at a low frequency from 2.2% to 6.7% of the 178 patients investigated, depending on the reader, with negligible impact on the total change score.
In comparing two readers in the two readings, we found a difference in the tendency to assign a mixed change pattern to patients. Reader 1 was more willing to accept patients with (a low number of) opposite scores than reader 2. But regardless of the reading or the reader, the greater majority of patients with a zero sum score were assigned “no change” to all joints, or, in case of change, a unidirectional pattern of change. It is interesting to speculate on the nature of the mixed change pattern. Because there is a demonstrable reader effect, and because the biological plausibility of a mixed change pattern is rather low, we tend to ascribe the mixed change pattern to measurement error rather than to a true (biological) effect. Reasoning along similar lines, a unidirectional change pattern may add to the credibility of joint damage progression or repair in a patient. This was undisputed with regard to progression, but so far the concept of repair has been criticised as being a measurement artefact. Admittedly, the number of patients with a unidirectional pattern of negative change was not high, and also reader dependent, but neither was the number of patients with a unidirectional pattern of positive change, which was also reader dependent. In the absence of a gold standard, these distinguishable unidirectional patterns add circumstantially to the validity of the concept of repair (or progression).
A few limitations should be mentioned here. For reasons of plausibility, we have only focused on erosion scores in this study, and we have excluded joint space narrowing scores from the analysis, but the picture would not be different as long as we consider scoring of erosions and joint space narrowing as independent phenomena. For reasons of convenience, we have investigated change as a binomial variable (change versus no change) thus ignoring quantitative information that may have impacted the total score. In this trial, however, the change in an individual joint was 1 unit in 76% of the joints with change (data not shown), so that we considered the impact of quantification on the total score as negligible.
How does this seemingly poor reliability and these individual joint observations eventually translate into changes in the total Sharp score at the patient level (and at the trial level)?
The overall reported change score of a treatment group in a trial is the average of all individual patient scores. The individual patient score is the average of Sharp scores provided by two (or more) readers. These readers judge entire patients rather than single joints and are implicitly able to bring a pattern in the direction of change in a patient. The total Sharp score is the sum of change scores of 44 individual joints. As such, the reported change score of a group of patients is a highly aggregated composite measure, incorporating the effects of hundreds of patients, the opinions of at least two readers about thousands of joints, and factoring in the implicit direction of change. We have seen that the absolute agreement in single joint scores is (very) poor, but we have also seen that change assignments in readings are hardly if ever effaced by opposite assignments in independent readings. Similarly, mixed pattern assignments (mutually effacing effects) in individual patients are rare. So, if reader 1 assigns a positive change score to one particular joint, and reader 2 judges change in this particular joint as insufficiently clear and assigns “no change” to all joints, the total Sharp score for reader 1 will be +1 unit and for reader 2 0 units, in spite of the lack of absolute agreement, and the reported average Sharp score will be +0.5 units. If reader 1 factors in a unidirectional trend he may score two other joints positively, with consequences for his total Sharp score (+3 units) and for the grand mean score (+1.5 units), whereas reader 2 would have no reason to do that. Generally, neither reader 1 nor reader 2 would assign negative and positive change scores within the same patient (although the more sensitive reader 1 will probably do that a little bit more frequently than the conservative reader 2), so that the impact of these stochastic events is very limited. As such, subtle changes in individual joints that are not reproducibly assessed by independent readers because of differences in the level of certainty translate into subtle but quantifiable changes in a patient’s total Sharp score, and eventually contribute to changes in group means. In modern trials with a very low level of true progression, scoring systems such as the modified Sharp score are instruments that challenge the level of confidence of individual readers in assigning change scores to potentially changed joints. These readers judge the joints of the entire patient, and are able to augment potential change if they are sufficiently confident. The low (biological) plausibility of opposite change scores within the patient and the lack of opposite results by other reader(s) protect the scoring system against a lack of sensitivity while there is a natural tendency to maintain specificity (conservatism in case of doubt). Importantly, it is crucial to maintain an absolute level of blinding of treatment and time order, in order to prevent any potential source of bias that may guide the reader in a spurious direction. In view of the subtle changes occurring in trials, such biased assignments could have an immediate impact on the total score.
This example clearly demonstrates that the common use of cut-off levels for progression is spurious in studies with mean progression scores close to zero. It may qualify a patient as a progressor, whereas in truth the result is the consequence of interreader disagreement.
In summary, we have shown here that, although absolute agreement among readers in individual joint scores is poor, opposing results within the same patient occur rarely. This single joint analysis explains why even very subtle changes in individual joints, assigned by one reader, translates into measurable changes at the level of change in total Sharp score and differentiation between treatment arms.
Competing interests None.
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.