Objective: To assess performance of radiologists and rheumatologists in detecting sacroiliitis
Methods: 100 rheumatologists and 23 radiologists participated. One set of films was used for each assessment, another for training, and the third for confidence judgment. Films of HLA-B27+ patients with AS were used to assess sensitivity. For specificity films of healthy HLA-B27− relatives were included. Plain sacroiliac (SI) films with simultaneously taken computed tomographic scans (CTs) were used for confidence judgment. Three months after reading the training set, sensitivity and specificity assessments were repeated. Next, participants attended a workshop. They also rated 26 SI radiographs and 26 CTs for their trust in each judgment. Three months later final assessments were done.
Results: Sensitivity (84.3%/79.8%) and specificity (70.6%/74.7%) for radiologists and rheumatologists were comparable. Rheumatologists showed 6.3% decrease in sensitivity after self education (p=0.001), but 3.0% better specificity (p=0.008). The decrease in sensitivity reversed after the workshop. Difference in sensitivity three months after the workshop and baseline was only 0.5%. Sensitivity <50% occurred in 13% of participants. Only a few participants showed changes of >5% in both sensitivity and specificity. Intraobserver agreement for sacroiliitis grade 1 or 2 ranged from 65% to 100%. Sensitivity for CT (86%) was higher than for plain films (72%) (p<0.001) with the same specificity (84%). Confidence ratings for correctly diagnosing presence (7.7) or absence (8.3) of sacroiliitis were somewhat higher than incorrectly diagnosing the presence (6.6) or absence (7.4) of sacroiliitis (p<0.001).
Conclusion: Radiologists and rheumatologists show modest sensitivity and specificity for sacroiliitis and sizeable intraobserver variation. Overall, neither individual training nor workshops improved performance.
- ankylosing spondylitis
- AS, ankylosing spondylitis
- CT, computed tomography scans
- MRI, magnetic resonance imaging
- SI, sacroiliac
Statistics from Altmetric.com
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.
- AS, ankylosing spondylitis
- CT, computed tomography scans
- MRI, magnetic resonance imaging
- SI, sacroiliac
A nkylosing spondylitis (AS) is a chronic inflammatory disease, in which predominantly the axial skeleton is affected. Sacroiliitis is considered as the hallmark of the disease. Involvement of the sacroiliac (SI) joints is usually established by plain radiographs, or—to a lesser degree—by computed tomography scan (CT) or magnetic resonance imaging (MRI). Rheumatologists in daily practice mostly order plain radiographs for diagnosing sacroiliitis, and they often read these films themselves. Radiologists read and report radiographs of SI joints examinations requested by rheumatologists, general practitioners, or orthopaedic surgeons. Reading radiographs of the SI joints is considered difficult and the diagnosis of sacroiliitis is often missed or incorrectly established. Inter- and intraobserver variations have been reported to be large.1–3 κ Statistics to express intraobserver variation in these studies ranged from 0.07 to 1.02,3 and interobserver variation from 0.19 to 0.79.1–3
We questioned whether the performance of radiologists and rheumatologists in reading SI radiographs differs, and whether this performance might be improved by offering training sessions. We provided individual self education with a training set of radiographs and as a second step organised workshops to read radiographs of the SI joints. Our study aimed at defining variations in sensitivity and specificity of detecting sacroiliitis for both radiologists and rheumatologists, and at investigating to what degree sensitivity and specificity would persistently change after training sessions. We also aimed at quantifying the intraobserver variability in reading these radiographs. Furthermore, we wanted to estimate the self confidence of radiologists and rheumatologists in determining the presence or absence of sacroiliitis, for both plain radiographs and CT of the SI joints.
All 154 Dutch rheumatologists, except for the six involved in the study, were invited to participate in this project. In total 117 agreed, but for a variety of reasons 17 did not start. At baseline, 100 (85% of the 117 consenting rheumatologists) participated, after self education 86 (74%), and after the workshop 75 (64%) (fig 1). In addition, a random sample of 30 consenting radiologists (out of the total number of 687 radiologists in the Netherlands) was taken from a list of members of the Dutch Radiology Society. The percentage of their daily work which consisted of viewing skeletal radiographs was as follows: for 3 radiologists <10%, 15 radiologists 10–30%, 10 radiologists 30–50%, and this was unknown for 2 radiologists. At baseline 23 (77% of the random sample of radiologists) participated, after self education 22 (73%), and after the workshop 8 (27%) (fig 1).
Radiographs of the sacroiliac joints
For this study, three different sets of radiographs of the SI joints were composed: a scoring set for the assessments, a training set for individual training, and a confidence assessment set to estimate the observer’s perceived certainty in diagnosing sacroiliitis by plain radiograph or CT of the SI joints. All radiographs were derived from a large Swiss family survey among 275 HLA-B27 positive patients with AS and 511 first degree relatives, who all completed questionnaires and underwent physical examination, HLA typing, and radiographic studies of the SI joints.4 The radiographs were projected from an anteroposterior view, limited to the SI joints, and not including the hip joints. All radiographs had been scored twice “blindly” by each of four experts (two rheumatologists, one epidemiologist, and one radiologist). The mean score of the eight readings of each SI joint was rounded to the next whole figure. Only SI films of HLA-B27 positive patients with AS showing definite sacroiliitis (as defined by the expert panel) and fulfilling the modified New York criteria were included in this study to assess the observer’s sensitivity. For the assessment of specificity we used the films of HLA-B27 negative first degree relatives of the HLA-B27 positive patients with AS. These relatives had no signs or symptoms suggestive of AS. In addition, a subset of radiographs with simultaneously taken CTs was selected. In the same way, the four experts had judged these CTs separately from the conventional plain SI films.
The New York scoring method for the SI joints was followed: 0=no abnormalities; 1=suspicious changes (no specific abnormalities); 2=minimal sacroiliitis (loss of definition at the edge of the SI joints, there is some sclerosis and perhaps minimal erosions, there may be some joint space narrowing); 3=moderate sacroiliitis (definite sclerosis on both sides of the joint, blurring and indistinct margins, and erosive changes with loss of joint space); 4=complete fusion or ankylosis of the SI joint (without any residual sclerosis).5 According to the modified New York criteria at least grade 2 bilaterally or grade 3 or 4 unilaterally is necessary for the diagnosis of AS.6 The gradings for the left and right SI joint were recoded into one final grading representing sacroiliitis according to the New York criteria.
The assessment or scoring set comprised 50 radiographs, of which 10 appeared twice (reversed—that is, the left joint was now marked as the right joint and vice versa): 12 (+3 repeats) radiographs with a final New York grading of “no sacroiliitis”, 12 (+3 repeats) with a grading of “dubious abnormalities”, 12 (+3 repeats) with bilateral definite sacroiliitis grade 2, and 4 films (+1 repeated) with a grading 3 or 4. For each radiograph, only the age and sex of the subject was provided. The mean age of the 16 patients with AS represented in the scoring set was 44.1 years; the mean disease duration was 14.9 years at the time the SI radiographs for this study were taken. The training set also comprised 50 radiographs (10 for each grading 0–4). For each radiograph, information on the grading was provided. The confidence assessment set was composed of 17 plain SI radiographs and 17 corresponding CTs with sacroiliitis and 9 SI/CT pairs without sacroiliitis. For these films no information on grading, age, sex, or clinical findings was provided. All radiographs in the scoring set and in the confidence assessment set appeared in a completely random order.
In total three assessments took place with the scoring set (fig 1). At each of these three occasions all participants individually graded each SI joint of the 50 radiographs according to the New York criteria. Firstly, a baseline score of sensitivity and specificity was established with the scoring set. One month later, each participant received individually the training set in order to practice reading radiographs individually. Three months after this training by self education, the participants again received the scoring set to assess the presence of persisting effects of this training procedure.
Another three months later, the participants attended one of several workshops organised throughout the country (with a maximum of 20 participants per workshop), in which the full spectrum of normal and abnormal SI joints, and the grading of these joints according to the New York criteria, was shown and discussed with the participants. In this workshop the same training set as for individual self education was used. Finally, in the same workshop, the participants also judged another 26 radiographs of the confidence assessment set for the presence or absence of sacroiliitis (according to the modified New York criteria), and rated their self confidence about their “yes/no” judgment from 1 (maximally uncertain) to 10 (absolutely sure). The same exercise was done for the 26 CTs from the same patients. Half of the participants first assessed the plain SI joints, and the other half started with reading the 26 CT films. When all scoring data were completed all the radiographs and CTs were discussed referring to the judgments by the expert panel. Three months after this workshop, the final assessment took place with the scoring set as before.
All radiograph sets were sent by post to the participants, who were requested to read the radiographs within the next two weeks. If necessary, a reminder was sent. After the last measurement, all participants received feedback on their personal scores together with the aggregated (centiles) results of all participants.
After completion of each scoring set, the final grading for each radiograph was dichotomised into the presence or absence of sacroiliitis. Grade 2 or more bilaterally, and grade 3 or 4 unilaterally was taken as the presence of sacroiliitis according to the New York criteria. The results from each participant were compared with the “gold standard” as defined by the expert panel. Sensitivity and specificity for each participant were calculated using 2×2 tables for each assessment period. The results of sensitivity and specificity at baseline were compared with the results after self education and after the workshop by paired t test.
The mean intraobserver agreement (concordance rate) was calculated by means of 2×2 tables for both radiologists and rheumatologists for each of the 10 films that were presented in duplicate (although in reversed right-left order). The κ statistic was not applied. Repeated films were included only once in the analysis of sensitivity and specificity.
To calculate sensitivity and specificity the judgments on the presence or absence of sacroiliitis for the 26 SI radiographs and CTs were compared with the standard defined by the experts. Differences in sensitivity and specificity between SI radiographs and CTs were calculated with paired t tests. The ratings that the participants had provided for their self confidence were subdivided into two groups: films correctly (according to the expert panel) diagnosed for the presence or absence of sacroiliitis and films incorrectly diagnosed for the presence or absence of sacroiliitis. This was done for plain radiographs and CTs separately. The differences between ratings on correctly versus incorrectly diagnosed sacroiliitis, and the differences in ratings of radiographs versus CTs, were analysed by independent t tests.
Sensitivity and specificity
Table 1 shows the mean sensitivity and specificity scores for both rheumatologists and radiologists at baseline before any training, after individual self education, and after the workshop. In general, the radiologists showed somewhat higher sensitivity scores and lower specificity scores than the rheumatologists, but the differences were not statistically significant at any time. The scores for the radiologists did not significantly change after self education or the workshop, possibly partly reflecting the smaller numbers involved, especially at the last assessment. In contrast, the rheumatologists showed a statistically significant decrease in sensitivity after individual self education, but at the same time a statistically significant improvement in specificity. The decrease in sensitivity was reversed after the workshop, because at that time a statistically significant increase compared with the results after self education was observed. The differences between sensitivity three months after the workshop and sensitivity at baseline were not statistically significant.
Because the differences in sensitivity and specificity between rheumatologists and radiologists were not statistically significant at each of the three assessments, further analyses were performed with pooled data from both groups.
Table 2 presents the distribution in sensitivity and specificity of all participants at each assessment period. Clearly, a relatively large group of participants had difficulties in diagnosing sacroiliitis (for example, sensitivity of ⩽50% at baseline for 13% of the participants). Participants with a low sensitivity or a low specificity score appeared to show high specificity or sensitivity scores respectively (table 3). Similarly, high sensitivity and high specificity scores were associated with low specificity and low sensitivity scores, respectively.
Change in sensitivity and specificity for individual participants
Minor changes in sensitivity or specificity at the group level do not preclude larger changes for individual participants. Therefore, to find out what kind of changes in sensitivity and specificity occurred after self education or after the workshop for individual participants, we recoded the change scores of each participant dichotomously by considering up to ±5% change in sensitivity or specificity as “no important change” and all other changes above or below this cut off point as a “relevant change”. Table 4 shows the results of the profiles of the participants three months after self education and three months after the workshop. Relevant increases in both sensitivity and specificity occurred only in a minority of the participants. Most of the participants showed an increase in either sensitivity or specificity without a relevant change in specificity or sensitivity, respectively.
The participants with an improvement in sensitivity after self education compared with baseline of >5%, showed a mean (SD) sensitivity at baseline of 63 (20) (n=23), and those with a change of >5% after the workshop a mean baseline sensitivity of 64 (16) (n=23). Similarly, an improvement in specificity of >5% after self education and after the workshop was associated with lower specificity scores at baseline of 60 (11) (n=33) and 63 (10) (n=22), respectively, than the mean specificity scores of the whole group as shown before (table 1). Furthermore, the participants with a decrease of >5% in sensitivity after self education and the workshop compared with baseline, showed higher baseline scores in sensitivity (84 (14) (n=57) and 86 (15) (n=33), respectively), and those with a decrease of >5% in specificity showed higher baseline scores in specificity (77 (16) (n=28) and 84 (14) (n=23), respectively).
Table 5 presents the intraobserver variation for both rheumatologists and radiologists for each of the 10 radiographs—together representing the spectrum of sacroiliitis—that were used for this assessment. No major differences in agreement between each of the repeated radiographs were found between rheumatologists and radiologists. The most extreme grades 0 and 4 showed the highest agreement. Sacroiliitis grades 1 and 2 showed substantial variation represented by lower agreement rates.
Plain radiograph compared with computed tomography
During the workshop each participant had to read independently in random order 26 radiographs and 26 CTs of the same patients for the presence or absence of sacroiliitis according to the New York criteria. In addition, for every judgment the participants also had to provide a rating on a 1–10 scale for their perceived self confidence for each of these diagnostic decisions. Table 6 presents the results of sensitivity and specificity, and the confidence ratings of the participants for both plain SI radiographs and CTs. The sensitivity score for CTs was significantly higher than for plain radiographs, whereas no difference in specificity was found.
The presence of sacroiliitis is mandatory for the diagnosis of AS. The SI joints are unilaterally or bilaterally affected with mild to severe inflammation, which may eventually lead to partial or complete ankylosis.7 The recognition of sacroiliitis is, however, often considered as difficult and requires experience. In this study the performance of rheumatologists and radiologists in detecting sacroiliitis has been evaluated. Three features of this nationwide study—in which more than 50% of all Dutch rheumatologists and a small sample (4.4% at baseline; 1.2% at completion) of radiologists participated—are striking.
Firstly, the diagnosis of radiographic sacroiliitis by radiologists and rheumatologists was comparable. Secondly, sensitivity and specificity scores were relatively moderate: 15–25% of the radiographs were incorrectly classified as if sacroiliitis was present (false positives), and 20–30% of the radiographs were incorrectly classified as if sacroiliitis was absent (false negatives) (table 1). A high sensitivity score was associated with a low specificity score (table 3), and an increase in sensitivity was often accompanied with decreased specificity and vice versa (table 4). Thirdly, improvement in both sensitivity and specificity that will persist for at least three months after a training session appeared to be difficult to achieve. It is worrying that both sensitivity and specificity decreased in a large group of participants. Thus, the individual training sessions and the workshops as provided cannot be regarded as effective in promoting the performance of “blindly” diagnosing the presence or absence of radiographic sacroiliitis. Although individual compliance was not assessed, non-compliance overall cannot explain the apparent lack of effects of the workshop which was attended by the majority (75%) of the participants.
It is difficult to explain these observations. It seems that an improvement or a decrease in sensitivity (or specificity) after a training session was associated with correspondingly lower or higher sensitivity (or specificity) scores at baseline as compared with the mean score of the group at baseline (table 3). This effect may be attributed to regression to the mean or a floor-ceiling effect. Possibly, after training sessions, the attitude towards interpreting radiographs might have changed. Participants with initially low sensitivity scores might now have considered every spot or blurring at SI joints as aberrant, thereby improving the sensitivity score, but at the cost of specificity. Conversely, participants with initially low specificity scores might now have considered every spot or blurring at the SI joint more cautiously, at the cost of sensitivity. However, the participants were not informed about their sensitivity and specificity scores during the study period. Clearly, even after training it remains difficult to distinguish between the normal and abnormal. Possibly, the same intervention should not have been offered to every participant. It might have been better to have assessed sensitivity and specificity first and then provide different targeted interventions to those participants with low sensitivity (and high specificity) and those with high sensitivity (and low specificity). This might be an area for future research.
The relative roles of plain radiographs, CT, and MRI in the radiographic diagnosis of sacroiliitis remain a matter of debate. The high sensitivity of CT and MRI is well known. Several studies have reported that CT and MRI are better than plain radiographs in detecting early sacroiliitis.8–13 However, because of the cost and other limitations to resources it is not always possible to use these techniques in the routine diagnosis of sacroiliitis.14,15 Therefore, plain SI radiographs remain mostly the initial diagnostic tool. CT or MRI may be particularly helpful as an additional diagnostic aid in the early stages of sacroiliitis (when plain radiographs may be negative) if there is a high probability of sacroiliitis, or conventional radiographs are inconclusive.
Owing to the difficulties in interpreting plain radiographs of SI joints, large inter- and intraobserver variations have been reported.1–3 In our study, intraobserver variability was expressed as the percentage of agreement for each of 10 radiographs that appeared twice in the scoring set. Clearly, concordance is highest if SI joints are definitely normal (grade 0) or definitely abnormal (grade 4). The use of κ statistics would not have been useful in this situation because of the high levels of expected agreement. The amount of intraobserver variation was comparable for both rheumatologists and radiologists. Most variation was seen in grades 1 and 2. However, the diagnostic—and possibly therapeutic—consequences of such seemingly small differences in grading of SI joints are most important. Patients with grade 2 sacroiliitis bilaterally will usually be diagnosed and treated as having AS, whereas people with grade 1 will normally not be considered as having an inflammatory rheumatic disease. Especially in these cases, CT or MRI may be helpful.16 It should be noted, however, that there are clear differences in properties—and therefore also in appropriateness of their application—among plain SI films, CT, and MRI. Plain radiographs provide an image where all sections are added to each other, whereas CT and MRI give information in slices. Furthermore, plain films and CT can assess mainly bone and bone destruction, whereas MRI can assess cartilage and inflammation in the acute stage. It should also be realised that AS might sometimes occur in the absence of radiographic sacroiliitis.17
Another aim of our study was to assess the degree of confidence of rheumatologists and radiologists in determining the presence or absence of sacroiliitis on plain radiographs and CTs of the same patients. The ratings for the correctly diagnosed presence or absence of sacroiliitis were on average higher than the incorrectly diagnosed presence or absence of sacroiliitis (p<0.001). Although the participants felt less certain about their judgments of those radiographs and CTs which they misdiagnosed than those which they correctly diagnosed, the ratings for the incorrectly diagnosed radiographs remained, somewhat surprisingly, relatively high (mean 6.6 versus 7.4 on a 0–10 scale) (table 6). Clearly, the use of CTs compared with the use of radiographs did not increase the self confidence of the participants. However, many rheumatologists felt they did not to have sufficient experience in reading CTs of SI joints and, therefore, these results might improve after training. On the other hand, the number of radiologists who participated in this part of the study is too small to generalise the findings.
The prevalence of definite sacroiliitis in the scoring set was 40%. This high a priori likelihood was unknown to the participants. It is unrealistically high in daily practice of radiologists, but on the other hand, diagnostic gain is at its highest level if the pretest probability is about 50%. Therefore, for rheumatologists this prevalence would indicate proficiency in making use of diagnostic tools. If a large number of normal (grade 0) SI films had been included in the scoring set this would have inflated the specificity artificially without clearly predictable effects on the sensitivity of diagnosing radiographic sacroiliitis.
Finally, except for data on the age and sex of the patient in the scoring set, no clinical findings from the patient’s history or physical examination were provided. Therefore, only radiographs were presented to radiologists and rheumatologists in order to evaluate their performance in detecting sacroiliitis. This radiological diagnosis is an indispensable condition for the diagnosis of AS. In daily practice, however, rheumatologists mostly take into consideration the clinical information of the patient before they come to a final judgment. Rheumatologists may decide to re-evaluate the patient at a later time, or to refer the patient for additional CT or MRI. Recently, a study has assessed the real performances of (Dutch) rheumatologists in daily practice, visited by patients incognito.18 In particular, a female patient mimicking symptoms suggestive of AS and referred by her general practitioner visited a total of 25 rheumatologists. She brought with her a radiograph from another hospital clearly showing bilateral sacroiliitis. After history taking and physical examination, in which nearly all rheumatologists performed spinal mobility tests, more than 50% of the rheumatologists proposed additional radiographic imaging.18 Evidently, a large group of rheumatologists felt uncertain about interpreting radiographs. Unfortunately, this study does not seem to contribute towards increasing their performance. It should again be emphasised that our study assessed sensitivity, specificity, and observer variation in reading films of SI joints, but did not take into consideration the effects on these characteristics of any clinical information. Such clinical data might already be known before reading the films or may be provided to the observer afterwards. The final effect of such additional information on the precision of establishing sacroiliitis as an indispensable condition for the diagnosis of AS is not yet known.
In conclusion, longlasting improvements in the performance of diagnosing sacroiliitis seem difficult to achieve, at least through self education using a training set of SI films or through uniform workshops. No statistically significant differences in sensitivity, specificity, and intraobserver variation of reading radiographs of SI joints were found between the radiologists and rheumatologists. Currently, CT of SI joints as compared with plain SI radiographs does not improve self confidence in diagnosing sacroiliitis.
The authors thank the radiologists and rheumatologists for their effort and time spent in participating in this study.
Funding: Dutch Rheumatism Association (Nationaal Reuma Fonds).