Objective To develop ultrasound (US) definitions and a US novel scoring system for major salivary gland (SG) lesions in patients with primary Sjögren’s syndrome (pSS) and to test their intrareader and inter-reader reliability using US video clips.
Methods Twenty-five rheumatologists were subjected to a three-round, web-based Delphi process in order to agree on (1) definitions and scanning procedure of salivary gland ultrasonography (SGUS): parotid, submandibular and sublingual glands (PG, SMG and SLG); (2) definitions for the elementary SGUS lesions in patients with Sjögren’s syndrome; (3) scoring system for grading changes. The experts rated the statements on a 1–5 Likert scale. In the second step, SGUS video clips of patients with pSS and non-pSS sicca cases were collected containing various spectrums of disease severity followed by an intrareader and inter-reader reliability exercise. Each video clip was evaluated according to the agreed definitions.
Results Consensual definitions were developed after three Delphi rounds. Among the three selected SGs, US assessment of PGs and SMGs was agreed on. Agreement was reached to score only greyscale lesions and to focus on anechoic/hypoechoic foci in a semiquantitative matter or, if not possible on a qualitatively (present/absent) evaluation of fatty or fibrous lesions. Intrareader reliability for detecting and scoring these lesions was excellent (Cohen’s kappa 0.81) and inter-reader reliability was good (Light’s kappa 0.66).
Conclusion New definitions for developing a novel semiquantitative US score in patients with pSS were developed and tested on video clips. Inter-reader and intrareader reliabilities were good and excellent, respectively.
- salivary glands
- primary sjögren syndrome
- reliability exercise
Statistics from Altmetric.com
What is already known about this subject?
The current absence of a consensus about definitions and scoring of elementary lesions detected by salivary gland ultrasonography is an obstacle to the diagnosis and monitoring of primary Sjögren’s syndrome.
What does this study add?
This study substantially advances the validation of a novel, reliable, semiquantitative scoring system for ultrasound abnormalities of the major salivary glands in patients with known or suspected primary Sjögren.
Video clips shared via the internet proved useful for having multiple experts assess ultrasonography examinations.
How might this impact on clinical practice or future developments?
Availability of a reliable scoring system for salivary gland ultrasonography abnormalities should assist in the diagnosis and follow-up of primary Sjögren’s syndrome.
The video clip method used in this study can be expected to improve future reliability assessments.
Primary Sjögren’s syndrome (pSS) is a systemic autoimmune disease of unclear aetiology that primarily targets the lacrimal and salivary glands (SG), causing dryness of the eyes and mouth. Mononuclear lymphoid cells infiltrate SGs, as well as the lacrimal glands, resulting in structural damage.1 In patients with suspected pSS, establishing the diagnosis may be challenging, as no single parameter as autoantibody or other biomarker is highly specific and SG histology may be inconclusive.2
The American-European Consensus Group criteria are currently the most widely used classification criteria for patients with pSS symptoms.3 Several recent studies suggest that salivary gland ultrasonography (SGUS) may be helpful for evaluating the presence of SG involvement in patients with suspected or established pSS and could be another parameter for diagnosis.4–7 American College of Rheumatology-European League Against Rheumatism (EULAR) criteria were published more recently without including SGUS.8 Furthermore, two studies support the usefulness of SGUS in monitoring the effect of treatment in clinical trials.7 9
Before SGUS can be used as an outcome measurement instrument (OMI), its reliability must be tested. There is an unmet need to standardise and assess SGUS because no international consensus exists on SGUS elementary lesion definitions and scorings. A few efforts have been made to test inter-reliability based on definitions made by few well-designed studies.10–15
To validate the use of SGUS as a possible OMI for SG lesions in patients with pSS, a subtask force of the Outcome Measures in Rheumatology Clinical Trials (OMERACT) working group was created in June 2016. The objectives of the subtask force were as follows: (1) to develop a consensus on definitions of normal ultrasound (US) findings in major SG and a standardised scanning protocol; (2) to develop US definitions on elementary lesions in patients with pSS and a scoring system; (3) to test the reliability of the scoring system on patients with pSS and non-pSS sicca patients using consensual definitions and the protocol using a web-based video clip platform.
Materials and methods
A stepwise approach was followed using the OMERACT methodological framework.16 First, a Delphi survey including questions on elementary SGUS abnormalities was developed based on a systematic literature review.17 The Delphi process is widely used among experts of diverse fields of activity to reach consensus of opinion on a specific topic. This process gathers data on what is known or not known about a fact or a topic from a panel of expert respondents to different surveys. It is a flexible and adaptable tool. Nevertheless, the expert selection and timing of its conduct should be well defined before its deployment to reach reliable data. Such process has its pitfalls such as low response rates, biased feedback and surveying of the panel of experts who may lack optimal knowledge of the studied topic.18 Thirty-eight statements regarding preliminary definitions were generated by a small steering committee (SJJ, GAWB, EN, MADA) and assembled into three sections. Section 1 comprised 27 statements relevant to the SGUS normal component definitions and scanning procedure of the parotid glands (PG), submandibular glands (SMG) and sublingual glands (SLG). Section 2 was composed of three statements relevant to definitions of elementary SGUS abnormalities in the PGs, SMGs and SLGs in patients with pSS. Finally, section 3 included eight statements relevant to a new scoring system for grading US abnormalities in the PGs, SMGs and SLGs in patients with pSS.
The survey was then sent by email to a broader group of 25 rheumatologists all experts in US and members of the OMERACT US task force in Sjögren disease. The Delphi participants came from 14 countries (Austria, Czech Republic, Denmark, Egypt, France, Germany, Italy, Mexico, Norway, Slovenia, Spain, Switzerland, The Netherlands and the USA). The participants were then asked to rate their level of agreement for each statement using a 5-point Likert scale, in which 1 was inappropriate and 5 was completely appropriate. The participants were asked to rate each definition using a level of agreement or disagreement for each statement according to a 1–5 Likert scale with 1=strongly disagree, 2=disagree, 3=neither agree nor disagree, 4=agree and 5=strongly agree. A Likert score of 4 or 5 was considered as agreement. Only when statements achieved a score of >75%, a consensus was considered for appropriately defining the category. Statements satisfying these requirements were used for the definition and apply for the scoring system during the video clip exercise. Those statements with already achieved agreement, but suggestions for an improved wording in the first Delphi round, were rephrased according to the experts’ comments and reappraised in the second round.
Statements with a <75% agreement on a score of 4 or 5 on the Likert scale in the first round were not further taken to the second round. Free-comment fields were provided for every statement.
Patient and public involvement
Our study advances the development of a reliable consensual semiquantitative scoring system for US abnormalities of the major SG in patients with known or suspected primary Sjögren. This aim was deployed by using and sharing video clips of US-SG among a large panel of experts. There was no patient and public involvement or partnership in the design or conduct of the study.
Video clip collection
The third phase of the validation process consisted of testing the intrareader and inter-reader reliability of the new scoring system. To this end, the 25 experts were asked to assess SGUS video clips in greyscale. First, a subset of video clips was sent to all experts to assess the image quality and to test the feasibility of the video clip exercise. The experts were instructed to record anonymised video clips of SGUS examinations during their usual clinical practice in subsequent patients with pSS and non-pSS sicca patients seen at their respective institution. Each video clip was limited to a maximum time of 5 s to allow sharing via secure web transfer. Longitudinal and transverse scans were recorded on the same video. In each patient, both PGs and SMGs were examined, yielding 12 video clips per patient. All video clips were sent to the study centre in Brest, France, where they were evaluated for quality requirements for high-resolution image, for example, high-frequency linear US probe at 12–18 MHz, greyscale image only, including longitudinal and transverse planes of left and right PGs and SMGs according to Delphi process, 5 s video clip duration (video clip examples are now available as online supplementary videos). High-quality greyscale videos were thus assembled to create a video clip atlas of normal and abnormal SGUS findings in order to evaluate the diagnostic usefulness of SGUS irrespective of machine type and preset mode. Some of the findings met none of the definitions developed by the Delphi process, as they exhibited fatty replacement or fibrosis but no anechoic or hypoechoic foci. In case the SG scans did not show any anechoic or hypoechoic foci, experts were given the option of scoring such glands qualitatively as fatty or fibrous.
Fatty replacement was defined as fatty infiltration within the gland. Such fatty gland’s surface becomes homogeneously hyperechoic compared with adjacent tissue. Fibrosis was defined as the presence on the gland surface of hyperechoic bands that develop into fibrotic tissue indistinguishable from the adjacent soft tissues.
The video clips that were found to be of high quality were then pooled and sent in random order to the experts who had participated in all three Delphi rounds. There were two video clip reading rounds, one between January and June 2017, and the second between July and November 2017. Thus, each expert read all the video clips.
Inter-reader reliability was assessed by using kappa statistics, and computing the weighted kappa coefficient (Fleiss-Cohen weights) for each pair of readers for statements with more than two ordinal categories.19 The minimum (min) and maximum (max) kappa values were calculated. Then, Light’s kappa, that is, the mean kappa value, was computed. To assess intrareader reliability, we computed the weighted kappa coefficients between two readings by each expert. We then computed Light’s kappa (mean of intrareader kappa values). The bootstrap percentile method was used to compute the 95% CI of Light’s kappa.20 Kappa values were interpreted according to Landis and Koch.21
In the primary analysis, all findings described as fatty or fibrous lesions were considered as missing data. In a second analysis, we performed a sensitivity analysis in which fatty echostructures were graded 1 (minimal change) and fibrous echostructures were graded 3 (severe change) in the new scoring system.
All statistical analyses were carried out using R language V.3.2.0 (R Foundation for Statistical Computing, Vienna, Austria; https://www.r-project.org).
Reaching consensus on definitions
The 25 experts reached a consensus on definitions after three Delphi rounds. Of the initial 38 candidate statements in the three sections, 28 were finally accepted by consensus. Of the 25 participants, 22 (88%) responded to the first Delphi questionnaire. All these 22 participants responded to the second and third Delphi questionnaires. The first Delphi questionnaire was composed of 38 statements divided among the three sections. The participants disagreed on 19 statements pertaining to the following: in section 1, PG assessment and vascularisation, definition and assessment of normal SMG echostructure and vascularisation, and SLG assessment; in section 2, presence of hyperechoic bands in the PG and SMG parenchyma; and in section 3, the scoring system and quantitative measurements of each gland. These 19 statements were rephrased for the second Delphi round. The participants then disagreed on nine statements, which pertained to elementary SGUS lesions in section 2 and to the scoring system in section 3. A consensus was reached during this second round on definitions for and the scanning procedure of normal SG. Most of the experts (75%) agreed that the SLGs were too small for a reliable assessment. Table 1 lists the final definitions and the standardised scanning procedure in section 1.
For the third Delphi round, several statements on SLG evaluation, the scoring system and measurement of each gland were rewritten. At the end of the third round, a consensus was achieved for elementary lesions (section 2, 95.4% agreement) and the scoring system (section 3, 79.1% agreement). In section 2, PG and SMG abnormalities in pSS were defined as focal or diffuse anechoic/hypoechoic foci (95.4% agreement). No consensus was reached as to the meaning of hyperechoic bands as an elementary lesion (70.8% agreement). However, the experts agreed that the SGs may be difficult to distinguish from the adjacent soft tissues in some patients with SG abnormalities, raising challenges in determining the score (79.1% agreement). In section 3, a novel four-grade semiquantitative scoring system for the PGs and SMGs in patients with pSS was defined grade 0, normal parenchyma; grade 1, minimal change: mild inhomogeneity without anechoic/hypoechoic areas; grade 2, moderate change: moderate inhomogeneity with focal anechoic/hypoechoic areas; grade 3, severe change: diffuse inhomogeneity with anechoic/hypoechoic areas occupying the entire gland surface (79.1% agreement).
Assessment of scoring system reliability using video clips
We included 199 high-quality video clips, 104 of them showing the PGs and 95 the SMGs, respectively. Of the 25 experts, 18 read all 199 video clips within the scheduled time frame. Online supplementary file 1 reports their responses. Of the video clips, 33% were graded 0, 20% were graded 1, 23% were graded 2 and 17% were graded 3. Overall, only 3% of video clips were graded as fatty and 4% as fibrous; these percentages ranged across experts from 0% to 10% and from 0% to 15%, respectively. Online supplementary file 2 reports the grades for the PGs and online supplementary file 3 for the SMGs. Of the PG video clips, 37% were graded 0, 18% were graded 1, 19% were graded 2 and 19% were graded 3; 3% were graded as fatty and 3% as fibrous with ranges across experts of 0%–12% and 0%–12%, respectively. Of the SMG video clips, 29% were graded 0, 23% were graded 1, 27% were graded 2 and 16% were graded 3; 2% were graded as fatty and 4% as fibrous, with ranges across experts of 0%–10% and 0%–18%, respectively. Figure 1 shows the four-grade semiquantitative scoring system with the two qualitative items fatty and fibrous.
The primary analysis showed excellent reliability for the PGs and for the SMGs (table 2). Reliability was similar for both the PGs and the SMGs combined (Light’s kappa 0.81; range 0.6–0.97; 95% CI 0.77 to 0.84). The sensitivity analysis in which fatty glands were graded 1 and fibrous glands were graded 3 showed a lower reliability (Light’s kappa 0.79; range 0.6–0.98; 95% CI 0.76 to 0.83).
In the primary analysis, reliability was good for the PGs and SMGs separately (table 3). Similar reliability results were obtained when both glands were combined (Light’s kappa 0.66; range 0.35–0.85; 95% CI 0.61 to 0.70). Reliability was lower in the sensitivity analysis (Light’s kappa 0.62; range 0.29–0.80; 95% CI 0.57 to 0.66).
In this first phase of a multiphase process to develop US as an OMI for scoring SG lesions in patients with pSS, the OMERACT task force on Sjögren’s syndrome developed consensual definitions through a Delphi process and subsequently tested their reliability in a web-based video clip platform showing dynamic video images of SG. In the Delphi survey, existing and novel criteria were rated by a broad panel of international experts. An agreement on defining SGUS lesions was reached after three rounds of Delphi process and these definitions were finally used to develop a new semiquantitative scoring system for the SGUS assessment of the PGs and SMGs. The results were excellent for the intrareader and good for the inter-reader reliability, respectively. The experts agreed that the SLGs should not be evaluated by SGUS. To date, there are not enough data concerning morphology and size using US and MRI for these small challenging glands.22
Previous attempts for reaching definitions of SG echostructural lesions were conducted by the EULAR-US pSS group in 2012. This has proven to be difficult as reliability results showed wide variability. Indeed, the reliability was tested on seven elementary lesions either on static and acquisition images on patients with pSS and not on a comprehensive scoring. At the time, an assessment on patients with pSS by five experts demonstrated that as to homogeneity of the gland, the inter-reader reliability for PG was moderate and for SMG fair.23 One of the main goals of the OMERACT SGUS subtask force was to develop a standardised SGUS scanning procedure. Training in the implementation of a standardised procedure is instrumental when striving for robust intrareader and inter-reader reliabilities. Studies of SGUS reliability that relied instead on static and acquisition images have produced variable results.23–26 Some studies involved the separate scoring of the items defined by Hocevar et al (echogenicity, homogeneity, hyperechoic bands, posterior border, presence or absence of calcifications and gland size).27–29 Only the homogeneity item showed good inter-reader reliability.30 Finally, a strong correlation has been demonstrated between homogeneity and the presence of anechoic/hypoechoic foci, indicating that these two items provide similar information.26 This fact should be taken into account when developing a semiquantitative scoring system.31 32
Fatty replacement and fibrosis are not considered in any of the current SGUS scoring systems. Fatty replacement of SGs is seen in healthy elderly individuals and in a small minority of patients with Sjögren’s syndrome. SG fibrosis is common in end-stage pSS and sometimes seen in early-stage pSS. Both fatty replacement and fibrosis were scored in the sensitivity analysis. Intrareader reliability was excellent and inter-reader reliability was good for both items in both the primary and the sensitivity analyses. Our results suggest that both items should be considered for SGUS scoring when the semiquantitative scoring system cannot be applied.
Reliability was lower for the SMGs than for the PGs. This fact may be related to the difference in parenchymal echostructure between these two glands.22 A recent study demonstrated no correlation between findings at the PGs and at the SMGs, whereas correlations were strong between the right and left PGs and between the right and left SMGs.26 Instead of scoring either the PGs or the SMGs we therefore recommend scoring at least one PG and at least one SMG for anechoic/hypoechoic foci, hyperechoic (fibrous) bands (grade 3) and fatty replacement (grade 1) as part of an overall semiquantitative scoring system.26
The main study limitation was that we did not include fatty and fibrous gland statements (ie, based on qualitative grading) in our Delphi survey section on US-SG scoring system (ie, current consensus is based only on semiquantitative grading). Only later after the expert reading of the videos that we decided to include fatty and fibrous glands into our US-SG scoring system. We believe that using simplified scoring (only accounting for anechoic/hypoechoic areas) other important data such as fatty and fibrous characteristics of the gland could be omitted. In other words, our study limitation highlights the weakness of current US-SG scoring system and warrants the need for an update of such system for a comprehensive scoring which combines all quantitative and qualitative characteristics of SG.
In conclusion, an international expert consensus was reached using OMERACT methodology for the definitions of normal US appearance and abnormalities seen in patients with suspected or confirmed pSS. In a next step, these definitions were tested in video clips of patients with pSS and non-pSS sicca patients on the PGs and SMGs alone, excluding the SLGs. The sharing of video clips via the internet proved to be a simple and feasible method for having a large number of experts evaluates the reliability of an SGUS scoring system. This study constitutes the first step towards a novel, reliable, semiquantitative SGUS scoring system available to clinicians and sonographers assessing the SGs of patients with suspected or confirmed pSS.
The authors thank François Madec and Jacques Bretagnolle for their contribution to the preparation of video analysis and for technical assistance, Emmanuel Nowak, head of Brest CHRU Data Management Unit as well as Antoinette Wolfe for English and Zarrin Alavi for critical reviewing of the manuscript.
Handling editor Josef S Smolen
Presented at The results of this paper were presented at the EULAR Congress in oral presentation (Amsterdam 2018).
Correction notice This article has been corrected since it published Online First. The affiliation for the last author has been corrected.
Contributors Conception and design: SJJ, MADA, GAWB, EN, SO, MB, GT, ICV, LT, AI, PC, CHD, FG, WAS, GF, CD, MHS, MAM, AH, SC, GEM, JJA, RT, DKMC, SF, PH, AZ, CG, DSH. Statistical analysis: CN, FG, MADA. Wrote the manuscript: SJJ, MADA, GAWB, ZA. Interpretation of results: SJJ, MADA, GAWB, ZA.
Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.
Competing interests None declared.
Patient consent for publication Not required.
Provenance and peer review Not commissioned; externally peer reviewed.
Data sharing statement Data will be available upon reasonable request from the corresponding author.
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.