Article Text

The 2019 American College of Rheumatology/European League Against Rheumatism classification criteria for IgG4-related disease
  1. Zachary S Wallace1,
  2. Ray P Naden2,
  3. Suresh Chari3,
  4. Hyon K Choi4,
  5. Emanuel Della-Torre5,
  6. Jean-Francois Dicaire6,
  7. Phillip A Hart3,
  8. Dai Inoue7,
  9. Mitsuhiro Kawano8,
  10. Arezou Khosroshahi9,
  11. Marco Lanzillotta10,
  12. Kazuichi Okazaki11,
  13. Cory A Perugino12,
  14. Amita Sharma13,
  15. Takako Saeki14,
  16. Nicolas Schleinitz15,
  17. Naoki Takahashi16,
  18. Hisanori Umehara17,
  19. Yoh Zen18,
  20. John H Stone19
  21. Members of the ACR/EULAR IgG4-RD Classification Criteria Working Group
    1. 1 Rheumatology Unit, Massachusetts General Hospital, Boston, Massachusetts, USA
    2. 2 Maternal-Fetal Medicine, McMaster University Faculty of Health Sciences, Hamilton, Ontario, Canada
    3. 3 Gastroenterology & Hepatology, Mayo Clinic, Rochester, Minnesota, USA
    4. 4 Rheumatology, Harvard Medical School, Boston, Massachusetts, USA
    5. 5 Cancer Center, Massachusetts General Hospital, Boston, Massachusetts, USA
    6. 6 Pinnacle Inc, Montreal, Ontario, Canada
    7. 7 Department of Radiology, Kanazawa University, Kanazawa, Japan
    8. 8 Division of Rheumatology, Department of Internal Medicine, Kanazawa University Hospital, Kanazawa, Japan
    9. 9 Rheumatology, Emory University School of Medicine, Atlanta, Georgia, USA
    10. 10 Università Vita-Salute San Raffaele, School of Medicine, Unit of Internal Medicine, Ospedale San Raffaele, Milano, Lombardia, Italy
    11. 11 Third Department of Internal Medicine, Division of Gastroenterology and Hepatology, Kansai Medical University, Osaka, Japan
    12. 12 Rheumatology, MGH, Boston, Massachusetts, USA
    13. 13 Radiology, Massachusetts General Hospital, Boston, Massachusetts, USA
    14. 14 Department of Internal Medicine, Nagaoka Red Cross Hospital, Nagaoka, Japan
    15. 15 Internal Medicine, Groupe hospitalier Timone, Assistance publique-Hôpitaux de Marseille, Aix-Marseille Université, Marseille, France
    16. 16 Department of Radiology, Mayo Clinic, Rochester, Minnesota, USA
    17. 17 Rheumatology and Immunology, Shiritsu Nagahama Byoin, Nagahama, Shiga, Japan
    18. 18 Diagnostic Pathology, Kobe University, Kobe, Japan
    19. 19 Massachusetts General Hospital Rheumatology Unit, Harvard Medical School, Boston, Massachusetts, USA
    1. Correspondence to Dr John H Stone, Massachusetts General Hospital Rheumatology Unit, Harvard Medical School, Boston, MA 02114, USA; jhstone{at}mgh.harvard.edu

    Abstract

    IgG4-related disease (IgG4-RD) can cause fibroinflammatory lesions in nearly any organ. Correlation among clinical, serological, radiological and pathological data is required for diagnosis. This work was undertaken to develop and validate an international set of classification criteria for IgG4-RD. An international multispecialty group of 86 physicians was assembled by the American College of Rheumatology (ACR) and the European League Against Rheumatism (EULAR). Investigators used consensus exercises; existing literature; derivation and validation cohorts of 1879 subjects (1086 cases, 793 mimickers); and multicriterion decision analysis to identify, weight and test potential classification criteria. Two independent validation cohorts were included. A three-step classification process was developed. First, it must be demonstrated that a potential IgG4-RD case has involvement of at least one of 11 possible organs in a manner consistent with IgG4-RD. Second, exclusion criteria consisting of a total of 32 clinical, serological, radiological and pathological items must be applied; the presence of any of these criteria eliminates the patient from IgG4-RD classification. Third, eight weighted inclusion criteria domains, addressing clinical findings, serological results, radiological assessments and pathological interpretations, are applied. In the first validation cohort, a threshold of 20 points had a specificity of 99.2% (95% CI 97.2% to 99.8%) and a sensitivity of 85.5% (95% CI 81.9% to 88.5%). In the second, the specificity was 97.8% (95% CI 93.7% to 99.2%) and the sensitivity was 82.0% (95% CI 77.0% to 86.1%). The criteria were shown to have robust test characteristics over a wide range of thresholds. ACR/EULAR classification criteria for IgG4-RD have been developed and validated in a large cohort of patients. These criteria demonstrate excellent test performance and should contribute substantially to future clinical, epidemiological and basic science investigations.

    • rheumatoid arthritis
    • Sjøgren's syndrome
    • inflammation

    Statistics from Altmetric.com

    Request Permissions

    If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

    This criteria set has been approved by the European League Against Rheumatism (EULAR) Executive Committee and the American College of Rheumatology (ACR) Board of Directors. This signifies that the criteria set has been quantitatively validated using patient data, and it has undergone validation based on an independent dataset. All ACR/EULAR-approved criteria sets are expected to undergo intermittent updates. The ACR is an independent, professional, medical and scientific society that does not guarantee, warrant or endorse any commercial product or service.

    Introduction

    IgG4-related disease (IgG4-RD) is an immune-mediated condition associated with fibroinflammatory lesions that can occur at nearly any anatomic site.1 2 It often presents as a multiorgan disease and may be confused with malignancy, infection or other immune-mediated conditions, such as Sjögren’s syndrome or vasculitis, associated with antineutrophil cytoplasmic antibodies (ANCAs). Rheumatologists, internists, gastroenterologists, nephrologists, pulmonologists, neurologists, radiologists, pathologists and other practitioners are often involved in the evaluation of patients with this condition. IgG4-RD can lead to organ dysfunction, organ failure and death. Its epidemiology remains poorly described because of its relatively recent recognition as a discrete condition, yet the disease is now seen by both generalists and specialists all across the world.

    IgG4-RD was first recognised as a distinct disease in 2003.3 4 Over the next decade, it became clear that although the disease could affect virtually any organ, there are strong predilections for certain organs.1 5 These include the major salivary glands (submandibular, parotid, sublingual), the orbits and lacrimal glands, the pancreas and biliary tree, the lungs, the kidneys, the aorta and retroperitoneum, the meninges and the thyroid gland (Riedel’s thyroiditis).6–8 Many of the early diagnoses of IgG4-RD relied on the pathological assessment of surgical resection specimens.9 These discoveries were often incidental findings made following resections of lesions with suspected malignancy. The large pathological samples available from such procedures generally permitted identification of a full range of findings considered characteristic of IgG4-RD: a lymphoplasmacytic infiltrate, storiform fibrosis, obliterative phlebitis and dramatic IgG4+ plasma cell infiltrates, among others.9 With growing recognition of this condition, however, the diagnosis is now made using increasingly small biopsy samples that frequently do not demonstrate the full spectrum of pathological findings.7 9 10 In a subset of patients with classic combinations of clinical, serological or radiological findings, clinical diagnoses are sometimes made in the absence of biopsy, but the threshold to perform biopsies of accessible sites when there is significant concern about malignancy or infection remains appropriately low.

    Other cases diagnosed early in the course of IgG4-RD were identified because of striking elevations in serum IgG4 concentrations.4 However, it is now recognised that serum IgG4 levels are normal in a substantial percentage of patients with clinicopathological diagnoses of IgG4-RD.6 11 12 Although serum IgG4 concentrations can provide an important clue to the diagnosis and some guidance in the longitudinal assessment of disease activity, the centrality of IgG4 in the overall pathophysiology of this condition has been called into question.13 The presence of an elevated serum IgG4 level is no longer considered essential to the diagnosis of IgG4-RD. Indeed, certain organ systems and anatomic regions (eg, the retroperitoneum) are less likely to be associated with a serum IgG4 elevation than others.6

    Finally, the radiological features of IgG4-RD have also been described with increasing thoroughness. Radiological findings such as a sausage-shaped pancreas and periaortitis affecting the infrarenal aorta are now viewed as being strongly suggestive of IgG4-RD if detected in the proper clinical context.14 15 Nevertheless, radiological findings in isolation—without reference to clinical, serological or pathological data—are never sufficient for either clinical diagnosis or appropriate disease classification.

    In short, although clinical, serological, radiological and pathological features all contribute to the classification of IgG4-RD, none of these approaches alone provides definitive evidence for the accurate classification of patients. The proper categorisation of patients for both research studies and clinical purposes relies on integration of data from all four domains of evidence. Given the recent recognition of IgG4-RD as a distinct condition, along with its multiorgan nature and the absence of a single diagnostic feature, classification criteria are now needed for the conduct of high-quality clinical and epidemiological investigations in this disease.

    Methods

    This study was approved by the Partners HealthCare Institutional Review Board.

    Study overview

    The development and testing of the classification criteria for IgG4-RD was based on consensus-based and data-driven methods using prospectively collected data and decision analytics.16–19

    Investigators

    A Steering Committee composed of investigators from North America, Europe and Asia was established (. The Steering Committee directed the entire project and invited other investigators who were assigned to specific Advisory Groups addressing clinical, serological, radiological and pathological issues. In addition to members of the Steering Committee and the Advisory Groups, other investigators were invited to participate by submitting cases of IgG4-RD and of mimicking conditions to be used in the development and testing phases of the study. This full group of investigators is known as the American College of Rheumatology (ACR)/European League Against Rheumatism (EULAR) IgG4-RD Classification Criteria Working Group (Appendix A).

    Item generation

    Each Advisory Group consisted of a Steering Committee member and experts in the field being addressed by the specific Advisory Group. The Advisory Groups were tasked with using evidence-based and consensus-based approaches to identify items that might be relevant to the classification of patients as having or not having IgG4-RD. These items comprised preliminary exclusion criteria and preliminary inclusion criteria. Preliminary exclusion criteria were defined as items that would lead to termination of consideration of the patient as an IgG4-RD case. In contrast, preliminary inclusion criteria could either increase or decrease the likelihood of classification of the patient as an IgG4-RD case. Preliminary inclusion criteria that demonstrated discriminatory ability to increase the likelihood of classification were later selected as inclusion criteria. A 24-member Steering Committee of the ACREULAR IgG4-RD Classification Criteria Development Group met in Boston in April 2016 to begin this process. At this initial Steering Committee meeting, 104 rounds of consensus-based decision-making were conducted. Consensus was achieved for 79 (76%) of these decisions, the process of which is described below. Item generation and the subsequent task of item reduction were continued through teleconferences and e-mail discussions.

    Process of consensus

    The rules regarding consensus were set out at the time of the first face-to-face meeting. Consensus was considered to have been reached when 80% of the members of the Steering Committee were in agreement on a given point. Discussion was permitted following achievement of the 80% threshold, however, if individuals in the minority wished to express the rationale behind their opinion. During discussions, evidence was presented by participants to support arguments. Discussants referred to the medical literature when relevant to illuminate a particular question. In some instances, in the setting of a persuasive argument by a member of the minority, discussion led to re-voting and occasionally to a change in the ultimate decision on a particular point.

    Item reduction

    Following item generation, the Steering Committee participated in two exercises to reduce the number of items. First, the Committee reviewed all proposed inclusion and exclusion criteria and reduced the potential criteria into 8–10 domains through the consensus process described above. Related items were clustered within domains that were independent of the other domains; for preliminary inclusion criteria, items contributed positive or negative weights toward classifying cases as IgG4-RD. For instance, biopsy immunohistochemistry results (eg, IgG4+ plasma cells/high-power field (hpf) and IgG4+ IgG+ plasma cells/hpf) were listed under an immunohistochemistry domain. Within each preliminary inclusion criteria domain, items were arranged by group members according to the degree to which they either increased or decreased the likelihood of classification as IgG4-RD (eg, an infiltrate of ≥40 IgG4 + plasma cells/hpf was positioned above an infiltrate of 0–9 IgG4+ plasma cells/hpf). Definitions for each item were determined such that cases could be assigned clearly to only one item in a domain.

    The Steering Committee then ranked each potential preliminary inclusion criteria item on a Likert scale from −5 (‘Highly confident the patient does not have IgG4-RD if this item is present’) to +5 (‘Highly confident the patient has IgG4-RD if this item is present’). Items associated with an average confidence between −2.0 and +2.0 were deemed to have insufficient sensitivity or specificity and were excluded from further consideration.

    Derivation case collection

    Investigators were invited to submit cases of IgG4-RD or mimicking conditions that they had managed and to report the presence or absence of each preliminary item for each submitted case using standardised data collection forms. No identifying data on these patients were collected. Investigators were encouraged to submit data on a broad range of IgG4-RD cases, including cases in which they were highly confident in the diagnosis as well as those in which they were less confident. The investigator submitting the case proposed the initial classification of the case as IgG4-RD or as a mimicker of IgG4-RD. This initial classification of all cases was reviewed by a subset of the Steering Committee to confirm the appropriateness of the initial designation. Cases who appeared to be inappropriately classified by the investigator or cases with insufficient information on which to base a classification decision were discarded.

    Approach to assigning relative weights to inclusion criteria items

    Twenty of the submitted cases representing a combination of IgG4-RD and mimickers were selected for a Steering Committee exercise designed for two purposes. First, the exercise was used to assign preliminary weights to the inclusion criteria. Second, it fostered discussion and facilitated consensus on the definitions of individual items. Only cases who did not fulfil any of the exclusion criteria were selected for this exercise. The cases selected represented a broad range of manifestations to assess the performance of all potential criteria. Investigators were asked to rank all cases in order from most likely to least likely to be classified as IgG4-RD. In addition, investigators were asked to indicate the point at which they would divide the cases into those that should be classified as IgG4-RD and those that were more likely to be mimickers.

    The draft IgG4-RD classification criteria consisted of 8 domains and a total of 29 items. Once preliminary domains and items had been selected, the Steering Committee met in person for a 2-day session employing decision science theory and computer adaptive technology. A computer software program known as 1000minds (http://www.1000minds.com) was used. Investigators participated in a series of discrete, forced-choice experiments through pairwise rankings of alternatives that led to quantified weights for each item.20–22 During this exercise, investigators were presented with a series of paired scenarios (A and B), each of which contained the same two domains (eg, serum IgG4 concentrations and salivary gland disease). Different combinations of the domains’ items were grouped together in each scenario.

    For each paired scenario choice, investigators selected the scenario they believed to contribute more toward the classification of the patient as having IgG4-RD, assuming that all other aspects of the case were the same. The distribution of votes (per cent who voted for A, B or ‘equal probability’) was presented for each pair of scenarios after each vote. Discussions and re-voting were pursued when necessary, using the same process of consensus described above. Consensus was considered to have been achieved when all participants either indicated complete agreement as to which scenario represented a higher probability of IgG4-RD or indicated that they could accept the majority opinion. During this phase of classification criteria development, 160 rounds of consensus-based decision-making were conducted. Based on this voting, the computer software assigned relative weights to each item. The specific weights assigned to each item were not revealed to investigators.

    Scoring of weighted items

    If >1 item was present within a given domain, only the highest weighted item was scored. As an example from the Chest domain, if a patient had peribronchovascular and septal thickening evident on CT of the chest (weighted 4 points) as well as a paravertebral band-like soft tissue mass in the thorax (weighted 10 points), only the weight of the paravertebral band-like soft tissue mass in the thorax would count in the patient’s total classification criteria score.

    Identifying a threshold for classifying IgG4-RD

    Each derivation case that was not removed by an exclusion criterion was assigned a total score based on the aggregation of weighted inclusion criteria present. These cases were ranked and a preliminary threshold was identified based on targets of >90% for specificity and >80% for sensitivity. Cases around the threshold were selected for discussion among the investigators, who reached consensus on a cut-off point between the group of patients who should be classified as having IgG4-RD and those who could not be confidently classified as having IgG4-RD. A preliminary threshold of 20 was selected by two of the investigators (RPN and JHS) after an in-person review of cases around this threshold revealed a common point at which cases were more likely to be classified by investigators as not clearly being IgG4-RD. This preliminary threshold was then tested in the first of 2 validation phases, using newly submitted cases of IgG4-RD and IgG4-RD mimickers. This preliminary threshold was not revealed to other investigators as the cases for the validation phase were collected.

    Collection of IgG4-RD cases and mimickers for the first validation phase

    Investigators were invited to submit a second set of data from cases of IgG4-RD or mimicking conditions. None of the cases in this second set had been included in the derivation set. The investigators reported the presence or absence of each finalised item using standardised data collection forms. For each case, investigators reported their confidence in the diagnosis on a scale of 0–3 in which 0=uncertain, 1=slightly confident, 2=confident and 3=very confident.

    Testing of the IgG4-RD classification criteria and other statistical analyses

    We evaluated the performance of the preliminary classification criteria among those cases who fulfilled the entry criteria. To determine the test performance, we only analysed cases in which investigators were at least ‘confident’ or ‘very confident’ in the diagnosis (IgG4-RD or mimicker); thus, a ‘confident’ or ‘very confident’ diagnosis was considered the gold standard for the purpose of assessing test performance. The number of patients with ‘confident’ or ‘very confident’ designations as either IgG4-RD cases or IgG4-RD mimickers was 771, or 85% of all the patients included in the first validation phase.

    We assessed the test performance of the classification criteria at the preliminary threshold of 20 as well as at a range of thresholds above and below 20. To determine the optimal threshold, we considered the goal of our classification criteria for use in clinical trials (specificity >90% and sensitivity >80%). We also considered other measures such as area under the curve (AUC),23 Youden’s criteria,24 distance from (0,1) on a receiver operating characteristic curve (ROC), difference between sensitivity and specificity and the diagnostic OR (positive likelihood ratio/negative likelihood ratio).25

    Sensitivity analyses

    We performed several sensitivity analyses to test the performance of the criteria. These sensitivity analyses included the following considerations: (1) if all cases, regardless of confidence level were included; (2) if all the exclusion criteria were removed; (3) if information on serum IgG4 concentrations was not available; (4) if biopsies were not available and (5) if the mimickers without data on serum IgG4 concentrations or biopsies were assumed to have the highest values for each item. Chi-square tests, Fisher’s exact tests, t-tests and Wilcoxon rank-sum tests were used to compare subgroups, as appropriate.

    Testing the final threshold in a second validation cohort

    Investigators were invited to submit another set of data from cases of IgG4-RD or mimicking conditions that they had managed but had not yet contributed to the previous derivation or validation cohorts. This second validation cohort was collected because minor changes in some of the definitions of inclusion and exclusion criteria had been made after the derivation set of patients had been collected, in the interest of clarifying definitions for investigators. However, the definitions of inclusion criteria and exclusion criteria used in the two validation cohorts were exactly the same. Using the same approach as above, we assessed the performance of the classification criteria at the identified threshold of 20. We used all cases and mimickers for whom the diagnosis was considered ‘confident’ or ‘very confident’ by the investigator as the gold standard (n=402 (83%)).

    Results

    Research group

    The Steering Committee consisted of investigators from North America, Europe and Asia. There were three Advisory Groups: clinical and serological, radiological and pathological. A total of 86 investigators submitted cases for the derivation and/or validation sets.

    Item generation and reduction

    At the conclusion of item generation, definitions for the entry criteria, exclusion criteria and inclusion criteria were established. The entry criteria were defined as (1) characteristic clinical or radiological involvement of a typical organ (eg, pancreas, bile ducts, orbits, lacrimal glands, major salivary glands, retroperitoneum, kidney, aorta, pachymeninges or thyroid gland (Riedel’s thyroiditis)) or (2) pathological evidence of an inflammatory process accompanied by a lymphoplasmacytic infiltrate of uncertain aetiology in one of these same organs. ‘Characteristic’ involvement generally refers to enlargement of the organ or a tumour-like mass within an affected organ. It also includes three organ-specific features, with reference to (1) the bile ducts, where narrowing tends to occur, (2) the aorta, where wall thickening or aneurysmal dilatation is typical and (3) the lungs, where thickening of the bronchovascular bundles is common.

    Online supplementary tables 1 and 2 list the preliminary exclusion criteria and the preliminary inclusion criteria, respectively. There was initially a total of 78 such criteria (51 preliminary exclusion criteria and 27 preliminary inclusion criteria). The preliminary exclusion criteria and preliminary inclusion criteria demonstrating the highest discrimination of IgG4-RD from disease mimickers were chosen as draft classification criteria. Complete definitions of the exclusion criteria and the inclusion criteria are shown in tables 1 and 2 respectively. Following the consensus exercises and the Likert scale rating of the preliminary inclusion criteria, refined lists of exclusion and positive and negative inclusion criteria were created (online supplementary table 2).

    Table 1

    Exclusion criteria definitions

    Table 2

    Inclusion criteria definitions

    Derivation and validation cohorts

    Table 3 describes the derivation cohort and the first and second validation cohorts used to develop and assess the performance of the classification criteria. A total of 1879 patients were included in the overall IgG4-RD classification criteria effort, including 486 in the derivation cohort (272 IgG4-RD cases, 214 mimickers), 908 in the first validation cohort (493 cases, 415 mimickers) and 485 in the second validation cohort (321 cases, 164 mimickers). The patients’ status as a case or mimicker, proposed by the submitting investigator, was confirmed by the members of the Steering Committee. In both the derivation and validation cohorts, the majority of cases were male patients and typically in their sixth decade of life, consistent with the demographics of IgG4-RD and many of its mimicking conditions.

    Table 3

    Demographic and disease characteristics of the derivation and validation cohorts*

    Classification criteria

    The derivation cohort was used to assess the relative performance of each proposed exclusion and inclusion criterion (table 4). The exclusion criteria are not designed to be a ‘laundry list’ of evaluations that must be checked off as negative before a patient can be classified as having IgG4-RD. Rather, they serve as a reminder to the investigator of evaluations that might be appropriate to consider in specific clinical scenarios.

    Table 4

    The 2019 American College of Rheumatology/European League Against Rheumatism classification criteria for IgG4-RD

    Criteria that did not distinguish IgG4-RD cases from mimickers were eliminated and those that helped distinguish IgG4-RD cases from mimickers were retained. The final entry criteria and items were modified through in-person discussion after completion of the 1000minds program and review of the derivation cases (n=486) ranked in order of points accrued by totaling the weights associated with each inclusion criteria item after cases fulfilling exclusion criteria had been excluded. A preliminary score of 20 was identified as the cut-off point at or above which the majority of investigators considered the patient to have IgG4-RD; with this threshold, a sensitivity of >80% and high specificity were also achieved.

    Validating the classification criteria

    We then tested the performance of the classification criteria in the first validation cohort (n=908). To determine the optimal cut-off, we assessed the test performance of criteria at various thresholds (table 5). Given that the purpose of the criteria was to identify patients with IgG4-RD for enrolment in research studies, the ideal threshold would have excellent specificity while retaining good sensitivity (>80%). The preliminary threshold of 20 had a specificity of 99.2% (95% CI 97.2% to 99.8%) and a sensitivity of 85.5% (95% CI 81.9% to 88.5%). Moreover, the threshold of 20 had excellent discrimination, with an AUC of 0.924 (95% CI 0.906 to 0.941). A threshold of either 21 or 22 had a specificity identical to that obtained with the threshold of 20, but sensitivity decreased at those thresholds, as reflected in other measures of threshold performance, including the AUC. A threshold of 20 also had the highest diagnostic OR compared with other thresholds.

    Table 5

    Performance of various thresholds of the 2019 American College of Rheumatology/European League Against Rheumatism classification criteria for IgG4-related disease using validation cohort 1

    Because of the emphasis placed on specificity, we considered the test characteristics obtained with a threshold of 20 superior to those of other potential thresholds. Of note, however, a threshold of 16 performed better in certain measures, including sensitivity (88.6%), Youden’s criteria, distance from (0,1) on the ROC curve (0.12) and AUC (0.933 (95% CI 0.916 to 0.950)). The threshold of 16 was associated with a slightly lower specificity: 98.1% vs 99.2%. When comparing a threshold of 20 to a threshold of 16 with regard to the diagnostic OR, a threshold of 20 was associated with superior test performance (761.5 vs 394.5). The consistent performance of these classification criteria across a range of thresholds suggests that the criteria will be robust when used in the clinic for purposes of research.

    Analyses were then performed using the second validation cohort (n=485). In this group, the classification criteria had a specificity of 97.8% (95% CI 93.7% to 99.2%) and a sensitivity of 82.0% (95% CI 77.0% to 86.1%).

    Sensitivity analyses with a threshold of 20

    We performed a number of sensitivity analyses to assess the robustness of the classification criteria at a threshold of 20 in the first validation cohort. If all cases, regardless of confidence in the diagnosis, were included, the classification criteria performed very well, with a sensitivity of 83% and a specificity of 98.9%. The IgG4-RD classification criteria are the first of its kind in any rheumatic disease to incorporate absolute exclusion criteria. In a sensitivity analysis that removed exclusion criteria from the classification algorithm, we found that the specificity of the criteria decreased from 99.2% to 89.2%, while the sensitivity increased from 85.5% to 90.0%. As is typical of clinical practice, serum IgG4 concentrations were not measured, or biopsies not performed, in some cases of IgG4-RD (3% and 15%, respectively) and mimickers (36% and 16%, respectively). When exclusion and inclusion criteria related to biopsy results or serum IgG4 concentrations were removed from the classification algorithm, the classification criteria maintained excellent specificity in both scenarios (98.9% when biopsy criteria were removed and 99.3% when serum IgG4 concentrations were removed). The sensitivity decreased substantially in the absence of pathological data or serum IgG4 concentrations to 48.6% and 75.0%, respectively. When we assumed the worst-case scenario in which all the mimickers without biopsy or serum IgG4 concentration data were assigned the highest weights for each (eg, IgG4 concentrations >5 times the upper limit of normal), the specificity of the classification criteria remained high (92.7%).

    Reasons for cases not achieving a classification of IgG4-RD

    Of the 428 and 267 IgG4-RD cases from the first and second validation cohorts used to test the classification criteria, 62 (14%) and 48 (18%), respectively, did not fulfil the classification criteria. In both the first and second validation cohorts, the majority of these false-negative cases (43 (69%) and 39 (81%), respectively) did not achieve sufficient inclusion criteria points (table 6), partly because they were less likely to have had biopsies compared with true-positive cases (65% vs 91% (p<0.001) and 73% vs 88% (p=0.007), respectively). Twenty false-negative cases in the first validation cohort (32%) and nine in the second validation cohort (19%) met at least one exclusion criterion. Of all the IgG4-RD cases submitted in the first and second validation cohorts, 24 (4.9%) and 42 (8.7%), respectively, did not meet the initial entry criterion (characteristic organ involvement). In addition, 23 (5%) and 13 (4%) of the submitted IgG4-RD cases in the first and second validation cohorts, respectively, fulfilled at least one exclusion criterion, most often a clinical or serological exclusion criterion (table 7).

    Table 6

    Comparison of differences in false-negative and true-positive IgG4-related disease cases from the validation cohorts*

    Table 7

    Percentage of validation cohort cases and mimickers fulfilling exclusion criteria*

    In the first validation cohort, 64 (20%) of 324 mimickers considered when deriving thresholds for the classification criteria did not meet entry criteria. Similarly, in the second validation cohort, 17 (10%) of the 164 mimickers did not meet entry criteria. Of those who met entry criteria in each validation cohort (260 and 147, respectively), 258 (99%) and 144 (98%), respectively, did not fulfil the classification criteria (true negatives). The majority of mimickers in both cohorts (201 (77%) and 93 (65%), respectively) were eliminated at the exclusion criteria stage (table 7). Online supplementary tables 3 and 4 list the inclusion criteria fulfilled by the cases classified as IgG4-RD and cases submitted as mimickers in the first and second validation cohorts.

    Discussion

    The 2019 ACR/EULAR IgG4-RD criteria represent a significant milestone in IgG4-RD, a multiorgan condition with myriad clinical presentations.3 4 Our approach reflects the fact that in clinical practice, information from clinical, serological, radiological and pathological evaluations must be integrated to arrive at a confident decision about whether to classify a patient as having IgG4-RD. The excellent sensitivity and specificity of these criteria will assist in the conduct of clinical trials and other studies of IgG4-RD. The purpose of these classification criteria is to facilitate the identification of more homogeneous groups of subjects for inclusion into clinical trials and observational studies.26–28

    No set of classification criteria can be constructed so as to include all patients within the spectrum of a disease. Accordingly, attempts to include all conceivable patients with clinical diagnoses of IgG4-RD would inevitably involve major sacrifices in specificity that would lead to the unacceptable inclusion of a significant percentage of false-positive cases. Our principal goal in constructing these classification criteria was to create a criteria set with the highest possible specificity while retaining moderately high sensitivity. The specificity of 97.8% achieved at a threshold of ≥20 points will include few false-positive cases: a highly desirable performance measure for clinical trials and other investigations. The sensitivity of 82.0% at this threshold also captures a broad spectrum of the patient population about whose IgG4-RD classification investigators are confident. The classification criteria for IgG4-RD that we have developed demonstrate robust test characteristics across a range of thresholds, suggesting that they will have broad relevance to the field of IgG4-RD investigation.

    These criteria are not intended for use in clinical practice as the basis of establishing the diagnosis of IgG4-RD.29 If the appropriate clinical diagnosis for a patient is IgG4-RD, then failure to fulfil the ACR/EULAR classification criteria should not prevent the management of that patient’s condition accordingly. There might be a substantial likelihood of this when, for example, a representative biopsy sample is difficult to obtain.30 These criteria provide a useful framework for clinicians considering the diagnosis of IgG4-RD in a patient. They highlight findings such as bilateral salivary gland enlargement, common features of IgG4-related kidney disease and typical pancreas abnormalities that increase the likelihood that a patient has IgG4-RD. They also describe findings that suggest alternative diagnoses are more likely, such as primary granulomatous inflammation, ANCA positivity and fevers. However, the exclusion criteria should not be interpreted as a list of studies or tests a clinician must obtain on every patient.

    An important strength of this criteria set is that a patient may be classified accurately as having IgG4-RD in many cases even in the absence of a biopsy. Although biopsies are essential in many settings to establish the diagnosis of IgG4-RD and exclude mimickers, we aimed to develop criteria in which biopsy is not required when the diagnosis of IgG4-RD is straightforward on the basis of clinical, serological and radiological findings. Such criteria are consistent with clinical practice,7 31 compatible with research and essential to the appropriate diagnosis of patients in both clinical and research settings. The fact that the 2019 ACR/EULAR IgG4-RD classification criteria require neither a biopsy nor an elevated serum IgG4 level reflects important changes in the approaches whereby classifications of this disease are now assigned (and clinical diagnoses rendered). Nearly 20% of cases classified as IgG4-RD had a normal serum IgG4 concentration or did not have a serum IgG4 value available. Moreover, 9% of the IgG4-RD cases did not have a biopsy, 37% lacked the classic histopathological findings and >40% did not meet previously defined cut-offs for IgG4+ plasma cell infiltrates.9 These criteria reflect the reality of clinical care and clinical investigation in IgG4-RD; clinicians consider a combination of factors when determining whether to classify a patient as having this disease.10

    The 2019 IgG4-RD classification criteria are one of the first sets of classification criteria in rheumatology to include absolute exclusion criteria that are not based solely on having an alternative diagnosis, but rather focus on clinical, serological, radiological and pathological features. This approach has strong appeal, particularly when the common mimickers of IgG4-RD themselves pose challenges in classification because of their multiorgan nature. Our sensitivity analysis indicated that in the absence of exclusion criteria, the specificity of the classification criteria decreased by nearly 10%, yet was accompanied by only a small improvement in sensitivity.

    Some patients with clinical diagnoses of IgG4-RD will not fulfil these classification criteria. There are several explanations for this. First, we excluded patients with disease that affected only organs or sites that are involved only infrequently in IgG4-RD (eg, patients with pituitary, breast, skin or prostate disease). We focused our classification criteria development efforts on patients with more typical and common manifestations because of the desire to enrol relatively homogeneous populations in clinical trials. Second, some patients were excluded because their clinical evaluations identified exclusion criteria. Again, for the purposes of clinical trials, the exclusion of exceptional cases is usually prudent. Third, some patients met the entry criteria and did not meet exclusion criteria but still failed to accrue sufficient inclusion points to be classified as having IgG4-RD. Patients considered with confidence by their investigators to have IgG4-RD who did not fulfil the classification criteria were significantly less likely to have had a biopsy. It is possible that in some of these cases, a biopsy showing typical features of IgG4-RD might be useful for achieving sufficient points for the patient to be classified as having IgG4-RD.

    Our study has a number of strengths. First, a cohort of nearly 1900 patients with either IgG4-RD or a mimicking condition was assembled by an international group of investigators. Second, the experts involved in the consensus exercises, decision analysis and cohort development represented investigators from a variety of specialties (eg, rheumatology, gastroenterology, pathology and radiology) and from around the world, including the Americas, Europe, Asia and Australia . Moreover, many investigators involved in cohort development were not involved in other aspects of the classification criteria development, minimising any influence of circularity of reasoning. Such a bias can occur when the same investigators who define criteria also develop derivation and validation cohorts.22 Our design prevented this potential bias. Third, we applied multicriteria decision analysis to derive the weights for each inclusion criteria item. These weights can be adjusted easily if or when other tests or information relevant to diagnosis become available.

    Despite these strengths, our study has certain limitations. First, although the derivation and validation sets included a wide range of IgG4-RD mimickers, the performance of these classification criteria might be further evaluated in specific populations enriched for malignant conditions, non-IgG4-RD pancreatobiliary diseases and infections. Because of the specific exclusion criteria intended to address these groups of mimickers, however, the 2019 ACR/EULAR criteria should perform well under such circumstances. Second, the laboratory, imaging and pathological findings were not assessed centrally. Although the sensitivity and specificity of certain results may consequently have varied between investigator sites, this is unlikely to have affected our results significantly because of the expertise of the research group overall.

    In summary, these are the first classification criteria for IgG4-RD, developed and tested using a data-driven approach and multicriterion decision analysis. The criteria perform well over a wide range of thresholds. They represent a significant advance in this rapidly evolving field and should be used in future clinical trials and epidemiological studies of IgG4-RD.

    References

    Supplementary materials

    • Supplementary Data

      This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.

    Footnotes

    • Presented at This article is published simultaneously in the January 2020 issue of Arthritis & Rheumatology.

    • Collaborators Members of the ACR/EULAR IgG4-RD Classification Criteria Working Group are as follows: Drs Takashi Akamizu, Mitsuhiro Akiyama, Lillian Barra, Adrian Bateman, Daniel Blockmans, Pilar Brito-Zeron, Corrado Campochiaro, Mollie Carruthers, Tsutomu Chiba, Lynn Cornell, Emma Culver, Saman Darabian, Vikram Deshpande, Lingli Dong, Mikael Ebbo, Andreu Fernández-Codina, Judith A Ferry, George Fragkoulis, Fabian Frost, Luca Frulloni,Gabriela Hernandez-Molina, Haihan Ji, Karuna Keat, Terumi Kamisawa, Shigeyuki Kawa, H. Kobayashi, Yuzo Kodama, Satoshi Kubo, Kensuke Kubota, Haiyang Leng, Markus Lerch, Yanying Liu, Zhifu Liu, Matthias Löhr, Eduardo Martin-Nares, Ferran Martinez-Valle, Chiara Marvisi, Yasufumi Masaki, Shoko Matsui, Ichiro Mizushima, Seiji Nakamura, Jan Nordeide, Kenji Notohara, Sergio Paira, Jovan Popovic, Manel Ramos-Casals, James Rosenbaum, Jay Ryu, Yasuharu Sato, Hiroshi Sekiguchi, Evgeniya V. Sokol, James R Stone, Wenwu Sun, Hiroki Takahashi, Masayuki Takahira, Yoshiya Tanaka, Augusto Vaglio, Alejandra Villamil, Yoko Wada, George Webster, Kazunori Yamada, Motohisa Yamamoto, Joanne Yi, Yinlan Yi, Giuseppe Zamboni, Wen Zhang.

    • Contributors All authors were involved in drafting the article or revising it critically for important intellectual content, and all authors approved the final version to be published. JHS had full access to all the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis. Study conception and design; acquisition of data; analysis and interpretation of data: ZSW, RPN, SC, HKC, ED-T, J-FD, PAH, DI, MK, AK, KK, ML, KO, CAP, AS, TS, HS, NS, JRS, NT, HU, GW, YZ and JHS.

    • Funding Supported by the ACR and the EULAR.

    • Competing interests None declared.

    • Patient consent for publication Not required.

    • Ethics approval This criteria set has been approved by the EULAR Executive Committee and the ACR Board of Directors. This signifies that the criteria set has been quantitatively validated using patient data, and it has undergone validation based on an independent data set. All ACR/EULAR-approved criteria sets are expected to undergo intermittent updates. The ACR is an independent, professional, medical and scientific society that does not guarantee, warrant, or endorse any commercial product or service.

    • Provenance and peer review Not commissioned; internally peer reviewed.