Development and preliminary validation of the Sjögren’s Tool for Assessing Response (STAR): a consensual composite score for assessing treatment effect in primary Sjögren’s syndrome

Objective To develop a composite responder index in primary Sjögren’s syndrome (pSS): the Sjögren’s Tool for Assessing Response (STAR). Methods To develop STAR, the NECESSITY (New clinical endpoints in primary Sjögren’s syndrome: an interventional trial based on stratifying patients) consortium used data-driven methods based on nine randomised controlled trials (RCTs) and consensus techniques involving 78 experts and 20 patients. Based on reanalysis of rituximab trials and the literature, the Delphi panel identified a core set of domains with their respective outcome measures. STAR options combining these domains were proposed to the panel for selection and improvement. For each STAR option, sensitivity to change was estimated by the C-index in nine RCTs. Delphi rounds were run for selecting STAR. For the options remaining before the final vote, a meta-analysis of the RCTs was performed. Results The Delphi panel identified five core domains (systemic activity, patient symptoms, lachrymal gland function, salivary gland function and biological parameters), and 227 STAR options combining these domains were selected to be tested for sensitivity to change. After two Delphi rounds, a meta-analysis of the 20 remaining options was performed. The candidate STAR was then selected by a final vote based on metrological properties and clinical relevance. Conclusion The candidate STAR is a composite responder index that includes all main disease features in a single tool and is designed for use as a primary endpoint in pSS RCTs. The rigorous and consensual development process ensures its face and content validity. The candidate STAR showed good sensitivity to change and will be prospectively validated by the NECESSITY consortium in a dedicated RCT.


INTRODUCTION
For decades, evidence-based therapy in primary Sjögren's syndrome (pSS) has largely been based on sicca features or patient-reported outcomes (PROs). Over the past 20 years, work from an international consortium, supported by the European Alliance of Associations for Rheumatology (EULAR), has led to the development and validation of the consensual EULAR Sjögren's Syndrome Disease Activity Index (ESSDAI) and EULAR Sjögren's Syndrome Patient Reported Index (ESSPRI). [1][2][3] Both have emerged as reference standards to measure systemic activity and patients' symptoms, respectively.
Thus, ESSDAI has been used as a primary endpoint in recent randomised controlled trials (RCT) testing biologics, and for the first time in pSS four RCTs have met their primary endpoint. [4][5][6][7] Sjögren's syndrome

Sjögren's syndrome
ESSDAI has shown promising capability to monitor changes in disease activity and assess therapeutic efficacy. Nonetheless, several trials failed to show improvement in ESSDAI, [8][9][10] perhaps due to inefficacy of the drugs, but also potentially to the relatively high placebo response rates observed with ESSDAI. Also, the lack of efficacy may be explained by the absence of assessment of important features in ESSDAI, such as patients' symptoms and glandular function. 11 Recent RCTs showed that improvement in ESSDAI does not necessarily translate to improvement in PROs. 4-7 12 Thus, used as a unique primary endpoint, ESSDAI does not capture all important disease features. These limitations are inherent to scale constructs and highlight the need for a composite endpoint able to assess the disease globally. 13 The NECESSITY (New clinical endpoints in primary Sjögren's syndrome: an interventional trial based on stratifying patients) consortium (https://www.necessity-h2020.eu/) includes pSS experts from academia, pharmaceutical industry and patient groups formed to develop a new composite responder index, the Sjögren's Tool for Assessing Response (STAR). STAR aims to resolve the issues on current outcome measures in pSS and is intended for use in clinical trials as an efficacy endpoint. We herein report its development process.

METHODS
The development and preliminary retrospective validation of STAR followed the OMERACT (Outcome Measures in Rheumatology) guidelines 14 and consisted of three steps (figure 1), combining data-driven methods from nine RCTs (table 1) and consensus methods. The Delphi panel was formed by 78 pSS international experts (57 clinicians, 21 scientists) and 20 patients with pSS (online supplemental 1).
Step 1: identification of the STAR core set This step aimed to select the core set of domains of relevance in assessing treatment response in pSS and the measurement tool and definition of response for each domain.
We used data from two rituximab trials because, although they failed to demonstrate treatment efficacy in their primary endpoint relying on PROs, clinical experience suggests that rituximab should work in at least some patients and for some endpoints. [15][16][17] When only a portion of patients respond to the treatment, it might preclude the identification of an average treatment effect in the whole population. Many statistical methods exist to maximise the chances of detecting parameters that show differential change between active and placebo arms. We here used the virtual twins approach, which identifies subgroups with enhanced probability of response based on their baseline characteristics and which estimates the treatment effect in each subgroup while correcting for optimism due to the datadriven process. 18

Identification of subsets of responders
The panellists first agreed, based on expertise, literature and patient feedback, on baseline variables to include in the analyses (ie, main pSS characteristics suspected to be associated with response to treatment), and on the definitions of response to treatment, based on existing outcome measures in pSS and validated cut-offs.
Virtual twins regression trees were computed for each definition of response and each set of baseline variables. Responder subsets (ie, a branch of virtual twins analysis) were selected by the lead team based on statistical criteria (a relative risk of response to treatment vs placebo notably higher than in the whole population and a sufficient number of patients (≥60) for statistical power) and clinical relevance (subset identified by a definition of response including both physician and PROs).

Identification of items sensitive to change
The items most sensitive to change were identified based on their effect size (ES; with their 95% CI) for the between-group difference of change in score from baseline to week 24 and in score at week 24 in each responder subset and the whole population. The ES for the difference between groups was assessed by the Cohen's d measure, assuming a pooled SD. 19 CIs were estimated using the non-centrality parameter approach. This method searches for the best non-central parameter (NCP) of the non-central t distribution for the desired tail probabilities, and these NCPs are then converted to the corresponding ES. 20 The larger the ES, the greater the sensitivity to change. 21 ES values are commonly considered large (>0.8), moderate (0.5-0.8) or small (<0.5). The following outcomes were analysed: specific scores (ESSDAI, ESSPRI, and physician and patient global assessment), dryness (global, oral and ocular), pain and fatigue Visual Analogue Scale, glandular function (Schirmer's test and salivary flow), and biological variables (β2 globulin, serum IgG, γ-globulin, erythrocyte sedimentation rate, rheumatoid factor (RF) and C4 complement).

Selection of domains, items and definition of response
The results of the analyses on sensitivity to change and relevant literature review were presented to the Delphi panel. The scoping review of the literature on outcome measures in pSS will be published elsewhere. Based on these data and on clinical experience, the Delphi panellists were asked to rate the importance of measuring each outcome in the context of assessing treatment response in clinical trials (from not important (1-3) to critical (7-9) on a 9-point Likert scale) and to provide comments and suggest new domains or measurements. Items scored as critical (score ≥7) by ≥50% of the panellists were selected and were defined as the domains to include in STAR. Several items and definitions of response were selected in each domain.
Step 2: construction of STAR options The lead team prepared the drafts of the STAR options, combining the items and definitions of response identified previously. These draft options, along with the recently developed concise Composite of Relevant Endpoints for Sjögren's Syndrome (CRESS), 22 were presented to the panellists to select by vote which designs will be analysed in the next step. They could also make suggestions of combinations and alternate measurement tools or thresholds. Designs with ≥50% of votes, modified as per experts' suggestions, were selected.
Step 3: evaluation of sensitivity to change of STAR options and selection of the candidate STAR This phase aimed at selecting the candidate STAR and relied on analysis of nine RCTs completed at the time of analysis (table 1).

Analysis of sensitivity to change of STAR options
The responder rate in each group for binary options (or the mean score for continuous options) was calculated for each STAR option in each RCT. Sensitivity to change was estimated using the concordance (C) index, 23 which is similar to the area under the curve of the receiver operating characteristics curve for a binary outcome. It ranges from 0 to 1 and is interpreted as follows: 1, perfectly discriminant; 0.5-1, more discriminant than random; and <0.5, worse than random.

Voting for top 10
These analyses along with explanations on data interpretation were presented to the expert panel. They were asked to vote for their top 10 options. During a follow-up meeting, the results of the vote were discussed to consensually select the options for the next step.

Meta-analysis of the selected options
To better appraise the sensitivity to change of the remaining STAR options, the Delphi panel decided to perform a meta-analysis

Sjögren's syndrome
of the nine RCTs. The Delphi panel voted on which trials they considered positive, negative or 'in between' with regard to primary but also key secondary endpoints. A study that failed to meet its primary outcome was considered 'in between' if the experts agreed that there was sufficient signal of benefit in the secondary outcomes. Meta-analyses were run for 'positive' and 'in between' trials together in which positive results were expected, and separately for negative trials in which no difference between groups was expected. For binary outcomes, meta-analyses were run using the Mantel-Haenszel method with the Paule-Mandel estimator for τ 2 , Q-profile method for the CI of τ 2 and τ, and continuity correction of 0.5 in studies with zero cell frequencies. 24 For continuous outcomes, the inverse variance method was used with the Paule-Mandel estimator for τ 2 , Q-profile method for the CI of τ 2 and τ, and Hedges' g.
For binary scores, the treatment effect was expressed as OR, where 1 or below indicates absence of any effect, while above 1 favours the experimental treatment. For continuous scores, the treatment effect was expressed as standardised mean difference, where 0 indicates absence of any effect, while above 0 favours the experimental treatment. Consequently, a STAR option that is sensitive and specific to change should have a treatment effect close to the null effect for the negative trials and as far from the null effect for the positive trials.

Voting for top 3
The results of the meta-analyses were shared with the Delphi panel, who then voted for their top 3 options. During a follow-up meeting, the results were discussed to consensually select the options for the next step.

Voting for the candidate STAR
A final vote was run to select the candidate STAR based on clinical relevance.

Patient involvement
The NECESSITY Patient Advisory Group (PAG) representatives were involved in all steps and participated in every discussion meeting. Other patients contacted by the PAG representatives participated anonymously in the development of STAR (steps 1 and 3). Only PAG representatives participated in step 2 because this exercise required technical knowledge of endpoint construction. The background information provided in each survey was tailored to the patients.

RESULTS
Step 1: identification of the STAR core set

Identification of subsets of responders
The Delphi panel selected two sets of baseline variables for analyses, one with ESSDAI and ESSPRI total scores (set 1) and one with their subscales/domains (set 2) (online supplemental 2), and proposed 14 definitions of response to treatment (online supplemental 3). Virtual twins regression trees were computed and the lead team selected four responder subsets (online supplemental 4).

Identification of items sensitive to change
Analysis of sensitivity to change of each outcome revealed that some outcomes improved significantly better in the rituximab arms compared with the placebo arms in at least one responder subset and/or in the whole population (figure 2): (1) among PROs, dryness (overall, oral or ocular) and ESSPRI; (2) among objective dryness measures, unstimulated whole salivary flow (UWSF) but not Schirmer's test; and (3) among biological markers, serum IgG, γ-globulin and RF levels. By contrast, systemic scores did not improve in any subset, except for physician global assessment in subset 3. The results were similar when analysing the ES for between-group differences of change in score from baseline to week 24 (figure 2) or the final value at week 24 (online supplemental 5). For each domain, voting results, as well as clinical relevance, feasibility at clinical sites, and acceptability for patients and regulatory agencies, were considered when selecting the measurement tools. Thus, the Delphi panel selected either one or two measurement tools per domain (online supplemental 6 and 7).

Selection of domains, items and definition of response
For the systemic domain, clinESSDAI was preferred to ESSDAI to avoid redundant recording of the biological parameter. 25 For each glandular domain, two measurement tools were included to ensure the score could be calculated regardless of equipment availability at clinical sites.

Step 2: construction of STAR options
Various designs for STAR were prepared by the lead team (table 2). The designs were inspired by the Disease Activity Score 28, 26 Systemic Lupus Responder Index, 27 American College of Rheumatology response criteria 28 and by the recently developed clinical CRESS. 22 Various cut-off values were proposed for each measurement. In some designs, due to their importance, systemic activity and PROs were defined as major domains that must improve to meet the definition of a responder. A total of 227 options were selected after voting and discussion meeting (online supplemental 8).
Step 3: evaluation of sensitivity to change of STAR options and selection of the candidate STAR Analysis of sensitivity to change was run for the 227 options in the nine RCTs (online supplemental 9). Options in STAR design 1 were rejected because it was not possible to obtain a stable estimation of domain weights to construct a score. Of the 225 remaining options, 189 were never selected and were rejected, and 16 additional options, found to be redundant or less clinically relevant than the others, were rejected during the follow-up meeting. Consequently, 20 options moved to the next step.

Sjögren's syndrome
Based on the panellists' classification of RCTs (online supplemental 10), meta-analyses were computed separately for trials considered 'positive' and or trials considered 'negative' by the experts (figure 3) to allow for comparison of sensitivity and specificity to change, respectively. Based on these results, the panellists voted for their top 3 options. Five options not selected by any panellist were not included in the final vote.
During the follow-up meeting, the panellists agreed that the selection of the candidate STAR from the remaining 15 options should be based on clinical relevance. The rationale for selection was as follows. A decrease in clinESSDAI was preferred to a set score (<5 points) at the final evaluation to avoid defining this domain as responder while the score did not change from baseline in patients with baseline low activity. ESSPRI was preferred to individual dryness scales because it is a validated score. The panellists selected the published minimal clinically important difference (MCID) as the response cut-off for clinESSDAI (≥3 points) and ESSPRI (≥1 point). Finally, the experts rejected the 'no worsening' clause because there is no published consensual definition for worsening of these outcomes and the options with this clause did not show better discriminative capacity (table 3). Finally, since the other 19 options (online supplemental 11) had good psychometric properties, they will be evaluated as exploratory endpoints in the NECESSITY clinical trial (EudraCT no: 2019-002470-32; online supplemental 12).

DISCUSSION
The NECESSITY consortium, supported by an international panel of pSS experts, scientists, methodologists and patients, developed a consensual single tool for pSS that globally assesses all disease features and for use as an efficacy endpoint in RCTs: the composite responder index STAR. STAR fulfils the truth, discrimination and feasibility criteria recommended by OMERACT. The strength of our work relies on a rigorous process combining both consensus techniques based on the opinion of a large panel and data-driven methods generated from nine trials. In the analyses performed separately for trials considered negative and positive by the expert consensus, our study demonstrated that the candidate STAR is able to show treatment efficacy in positive trials and did not erroneously detect significant between-arm differences in trials considered negative, as did some alternate options ( figure 3).
Designing a primary endpoint in pSS is challenging due to the wide spectrum of disease features and the great heterogeneity and complexity of signs and symptoms. Major changes in RCT design recently conducted to adoption of ESSDAI as primary outcome and allowed, for the first time, demonstration of treatment efficacy (table 1). However, these trials suggested that other outcomes might also improve with treatment, such as ESSPRI, UWSF and biological components (IgG and RF levels). However, recent trials focused on patients with moderate to high systemic disease activity, excluding a large proportion of patients with no systemic complications but with high symptom burden. In pSS, low quality of life is mainly driven by PROs rather than systemic activity 29 ; also, these two domains poorly correlate. 11 30 31 STAR can evaluate treatment response in the full spectrum of patients with pSS, including those with low systemic activity but high burden of symptoms, for whom there remains an important unmet need. Effectively, to avoid the pitfalls of a data-driven process relying on a single trial, the development of the candidate STAR relied on nine trials, some of which included patients with low systemic disease activity and having various timepoints of evaluation (12-48 weeks, but 24 weeks in most cases). A recent important initiative from a group in the Netherlands, also a NECESSITY partner, proposed the CRESS based on reanalysis of the ASAP-III (Abatacept Sjögren Active Patients Phase III Study) trial. 8 22 The CRESS, similar to STAR, also includes the same five domains, confirming their clinical relevance in the global assessment of pSS. However, STAR has defined two major domains, systemic activity and patient symptoms, and the definition of response requires improvement of at least one. Thus, unlike CRESS, STAR requires improvement of PROs in patients with low systemic activity. Also, in negative trials, where no difference between arms is expected, the candidate STAR, accurately, did not detect any difference between arms, where other options such as the concise CRESS did (figure 3). STAR also includes improvement of glandular function using simple and validated measures, that is, Schirmer's test, sicca ocular staining score (OSS) 32 and UWSF, but also includes salivary gland ultrasound, leaving the door open to more sophisticated tests to evaluate these domains in the future. Lastly, and although they do not reflect patients' perceived disease burden, the experts decided to include IgG and RF levels because they considered, whatever the mechanism of action of the drug, a therapeutic goal to decrease the levels of these biomarkers, signs of activity (IgG) or predictive markers of lymphoma (RF). 33 34 Nevertheless, our study has some limitations. The main issue is circular thinking since pSS experts may be tempted to define a patient as a responder or a non-responder or a trial as positive or negative based on pre-existing indexes. This may give high weight to previous indexes, leaving little room for very innovative items, which by definition were not included in previous RCTs and cannot be evaluated at this stage. Nevertheless, theses definitions relied on a high level of consensus (online supplemental 10) after evaluation of multiple independent RCTs. Finally, in most of the trials, OSS, ultrasound data and RF levels were not available and thus the impact of these outcomes on STAR response cannot be evaluated at this stage. The NECESSITY PAG strongly supports the STAR outcome (see letter of support in online supplemental 13). Recommendations from the European Medicines Agency (EMA) were sought through a scientific advice procedure, and the EMA has offered to publish on their website a letter of support for STAR (https://www.ema.europa.eu/en/documents/other/letter-supportsjogrens-tool-assessing-response-star_en.pdf). Also, additional steps are being worked on in collaboration with OMERACT to fulfil all requirements and for STAR to be formally endorsed.
Even though this process relied on a nearly never-equal number of experts and RCTs, further to the present retrospective validation, STAR has to be prospectively validated in an independent population in the NECESSITY RCT (online supplemental 12). The strength of this validation step is its evaluation of the psychometric properties of STAR, in particular its discriminant capacity in an interventional study where active and placebo arms will be compared. Also, patients will be stratified according to systemic activity, allowing the evaluation of the properties of STAR in any patient with pSS with either high systemic activity or high level of symptoms. We strongly encourage the use of the candidate STAR to evaluate its properties in diverse patient populations with treatments of various mechanisms of action to definitively validate STAR as a gold standard outcome measure for RCTs in pSS.