Validation of a score tool for measurement of histological severity in juvenile dermatomyositis and association with clinical severity of disease

Objectives To study muscle biopsy tissue from patients with juvenile dermatomyositis (JDM) in order to test the reliability of a score tool designed to quantify the severity of histological abnormalities when applied to biceps humeri in addition to quadriceps femoris. Additionally, to evaluate whether elements of the tool correlate with clinical measures of disease severity. Methods 55 patients with JDM with muscle biopsy tissue and clinical data available were included. Biopsy samples (33 quadriceps, 22 biceps) were prepared and stained using standardised protocols. A Latin square design was used by the International Juvenile Dermatomyositis Biopsy Consensus Group to score cases using our previously published score tool. Reliability was assessed by intraclass correlation coefficient (ICC) and scorer agreement (α) by assessing variation in scorers’ ratings. Scores from the most reliable tool items correlated with clinical measures of disease activity at the time of biopsy. Results Inter- and intraobserver agreement was good or high for many tool items, including overall assessment of severity using a Visual Analogue Scale. The tool functioned equally well on biceps and quadriceps samples. A modified tool using the most reliable score items showed good correlation with measures of disease activity. Conclusions The JDM biopsy score tool has high inter- and intraobserver agreement and can be used on both biceps and quadriceps muscle tissue. Importantly, the modified tool correlates well with clinical measures of disease activity. We propose that standardised assessment of muscle biopsy tissue should be considered in diagnostic investigation and clinical trials in JDM.


INTRODUCTION
The idiopathic inflammatory myopathies are rare complex chronic inflammatory disorders affecting muscle, skin and other organs.The most common childhood idiopathic inflammatory myopathy (onset before 16th birthday)-juvenile dermatomyositis ( JDM)-has an incidence of 2-3 cases/million/ year. 1 2 The rarity of JDM makes recognition and assessment challenging for clinicians and histopathologists.Until now there has been no standardised histological approach to the assessment of the severity of abnormalities in muscle biopsy specimens from patients with suspected JDM.The JDM Cohort and Biomarker study collects clinical data and samples, including biopsy material, from children with myositis from across the UK and Ireland. 3 The International Juvenile Dermatomyositis Biopsy Consensus Group has previously designed and tested a scoring tool for assessment of the severity of pathological change in biopsy specimens from patients with suspected or proven JDM. 4 This tool assesses features agreed to be characteristic of JDM, organised into four domains (inflammatory, vascular, muscle fibre and connective tissue).The tool also includes an overall score of severity, scored by marking a Visual Analogue Score (histopathologists' VAS) (1.0-10.0cm).If particular items assessed within a score tool correlate well with clinical features, disease course or response to treatment, the tool would be a valuable addition to the evaluation of this complex disease.A similar approach to quantify features of renal allograft rejection was refined, validated and tested to produce the Banff scoring system. 5Retrospective studies of JDM biopsies have suggested that morphological features may correlate with clinical course but standardised scoring systems have not been previously used. 6To our knowledge there are no similar standardised tools available for assessment of pathological features in muscle biopsies.
The aims of this study were to reassess the reliability of the JDM score tool in quadriceps, including a comparison with our previous results, 4 and to assess reliability of the tool when applied to biceps humeri, since this muscle is regularly sampled in some centres.To support the utility of the tool for multicentre studies we wished to determine the intraobserver agreement of the tool elements.Finally, we evaluated whether items of the score tool are associated with clinical measures of disease severity.

PATIENTS AND METHODS Patients and biopsy material
Patients were recruited in the UK (through the UK JDM Cohort and Biomarker study) and Brazil.Both studies had full approval from ethical review boards and were carried out according to the declaration of Helsinki.Criteria for inclusion were that children had definite or probable JDM according to the Bohan and Peter criteria, 7 and biopsy material was available for research.All children had disease duration of <12 months before biopsy and had their biopsy sample taken before use of steroids or disease-modifying agents such as methotrexate or other immunosuppressive agents.A total of 55 cases were available: 33 from UK, 22 from Brazil (table 1).UK muscle samples were all from the quadriceps femoris (vastus lateralis): 11 of these were reported in a previous study. 4In this study those 11 were analysed only for the correlation with clinical data.Brazil tissues were all from biceps humeri.We have shown that biceps and quadriceps have subtle differences in fibre size, fast:slow fibre ratio and capillary:fibre ratio, therefore the muscle source of biopsy tissue is an important consideration on assessment. 9n both cohorts, clinical data at disease onset, serum muscle enzyme levels, erythrocyte sedimentation rate and muscle strength measured by manual muscle testing on the Medical Research Council (MRC) scale 0-5 were recorded. 8omplications, including calcinosis, skin ulceration, lung, cardiac and gastrointestinal (GI) involvement, were assessed before biopsy.In the UK cohort, data on the Childhood Myositis Assessment Score (CMAS), an assessment of overall strength and stamina 10 as well as physicians global assessment (PGA, range 0.0-10.0)were also available. 11

Scoring exercises
The International Working Group on JDM Biopsy previously proposed a score tool for assessment of JDM biopsy, designed using samples from quadriceps femori. 4In this study the same group of experts reconvened to assess inter-and intraobserver reliability of the tool, its reliability when used to score biceps tissue samples and to test correlation between elements of the score tool and clinical features.All validation and reliability data were generated from 44 new cases not used in our previous study.For the main scoring exercise to assess inter-observer reliability, 11 quadriceps samples and 11 biceps samples were selected to include cases in each group demonstrating a range of features and severity ( judged by HV and JLH).The quadriceps and biceps samples were allocated by a 11×11 Latin square design for each group, as described previously. 4 further 22 additional biopsy samples (11 quadriceps, 11 biceps) were each assessed by five scorers, randomly assigned using a separate partial Latin square design (11×5) for quadriceps and biceps.Scorers did not know to which set of results their scores would contribute.Data from 11 quadriceps cases were available from our previous study. 4These data allowed inclusion of all 55 cases in the final analysis for association with clinical features.To assess intraobserver agreement, eight quadriceps cases were scored again by eight scorers in an 8×8 Latin square, 3 days apart from the initial scoring exercise.For each scoring exercise the full panel of stained sections was available as above; scorers were aware of age at time of biopsy and the muscle source of each biopsy.

Data analysis, statistics and decision on most informative items
Data from the scoring exercises were analysed to provide two summary measures, as used previously. 4We used an intraclass correlation coefficient (ICC) as a measure of reliability, and as a measure of scorer agreement we used the ratio of the estimates of the SD attributable to the scorer:the SD attributable to the cases (α). 11The ICC and α value for each domain and each item were used to classify the data as good, good* or poor. 4 11tems reaching an ICC>0.6 were considered to have high reliability while items with an α score<0.4 had high agreement.
Where both reliability and agreement were high (ICC>0.6,α<0.4), the item was classified as good; where either reliability or agreement were high, but not both, performance of the item was classified as good*.Where agreement and reliability were low (ICC<0.6,α>0.4), the item was classified as poor.In the intrarater exercise we calculated proportional agreement ( pA; the number of exact agreements of score divided by the number of biopsies (n=8)) achieved by each scorer and for each item of the tool.The median (and range) pA across all scorers for each element of the tool is reported. 12o explore associations between clinical measures of disease severity and tool items we used the modal score for each item in the tool, with the exception of domain totals for which we used the median values.Examination of these associations was restricted to tool items which consistently exhibited good* or good rating.Specifically, if they achieved good or good* in our original scoring exercise 4 and in both the 11×11 scoring exercises conducted for this study, they were considered 'informative items' and suitable for further analysis.Comparisons of ordered categorical and binary variables (eg, MRC score, presence/ absence of skin ulceration, biopsy score tool items) were compared between biopsy groups using Pearson's χ 2 test or Fisher's exact test, as appropriate.Age at onset, age at biopsy, time to biopsy, histopathologists' VAS and modified domain total scores were compared using the Mann-Whitney U test.
The scores for the informative items, modified domain total scores and histopathologists' VAS were assessed for associations with measures of muscle strength by calculating the Spearman's rank correlation coefficient and conducting a test of independence.Pearson's χ 2 test or Fisher's exact test were used to assess whether scores for informative items were associated with the presence of periungual erythema, skin ulceration, lung or GI involvement.The Kruskal-Wallis test was used to assess whether the modified domain total scores were associated with the presence of periungual erythema, skin ulceration, lung or GI involvement.This test was also used to assess whether the scores for the six informative items were associated with PGA or CMAS in the UK cohort only.Spearman's rank correlation coefficient was used to assess correlation between modified domain total scores and PGA, modified domain total scores and CMAS, histopathologists' VAS and PGA and CMAS.All p values reported are unadjusted for multiple testing.

Clinical data
Fifty-five patients with JDM (38 female, 17 male) were included in this study.Table 1 shows the patient demographic and clinical data.Patients had a median age at onset of 6.42 years (IQR 4.04-9.13)and median disease duration of 3.0 months (IQR 2-6) at time of biopsy.There were no significant differences in age at biopsy, duration of disease before biopsy or clinical severity between the two groups of patients, with one exception: at the time of biopsy, the Brazil cohort had six (27%) cases with calcinosis, while the UK cohort had none ( p=0.002).Proximal muscle strength as measured by manual muscle testing did not differ between the two groups.CMAS and PGA data were available only from the UK cohort.

Score tool reliability
The score tool and accompanying instructions are shown in online supplementary table S1. 4 Data on score tool reliability were generated from 22 cases (11 quadriceps, 11 biceps), all new cases compared with our previous study. 4Overall scores for inflammatory and muscle fibre domains, as well as several items from each of these domains and severity assessment by histopathologists, reached high reliability for both quadriceps and biceps samples (table 2).These items were also reliable in our previous study. 4Intrarater agreement, assessed by pA, was substantial or better (>0.6) in all but one element.The median pA was ≥75% for all the informative items (see online supplementary data, table S2).
Items that can be reliably assessed by the same observer on different occasions and different observers will be useful in future studies.Therefore we limited further analysis to informative items-that is, those that were the most reliable, shown in bold in table 2. Two of these, overexpression of MHC protein on muscle fibres and infarction, had an α score of 0 indicating high agreement, but low variability since they were either always abnormal (MHC overexpression) or very rarely seen (infarction).These items were excluded from the modified score tool.Selection of an element for further analysis depended on the performance of that element rather than the importance of the pathological feature for diagnostic purposes.Representative examples of items selected for inclusion in the modified score tool, from both biceps and quadriceps biopsies, are shown in figure 1.

Association with disease severity measures
We reasoned that a modified score tool containing the most reliable items would be an appropriate instrument to investigate associations with clinical measures of disease.The most reliable items fell into two domains of the score tool: inflammatory and muscle fibre.Using these items, a modified total score range was calculated for each of these domains.Scores for these informative items, modified domain total scores and overall histopathologists' VAS score data were analysed for all 55 cases (table 3).Comparison of the number of biopsies scoring high or low for each of these items suggested that the biceps samples showed more severe pathology than quadriceps, with differences between scores for the modified muscle fibre domain total, two individual items in the muscle fibre domain, as well as a significantly higher histopathologists' VAS for severity in biceps compared with quadriceps (table 3).
There was evidence to suggest that measures of weakness were associated with biopsy scores for all of the informative items, the modified total domain scores and the histopathologists' overall severity score (table 4).Specifically, a higher modified total for both domains was strongly associated with elbow flexor strength score as assessed by the MRC scale (0-5), r=−0.59p<0.0001: r=−0.60 and p<0.0001 for inflammatory and muscle fibre domains, respectively.Within the muscle fibre domain substantial correlations were seen between the neonatal myosin positivity and both measures of strength (r=−0.57p<=0.001).The histopathologists' VAS was also significantly associated with measures of weakness (table 4).No associations were found between the six informative items and periungual erythema, skin ulceration, lung or GI involvement (data not shown).
For quadriceps biopsies, where data on PGA and CMAS were also available, PGA was associated with the biopsy score for the informative items in the inflammatory domain (CD3+ endomysial, CD3+ perimysial and CD68+ endomysial) and two items In all of the above the direction of the association was as expected; a higher biopsy feature score was associated with higher PGA.Both modified total muscle fibre and modified total inflammatory domains were weakly correlated with CMAS.Details of these correlations are shown in online supplementary table S3.

DISCUSSION
These data provide the first validation of a histological score tool estimating severity in JDM, much needed in this uncommon but potentially devastating autoimmune childhood disease.The tool is designed to measure histological severity using semiquantitative assessment of histological features, rather than to diagnose the condition.This study extends our earlier findings and demonstrates the reliability of the tool, with low inter-and intraobserver variability.Importantly, the most reliable items of the scoring system correlate well with measures of clinical disease activity.
Our study used cases from two different countries, where the muscle used for diagnostic biopsy differs.Although all biopsies were taken early in disease course, calcinosis was more common in cases from Brazil, perhaps reflecting disease severity in that cohort, 13 and biceps biopsy samples were also scored as more severe in several items (table 3).As biceps samples were not available from UK cases, nor quadriceps tissue from Brazilian cases and no case had samples from both muscles, it was not possible to test how the site of the biopsy affects pathological change.It is also possible that there are other differences between the groups of patients, related not to biopsy site, but to differences in clinical care, ethnicity or environment.Despite these potential confounders we found that the score tool functioned equally well on biceps and quadriceps tissue, and the same score items were the most reliable for both sample sets.By incorporating biceps samples we have generated data suggesting that the score tool can be applied to a muscle other than quadriceps.This provides confidence for inclusion in future studies of centres whose biopsy site is routinely either biceps or quadriceps.
After identifying morphological features that proved reliable between different assessors and different muscles, we showed that these items were moderately or strongly correlated with muscle strength, and with the overall PGA and CMAS, where available.Thus the score tool appears to correlate well with muscle disease activity.A limitation to this analysis is that skin score data were not available on a sufficiently large number of cases to compare biopsy assessment with skin disease activity.
The adoption of agreed protocols for histological assessment of tissue has provided important progress in other diseases, especially in conditions where semiquantitative analysis of specific features has been found to correlate with clinical severity and hence influence management.For example the Banff scoring of renal pathology is widely used to quantify allograft rejection, in trials of anti-rejection drugs and in clinical practice.This system has been refined, altered, validated and tested in several stages. 5Similarly, the BrainNet Europe consortium has tested, standardised and validated assessment of features such as α-synuclein immunoreactive structures and amyloid β, in neurodegenerative diseases. 14 15n JDM, some evidence suggests that histopathological features indicative of vasculopathy correlate with more aggressive disease, 16 or that features of vasculopathy and necrosis may predict chronicity. 6However, those studies did not include biopsy analysis by a large group of observers and it is therefore difficult to assess how readily they would translate to multiple

Figure 1
Figure 1 Features of dermatomyositis including the informative score tool items selected for the modified score tool, illustrated in a quadriceps biopsy (A, C, D, F and H) and in a biceps biopsy (B, E, G and I).Perivascular inflammation was seen, often with a perimysial localisation (A and B, arrows indicate vessels).Perifascicular fibre atrophy was a feature of some biopsies, and other fibre abnormalities including basophilia, indicating regeneration, were often more prominent in perifascicular regions (B, double arrow).CD3 immunoreactive T cells were present in the perimysium (C and E, arrows) and also the endomysium (D, arrow).Macrophage infiltrates were identified by CD68 immunohistochemistry in the endomysium (F and G, arrow) and also around vessels (G, double arrow).Neonatal myosin expression could often be seen to have a characteristic perifascicular pattern (H and I).(A and B) haematoxylin and eosin; (C, D and E) CD3 immunohistochemistry; (F and G) CD68 immunohistochemistry; (H and I) neonatal myosin immunohistochemistry. Bars represent: 50 mm in A, B, C, E and G; 25 mm in D and F; 100 mm in H; 260 mm in I.

Table 1
Patient demographics and clinical features at time of biopsy