Article Text

PDF

Validity, reliability, and applicability of seven definitions of hip osteoarthritis used in epidemiological studies: a systematic appraisal
  1. M Reijman1,2,
  2. J M W Hazes2,3,
  3. B W Koes1,
  4. A P Verhagen1,
  5. S M A Bierma-Zeinstra1
  1. 1Department of General Practice, Erasmus MC, Rotterdam, The Netherlands
  2. 2Department of Epidemiology and Biostatistics, Erasmus MC, Rotterdam, The Netherlands
  3. 3Department of Rheumatology, Erasmus MC, Rotterdam, The Netherlands
  1. Correspondence to:
    M Reijman
    MSc, Department of General Practice, Erasmus MC-Faculty, PO Box 1738, 3000 DR Rotterdam, The Netherlands; m.reijmanerasmusmc.nl

Abstract

Objective: To summarise and review articles addressing the quality (validity, reliability, applicability) of seven commonly used definitions of hip osteoarthritis (OA) for epidemiological studies in order to use it primarily as a classification criterion.

Methods: Medline and Embase were searched and articles studying the validity, reliability, or applicability of the definitions of hip OA were selected. Two reviewers independently extracted data on the quality of the seven definitions.

Results: Review of the literature showed the validity of the various definitions of hip OA, in particular, has barely been investigated. Minimal joint space (MJS) demonstrated the highest (intra- and interrater) reliability, and showed the highest association with hip pain and restricted internal rotation compared with the other definitions of hip OA. The reliability of the Kellgren and Lawrence grade and the index according to Lane is comparable with that of the MJS, but the construct validity should be investigated more thoroughly. The reliability and validity according to the Croft grade were inferior to the MJS, the Kellgren and Lawrence grade, and the index according to Lane. Despite precise and extensive development, the ACR criteria showed poor reliability and poor cross-validity (agreement between three ACR criteria sets) in a primary care setting.

Conclusions: The reliabilities of MJS, Kellgren and Lawrence, and the index according to Lane were comparable, but the MJS had the highest relationship with hip pain in a male population. Considering how often definitions of hip OA are used, it is surprising that the validity has been so poorly investigated, and the validity needs to be studied more thoroughly.

  • hips
  • osteoarthritis
  • definitions
  • reliability
  • validity
  • review
  • ACR, American College of Rheumatology
  • ICC, intraclass correlation coefficient
  • MJS, minimal joint space
  • OA, osteoarthritis
  • ROM, range of movement

Statistics from Altmetric.com

Osteoarthritis (OA) is the most common joint disorder1 and is a considerable burden on society. Depending on the definition of hip OA used, the prevalence ranges from 7 to 25% in people aged 55 years and over.2 The hip is particularly interesting because it is often the sole joint affected by OA, suggesting that local biomechanical risk factors are important. In addition, the hip is crucial to independent function.3

A major problem in studying hip OA, is the absence of consensus in defining hip OA for epidemiological and clinical studies.4 Most epidemiological studies have used a single hallmark of hip OA—namely, radiological changes, to define hip OA.5,6

To investigate (potential) risk factors a valid and reliable definition of hip OA is required. Therefore we appraised the quality—that is, the validity, reliability, and applicability, of seven definitions of hip OA commonly used for epidemiological studies:

  1. The radiological grading system of Kellgren and Lawrence7

  2. Croft’s radiological grading system (a modification of the Kellgren and Lawrence grading system)8

  3. Minimal joint space (MJS) according to Croft et al(a measurement of the narrowing of the joint space)8

  4. Measurement of the joint space according to Resnick and Niwayama9

  5. Three sets of criteria (one clinical, and two combined sets of clinical and radiographic criteria) of the American College of Rheumatology (ACR)10

  6. Clinical definition of hip OA: radiological OA combined with pain in the hip region11,12

  7. Radiographic index grade according to Lane.13,14

The objective of our study was to review the quality (reliability, validity, applicability) of these seven definitions of hip OA commonly used in epidemiological studies, in order to use it primarily as a classification criterion.15,16

METHODS

The literature was searched for all relevant papers containing one of the seven definitions of hip OA. Studies which fulfilled predefined inclusion criteria were identified and subsequently assessed for aspects of reliability, validity, and applicability of the definition of hip OA used in each particular study.

Identification of the literature

To identify the studies a search was made in the following databases: Medline/Pubmed (1966-March 2002), Cochrane Library and Embase (1990-March, 2002). The specific keywords were:“osteoarthritis, hip” or “osteoarthritis” and “hip” and “clinical definition”, “radiological definition”, “case definition”, “radiographic grading”, “diagnosis”, “severity”, “index of severity”, “classification criteria”, “radiographic change”, “minimal joint space”, “Kellgren”, “Kellgren and Lawrence”, “reliability”, “reproducibility of results”, “epidemiologic studies”, or “feasibility studies”. The search was extended by screening the reference lists of all relevant articles identified. We repeated the search using the keywords of all selected articles.

Criteria for studies considered for inclusion

A study was included in this review if it fulfilled all of the following criteria:(a) the study group contained people with and without hip OA;(b) it was an original article or a systematic review;(c) at least one of the seven definitions of hip OA investigated here was used;(d) the study described the design, or the reliability, or the validity, or the applicability of at least one of the above mentioned definitions or the study investigated the risk factors (or determinants) of hip OA, and used at least two of the above mentioned definitions.

Critical assessment of OA definitions

Using the information from the criteria of Buchbinder et al,15 Felson et al,16 and Bierma-Zeinstra et al,17 we compiled a list of criteria to evaluate the definitions of hip OA (appendix 1, available on the website at http://www.annrheumdis.com/supplemental). These criteria relate to the reliability, validity, and applicability of the definition of hip OA:

  1. The reliability of the definition expressed in intra- and interrater reliability

  2. The validity of the definition expressed in

  1. Criterion validity

    • Expert validity: the expert validity evaluates the sensitivity and specificity of the classification criteria with the use of a predefined “gold standard” by expert’s opinion in a trans-sectional study design.15,16

    • Predictive validity: the predictive validity evaluates the sensitivity and specificity of the classification criteria with the use of a predefined “gold standard” by an “obvious hip OA”(for example, a total hip replacement) after a certain period of follow up.15,16

  2. Construct validity

  3. The construct validity evaluates whether the definition correlates with the external variables with which it should correlate.16,17 In the case of radiological hip OA, the definition should correlate with known symptoms (hip pain, disability, limited range of movement (ROM), morning stiffness <1 hour) of hip OA, or with known risk factors of hip OA. If the definition is based on clinical signs, it should correlate with radiological signs of hip OA.

  1. The applicability of the definition of hip OA expressed in three issues—namely:

    • The ability to discriminate between hip OA and no hip OA

    • The ability to categorise the severity of hip OA

    • The tools and skills needed to define people with hip OA.

  2. A description of which method has been used to develop the definition of hip OA (content validity).

Two reviewers (MR and SMABZ) independently evaluated the definition of hip OA used in the included articles according to the above criteria. In cases of disagreement, both reviewers tried to achieve consensus. If the disagreement was not resolved, a third reviewer (BWK) was consulted to achieve a final judgment.

Data extraction

In the included studies, data on reliability (various measures of intra- and interrater reliability), construct validity (association measures), and information on the applicability of the seven definitions used for hip OA were collected by two reviewers independently of each other and summarised (descriptive analysis) according to each definition separately.

RESULTS

Identification/selection of the literature

The initial searches resulted in 1170 potentially relevant articles.18 Of these, 12 articles fulfilled the predefined inclusion criteria. After screening the reference lists of the 12 articles, another two articles were included. Finally, 14 publications were used to extract data on the reliability, validity, or applicability of the definitions for hip OA.

Description of included studies

Of the 14 articles, 13 studied the reliability and 7 the validity of one (or more) of the definitions. Table 1 lists the characteristics of the studies. As can be seen, there is a large difference in the reported prevalence of hip OA, probably due to the large difference in the percentage of men and the different classifications of hip OA used. All studies used a relatively young population (mean age <66 years).

Table 1

Characteristics of the included studies (n = 14)

The 14 studies defined hip OA according to one (or more) of the following seven definitions (appendix 2; available on the website at http://www.annrheumdis.com/supplemental): Kellgren and Lawrence grade (five studies), Croft grade (six), MJS (according to Croft et al, eight.) MJS (according to Resnick and Niwayama, none), the ACR criteria (three), hip pain and joint space narrowing (one), and the index grade according to Lane (two).

RESULTS OF THE INCLUDED STUDIES

Reliability

Of the 14 studies, 13 investigated the reliability of five of the seven definitions of hip OA (table 2).

Table 2

Reliability of the definitions of hip osteoarthritis

The four studies that investigated the reliability of the Kellgren and Lawrence grade reported an intrarater reliability with κ statistics of 0.76,19 Pearson correlation coefficient of 0.66–0.89,5 an interrater reliability with κ statistics of 0.60–0.75,12–19 and an intraclass correlation coefficient (ICC) of 0.63.5 In contrast with more recent studies, the original study of Kellgren and Lawrence showed a relatively lower interrater reliability (correlation coefficient of 0.40).7

In five studies the overall grade of Croft (a modification of the Kellgren and Lawrence grade) had an intrarater reliability with κ statistics of 0.49–0.935,8,20,21, but a relatively lower interrater reliability with κ statistics of 0.37–0.79.5,8,20 The wide range of intra- and interrater reliability between the studies is mainly explained by the different cut off levels used.

In seven studies the MJS according to Croft et al showed the highest intra- and interrater reliability compared with the other definitions of hip OA. The MJS according to Croft et al showed an intrarater reliability with κ statistics of 0.81–0.855,8,21 and an ICC of 0.83–0.94,19,20 an interrater reliability with κ statistics of 0.42–0.84,5,8,22 and an ICC of 0.75–0.96.19,20 Only the study by Hirsch et al5 described a relatively low interrater reliability with κ statistic of 0.42.

Only one study investigated the interrater reliability of the ACR classification(s)23 and reported a wide range for the clinical set with κ statistics of 0.0–0.65 and the combined clinical, radiological, and laboratory signs set with κ statistics of 0.31–0.85.

Two studies investigated the reliability of the index according to Lane. These studies reported an intrarater reliability with κ statistics of 0.83 (⩾ grade 2) and an ICC of 0.70–0.88,13 an interrater reliability with κ statistics of 0.72–0.92 (⩾ grade 2) and an ICC of 0.76–0.87.13,14

Validity

None of the screened studies investigated the criterion (expert or predictive) validity of the seven definitions of hip OA. In the 14 studies the construct validity was evaluated by considering two questions: Does the radiological definition correlate with known symptoms of hip OA? and Does the definition correlate with other definitions of hip OA? Of the 14 studies, seven evaluated the construct validity of two of the definitions of hip OA (MJS and the overall grade of Croft)(table 3). The association between the radiological definition and (known) symptoms of hip OA (hip pain, restricted ROM) was used as a measure of construct validity. The highest association was described between severe radiological hip OA and hip pain, and between severe radiological hip OA and a restricted internal rotation of the hip.8,20 In their study, Birrell et al investigated the association between restricted ROM and mild to moderate radiological hip OA defined as grade ⩾2 (Croft grade) and severe OA defined as MJS ⩽1.5 mm.20 Internal rotation appeared to be the most discriminating movement for severe hip OA (OR = 46.8 (95% CI 5.2 to 420.0)v 3.6 (95% CI 1.6 to 8.0) for moderate OA). In 1990 Croft et al investigated the association between hip pain and radiological hip OA.8 Severe hip OA defined by MJS ⩽1.5 mm, showed a stronger association with hip pain than defined by the Croft grade (prevalence of 56.0%v 47.5% of those with hip pain). The association with pain and MJS ⩽2.5 or Croft grade ⩾3 is comparable (prevalence of 28.3%v 28.8% of those with hip pain).

Table 3

Association of definitions with known symptoms of hip osteoarthritis

For the construct validity we also reported the correlation between the different definitions of hip OA. The relationship between the Kellgren and Lawrence definition and the three sets of ACR criteria was very low (κ = 0.03–0.16).12 There was a moderate agreement between the definition of Kellgren and Lawrence and “hip pain and joint space narrowing (κ = 0.52).12 There was a high association between a severe hip OA defined by MJS ⩽1.5 mm and grade ⩾4 (Croft grade)(OR = 153.5).8 None of the studies compared the association between two of more definitions with known risk factors.

The method of development of the seven definitions of hip OA also differs considerably. The Kellgren and Lawrence grade and the index according to Lane were developed based on the opinion of the researchers. The overall grade of Croft and the MJS were based on a study group, and were developed according to pain within the study group. The ACR criteria sets were also based on a study group, and were developed using regression analysis (classification tree) on the occurrence of hip OA defined by an expert team. The methods of development of the remaining two definitions were not given.

Applicability

The applicability of the definitions of hip OA in the present study was made operational as the ability to discriminate between hip OA and no hip OA, the ability to categorise the severity of hip OA, and the skills and tools needed to classify people according to the respective definitions (table 4). According to their own description, six definitions are used to discriminate between people with and without hip OA, and all six are easy to apply for people at MD level. The Kellgren and Lawrence grade, Croft grade, the MJS and the index grade according to Lane can also categorise the severity of hip OA.

Table 4

Applicability of definitions of hip OA

All definitions include information from a radiograph (except the clinical set of the ACR criteria). The ACR also makes use of information of the clinical history and physical examination (restricted ROM).

DISCUSSION

Reviewing the selected literature demonstrates that the validity of the various definitions of hip OA, in particular, has barely been investigated. The highest (intra- and interrater) reliability was reported for the MJS and the index according to Lane, and the highest association with hip pain compared with the other definitions of hip OA for the MJS.

Despite putting much effort into identifying all relevant articles, some might have been missed because, for example, they used other keywords, had unclear abstracts, or were not indexed in Pubmed or Embase. Although the sensitivity of our search action might not be optimal,24,25 we nevertheless believe that we included the most appropriate studies that evaluated aspects of the quality of definition of hip OA, and assume that the data presented here give a clear insight into the currently available studies on this topic. Only 14 of 1170 potentially relevant articles fulfilled the predefined inclusion criteria. The most restrictive inclusion criterion was that the study group contained people with and people without hip OA.

The problems encountered when comparing the results of the included studies, were the differences in study groups (percentage of men), settings (open population, patients with hip pain who consulted their general practitioner), different cut off points for case definitions, and the different or non-transparent statistics used in the studies. For example, the percentage of men in the different studies ranged from 0 to 100%; because sex is a known risk factor for hip OA this will obviously influence the prevalence of hip OA. The prevalence, in turn, will also affect the value of reliability.26 One study23 adjusted the κ value (Cohen) they found for prevalence (prevalence adjusted bias adjusted κ/PABAK26); the adjusted κ was much higher than the crude κ.

In the absence of a “gold standard” for a definition of hip OA, we were particularly careful when evaluating the validity. Two potential solutions to define a “gold standard”, by expert’s opinion or by an “obvious hip OA”(such as total hip replacement) after a certain period of follow up were not used in the screened studies. Summarising the available information, it was clear that very few studies investigated the construct validity of the definitions used for hip OA. Of the 14 articles, not one focused on the relationship between risk factors and radiological hip OA, leaving us to evaluate the studies that reported the association between symptoms and radiological hip OA. Croft et al investigated the association between hip pain and radiological hip OA (two definitions of Croft)8; in their study group of 1315 men, only 759 completed the questionnaire (243 men died, 152 men were too ill according to the general practitioner). The men excluded were probably older, more disabled, and had more comorbidity than the men included, which might have led to a selection bias; the results of that study should therefore be interpreted with caution. Croft et al also investigated the association between individual radiological features and hip pain8; they concluded that MJS (⩽1.5 mm) showed a stronger association with hip pain than osteophytes (56%v 34.4%). Surprisingly, no articles were found that investigated the association between the overall Kellgren and Lawrence grade and hip pain. The validity of three sets of criteria of the ACR was investigated in only one study,12 which concluded that the clinical ACR criteria showed no cross-validity (agreement between three ACR criteria sets) with the two other ACR criteria sets, tested in primary care.

For reliability, the lack of comparability between the different studies is also an important confounder. Different standardisation of the x ray measurements between studies, or a possible difference in MJS between men and women, can influence the results of the reliability. Only one study directly compared the reliability of the Kellgren and Lawrence grade with MJS (according to Croft)19; the MJS showed a better (intra- and interrater) reliability. Five studies directly compared the overall grade of Croft and the MJS5,8,20,21,27; all these studies showed a better reliability of the MJS. No studies compared the other definitions. Only three studies reported the time interval between repeated readings: Croft et al 3–5 months8, Kellgren and Lawrence 1 month,7 and Lane et al 1 month.13 The length of this interval will probably influence the reliability (a longer time interval between repeat readings will give a lower intrarater reliability).4 Overall, we assume that the MJS and the index according to the Lane definition for hip OA have the highest reliability for epidemiological and clinical studies.

The most commonly used definition of hip OA, the Kellgren and Lawrence grade, is also the one most criticised. Previous criticisms of the Kellgren and Lawrence grade include: inconsistencies in the description of radiographic features of OA,28–30 the prominence awarded to osteophytes at all joint sites,1–30 and a poor interrater and between-centre reliability.1,28–30 According to the articles included in our review, the interrater reliability was poor only in the original study of Kellgren and Lawrence,7 but much better in three other much larger studies.5,12,19 Notably, the same description of the Kellgren and Lawrence grade was used in all studies. Therefore in the present study we could not confirm the criticism of inconsistent grades and poor reliability of the Kellgren and Lawrence grade. The main criticism of the Kellgren and Lawrence grade is the importance of the presence of osteophytes. Although it is well known that the association between osteophytes and hip pain is poor,8 not one of the 14 articles investigated the association between the overall Kellgren and Lawrence grade and hip pain. Overall, we assume that the Kellgren and Lawrence grade for hip OA is a useful definition for epidemiological studies.

Summarising the properties of the definitions used for hip OA investigated in the present study, we conclude that:

  1. The MJS showed a good intra- and interrater reliability, a good association with hip pain and restricted internal rotation, and a good applicability; however, the quality (validity, reliability) of this definition should be investigated in an open population.

  2. The Kellgren and Lawrence grade has a reliability comparable to that of the MJS, but the construct validity should be investigated more thoroughly.

  3. The Croft grade appeared to be inferior to the MJS and the Kellgren and Lawrence grade for both reliability and validity.

  4. The ACR criteria (despite their precise and extensive method of development) showed a poor reliability and a poor cross-validity in a primary care setting. Because these data are based on the results of only two studies, more research is needed on the ACR criteria (also in other settings).

  5. The index according to Lane showed also a good intra- and interrater reliability, but no studies were included which investigated the construct validity of this index grading system.

Considering how often the definitions of hip OA are used, it is surprising that the validity has been so poorly investigated. Meanwhile, because of the lack of such validity studies, we recommend that only those definitions with the best construct validity and the best reliability be used in epidemiological studies. We also recommend that the validity, especially the criterion (expert or predictive) validity, of the commonly used definitions be studied more thoroughly.

Acknowledgments

This study was supported by a grant from the Dutch Arthritis Association.

REFERENCES

View Abstract
  • Web-only Appendices

    The appendices are available as downloadable PDFs (printer friendly files).

    If you do not have Adobe Reader installed on your computer,
    you can download this free-of-charge, please Click here

     

    Files in this Data Supplement:

    • [View PDF] - Appendix 1 Criteria used in the present study to evaluate the definitions of hip osteoarthritis used in the literature
    • [View PDF] - Appendix 2 Definitions of hip osteoarthritis used in the literature

Request permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.