Impact of different descriptions of the Kellgren and Lawrence classification criteria on the diagnosis of knee osteoarthritis
Objectives Although the Kellgren and Lawrence (K&L) criteria for defining radiological osteoarthritis are widely used in epidemiological and clinical studies, the authors previously documented the existence of five different versions of these criteria. This study identifies the impact of the use of alternative versions of the K&L criteria and evaluates which description has the highest association with knee complaints.

Methods Two readers scored most radiographs of the knees of participants of the Rotterdam Study with the original K&L description (90%). In addition, each alternative description was used in a random part (20%) of the radiographs. The authors calculated reproducibility of all descriptions, and compared sensitivity and specificity of the alternative descriptions for three cut-off points with the original description as reference standard (K&L≥1, K&L≥2 and K&L≥3). The authors calculated κ statistics to compare agreement between the original and alternative descriptions, and evaluated the association with knee complaints.

Results The dataset comprises radiographs of knees of 3071 people. For cut-off K&L≥1 all four alternatives classified more people as having osteoarthritis than the original description; κ was low, and sensitivity and specificity were moderate to good. For cut-offs K&L≥2 and K&L≥3 there was little difference in the number of cases and κ, sensitivity and specificity were good to perfect. The original description and alternative 3 showed the strongest association with knee complaints.

Conclusions The different descriptions of the K&L criteria have impact on the classification of osteoarthritis in the lowest grade (K&L≥1). All descriptions have strengths and weaknesses. It depends on the purpose which is the best description.

Radiological classification of osteoarthritis remains the reference standard despite the emergence of new techniques such as MRI. The explanation is feasibility and tradition, but also the fact that no clear cut-off or overall severity grade exists for osteoarthritis in the MRI classification criteria.1 The most widely used radiological classification criteria for knee osteoarthritis are those developed by Kellgren and Lawrence (K&L) in 1957.2 In epidemiological studies the cut-off point of 2 or more comprises the radiological definition of osteoarthritis.3 Clinical studies also use the K&L criteria to identify and select patients4,,7 with a certain grade of osteoarthritis, or to assess osteoarthritis progression.8

A point of concern is the lack of consensus regarding the descriptions and interpretations of the K&L classification criteria.9,,11 In a concise report on 18 cohort studies we provided information on no less than five different versions/descriptions of the K&L criteria for knee osteoarthritis in use in epidemiological studies.9 Worse, even sequential studies on the same population used different versions. Although the differences between the versions seem small, the impact of this variability remains unclear.9 For instance, the number of cases classified as having osteoarthritis may differ between the descriptions.

Another criticism regarding radiological criteria is that they are not congruent with clinical criteria for osteoarthritis, or with the presence of pain in the knee.1 10 Although this criticism will remain for the different descriptions, the association between knee complaints (pain and/or stiffness) and these descriptions may differ.

This study explores the impact of different descriptions of the K&L criteria on the classification and distribution of severity of knee osteoarthritis, and assesses the association between complaints of the knee and the different descriptions.



The population used in the present study is an extension of the Rotterdam Study (RS-III-1) cohort.12 The Rotterdam Study is a population-based cohort study in which the incidence and risk factors for chronic disabling diseases are investigated. All participants of the RS-III-1 cohort were 45 years and older and living in Rotterdam; the participants were included between 2006 and 2008.12 The medical ethics committee of the Erasmus Medical Center approved the study and all participants gave written consent.

A total of 3071 participants out of the 3932 people included in the RS-III-1 study had radiographs of both knees.

Radiographs and scoring method

All radiographs of the knees were weight-bearing anteroposterior radiographs taken at 70 kV, a focus of 1.8 mm2, and focus-to-film distance of 120 cm, using High Resolution G 35×43 cm film (Fujifilm Medical Systems, Stamford, Connecticut, USA). Radiographs of the extended knees were obtained with the patella in the central position. Radiographs were scored by two trained readers, who were blinded for clinical data. Both scored half of the radiographs. For the present study, the readers did the scoring using different descriptions of the classification criteria of K&L (table 1). The different descriptions were chosen based on our earlier report9 but with one exception, ie, the cut-off K&L≥2 for definite radiological osteoarthritis in alternative 1 was based on definite osteophytes only (without joint space narrowing (JSN)). This is also an alternative of the K&L criteria used in large studies such as the Framingham13 and the Chingford study.14 In the present study, all radiographs were also scored using semiquantitative separate lesion scoring, eg, osteophytes (grading 0–3) and JSN (grading 0–3).

Table 1

Different descriptions of the classification criteria of Kellgren and Lawrence used in various epidemiological studies

Ninety per cent of all x-rays (n=2772) were scored using the original description. In addition, the radiographs were randomly scored with one of the four alternative descriptions (20% with each alternative description). The original description is the definition as described by Kellgren and Lawrence2 and clarified in the WHO atlas.3

Cut-offs points were placed at K&L≥1, K&L≥2 and at K&L≥3. Cut-off K&L≥2 is the most important of these as the one generally used for the presence of osteoarthritis to determine eligibility for studies. A random selection of the radiographs (377 radiographs, 12%) was scored by both readers in order to determine the reproducibility of the scoring. Of this selection, 659 knees were scored twice with the original description and twice with one of the other descriptions (74 with alternative 1; 114 with alternative 2; 152 with alternative 3 and 158 with alternative 4). Of 73 knees only a double score with alternative 1 was present and 22 knees were not scored twice due to the debatable quality of the x-ray.

In the patient interview (held at baseline), all participants answered the question whether or not they had pain or other complaints in the knees. Participants who had experienced pain or complaints in the knee(s) for 1 month or longer were classified as having knee complaints.


For reproducibility of the scoring a 4×4 weighted κ was calculated for each alternative. To show differences between the two readers for each cut-off in the descriptions, a 2×2 κ was calculated with the dichotomous data of each cut-off.

For each cut-off (K&L≥1, K&L≥2, K&L≥3) the percentage of people classified as having osteoarthritis by each alternative was compared with that of the original description: both directly and by calculating sensitivity and specificity of each alternative description with the original description as the reference standard. Weighted κ compared the agreement between the original description and the alternative descriptions of the K&L criteria: a 4×4 weighted κ to compare all grades of the original and alternative descriptions, and a 2×2 κ to compare these two per cut-off. A κ value above 0.8 is considered very good, between 0.6 and 0.8 good and between 0.4 and 0.6 moderate.14 15 These analyses compared the total number of knees (and not the number per person).

OR described the association between knee complaints and the different alternatives, adjusted for known risk factors such as age, body mass index and gender. The results are on the person level, which means that the score K&L≥1 is a score of grade 1 or more in one or both knees, K&L≥2 is a score of grade 2 or more in one or both knees, and so on. The results also include percentages per cut-off and per alternative of the grading of osteophytes and JSN.



The study population had a mean age of 57 years and a mean body mass index of 27.7. Slightly more women than men were included (table 2). The dataset comprised 3071 radiographs of knees (ie, a total of 6142 knees scored), corresponding to 78% of the participants in the RS-III-1 study. Seventeen of these knees had been totally replaced, and no score could be given to 102 knees due to bad quality or a missing image. Approximately 90% of the films were scored by the original K&L description (5378 knees, 87.6%) and approximately 20% of the radiographs with each alternative description. Most (3048) participants supplied information on knee complaints. Although approximately one-third had such complaints, only approximately 5% had radiological osteoarthritis (K&L≥2; table 2).

Table 2

Characteristics of the study population (n=3071)


The radiographs randomly selected to determine the reproducibility of the two readers reflected the distribution of scores of the K&L criteria in the source population. The reproducibility of alternatives 1, 2 and 3 was good (weighted κ 0.66, 0.69 and 0.63, respectively; table 3). In contrast, reproducibility was poor to moderate for the original description and alternative 4 (weighted κ 0.41 and 0.35). For the cut-off K&L≥2 the κ was good. For cut-off K&L≥1 the κ was low for both the original description and for alternative 4, moderate for alternative 3 and good for alternatives 1 and 2 (table 3).

Table 3

Reproducibility of the two readers


The agreement between the original description and the alternatives was moderate (weighted κ approximately 0.50; table 4). For cut-off K&L≥1 all alternatives yield more cases than the original description: for alternatives 1, 2 and 3 this effect is small (24% vs 18%); however, alternative 4 classifies almost 50% of all the knees as osteoarthritis (table 4). Because of these differences, κ is low to moderate for all alternatives (κ<0.45). Sensitivity is moderate for alternatives 2 and 3 (±57%) with good specificity (84%); for alternative 1 sensitivity is 65% and specificity 83%, and for alternative 4 sensitivity is 100% and specificity 61%.

Table 4

Comparison between the four alternative descriptions and the original description, at the three cut-off points

Alternatives 3 and 4 yield approximately the same amount of cases as the original description in cut-off K&L≥2, with very good κ and sensitivity. Alternative 2 yields slightly more cases with a good κ, good sensitivity and very good specificity. Alternative 1 yields twice as many cases as the original description, has a slightly lower κ but 100% sensitivity and specificity.

For cut-off K&L≥3 all alternatives yield more cases than the original description: κ is moderate to good, and sensitivity and specificity are 100% for all alternatives.

In conclusion, all four alternatives yield more cases than the original description at all cut-off points; for cut-off K&L≥2 the κ, sensitivity and specificity are good.

Association between knee complaints and different descriptions

The presence of osteoarthritis as defined by almost all cut-offs on almost all alternatives was significantly associated with the presence of knee complaints (OR with p≤0.001; table 5) for all cut-offs, with the exception of alternative 4: association between the presence of knee complaints and cut-off K&L≥1 not significant and for cut-offs K&L≥2 and K&L≥3 p≤0.01. Numerically, the original description showed the strongest associations with knee complaints, although the differences were small and not significant compared with alternative 3.

Table 5

Association between the five different descriptions and knee complaints for the three cut-off points

Distributions of osteophytes and JSN grades are shown in figures 1 and 2 (and supplementary table S1, available online only). Where alternatives 1–3 show a similar distribution of osteophytes, the distribution is strikingly different in the original description and alternative 4 in cut-off K&L≥1 for grade 0 and grade 1 osteophytes. Also, in the original description in cut-off K&L=0 slightly more grade 1 and 2 osteophytes were scored than in the alternatives. Furthermore, differences were seen in the distribution of JSN in cut-off K&L≥1 (a lot more grade 1 JSN than in the alternatives; figure 2). Finally, in cut-off K&L≥2 in alternatives 1 and 2 more grade 0 JSN and less grade 1 JSN was seen than in the other descriptions.

Figure 1

Distribution of grades of osteophytes in Kellgren and Lawrence (K&L) score per alternative scoring method.

Figure 2

Distribution of grades of joint space narrowing in Kellgren and Lawrence (K&L) score per alternative scoring method.

All other distributions were comparable.


This study shows that the only real impact of variable descriptions of the K&L criteria on the classification of osteoarthritis occurs at the cut-off point K&L≥1, in which all studied alternatives classified more knees as having osteoarthritis. At higher cut-offs the impact was much smaller. At the cut-off of K&L≥1 the reproducibility of most alternatives was much better than the original description. The association of knee complaints with the descriptions was for the alternatives slightly less than for the original description.

To our knowledge this is the first study to investigate the impact of the use of different descriptions of the K&L classification criteria for all the cut-off points. Felson et al16 compared a modified grade 2 of the K&L classification criteria with the original grade 2 and found no support for their modified grade 2 based solely on JSN.

Reproducibility problems of the original definition and alternative 4 are probably due to differences in the interpretation of K&L grade 1, especially for osteophyttic lipping. In addition, the aberrant distribution of osteophytes in K&L≥1 in these two descriptions is due to the possible osteophyttic lipping described in this grade and the lack of a description of possible osteophytes at all in these descriptions. For alternatives 1, 2 and 3 the weighted κ was good, probably because these latter descriptions leave less room for personal interpretation. In the cut-off K&L≥3 of the original description the reproducibility was also low, which might be due to the small number of available cases. We left out the κ for the reproducibility of cut-off K&L≥3, because of the low number of cases.

All four alternatives result in a larger number of cases with K&L grade 1 compared with the original description. The original description of grade 1 is the only one in which ‘doubtful narrowing of joint space’ is needed in grade 1. The distribution of osteophytes in the original description is therefore aberrant compared with the other alternatives in K&L grade 0 as well as the distribution of JSN in cut-off K&L≥1. This is because of the required combination of osteophytes and JSN in cut-off K&L≥1 in the original description.

Whereas JSN is often considered more important for osteoarthritis than osteophytes in joints other than the knee (eg, the hip), even in grade 2 three of the five alternative descriptions require possible JSN with K&L grade 2. The original description includes JSN (as doubtful narrowing) in grade 1. Alternative 1 and 2 do not require JSN at all, which leads to an aberrant distribution of JSN seen in cut-off K&L≥2 in these descriptions compared with the other descriptions. For the association between definite knee osteoarthritis and the presence of knee complaints, possible JSN in grade 2 of the descriptions seems important. These descriptions (ie, the original description and alternative 3) yield the strongest association between definite knee osteoarthritis and the presence of knee complaints. So ‘possible JSN’ needs to be in the description of definite knee osteoarthritis. For cut-off K&L≥1 no additional value was found for doubtful JSN (included in the original description) in the association with knee complaints. For cut-off K&L≥3 the OR are high, probably due to the small number of patients with K&L≥3.

The number of people with knee complaints is in all alternatives approximately 30%, but the actual number of people with knee complaints within the separate alternatives is small. This can lead to a different OR with the same grade description, as seen in cut-off K&L≥2 for alternatives 3 and 4. There are no significant differences between the OR of all descriptions within a cut-off.

This study has some limitations. First, our study population was relatively young, explaining the large number of knees classified as normal or possible osteoarthritis and the low prevalence of severe osteoarthritis. As the K&L criteria are frequently used in such populations, to discriminate between healthy/possible osteoarthritis and mild but definite osteoarthritis, it is particularly important in this context to avoid variability in the descriptions, especially when the cut-off K&L≥1 is applied. Second, we did not score all radiographs with all descriptions, which would make a comparison possible between all descriptions instead of only a comparison between the alternatives and the original description. A third limitation is the radiographs, extended knees and no skyline or lateral radiographs for patellofemoral osteoarthritis. Semiflexed knees is preferred over extended knees to evaluate structural severity on radiographs, especially for joint space width.17 Although the K&L criteria are developed in extended knees. The lack of information about patellofemoral osteoarthritis is a limitation in the association between the different descriptions for the tibiofemoral K&L score and complaints of the knee, because patellofemoral osteoarthritis could also be a cause of pain.18 Moreover, we only had information about the existence of knee complaints per person and not per knee. Therefore, we could not do a knee-specific analysis, which would provide more accurate information.

In conclusion, the different descriptions of the K&L classification criteria have a direct impact on the yield of cases, especially with grade 1 and, to a lesser extent, with grade 2 as the cut-off. All descriptions have strengths and weaknesses. All alternatives yield more cases than the original description in grade 1. The reproducibility of grade 1 of the original description and alternative 4 is low, due to the influence of personal interpretation of possible osteophyttic lipping. Alternatives 1 and 2 have an aberrant distribution of JSN in grade 2. Alternative 3 is less extensive in grades 3 and 4 than the original description; both have a high association with knee complaints in all cut-offs. It depends on the purpose which is the best description. Based on these results we recommend the use of the original description if you want to distinguish definite/mild osteoarthritis (K&L≥2) from none/possible osteoarthritis (K&L<2), and we recommend the use of alternatives 1, 2 or 3, or a modification of grade 1 of the original description (‘doubtful narrowing of joint space and/or possible osteophytes’) for distinguishing no osteoarthritis (K&L=0) from possible osteoarthritis (K&L=1).


