Article Text

Download PDFPDF

Rasch analysis of the Western Ontario MacMaster Questionnaire (WOMAC) in 2205 patients with osteoarthritis, rheumatoid arthritis, and fibromyalgia
  1. Frederick Wolfea,
  2. Sheldon X Kongb
  1. aArthritis Research Center and University of Kansas School of Medicine, Wichita, Kansas, USA, bMerck and Co Inc, Whitehouse Station, New Jersey, USA
  1. Dr F Wolfe, Arthritis Research Center, 1035 N Emporia, Suite 230, Wichita, KS 67214, USA.


OBJECTIVE Advances in health measurement have led to the application of Rasch Item Response Theory (IRT) analysis (Rasch analysis) to evaluate instruments measuring health status and quality of life of patients, including the Health Assessment Questionnaire and SF-36. This study investigated the extent to which the Western Ontario MacMaster osteoarthritis questionnaire (WOMAC) satisfies the Rasch model, particularly in respect to unidimensionality, item separation, and linearity.

METHODS The study included a total of 2205 patients, 1013 with rheumatoid arthritis (RA), 655 with osteoarthritis of the knee or hip (OA), and 537 with fibromyalgia. All patients completed the WOMAC as part of a longitudinal study of rheumatic disease outcomes. To examine whether the WOMAC pain and function scales each fits the Rasch model, the Winsteps program was used to assess item difficulty, scale unidimensionality, item separation, and linearity.

RESULTS Although the WOMAC worked best in OA, regardless of disorder, both the pain and function scales were unidimensional, had adequate item separation, and had a long range (25–150) of linearity in the function scale. Several functional items, however, had a high information weight fit (INFIT) statistic, indicating poor fit to the model. These items included “getting in and out of the bath” and “going down stairs.”

CONCLUSION The WOMAC generally satisfies the requirements of Rasch item response theory across all disorders studied, and is an appropriate measure of lower body function in OA, RA and fibromyalgia. Although some individual items do not fit well, it is not likely that removing such items would result in more than overall minimal differences, and it will be difficult to remove traces of multidimensionality while keeping the central constructs of progressive lower body musculoskeletal abnormality intact. In addition, it is possible that a “purer”, still more unidimensional instrument would be less useful in clinical trials and epidemiological studies by restricting the range of the scale.

  • Rasch analysis
  • osteoarthritis
  • rheumatoid arthritis
  • fibromyalgia

Statistics from

Among the major determinants of health related quality of life in osteoarthritis of the knee or hip (OA) are pain and functional loss. The Western Ontario MacMaster (WOMAC) questionnaire was designed to measure these components by assessing 17 functional activities, five pain related activities, and two stiffness items.1 This instrument has been widely studied, and many of its psychometric properties are known.2-6 It has been widely used in clinical trials7 ,8 because of its sensitivity to change and its construct validity.

Previous studies of function in OA have shown that WOMAC is more sensitive to change and has greater efficiency than most other instruments used to assess OA, including the Health Assessment Questionnaire (HAQ) and the SF-36 Health Survey.1 ,8-10This is not surprising as the HAQ was developed and validated in a rheumatoid arthritis (RA) population, and the SF-36 was developed as a generic instrument for use in all populations. Despite this the WOMAC, like other instruments, does not correlate well with radiographic progression, and is sensitive to influences such as the presence of back pain and the number of somatic symptoms.11

Advances in health measurement have led to the application of Rasch Item Response Theory (IRT) analysis (or Rasch analysis)12to assess instruments measuring health status and quality of life,13-26 including the HAQ21 and SF-36.25 Although an instrument may measure several health concepts with different subscales, a good subscale should be unidimensional, not have floor or ceiling effects, and have adequate spread along a single dimension. Individual scale items that do not fit the unidimensional model may be measuring a different concept than what was anticipated (non-unidimensionality), may be misunderstood by respondents, or may not apply to all persons under study. An additional problem with measurement scales is that some of the items may be redundant, thereby contributing no additional information.

Studies of the SF-36 physical function scale (SF-10) have shown elements that do not have adequate fit to a unidimensional model.21 ,25 Recent Rasch IRT studies by Tennantet al showed that the HAQ has impaired construct validity in OA, and in rheumatoid arthritis (RA) it had ceiling effects and inadequate spread.21 This study examined whether similar problems might exist with the WOMAC. In addition, we have used the WOMAC in RA and fibromyalgia. Therefore we also examined whether the Rasch analysis yielded similar results across the three disorders. If results were found to be similar they would offer additional support for the use of the WOMAC in other illnesses besides OA, and they would also confirm the results of Rasch analyses in OA.


Data for this study were obtained from patients participating in a long term outcome study of OA, RA and fibromyalgia. Most patients were attending the Arthritis Center in Wichita, Kansas, an outpatient rheumatology clinic and research centre where longitudinal data on OA, RA and fibromyalgia patients had been collected since 1974. As part of the data collection, questionnaires at six month intervals were mailed to patients who chose to participate in the longitudinal study. The characteristics of this data bank and the methods of data collection have been described previously.27 ,28 The current report relied on mailings sent between July 1996 and January 1998 when the WOMAC questionnaire was added to the assessment package and included 2205 patients, 1013 with RA, 655 with OA, and 537 with fibromyalgia. Among the RA patients, 447 of the 1013 were members of a US inception cohort of RA who were recruited during the study period from the practices of rheumatologists, and who had a disease duration of less than one year when first seen by their rheumatologists. Of the 655 OA patients, 348 were recruited during the study period by media and mailed advertising for participation in an OA outcome project. Seventy five of the 537 fibromyalgia patients were from centres other than Wichita who had participated in previous fibromyalgia outcome studies.29When more than one WOMAC assessment was available, we made use of the most recent assessment. Thus we are performing cross sectional not longitudinal analyses in this study.

Patients with RA and fibromyalgia satisfied published diagnostic/classification criteria.30 ,31 Patients with OA had definite radiographic abnormality and knee pain, and clinically had OA. Although most satisfied published criteria for OA,32 ,33 it was the purpose of this project to identify mild cases so that minimal entry criteria for this study included a clinical diagnosis of OA, definite osteophytes, and characteristic knee pain.

All patients completed the WOMAC. The multidimensional Western Ontario MacMaster osteoarthritis index assesses pain, stiffness, and physical function activities related to OA of the hip or knee.1 ,2 ,6 ,7 ,34 In this study the WOMAC was used in its visual analogue scale (VAS) format (version VA3.0) but was converted to an 11 step 0–10 scale for Rasch analyses. This method of using a VAS scale in Rasch analysis has been previously described.25 In these analyses we studied the 17 item function scale and the 5 item pain scale, but we did not analyse the WOMAC stiffness scale because it had only two items. This version of the WOMAC questionnaire does refer specifically to any joint.


Data were analysed use Stata Version 5.035 for general descriptive statistics. For Rasch analyses we used Winsteps (Bigsteps) Version 2.8.36

In the Bigsteps Rasch analysis, patients and item scores are used to “calibrate” items on a logit scale where the midpoint of the scale is 0. Items at one end of the scale are “easier” and items at the other end are more “difficult.” In the current analyses, items with a negative (−) calibration are more difficult. Individual items that are at least 0.15 logits apart represent individual strata.37 It is generally desirable that this separation distance between items be 0.15 logits or more. Otherwise one item is not distinctly separate from the next. Another important characteristic of a good instrument is that it has a good overall separation (expressed in logits). The greater the separation the more distinct strata are identified. But the separation between individual items should also not be too far or spaces will occur between the individual items. When the data fit the Rasch model the information weighted fit statistic (INFIT) will be between 0.7–0.8 to 1.2–1.3 using the mean square INFIT (MNSQ) statistic. When the data fit well they indicate that the subscale items contribute to a single underlying construct (unidimensionality). When the INFIT statistics are applied to the individual items, INFIT statistics > 1.2–1.3 indicate that the item does not contribute to the underlying construct or is “noisy”. An INFIT statistic of 1.3 means that there is 30% more noise than expected. Generally the higher number is taken as the upper limit of allowable “noise”. INFIT (MNSQ) values of < 0.7–0.8 indicate items that are muted. This may occur when there are several items that are similar or highly correlated, or when one item is dependent on another. Generally the lower number (0.7) is taken as the lower limit for INFIT. “A mean-square of 0.7 indicates 43% more ambiguity in the inferred measure than modeled.” 38 The choice of limiting values is in part a function of the purpose for which the scale will be used. Wright and Linacre indicate that reasonable fit statistics are between 0.7—1.3 for “run of mill” tests and that 0.8—1.2 are reasonable for “high stakes” tests.38Thus the scale that perfectly fits the Rasch model is unidimensional, has adequate separation so that there are sufficient strata, has items that are not calibrated too far apart, and has individual items that all contribute to the underlying construct.


Figure 1 depicts the distribution of WOMAC function scores for the 2205 study participants. As shown in table 1, WOMAC function and pain were most abnormal in the 537 fibromyalgia patients and least abnormal in 1013 RA patients. The 655 patients with knee and hip OA held a middle position.

Figure 1

Distribution of 2205 WOMAC function scores as modelled in a kernel density plot. Actual range is 0–170.

Table 1

WOMAC Scores in RA, OA and fibromyalgia


Rasch analyses were performed on each patient group separately. For OA patients the overall INFIT and separation statistic for pain was 1.01 (table 2) with an item separation of 13.26. Similar values were 1.02, 9.11 for RA and 1.02, 7.97 for fibromyalgia. More positive scores for average item calibration indicate easier categories, and more negative scores indicate more difficult categories. Table 2 shows that, for OA, individual inter-item differences of at least 0.15 logits were generally identified, indicating satisfactory item separation. Among RA patients, however, items about night pain, pain with walking, and pain standing were not as well separated, and this was true in fibromyalgia patients as well. For fibromyalgia patients the easiest category was walking, in contradistinction to the other groups where sitting and night pain were the easiest categories—fitting the clinical complaints that are often heard. As with RA patients, middle variables had calibrations similar to fibromyalgia. These data suggest that for pain scores the WOMAC performs appropriately in terms of INFIT and separation statistics for all patient groups, but with fewer distinct strata in the RA and fibromyalgia groups as indicated by inter-item differences of less than 0.15 logits.

Table 2

Average WOMAC pain item calibrations, SE, and INFIT statistic ordered by calibration


As with the pain scale, general INFIT and separation statistics for function were quite satisfactory. INFIT and separation values for OA, RA and fibromyalgia were 1.02 and 12.21; 1.01 and 13.58; and 1.02 and 11.07, respectively.

The individual items of the 17 item function scale were examined to understand their appropriateness as part of a unidimensional scale. Table 3 and figure 2 present these items ordered by their INFIT statistics. As indicated in the methods section, INFIT (MNSQ) values greater than 1.2–1.3 indicate items that have unexpected values or items that may tap additional dimensions. Among OA patients, “getting in and out of the bath”, “going down stairs”, and performing “heavy chores” have scores greater than 1.20. These items also poorly fit the unidimensional model among patients with RA and fibromyalgia, although RA patients also had other poorly fitting items. These items and their relation to diagnostic group membership can be seen clearly in figure 2.

Table 3

Average WOMAC functional item calibrations, SE, and INFIT statistic ordered by misfitting

Figure 2

INFIT (MNSQ) statistics for WOMAC functional scale items. INFIT statistics > 1.2–1.3 indicate that the item does not contribute to the underlying construct. INFIT (MNSQ) values of < 0.7–0.8 indicate items that are muted.

Data in table 3 and figure 2 also indicate items that that may be redundant or have dependencies. These items were identified by INFIT statistics of 0.8–0.7 or less. Doing light chores and getting in and out of a car were among those items for all groups, and rising from a bed were such items for the RA and OA groups. When the INFIT criterion is set at 0.7, then only “getting in and out of a car” in the RA group is misfitting.

Table 3 and figure 3 display the item calibrations for the WOMAC function items for the three groups. Going upstairs and performing heavy chores were clearly the most difficult items. The figure demonstrates the general similarity of results among the groups as well as the differences between individual items.

Figure 3

Average calibration in logits for WOMAC functional scale items. Negative calibrations indicate more difficult items. The more positive the score the easier the item.

Graphic examination of item calibrations (fig 4) indicated a large range (25–150) in which the WOMAC function score was linear in OA patients. The equivalent range on a 0–10 scale is 1.5–8.8. There is no substantial ceiling effect, for only two per cent of patients have scores greater than 150. By contrast, 21 per cent of patients had scores less than 25, reflecting the larger number of patients with very mild disease. Although INFIT statistics for these lower ranges (steps) indicate appropriate fit, as with scores over 150 there is also greater distance (or severity) between each step of WOMAC function in the ranges below 25 and above 150 than in the range of 25 to 150. Similar results (not shown) were obtained in RA and fibromyalgia patients.

Figure 4

Plot of severity in logits versus WOMAC functional score for OA. Curves are similar in RA and fibromyalgia. The WOMAC function score is linear over the range of 25–150 (equivalent range on 0–10 scale is 1.5–8.8). Less than two per cent of patients have scores greater than 150. Twenty one per cent have scores less than 25 (see fig 1).


The data presented here indicate that the WOMAC generally performs well in OA, RA and fibromyalgia. The overall fit statistics, scaling and separation are quite satisfactory; and results obtained in the three disorders are quite similar. A few individual items, however, do not have adequate fit statistics. For items with high INFIT statistics, the item may be addressing a different process or content—that is, an item in another dimension. Getting in and out of the bath may require arm strength or general well being; and the item might measure cardiopulmonary problems or weakness, qualities that differ from lower extremity problems. Performing heavy chores or going upstairs might have the same trouble. Going downstairs could tap into problems related to balance or other neurological conditions. Another, and more likely explanation, is that tub baths are not common, showers being used, so that the experience of patients is not uniform. Similarly, problems with stairs may reflect the limited use of stairs by many patients. These items were troublesome across all of the three disorders so that it is likely that they represent true psychometric problems for the scales.

At the other end of the spectrum, where the INFIT statistic is low, redundancies can cause poor fitting. This is true when one item answers another or is strongly correlated with it; or if there are similar items. Getting in and out of a car, doing light chores and rising from a bed were such items for the study populations.

The overall fit statistic was satisfactory in RA and fibromyalgia, although the overall separation for WOMAC pain was greater in OA than the other conditions, indicating that pain assessment was better in OA than RA and fibromyalgia.

A case could be made to reformulate the WOMAC by eliminating redundant items and eliminating items such getting in and out of the bath, going downstairs, and possibly doing heavy chores. We have recently shown that the presence of low back pain is highly correlated to WOMAC scores, as are general medical complaints.11 It is possible that some of the noise that is present in the WOMAC assessment of OA comes through such mechanisms and are more expressed in items with high INFIT scores.

As we expected, we found evidence of non-linearity (ceiling and floor effects), as shown in figure 4. This non-linear effect was unimportant as a ceiling effect because only two per cent of patients had values above the ceiling. However, 21 per cent had values in the range of the floor effect. The non-linear effects have two practical considerations. Firstly, it can generally be assumed that a WOMAC score that is, for example, two or three times greater than another score and that falls within the zone of linearity, approximately is measuring severity that is two or three times greater. But no such assumption can be made for scores above the ceiling or below the floor level. Secondly, provided there are no missing data that require interpolation, ordinary statistical methods are appropriate to analyse the data when a monotonic relation between the scale and other variables is expected, as is most often the case. One way out of these problems is to apply Rasch transformations to the WOMAC scales results, thereby converting the observed scale into a linear measure. The advantage of doing this is that all of the values are then interval. Clinically, then, you can compare patients on a linear measuring scale. The benefit of doing this must be weighed against the difficulty it imposes, a difficulty that is not inconsiderable. For use in clinical trials in which patients would be expected to have scores within the linear range, floor and ceiling effects would not be expected to be a problem, but this might not be true in observational studies of populations where many patients may have low WOMAC scores.

While high INFIT scores may have addressed elements that are not unidimensional, we do not believe this should be interpreted as a central critique of the WOMAC. Not only was the overall fit, scaling and separation of the WOMAC quite good across all of the conditions studied numerous studies have shown that the WOMAC is a sensitive and effective instrument in OA. To test whether elimination of the non-fitting items improves the usefulness of the WOMAC is relatively easy to do, by analysing a clinical trial or similar study with and without the non-fitting items. But given the general overall good fit, it is not likely that removing such items would result in more than minimal differences. In addition, it will be difficult to remove traces of multidimensionality while keeping the central constructs of progressive lower body musculoskeletal abnormality intact; and it is possible that a “purer”, still more unidimensional instrument would be less useful in clinical trials and epidemiological studies by restricting the range of the scale.

There are a number of items that require comment. The WOMAC was developed on samples that included hip OA and knee OA patients. This study only uses knee OA patients. Therefore our conclusions about the WOMAC in OA only refer to knee OA. It is also true that the WOMAC was not validated nor designed for use in RA or fibromyalgia. This study deals with the psychometric properties of the instrument in these disorders. Additional studies will be required to determine if the WOMAC is useful or valid in these conditions. Even so, the WOMAC taps into dimensions that are key to fibromyalgia and are not fully evaluated by any other current instrument, and it may provide additional information (for research purposes) regarding lower body function in RA.


Funding: supported in part by grants (AR43584) from the National Institutes of Health and Merck and Co, Inc.


Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.