Background Comparing treatment effectiveness over time in observational settings is hampered by several major threats, among them confounding and attrition bias.
Objectives To develop European Alliance of Associations for Rheumatology (EULAR) points to consider (PtC) when analysing and reporting comparative effectiveness research using observational data in rheumatology.
Methods The PtC were developed using a three-step process according to the EULAR Standard Operating Procedures. Based on a systematic review of methods currently used in comparative effectiveness studies, the PtC were formulated through two in-person meetings of a multidisciplinary task force and a two-round online Delphi, using expert opinion and a simulation study. Finally, feedback from a larger audience was used to refine the PtC. Mean levels of agreement among the task force were calculated.
Results Three overarching principles and 10 PtC were formulated, addressing, in particular, potential biases relating to attrition or confounding by indication. Building on Strengthening the Reporting of Observational Studies in Epidemiology guidelines, these PtC insist on the definition of the baseline for analysis and treatment effectiveness. They also focus on the reasons for stopping treatment as an important consideration when assessing effectiveness. Finally, the PtC recommend providing key information on missingness patterns.
Conclusion To improve the reliability of an increasing number of real-world comparative effectiveness studies in rheumatology, special attention is required to reduce potential biases. Adherence to clear recommendations for the analysis and reporting of observational comparative effectiveness studies will improve the trustworthiness of their results.
- patient reported outcome measures
- outcome assessment
- health care
Statistics from Altmetric.com
Observational data are increasingly used to analyse the safety and effectiveness of new therapies in different subgroups of patients.1 For effectiveness studies, as in randomised controlled trials (RCTs), authors typically report the proportion of patients reaching a defined clinical threshold (eg, for rheumatoid arthritis (RA): European Alliance of Associations for Rheumatology (EULAR) response rates, EULAR/American College of Rheumatology remission or low disease activity (LDA) rates) after a set time. Comparing the proportion of responders across treatments is relatively straight forward in head-to-head RCTs, since treatment groups are similar in terms of patient characteristics by means of randomisation. However, clinical trials have restrictive inclusion criteria and usually short follow-up, and thus do not provide a full picture of clinical responses for the broader patient population seen in clinical practice, especially for chronic diseases.2 Pragmatic RCT may provide a more real-world picture of comparative effectiveness due to more liberal inclusion criteria but also have short follow-up time, at least under full randomisation.3
While comparative effectiveness should be assessed also in observational studies and registers, the interpretation of the results is hampered by the limitations of observational studies,4 and in particular two potential limitations. The first limitation is related to confounding. For example, in RA registers, non-tumour necrosis factor inhibitors (TNFi) biological disease-modifying antirheumatic drugs (DMARDs) are often prescribed to older patients, with a higher burden of disease compared with patients receiving TNFi.5 6 Assumed advantages of one of the treatments may channel patients with special characteristics, with the consequence that disease activity evolution can be incorrectly attributed to the use of the treatment. This issue is often referred to as confounding by indication or channelling bias. The second limitation is related to a specific type of selection bias called attrition bias. Attrition bias occurs when there are systematic differences between treatment groups in the number or in the way patients are lost from a study.7 Indeed, when considering effectiveness after a certain time, it is necessary to determine how to take into account patients who stopped the treatment, for example, due to an adverse event or lack of effect, and patients lost to follow-up (eg, who stopped participating in the registry). Patients who remained on the same treatment may have a better response to the treatment, thus resulting in a selection bias in favour of responders, yielding an overestimation of effectiveness. If there is differential attrition bias, such as more frequent treatment discontinuation of one of the treatments, or discontinuation of the treatment for different reasons, the comparative effectiveness analysis will be biased.
EULAR has previously published points to consider (PtC) on how to use observational data to analyse and report safety data in biologic registers and report clinical trial extension studies.8 9 The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) guidelines offers a starting framework on how to report studies. With respect to biologic registers, these PtC build on the STROBE guidelines,10 aiming to provide more detailed guidance on reporting complex exposure characteristics, such as time being exposed, drop-out and change from one exposure to another, with a clear focus on effectiveness outcomes and their analyses. There is an unmet need for PtC on the analysis of effectiveness in purely observational real-world data, especially registers, addressing three key aspects of real-world effectiveness. First, baseline of treatment is often hard to ascertain since patients start and stop different treatments over time. Thus, the 1-year follow-up of one treatment could happen 3 months after this treatment was stopped at month 9, and correspond to the start of another treatment. Second, visits often occur at variable time points. Third, treatment discontinuation is substantial and may be informative on treatment success, for instance when patients stop for ineffectiveness. A task force was created with the aim of developing EULAR PtC to analyse and report comparative effectiveness over time (eg, treatment response rate after a set time) in rheumatology.
After approval by the EULAR Executive Committee, the convenors (DSC and AF) and the fellow (KL) convened a multidisciplinary task force to develop the PtC, guided by the consensus process outlined in the 2014 updated EULAR Standard Operating Procedures (SOPs).7 The task force consisted of: eight rheumatologists, four epidemiologists/rheumatologists, two statisticians (DSC and TF), two patient representatives (MdW and SRS) who were also social sciences researchers, and two health professionals (TS and AS).
Two 1-day face-to-face task force meetings were held. The first meeting was convened in March 2019 to clarify the focus of the task force, identify the scope of methods considered in the systematic literature review (SLR), and determine alternative sources of information on accurate analyses to assess comparative effectiveness. The SLR was performed by the research fellow (KL), with support from two task force members (JK and SAB) and one of the convenors (DSC), to identify relevant peer-reviewed publications published in key rheumatology journal (Scientific Journal Ranking>2) in a 10-year period (between January 2008 and March 2019) and see the evolution of analysis and reporting over time. Studies without full text or with less than 100 patients were excluded. The aim was to identify studies comparing treatments on various outcomes in longitudinal observational studies of real-world patients’ populations. Of the 9969 abstracts screened, 305 full-text articles were assessed for eligibility; with 211 articles included, only 35% of studies mentioned attrition, and the majority did not use a method that allows adjusting simultaneously for confounding and attrition when estimating comparative effectiveness over time (for a full description of the SLR, see11). During the first meeting, the task force also decided to perform a statistical simulation study to assess the accuracy of various methods found in the SLR or those suggested by task force members.
A first draft of the PtC, including 13 items, were prepared by the fellow (KL) and the two convenors (DSC and AF). The SLR and simulation results were presented to the task force at a second meeting in November 2019, where the task force formulated a set of overarching principles and consensus statements, based on the initial draft of the PtC. Consensus, defined as ≥75% of participants voting ≥8 on a 10-points scale to the inclusion of a given item, and on exact wording was undertaken through a two-round online Delphi, with the possibility to leave comments. When no consensus was reached, the statement was reformulated and submitted to a second vote. The mean and SD of the level of agreement of task force members, as well as the percentage of participants voting ≥8/10, were then calculated.
The final manuscript was reviewed and approved by all task force members and approved by the EULAR Council (formerly EULAR Executive Committee).
Treatment effectiveness relates to how well a treatment performs in routine clinical settings
Although this overarching principle can easily be endorsed by everyone, it depends on how comprehensively it is defined by all stakeholders involved, including patients, and, potentially, carers (non-professional persons helping patients). It is critical that patients are involved in the selection of outcomes that should be measured because their perspective on outcomes that are important differs from those of researchers, health professionals and other stakeholders. Furthermore, how well a treatment performs is often a matter relative to other treatments, instead of an absolute assessment. In practice, there are many therapeutic options available, and the study is more useful if it contains ‘all’ of these, rather than just a comparison of two or three treatments. This improves the possibility to evaluate channelling and gives a more complete picture of the effectiveness relative to the options that would actually be relevant choices in practice. This is in line with the EULAR 2018–2023 strategy, aiming at delivering a comprehensive quality of care framework in patients with rheumatic and musculoskeletal diseases (RMDs) (https://www.eular.org/eular_strategy_2018.cfm).
Observational studies have several limitations, including confounding and missing data
Observational studies often have longer follow-up than RCTs and represent ‘real-life’ patients as seen in a typical clinical practice, with multimorbidities, unscheduled changes in treatment and incomplete adherence. They are also necessary to investigate some exposures that could not, technically or ethically, be randomised. Observational studies are thus invaluable companions to RCTs. However, data can be hampered by confounding since patients are not randomised.
The main issue with missing data in observational studies is more one of quantity than of quality. Indeed, observational studies often have much more missing data than RCT, in part due to lower manpower, but also due to longer follow-up. In addition, their design as non-interventional studies mirrors clinical practice. This means patients may move to another region (and be lost for follow-up)—but they could also be lost to follow-up due to the severity of their (comorbid) disease.12 13 Patients may decide to stop participating in the study, or they may decide to not fill in specific data. Clinicians on the other hand will also perform differently according to specific patient characteristics or routine procedures. Therefore, missing data may be sometimes missing at random, but not always.
Robust and transparent epidemiological and statistical methods increase the trustworthiness of the results
Evidence-based medicine supports clinical decision-making, allowing results to ‘make sense’, thereby ensuring better adherence to treatments and advice. It may also potentially improve patients’ quality of life by helping them to be confident that they made the best possible choice. For complex observational studies, achieving this trustworthiness of results requires particular attention to robust, transparent and detailed methods.
Points to consider
Ptc 1: reporting of comparative effectiveness in observational studies must follow the STROBE guidelines
The STROBE guidelines already provide comprehensive reporting guidelines for observational studies.10 However, they lack specific recommendations for longitudinal analyses.
Ptc 2: to provide a more complete picture of effectiveness, several outcomes across multiple health domains should be compared
Effectiveness is a complex construct and cannot be assessed by a single outcome. Though several studies can each look at a different outcome, a more prudent approach is to include several outcomes, across multiple health domains, to acknowledge the variety of interests of the involved stakeholders.
Ptc 3: lost to follow-up from the study sample must be reported by treatment
The following two statements aim to address potential attrition bias, by providing necessary information about the extent of lost to follow-up and the potential differential lost to follow-up. Lost to follow-up is defined as having no additional information about a patient after a given time point. In contrast, treatment discontinuation is defined as knowing that the patient stopped a specific treatment at a given time point, whether or not there is information after that time point (eg, start of a new treatment). Because treatments are often composed of several treatments, it may be necessary to be more specific when describing changes in therapies than simple start and stop of main treatment (eg, start of conventional synthetic DMARD, in addition to a biologic/targeted synthetic DMARD). It is necessary to report lost to follow-up by treatment or treatment combination, in order to provide information on potential differential loss to follow-up.
Ptc 4: the proportion of patients who stop and/or change therapies over time as well as the reasons for treatment discontinuation must be reported
Though the rate of treatment discontinuation may be similar across treatments, the reasons for this discontinuation could differ between treatments. Reasons for discontinuation have also changed since treat-to-target approaches have become more frequent and may call for treatment tapering, especially for patients under combination therapy. For some RMDs, treatments may sometimes be discontinued when patients are in sustained clinical remission,14 15 in other words due to effectiveness. Thus, in a worst-case scenario, one treatment could have only discontinuation for adverse events, while another could have discontinuation for remission. Consider also examining characteristics of patients who stopped or changed therapies by reason for treatment discontinuation, to determine the importance of attrition bias (eg, age, gender, and baseline disease severity for each reason of treatment discontinuation per treatment).
Ptc 5: covariates should be chosen based on subject matter knowledge and model selection should be justified
Similar to any adjustment for confounding, the list of covariates for effectiveness at a given time point should be determined based on known potential confounders. Indeed, even recent advances in model selection may still have important issues related to being too data driven,16 including bias in variable selection, overestimation of parameters and inflated type I error.
Ptc 6: the study baseline should be at treatment initiation and a description of how covariate measurements relate to baseline should be reported
In open cohort studies, determining baseline may become quite difficult. Efforts should be made to accurately define baseline in each study, and explicitly describe whether covariates were measured at baseline. For instance, the visit to assess disease activity could have occurred 2 weeks prior to treatment initiation, while imaging data were obtained at a visit 2 months later. In addition, registers often contain several treatment courses per patients. Consider using data from all treatment courses for the same patient, applying appropriate statistical methods to take into account non-independence.
Ptc 7: the analysis should be based on all patients starting a treatment and not limited to patients remaining on treatment at a certain time point
Due to attrition, analysing only patients still on treatment at a certain time point (eg, 1 year) would lead to bias, by considering only those patients for whom the treatment did not need to be discontinued. Complete case (CC) analysis may lead to larger bias as follow-up time and thus attrition increases.
Ptc 8: when treatment discontinuation occurs before the time of outcome assessment, attrition should be taken into account in the analysis
Attrition due to treatment discontinuation is a special case of informative censoring, whereby the patients stopping treatment differ from patients remaining on treatment, for instance by having a smaller decrease in disease activity. Several analysis methods are available to correct this selection bias. However, an increase in the response rate should be interpreted carefully since an apparent increase may represent a selection of patients for whom the treatment worked well instead of an increase of treatment effectiveness over time.
In this point to consider, we encourage researchers to consider using multiple imputation techniques and/or causal inference models such as inverse probability weighting (IPW), which have been shown to be more accurate than CC analyses.17 When data are missing at random, that is when the missingness pattern is dependent on some other variables but can be predicted from available information,18 both methods have been shown to provide reliable estimates.17 19 20 Nevertheless, because of the importance of model specification of missingness, some simulations studies have shown no better results from CC analyses than from multiple imputation and IPW.21 Indeed, other studies showed better results from IPW or multiple imputations methods when the mechanism of either dropout or death were correctly specified.22 23
In this framework, members of the taskforce were presented a simulation study that examined the impact of specifying missingness of effectiveness outcome due to treatment discontinuation and attrition.24 25 This study used data generated based on a collaboration of registers of biologic DMARDs including ~50 000 RA patients. The effectiveness measure assessed was LDA rate at 1 or 2 years. The methods compared included CC, Lundex,9 IPW,17 and a specific multiple imputation model called Confounder-Adjusted Response Rate with Attrition Correction (CARRAC). For both IPW and multiple imputations models, the covariates to specify missingness comprised reasons for treatment discontinuation, in addition to more usual patient characteristics. The conditions tested included having between 10% and 30% of patients stopping treatment or being lost to follow-up. These percentages were allowed to vary between treatment groups, to investigate differential attrition. Furthermore, a condition evaluated the impact of informative attrition, where CDAI at the time of response rate (1 year) influenced the chance of having discontinued treatment, thus making data ‘not missing at random’ (NMAR). Results showed that CC usually overestimated LDA at 1 year, and Lundex methods underestimated LDA at 1 year, whereas IPW and CARRAC were usually unbiased. Even though effectiveness estimates assessed by CC or Lundex methods were often quite biased for each treatment, the difference in LDA between two treatments were often closer to the true difference value.
Ptc 9: sensitivity analyses should be undertaken to explore the influence of assumptions related to missingness, particularly in case of attrition
Since assumptions and choices of covariates can have a strong impact on the estimates of effectiveness, sensitivity analyses considering different reasonable alternatives will help determine the robustness of the findings. For instance, using CC analysis assumes that the effectiveness of the treatment was similar for those who remained on treatment and for those who discontinued (eg, for lack of loss of effect). The estimate from a second analysis considering all patients who discontinued treatment as non-responders would provide the opposite viewpoint that all discontinuations are due to ineffectiveness. Thus, showing the results of both analysis gives an idea of how much effectiveness can vary based on the assumptions underlying the analyses.
Ptc 10: authors should prepare a statistical analysis plan in advance
Statistical analysis plans protect the analyses from becoming too data-driven, influenced by what is seen in the initial descriptive results. This is particularly important for observational studies since analyses are much less clear-cut than for randomised trials. Consider including details on covariates included for adjustment, how these covariates will be included in the models (eg, age as a continuous linear variable, or as a categorical factor), which outcomes will be considered, which analyses will be done, and which sensitivity analyses will be run.
Observational studies are becoming more comprehensive and detailed. Their longer follow-up allows for a better understanding of the long-term effect of treatments. However, researchers need to be mindful of the risk of biased estimations of effectiveness. Since no solution to adjust for this risk will be perfect, guidance on which information should be reported to allow a fair assessment of potential bias is critical. Indeed, these PtC expand on the STROBE guidelines regarding the importance of describing missing data patterns. Similar to STROBE guidelines, they are relevant not only to RMDs, but to most medical fields using cohort studies to assess effectiveness, and especially to chronic disease treatments.
To our knowledge, no other non-governmental organisation representing patients, healthcare professionals and scientific societies to date has developed recommendations for comparative effectiveness studies. Yet the need for guidelines becomes increasingly evident. First, evidence accrues from numerous publications in statistics, across various medical fields, focused on missing outcome data over time and how to impute them.11–15 Overall, these studies find that missingness is often informative (ie, associated with either exposure, or the outcome that should have been measured), thereby making the data ‘NMAR’. These results reinforce the message that showing missing data patterns is necessary, to inform readers about differential attrition bias, which would cause a difference in the strength of association found between treatment and the effectiveness outcome. Second, discontinuation of treatment for remission is an option, and thus previous methods such as the simple Lundex approach,9 which considered all patients who stopped treatment as non-responder, are less appropriate than before. Though this trend may be stronger in some countries than in other, depending on local practices or recommendations, evolution in standards of care will continue, as will the need for well-documented reporting and analysing of effectiveness.
Compared with previous EULAR-endorsed PtC, Oxford Centre for Evidence-Based Medicine Levels of Evidence were omitted because no clinical studies were included. Thus, as recommended by EULAR SOP, we downgraded our recommendations to ‘PtC’ due to the lack of strong data-driven evidence. However, the agreement between task force members was very high. Though this taskforce represents experts from 11 countries, a limitation is that there was only one representative from Eastern Europe.
Finally, as analyses of observational data become more complex and to accommodate more intricate research questions and data collection, supporting tools should be provided to researchers. These PtC are one tool to support correct reporting of comparative effectiveness studies. Another available support is the EULAR Virtual Research Centre offering a range of resources including clinical research support. Investigators of future studies should be encouraged to implement variables to be able to adhere to these recommendations, for example, providing reasons for treatment discontinuation. R packages, SAS procedures or any other statistical software should be developed to easily implement state of the art analyses, with a detailed documentation clarifying the substantive choices that fall to the investigators.
Patient consent for publication
Handling editor Dimitrios T Boumpas
Twitter @delcourvoisier, @pedrommcmachado
Contributors All authors were members of the taskforce and made substantial contributions to the developement and interpretation of the points to consider. They also contributed to revising the manuscript critically for important intellectual content. All authors approved the final version to be published, and agree to be accountable for all aspects of the work.
Competing interests DSC has received consulting fees from Abbvie, MSD, and Pfizer outside submitted work. KL has received speaker fees from Gilead-Galapagos and grant/research support from AbbVie outside submitted work. ZR has received consulting and speaker fees from Abbvie, Eli Lilly, Novartis, MSD, Pfizer, Roche, Sandoz outside submitted work. PMM has received consulting/speaker’s fees from Abbvie, BMS, Celgene, Eli Lilly, Janssen, MSD, Novartis, Orphazyme, Pfizer, Roche and UCB, unrelated to the work presented in this manuscript, and is supported by the National Institute for Health Research (NIHR), University College London Hospitals (UCLH), Biomedical Research Centre (BRC). Disclaimer: The views expressed here are those of the authors and do not necessarily represent the views of the (UK) National Health Service (NHS), the National Institute for Health Research (NIHR), or the (UK) Department of Health, or any other organisation. FI has received consulting/speaker’s fees from Abbvie, BMS, Celgene, Eli Lilly, Galapagos, Janssen, MSD, Novartis, Pfizer, SOBI, Roche and UCB, unrelated to the work presented in this manuscript. TAS has received grant/research support from AbbVie and Roche, has been a consultant for AbbVie and Sanofi Genzyme, and has been a paid speaker for AbbVie, Roche, Sanofi and Takeda. AF has received grant/research support from AbbVie, BMS, Eli-Lilly, Galapagos, and Pfizer, has been a paid speaker for AbbVie, BMS, Eli-Lilly, Novartis, on Novartis. MdW operating for Stichting Tools has received fees for lectures or consultancy provided by MdW from Celgene, Eli Lilly, Pfizer and UCB, over the last three years, unrelated to the work presented in this manuscript. AS has received speaker’s fees from AbbVie, BMS, Celltrion, MSD, Pfizer, and Roche, unrelated to the work presented in this manuscript. SRS has received consulting/speaker’s fees from 67 Health, Ampersand Health, Envision Pharma Group, Janssen and On The Pulse Consultancy, and is an employee of Envision Pharma Group, unrelated to the work presented in this manuscript. LMØ has received grant/research support from Novartis, unrelated to the work presented in this manuscript.
Patient and public involvement Patients and/or the public were involved in the design, or conduct, or reporting, or dissemination plans of this research. Refer to the Methods section for further details.
Provenance and peer review Not commissioned; externally peer reviewed.
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.