Background Tremendous opportunities for health research have been unlocked by the recent expansion of big data and artificial intelligence. However, this is an emergent area where recommendations for optimal use and implementation are needed. The objective of these European League Against Rheumatism (EULAR) points to consider is to guide the collection, analysis and use of big data in rheumatic and musculoskeletal disorders (RMDs).
Methods A multidisciplinary task force of 14 international experts was assembled with expertise from a range of disciplines including computer science and artificial intelligence. Based on a literature review of the current status of big data in RMDs and in other fields of medicine, points to consider were formulated. Levels of evidence and strengths of recommendations were allocated and mean levels of agreement of the task force members were calculated.
Results Three overarching principles and 10 points to consider were formulated. The overarching principles address ethical and general principles for dealing with big data in RMDs. The points to consider cover aspects of data sources and data collection, privacy by design, data platforms, data sharing and data analyses, in particular through artificial intelligence and machine learning. Furthermore, the points to consider state that big data is a moving field in need of adequate reporting of methods and benchmarking, careful data interpretation and implementation in clinical practice.
Conclusion These EULAR points to consider discuss essential issues and provide a framework for the use of big data in RMDs.
- health services research
- outcomes research
Statistics from Altmetric.com
The recent expansion of big datasets and advanced computational techniques led to tremendous opportunities for health research.1 As elegantly elaborated by Topol, the use of big data in medicine is going to disrupt the medical system as we know it.2 Big data include both clinical data (eg, originating from electronic health records, healthcare system claims data or patient-generated data such as from apps), biological data issued from the development of molecular research leading to multi-omics complex molecular data,3 social data (eg, originating from social networks, Internet of Things, physical social connexions or economic data repositories), imaging data and environmental data (eg, urbanistic data, pollution or atmospheric conditions).4 5 In parallel, artificial intelligence–based methodologies allowing computer systems to ‘learn’ from data (ie, progressively improve performance on a specific task without being explicitly programmed) are more and more accessible.6 7 The collection of big data combined with such information processing techniques (computational modelling, machine learning) led to an opportunity for progress in medical research, which should ultimately modify patient care and clinical decision-making.
Some recent applications of big data show interesting potential. These include the correct detection of skin lesions suspect of melanoma,8–10 prediction of cancer treatment response based on imaging11 and the correct interpretation of eye fundus pathologies.11 However, big data is an emergent area in need of guidelines and general recommendations on how to move this field forward in a collaborative and ethical way. Some of the challenges presented by big data and artificial intelligence include data sources and data collection: how to collect and store the data, while guaranteeing ethics and data privacy12; how to interpret data models of complex analyses13 14; and what are the clinical implications of big data: how to go from big data to clinical decision-making.3 15 16
To our knowledge, no academic societies have developed consensus guidelines dealing with big data.17 Very recently, the European Medicines Agency (EMA) released recommendations focused on the acceptability of evidence derived from big data in support of the evaluation and supervision of medicines by regulators18; however, these recommendations deal mainly with the interpretation of drug-related big data. The European League Against Rheumatism (EULAR) has recently formulated as one of its key strategic objectives, the advancement of high-quality collaborative research and comprehensive quality of care for people living with rheumatic and musculoskeletal disorders (RMDs).19 Thus, EULAR naturally takes an interest in big data and its applications.
The objective of this project was to develop EULAR ‘points to consider’ (PTC) for the collection, analysis and use of big data in RMDs.
After approval by the EULAR Executive Committee, the convenors (LG, TRDJR) and the project fellow (JK) led a multidisciplinary task force guided by the 2014 updated EULAR Standardised Operating Procedures,20 which were modified for this specific task force. In October 2018, the main questions to be addressed in the preparatory work for the task force were defined as (1) data sources and collection, (2) data analyses, and (3) data interpretation and implementation of findings. These questions were addressed in subsequent months leading up to the face-to-face meeting by the project fellow and the convenors. A systematic literature review (SLR) was performed between November 2018 and February 2019, regarding publications employing big data in RMDs, with a comparison in other medical fields.21 Additionally, a narrative review of unpublished data on websites on big data and artificial intelligence was performed to inform the task force12 17 18 22–26 and expert opinions were obtained from four selected persons through individual telephone interviews.
In February 2019, during a 1-day face-to-face task force meeting, overarching principles and PTC were developed. The process was both evidence based and consensus based, through discussions of the international task force of experts from a range of disciplines including computer science and artificial intelligence. The task force consisted of 14 individuals from 8 European countries: 6 rheumatologists, 4 data scientists/big data experts, 1 cardiologist specialised in systems medicine, 1 patient research partner, 1 health professional with expertise in outcomes research and 1 fellow in rheumatology. Furthermore, feedback was obtained after the meeting from two additional experts. This inclusive approach aimed to obtain broad consensus and applicability of the PTC. During the 1-day meeting, the preparatory work was presented and discussed, the target audience of the PTC was defined, then the PTC were formulated and extensively discussed. The PTC were finalised over the subsequent 2 weeks by online discussions, taking into account the publication the same week of an EMA consensus document on big data.18
During the meeting and through online discussions, based on the gaps in evidence and the issues raised among the task force, a research agenda was also formulated. After the PTC were finalised, the level of evidence and strength of each PTC were ascertained according to the Oxford system.27 Finally, each task force member voted anonymously on their level of agreement with each PTC via email (numeric rating scale ranging from 0=do not agree to 10=fully agree). The mean and SD of the level of agreement of task force members were calculated.
The final manuscript was reviewed and approved by all task force members and approved by the EULAR Executive Committee.
The target audience of these PTC includes researchers in the field of big data in RMDs and researchers outside the field of RMDs; data collection organisations and/or groups collecting data (eg, registries, hospitals, telecommunications operators, search engines, genetic sequencing teams, institutions which collect images etc); data analysts and organisations; people with RMDs, people at risk of developing RMDs and patient associations; clinicians involved in the management of people with RMDs; and other stakeholders such as research organisations and funding agencies, policy-makers, authorities, governments and medical societies outside of RMDs.
Overarching principles and PTC were formulated, which are shown in table 1 and are discussed in detail below.
Definitions of terms
This first point in table 1 proposes a definition of terms relating to big data. Although the term big data is widely used, there is not one commonly accepted definition. When performing the literature review, several definitions were found (box 1).6 21 The first overarching principle defines the term big data, largely based on the EMA definition.18 Big data is defined by its size and diversity—it is diverse, heterogeneous and large and incorporates multiple data types and forms; but also by the specific complexity and challenges of integrating the data to enable a combined analysis.18 The second half of the definition refers to artificial intelligence (AI). AI is defined as the ability of a machine to mimic ‘cognitive’ functions that humans associate with human minds, such as ‘learning’ and ‘problem solving’.6 New computational techniques, such as AI (which includes machine learning and deep learning) are often (but not necessarily) applied to big data.18
Box 1 Some definitions of the terms ‘big data’ in the literature
Extremely large sets of information which require specialised computational tools to enable their analysis and exploitation. These data might come from electronic health records from millions of patients, genomics, social media, clinical trials or spontaneous adverse reaction reports.18
Data sets that are too large or complex for traditional data-processing application software to adequately deal with.73
Defined by volume, if log(n∗p) is superior or equal to 7, where n is number of rows and p is number of columns.74
Data sets that are large or complex (multidimensional and/or dynamic) enough to apply complex methods, eg, artificial intelligence.75
Information assets characterised by such high velocity, variety and volume that specific data mining methods and technology are required for its transformation into value.76
A generic and comprehensive definition of big data is based on the five V paradigm, ie, volume of data, variety of data, velocity of processing, veracity and value.77
The term big data refers to the emerging use of rapidly collected, complex data in such unprecedented quantities that terabytes (1012 bytes), petabytes (1015 bytes) or even zettabytes (1021 bytes) of storage may be required.78
This next sentence is informative and aims to present the diversity of data sources leading to big data; we listed in a non-exhaustive way some of the sources of big data. The most common sources of big healthcare data found in the SLR were clinical; these include electronic health records, studies and registries, billing and healthcare system claims databases.21 28 29 A more recent source of clinical big data currently underused in RMDs is the Internet of Things (eg, wearables, apps, medical devices and sensors), but also social media, behavioural and environmental data.18 30 31 Imaging is also a growing field of application of big data.10 32 33 Regarding basic and translational research results, -omics such as genomics and bioanalytical omics are an important and rapidly growing field for big data.18 34
Overarching principle A: ethical aspects
This overarching principle addresses ethical issues with big data. The collection, analysis and implementation of big data in RMDs must adhere to all applicable regulations. This covers privacy, confidentiality and security, ownership of data, data minimalisation, and flow of data within the EU and with third countries.22 35 This is both a regulatory and legal requirement, and an ethical one.12 In terms of legal requirements, the General Data Protection Regulation (GDPR) has set standards which apply across Europe, but for health-related data, national rules could also apply on top of these.12
In this overarching principle, we also raise the question of the role of the patient and/or carer in big data. Big data enables active participation of patients, but this is not always the case. Participation of patients and patient research partners can be helpful in data interpretation; for big data, the active participation of patients is still a field to be explored.36 This principle highlights issues around information, consent and responsibilities, and also patient rights and participation.35
B: Potential of big data
Big data provides unprecedented opportunities which we wished to highlight in this overarching principle. Maybe even more than other types of data, big data benefits from transversal thinking, by both original ‘outside the box’ approaches and cross-fertilisation approaches taking into account other medical fields and aspects such as comorbidities, psychological, sociological and environmental findings.18 In this regard, collaboration both within the RMD field and in particular with patients, and outside of RMDs, is key, as will be addressed later in these PTC.15 24 37
C: Ultimate goal
This overarching principle states that the ultimate goal is to be of benefit to people with RMDs. This is always a key priority of EULAR and is in keeping with the EULAR Strategic Objectives and Roadmap.19 38
Points to consider
PTC 1: data collection—use of standards
As the amount of big data increases, the need for data harmonisation becomes more apparent, with the possibility for using different data sources through application of global standards. It is essential to ensure that existing and future datasets can be used and, in particular, pooled for big data approaches. To this end, they must be harmonised/aligned to facilitate interoperability of data.18 Where possible, minimising the number of standards and using global data standards would be helpful; as stated by the EMA, standards should be transparent, open to promote widespread uptake and globally applicable.18
In that regard, international consensus efforts such as data standards, developed by groups such as the International Consortium for Health Outcomes Measures, International Council for Harmonisation, Health Level Seven International, International Organization for Standardization and Clinical Data Interchange Standards (to name a few) are useful.39–42 Some of these groups have developed standards for rheumatology.40 The EULAR dataset for rheumatoid arthritis registries, or other core sets, are also helpful in this regard.43 44 While these standards regulate the way in which the data are recorded and stored, they do not control how efficient the data collection is at the care team level.
PTC 2: data collection and storage—FAIR principle
The FAIR (Findable, Accessible, Interoperable and Reusable) data principles are a measurable set of principles intended to act as a guideline to enhance the reusability of their data.45 The FAIR principles are recognised by many actors, including the EMA and the EU Commission.18 22 24 46 The FAIR principles are strongly linked to PTC 1 and 3, referring to standardisation, interoperability and data storage. Efforts are ongoing to promote the FAIR principles, such as those of the EU commission through the development of the EU eHealth Digital Service Infrastructure.47
PTC 3: data storage—data platforms
Several platforms have been developed to facilitate big data projects. These platforms are independent, standardised, collaborative and not at all limited to use for RMDs.48–50 These platforms have been developed with financial support from the EU and therefore adhere to necessary standards. Hence, the use of such platforms should be promoted as recently stated by the EMA.18 In these PTC, we refer to the use of such platforms for RMD big data, but of course this would also apply to other groups of big data.
Public access to data is an important point, which raised much debate within the task force. Internationally, several groups emphasised the principle that big data should be made publicly available to promote open and reproducible research; in particular, when the data are publicly funded.18 26 51 52 On the contrary, downsides of public access to data are the potential loss of momentum to secure intellectual property and scientific publications from the researchers who initially generated the data.53 Given this controversy, data sharing should be achieved in a way that is sustainable for all parties involved.53 How to make data but also algorithms openly available is very complex.54 55 The task force consensus was in favour of accessible data, but in the current situation, with limited and supervised access; we also felt that pilot projects to assess the impact of data sharing are needed and that such data sharing should be evidence based.56 This consensus will need to be revised as the situation evolves. The topic of data sharing was also added to the research agenda.
PTC 4: privacy by design
Privacy by design is an important approach which should be followed when managing big data projects. This point insists on the importance of privacy by design at the different levels of big data use, including the collection, processing, storage, analysis and interpretation of big data.17 57 Privacy by design is directly quoted in EU law about personal data.12 This approach prompts thinking on the reasons you collect/gather, process, store and protect data, from inception to final deletion. Privacy by design also prompts individuals to self-assess the potential risks or weaknesses relating to data, and how best to manage such risks. This PTC is a major challenge for researchers in big data, but it appeared to the task force to be a legal requirement or an ethical one, and also an educational one since this practice is not widely understood. For big data projects, the data source is key: either the data are collected for the purpose of the project or data are re-used from existing sources. In the first case, obtaining consent is mandatory and must involve a data officer and follow a transparent and effective process in terms of data governance.35 When data are re-used, the national laws on consent, data sharing and governance must be applied. In this context, the development of common principles for data anonymisation would facilitate data sharing, including regulations for sharing, de-identifying, securely storing, transmitting and handling personal health information.18
The European regulatory framework around data is currently undergoing change: from May 2019, the circulation of non-identifying data will be facilitated.47 The implications of this change will have to be assessed.
PTC 5: collaboration
While interdisciplinary collaboration is beneficial and required for all research projects, it is even more important in big data projects where expertise is dispersed among different stakeholders. The task force insisted on the importance of collaboration between appropriate stakeholders at the analysis stage, for example, where AI methods require appropriate expertise, and at all phases of a big data project.25 Interdisciplinary collaborations should intervene at different times across a project, to enable the most appropriate design to be chosen, while ensuring that data collection and the type of analysis are fit for purpose. Of note, the statistical methods may be based on AI or may include more traditional statistics and/or computational methodologies, as appropriate. Further knowledge is needed on the comparison of statistical methods, which is discussed in more detail in PTC 7.21 58 The appropriate individuals to collaborate include clinical/biological scientists, computational/data scientists, health professionals and patients; proposals for respective roles are shown in table 2.
PTC 6: data analyses reporting
The methods, parameters and tools used in big data processing must be reported explicitly in any scientific paper. This is pivotal to allow comparison and interpretation of findings. Our SLR found that 8% of papers using AI did not report in any way what artificial intelligence methods were being used.21 Proper reporting is important for all research, but even more so when innovative methods such as artificial intelligence are used, to avoid confusion and to promote reproducibility.14 18 30 59
PTC 7: benchmarking of data analyses
AI encompasses several techniques which are intended to solve the most difficult problems in computer science: search and optimisation (heuristics), logic (fuzzy logic), uncertain reasoning and learning (machine learning).60 In our SLR, machine-learning methods were the most used AI techniques in RMDs and in other medical fields (98% and 100% of AI papers, respectively). The most used machine-learning algorithms were artificial neural networks (with deep learning as the most advanced version), representing 48% of AI articles.21 61
In addition, comparison of artificial intelligence methods within RMDs should be promoted.17 18 24 62 This is particularly needed because AI is a rapidly growing field; there is an ongoing and unsolved debate as to which methods within AI perform best.63 64 The comparison of AI methods was also added to the research agenda since it was felt that this particular topic was difficult to perform at this moment in time and was more aspirational.
PTC 8: validation of big data findings
Although there may be a perception that big data are more valid or less subject to bias than traditional studies, model overfitting, inappropriate generalisation of the results and/or bias can in fact lead to inappropriate conclusions.14 18 28 Thus, it is important both to assess and benchmark the quality of the generated data and the methods used to avoid overinterpretation of results, overfitting of the models and generalisation of the results when using big data. The task force also felt that it was important to validate results in independent datasets.24 28 Overall, the task force agreed that conclusions drawn from big data need independent validation (in other datasets) to overcome current limitations and to assure scientific soundness. However, a specific challenge for big datasets and the validation of results is the need for other (similar) big datasets—thus, feasibility of validation is a key issue which was discussed at length within the task force.
PTC 9: implementation of findings
The clinical implementation of big data findings should be considered at the earliest opportunity. The SLR and from literature showed that this implementation is currently mostly lacking.21 65 The task force consensus was that researchers using big data should consider implementation of their results in clinical practice; this would include, for example, discussing implementation of findings in clinical practice in the original papers. The task force is well aware that this is a difficult task; such implementation being both complex to set up, costly, and potentially not within the scope of the primary study.66 In this regard, the EMA states that regulatory guidance is required on the acceptability of evidence derived from big data sources.18 67 However, taking all these limitations into account, the task force consensus was that implementation of findings should be proactively considered early on.
PTC 10: training
Interdisciplinary training for clinical, biological or imaging researchers, healthcare professionals and computational biologists/data scientists in the field of big data is important and links closely with the need for collaborations in the field of big data (table 2). Indeed, machine-learning methods are becoming ubiquitous and have major implications for scientific discovery26; however, healthcare professionals are not perfectly aware of the correct use of these methods, whereas data scientists may lack the clinical knowledge to design studies and interpret the findings (table 2). Given the current relative lack of expertise related to big data in the field of RMDs, and given the rapid changes in this field, certain organisations should set up or facilitate training sessions.18 37 This may include academic institutes, public research bodies and international organisations, such as EULAR. The training is needed for both sides: the healthcare professionals needing to learn about the basics of big data, and the data scientists needing to better understand the clinical questions and context within which big data have been collected, and/or is being applied.68 The training can be performed separately for the different stakeholders, but in some instances, it will require an interdisciplinary educational setting in order to engage multidisciplinary teams and their unique dynamics (eg, the need to set a common vocabulary). The training process should detect skills gaps, identify individuals with bioinformatics/biostatistics/analytics/data science expertise within or outside the field of RMDs and implement appropriate training. The training should also aim for different levels of education provision, ranging from academic taught modules (undergraduate and postgraduate), academic research modules (PhD) and continuous professional development opportunities (eg, through seminars and workshops). Similar efforts can be observed in Systems Biology and Systems Medicine.18 68–70
Based on the discussions among the task force and the areas of uncertainty identified within the SLR and discussions among expert stakeholders, a research agenda has been proposed, depicted in table 3. This research agenda covers issues related to data collection, data analyses, training, interpretation of findings and implementation of findings.
These are the first EULAR-endorsed PTC for the use of big data within the field of RMDs, which could well be applied by other medical disciplines. These PTC address the core aspects of big data, namely data sources and storage, including ethical aspects, data analyses, data interpretation and implementation. Legal aspects are not clearly mentioned, but these PTC were meant to cover principles and practical aspects of big data; however, the law, and in particular GDPR, applies first.12 For the update of these points to consider in a few years, participants with legal and ethical expertise should be considered.
This consensus effort is original and should help to promote growth and alignment in the field of big data. However, we are aware that this is a rapidly moving field and that the present PTC may quickly become outdated. It is reassuring that our proposals were not in contradiction to other recent recommendations, such as those of the EMA or the National Health Service in the UK.17 18
To our knowledge, no other non-governmental organisation representing patients, healthcare professional and scientific societies to date has developed recommendations for big data. While the American College of Rheumatology has not published specific guidance relating to big data, it has developed an online patient registry from electronic health records which could potentially be used as a big data source.71
The use of big data is rapidly expending as witnessed by the increasing number of organisations, companies and publications/books dealing with this topic. Undoubtedly, the exploration, use and implementation of big data provide opportunities to improve healthcare, but it is also clear that this field is in need for guidelines and criteria. These PTC are a first tool to set those guidelines. With the growth of big data in RMDs, we expect that these PTC inspire governmental and research organisations, healthcare providers, researchers and patients to increase relevant training of the stakeholders, promotes research on interpretation and clinical applications of big data results, and develop benchmarks/guidelines for reproducible research.
Points 8 and 9 referring to validation and implementation raised much debate within the task force since we felt it was important to both insist on the importance of these steps and at the same time aim for applicability/feasibility of the points to consider. The final formulation of the points was thought to encourage progress without being too directive, to allow researchers to move forward as needed. Such elements will have to be updated as more data become available.
The grading of the evidence was a challenge in the present work as the Oxford level of evidence27 which is used in EULAR task forces is better adapted to therapeutic evidence than to observational or prognostic evidence as is often obtained in big data work. However, according to EULAR Standardized Operating Procedures,20 levels of evidence and strength of recommendations should be rated by the Oxford Levels of Evidence. Moreover, in the case where there is little data-driver evidence, EULAR Standardized Operating Procedures recommend to downgrade the recommendations to the level of ‘points to consider’, which is what was performed here.
This work has several limitations: the main one is that the present PTC are not specific to RMDs. However, they are not specific because the aspects of big data that they address are universal, and at present, there is no specific issue related to big data in RMDs, as is also the case in any other medical specialty. Moreover, the experts we consulted consider big data as an opportunity to go beyond the traditional division of medical specialties and allow multidisciplinary approaches. The other main limitation was the extremely low level of evidence for all the PTC, raising the question of the interest of evidence in this specific field where the PTC were expert driven. This is often the case on subjects where recommendations are formulated before supportive data are produced.72 It is linked to the novelty of the subject.
In conclusion, it is anticipated that new data in this rapidly moving field will emerge over the next few years and that some of the questions formulated in the research agenda will be answered. Therefore, we will consider an update of these PTC as needed in a few years.
Handling editor Professor Josef S Smolen
LG and JK contributed equally.
Correction notice This article has been corrected since it published Online First. The equal contribution statement has been added.
Contributors All authors have contributed to this work and have approved the final version.
Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.
Competing interests LG has published a study for which Orange IMT (telecommunications company) performed machine-learning analyses, without charge to the author. HS is an employee of Sanoïa, Digital CRO providing clinical research services including data science. RC is an employee of Orange Healthcare. There are no competing interests for the other authors.
Provenance and peer review Not commissioned; externally peer reviewed.
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.