Article Text
Statistics from Altmetric.com
Increasing numbers of patients are sharing their health-related experiences online in forums, or on social media websites, such as Twitter and Facebook. This largely untapped source of data about patients’ experience of living with disease and its treatment may be useful in deriving drug safety information such as the occurrence, nature and impact of side effects.
Text mining techniques can transform free text into structured data amenable for analysis by automatically recognising mentions of various health conditions and their relationship to a particular medication. These techniques have been used to identify the occurrence of commonly discussed drug adverse events (AEs) from posts on Facebook and Twitter.1 2 They have also been used to identify discussions about benefits of drugs and how these benefits compared with the AEs, other treatment options, costs and complaints about the product.1
A recently published analysis of Twitter posts mentioning prednisolone or prednisone found insomnia and weight gain to be the most frequently discussed side effects.3 However, with the 140 (or more recently 280) character limit per tweet, any side effect information is limited to what can be included and discussed within this space.
HealthUnlocked (HU), Europe’s largest social media network for health that supports patients and healthcare providers, hosts over 700 communities (including the UK’s National Rheumatoid Arthritis Society (NRAS)) with 4.5 million visits and around 250 000 new posts per month. There is no character limit to these forum posts, allowing patients to describe their experiences more fully. As a result, these richer data on experiences may add further value beyond simply detecting the occurrence of side effects, such as the severity of adverse drug reactions (ADRs), the impact of ADRs on quality of life, strategies patients use to manage side effects, as well as positive experiences with medications. Using the example of glucocorticoid (GC) therapy, this study aimed to explore the potential of HU posts in providing information about the occurrence and nature of drug side effects.
Our objectives were to (1) investigate the capability of machine learning-based systems (trained on publicly available data from other sources) to detect suspected ADRs (sADRs) from HU posts and evaluate how this compared with human annotation and (2) explore themes of discussion about GC-related ADRs within HU posts.
HU provided a dataset of de-identified posts from the NRAS community from December 2015 to December 2016, after the community had been notified of the planned research and reminded of the forum’s terms of use and data sharing opt-out options. Posts mentioning GCs were processed by automated Natural Language Processing software, which identified the drug and health issues, mapped them to the Medical Dictionary for Regulatory Activities dictionary3 and categorised the posts as a sADR or not.
Posts were identified as containing a sADRs by recognising mentions of health issues and classifying whether the post also describes an ADR, with indicator score >0.7. The indicator score shows how confident the classification model was at predicting whether that post contains an ADR. The classification model was trained on AskAPatient4 ADR classification dataset, whereas for the recognition of ADR mentions, our model was trained on CSIRO adverse drug event corpus (CADEC)5 data.
A sample of sADR posts (n=50) were randomly selected and manually reviewed to determine whether they were true ADRs. Additionally, a sample (n=50) of the posts that included GC and were labelled as having a mention of a health issue but not thought to have an ADR, were also assessed for true ADRs. Posts identified as containing GC ADRs from manual analysis were then reviewed to identify themes.
Of the 35 904 posts from 1998 users, 2409 posts mentioned GCs, of which 324 posts were identified as containing a sADR. After manual review of the 50/324 sampled sADRs, only 36% (18/50) of these posts contained a true ADR. Of the 50 sampled posts that included a mention of GCs and a health issue but were not a sADR, 28% (14/50) were found to contain true ADRs.
Thematic analysis of the 32 posts containing true GC ADRs found the most frequently mentioned ADRs were fractures (n=6), infection (n=5), headaches (n=3) and weight gain (n=3). Posts included rich descriptions and a number of important themes were identified, illustrated in table 1.
Our analyses show that current machine learning models trained on available annotated data for ADR detection in social media still need further improvements to identify ADRs in health forum data. Our manual review showed there are important themes relating to patients’ experiences and perceptions of using GC that may not be obtained using traditional methods such as analysis of health records or spontaneous pharmacovigilance. This expanded from the occurrence of side effects to their nature, impact on quality of life, temporal change, importance to the individual and more. With improved automated ADR detection and other feature recognition, this rich data source may be useful to identify ADRs most important to patients and additional features that will improve future shared informed decision making.
Acknowledgments
MedDRA trademark is owned by The International Federation of Pharmaceutical Manufacturers & Associations (IFPMA) on behalf of International Conference on Harmonisation of Technical Requirements for Registration of Pharmaceuticals for Human Use (ICH). The initial findings of this project were presented at the Annual European Congress of Rheumatology (EULAR) conference, Madrid in June 2019 (https://ard.bmj.com/content/78/Suppl_2/80.1).
Footnotes
Handling editor Josef S Smolen
Twitter @Arani_Viv, @WGDixon
Contributors WGD conceived the study. AV performed the analyses and drafted the manuscript. MB contributed to the data analysis. GN provided computer science technical support and expertise regarding automated language processing. All authors reviewed and approved the final version of this text.
Funding The work was supported by the Centre for Epidemiology Versus Arthritis (20380). AV is supported by a National Institute for Health and Research (NIHR) funded Academic Clinical Fellowship. MB is funded by an Engineering and Physical Sciences Research Council (EPSRC) PhD fellowship.
Competing interests WGD has received consultancy fees from Google and Bayer, unrelated to this work.
Patient and public involvement Patients and/or the public were involved in the design, or conduct, or reporting, or dissemination plans of this research. Refer to the Methods section for further details.
Patient consent for publication Not required.
Ethics approval University of Manchester Research Ethics Committee, review by Computer Science ethics (ID CS 257).
Provenance and peer review Not commissioned; externally peer reviewed.