Article Text

Download PDFPDF

  1. Vasileios Pezoulas1,
  2. Themis Exarchos1,2,
  3. Aliki Venetsanopoulou3,
  4. Evangelia Zampeli4,
  5. Saviana Gandolfo5,
  6. Salvatore De Vita5,
  7. Foteini N. Skopouli6,
  8. Athanasios Tzioufas3,
  9. Dimitrios Fotiadis1
  1. 1University of Ioannina, Dept. of Materials Science and Engineering, Ioannina, Greece
  2. 2Ionian University, Dept. of Informatics, Corfu, Greece
  3. 3University of Athens, Dept. of Pathophysiology, Athens, Greece
  4. 4Institute for Systemic Autoimmune and Neurological Diseases, Athens, Greece
  5. 5Udine University, Dept. of Medical and Biological Sciences, Udine, Italy
  6. 6Euroclinic Hospital, Dept. of Internal Medicine and Clinical Immunology, Athens, Greece


Background: Primary Sjögren’s Syndrome (pSS) is a chronic systemic autoimmune disease that is affecting primarily women near the menopausal age, causing exocrine gland dysfunction, with clinical manifestations varying from dry eye and mouth to multi-systemic disorders [1]. The lack of automated means for data quality improvement in pSS cohorts and the huge time effort needed for manual curation, yield data that are irrelevant and incomplete, introducing undesirable implications in their analysis.

Objectives: To enhance the quality of the clinical data in pSS using automated data curation.

Methods: Anonymized clinical data were recruited from 380 patients with pSS from the University of Athens (UoA) cohort (300 patients, mean age 68.7914.84) and the Harokopio University of Athens (HUA) cohort (80 patients, mean age 59.2913.92). The features consist of SS-related measures (see [2] for details). The curation tool produces 3 files: (i) a quality report, including the metadata, the presence of outliers (using the z-score [3]), unknown data types, and missing values, on a feature-basis (data imputation [4] is used to fix features with < 50% missing values), (ii) the curated dataset, where the inconsistencies are marked using color notations, and (iii) a standardization report, where the features that share common terminology with those from a reference model [5] (i.e., a set of parameters that describe the pSS minimal requirements) are identified using lexical matching [6].

Results: For the UoA cohort, out of 167 features, 80 were classified as “bad”, 30 with unknown data type, and 12 were marked for outliers (Fig. 1). An example of an outlier was found for the IgM (1370 mg/dL). For the HUA cohort, out of 204 features, 69 were classified as “bad”, 5 with unknown data type, and 13 were marked for outliers. The standardization process succesfully matched 82 out of 88 (93.18%) pSS-related terms for the UoA cohort and 61 out of 69 (88.4%) terms for the HUA cohort.

Conclusion: Our strategy enhances the quality of the pSS clinical data through data curation and reduces the time effort needed for manual curation by the clinicians. The tool produces re-usable reports that can be used to fix inconsistencies, outliers, missing values, and harmonize pSS clinical data [6].

Figure 1

An instance of (A) the curated dataset, and (B) the quality assessment report, for the UoA cohort.

References [1] C. P. Mavragani, and H. M. Moutsopoulos, “Sjögren syndrome,”Can. Med. Assoc. J., 2014;186(15):E579-86.

[2] S. Fragkioudaki, et al., “Predicting the risk for lymphoma development in Sjogren syndrome: an easy tool for clinical use,” Medicine, 2016;95(25).

[3] P. J. Rousseeuw, and M. Hubert, “Robust statistics for outlier detection,” Wiley Interdiscip. Rev. Data Min. Knowl. Disc.,2011;1(1):73-9.

[4] S. Van Buuren, “Flexible imputation of missing data,” Chapman and Hall/CRC, USA, 2018.

[5] V. C. Pezoulas, et al., “Towards the establishment of a biomedical ontology for the primary Sjögren’s Syndrome,” in IEEE Eng. Med. Biol. Soc.,2018;4089-92.

[6] K. Kourou, et al., “Cohort Harmonization and Integrative Analysis from a Biomedical Engineering Perspective,”IEEE Rev. Biomed. Eng., 2018.

Acknowledgement: * This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 731944 and from the Swiss State Secretariat for Education, Research and Innovation SERI under grant agreement 16.0210.

Disclosure of Interests: Vasileios Pezoulas: None declared, Themis Exarchos: None declared, Aliki Venetsanopoulou: None declared, Evangelia Zampeli Speakers bureau: Roshe, Astrazeneca, Saviana Gandolfo: None declared, Salvatore De Vita Grant/research support from: Roche, Pfizer, Abbvie, Novartis, BMS, MSD, Celgene, Janssen, Consultant for: Roche, Foteini N. Skopouli: None declared, Athanasios Tzioufas Grant/research support from: ABBVIE, PFIZER, AMGEN, NOVARTIS, GSK, Dimitrios Fotiadis: None declared

Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.