A Natural Language Processing Model for COVID-19 Detection Based on Dutch General Practice Electronic Health Records by Using Bidirectional Encoder Representations From Transformers: Development and Validation Study

Verfasser:	Maarten Homburg Eline Meijer Matthijs Berends Thijmen Kupers Tim Olde Hartman Jean Muris Evelien de Schepper Premysl Velek Jeroen Kuiper Marjolein Berger Lilian Peters
Dokumenttyp:	Artikel
Erscheinungsdatum:	2023
Reihe/Periodikum:	Journal of Medical Internet Research, Vol 25, p e49944 (2023)
Verlag/Hrsg.:	JMIR Publications
Schlagwörter:	Computer applications to medicine. Medical informatics / R858-859.7 / Public aspects of medicine / RA1-1270
Sprache:	Englisch
Permalink:	https://search.fid-benelux.de/Record/base-26626990
Datenquelle:	BASE; Originalkatalog
Powered By:	BASE
Link(s) :	https://doi.org/10.2196/49944

BackgroundNatural language processing (NLP) models such as bidirectional encoder representations from transformers (BERT) hold promise in revolutionizing disease identification from electronic health records (EHRs) by potentially enhancing efficiency and accuracy. However, their practical application in practice settings demands a comprehensive and multidisciplinary approach to development and validation. The COVID-19 pandemic highlighted challenges in disease identification due to limited testing availability and challenges in handling unstructured data. In the Netherlands, where general practitioners (GPs) serve as the first point of contact for health care, EHRs generated by these primary care providers contain a wealth of potentially valuable information. Nonetheless, the unstructured nature of free-text entries in EHRs poses challenges in identifying trends, detecting disease outbreaks, or accurately pinpointing COVID-19 cases. ObjectiveThis study aims to develop and validate a BERT model for detecting COVID-19 consultations in general practice EHRs in the Netherlands. MethodsThe BERT model was initially pretrained on Dutch language data and fine-tuned using a comprehensive EHR data set comprising confirmed COVID-19 GP consultations and non–COVID-19–related consultations. The data set was partitioned into a training and development set, and the model’s performance was evaluated on an independent test set that served as the primary measure of its effectiveness in COVID-19 detection. To validate the final model, its performance was assessed through 3 approaches. First, external validation was applied on an EHR data set from a different geographic region in the Netherlands. Second, validation was conducted using results of polymerase chain reaction (PCR) test data obtained from municipal health services. Lastly, correlation between predicted outcomes and COVID-19–related hospitalizations in the Netherlands was assessed, encompassing the period around the outbreak of the pandemic in the Netherlands, that is, ...