Record linkage in health data : a simulation study.

Author(s)
Ariel, A. Bakker, B.F.M. Groot, M.C.H. de Grootheest, G. van Laan, J. van der Smit, J. & Verkerk, B.
Year
Abstract

Record linkage is becoming more and more common in statistical and academic research. Linking records makes it possible to combine data from different sources to answer research questions that are very difficult to answer using data from just one source. The advantages of combining different sources have been demonstrated by among others Newcombe et al., (1959); Wallgren and Wallgren (2007); and Bakker and Daas (2012). In many situations, record linkage is an efficient way to collect data and can reduce the inconvenience of asking sensitive questions (Fournel et al., 2009; Herings, 1993). The challenge in record linkage is to link records that belong to the same individual from different sources. Missed links lead to the same problems as non-response in surveys. If certain groups of individuals are more difficult to link, estimations could be biased. Similarly, incorrect links, defined as combining the information of two different persons into one record, lead to errors that are similar to measurement error (Bakker and Daas, 2012). The quality of linkage procedures and thus the reliability of the datasets are difficult to determine, and this constitutes a major issue in record linkage. In health research, linkage has become a popular way to combine data, despite the sensitivity of the information and strict regulations for preventing disclosure of information. Biobanks – collections of biomedical samples with medical, genetic and/or genealogic data (see glossary of terms) – can be greatly enriched by linking them to certain registers, for example for assessing the effect of exposures on health outcomes (Pukkala, 2008). The same holds for longitudinal cohort health data, i.e. information about particular groups of persons. In the Netherlands, many high-quality medical and socioeconomic registers, covering more general population groups, are available for linkage to biobanks and cohorts (Bakker, 2002). The potential of record linkage in health research has been extensively demonstrated (Vink et al., 2006; Eussen et al., 2010; Bozkurt et al., 2009; Schelleman et al., 2006; Bergman et al., 2000). For example, linkage of the Netherlands Cancer Register to the nationwide Dutch Pathology database (PALGA) has been proven to be useful to study the risk and prognosis of endometrial cancer after treatment with tamoxifen (Bergman et al., 2000). Also, linking pharmacy records to biobanks has provided an opportunity to investigate the interactions between thiazide diuretics and genetic variation in the renin-angiotensin-system on the risk of type 2 diabetes mellitus (Bozkurt et al., 2009). In spite of the fact that record linkage has proven its value in research, it is not just a case of simply following a protocol. Researchers who intend to enrich their data with information from another source need to choose an approach that takes into account the available identifying variables in both sources, a linkage algorithm that combines records based on those variables, and all ethical and legal issues involved. In the present paper, we demonstrate the influence that the choice of variables and linkage algorithms has on linkage results, but also the importance of the properties of the data sources. The current paper was written as part of Biolink NL, one of the so-called rainbow projects funded by the Dutch Biobanking and Biomolecular Resources Research Infrastructure (BBMRI-NL). BBMRI-NL aims to stimulate collaboration and data sharing between research institutes (mostly biobanks), building on existing infrastructures, resources and technologies. The Biolink NL project is a combined effort of researchers from a number of academic research institutes and Statistics Netherlands. In general, universities and research institutes wishing to enrich their data by linking their own source to external data sources face three challenges. Firstly, privacy laws restrict the use of personal identifiers that can be used to identify the same person in di?erent datasets. Research cohorts are not allowed to store the National Identification Number (NIN, in Dutch: Burger Service Nummer, BSN or Citizen Service Number) in any form. Typically, only personal identifiers such as name, date of birth, etc. are available in their data. Data linkage based on NIN requires permission from the authority concerned and is regulated by a strict protocol to warrant confidentiality. However, even when the NIN can be used for linking, many biobanks or cohorts contain individuals without a NIN. Secondly, as both biobanks and registers are governed by the statutory legal framework, linking them to other registers may be restricted by law. In addition, access to individual medical registers and biobanks is controlled by various parties with different regulations and committees. Some biobanks use an informed consent procedure that allows their data to be linked with other registers, whereas others do not have such an explicit consent procedure. Thirdly, and this is not only a challenge for universities and research institutes but also for statistical offices, there is an emerging need to assess linkage quality. Linkage quality is seldom assessed and almost never on a regular basis after implementation of a new or changed linkage procedure. Linkage quality can be determined by means of a validation, typically by comparing the consistency or the plausibility of research variables (also known as content variables, for example disease history, medicine use, etc.) in the linked records, if the access to such variables is not restricted. While such a procedure is legitimate, one should be aware of potential discrepancies between the content variables due to possible differences in definitions. Moreover, if the target population is changed, the validation must be performed again. Most of the aforementioned challenges apply to biobanks and research institutes, but not to Statistics Netherlands, which has a unique legal position allowing it to use the NIN for linkage purposes. The compilation of social statistics, including health statistics, is largely based on linked register data (Arts et al., 2000b; de Bruin et al., 2003). These linkages are based on the NIN and therefore linkage quality is high, but Statistics Netherlands is still very interested in the further development of linkage methods. Nowadays, big data mean that large volumes of information are becoming available alongside registers and survey data. The possibilities for linking such data are limited, because the number of potential linkage variables is usually small, and unique identifiers are missing. In this respect, the challenge is to develop new linkage methods that take these aspects into account. The main goal of the present study is to compare the performance of various record linkage methods in health data when only personal identifiers can be used for linkage. The study findings will be published into two white papers. The present – first – paper compares the properties of different linkage methods using simulation datasets, while a second report will be published after real datasets have been linked in three or four demonstration projects. Previous studies have shown that the combination of personal identifiers provides a feasible alternative if a unique identifier such as the NIN is not available (see for example, Van den Brandt et al., 1990a; Pasquali et al., 2010; Meray et al., 2007; Newcombe et al., 1992). The strength of such combinations is determined by the number of identifiers included, as well as their individual discriminative power (Newcombe et al., 1992; Reitsma, 1999). Although it is tempting to increase the power of the linkage key by combining as many identifiers as possible, in practice variable values contain errors and may change in the course of time, leading to discrepancies in the linkage keys. Moreover, some personal information may not be used as a linkage variable because of privacy concerns. For example, the identifier surname can be a powerful linkage variable. At the same time, this identifier is error-prone and considered highly sensitive, which restricts its usage even when encrypted. In most situations, the use of this identifier requires additional work such as pre-processing in order to reduce any inconsistency due to either spelling variation or typographical errors. Therefore, it is necessary to recognize in which situations surname should be included for linking. In this study, we investigate the performance of record linkage methods when certain combinations of personal identifiers are used as linkage variables, taking into consideration that these are not error-free. We select a set of identifiers likely to be present in real data. Another aspect affecting linkage success is the size of the data sources involved. For example, linking large datasets may increase the likelihood of linking the wrong records; it is important to take this into account, as this project comprises various sized data sources. The overall goal of our study is to improve existing record linkage practice, with the following sub goals: 1. To identify which combinations of personal identifiers are indispensable to obtain an acceptable proportion of correct links; 2. To compare the performance of deterministic and probabilistic approaches; 3. To describe the influence of dataset size and quality on linkage results. For both practical and privacy-related reasons, we first evaluate record linkage methods using simulated datasets, in which the true links are known. We compare their performance with different combinations of linkage variables, focusing on identifiers commonly available in cohorts and registers. Because in reality very few databases are completely error-free, errors were introduced into the simulation as well. We intend to apply the same linkage methods to real data and work together with researchers who have more detailed knowledge of the research topic in the near future. Using these real datasets, we plan to identify which population subgroups, if any, are more difficult to link than others, and hence could give rise to selection bias and inaccurate research outcomes. The findings of these linkages will be presented in a separate white paper. This paper consists of five chapters. In the following chapter we introduce the basic theory of record linkage methods. Chapter 3 is a short literature review that focuses on record linkage methodology. Chapter 4 describes in detail how the datasets for linkage simulation were created in such a way that these resemble existing data in biobanks and registers, including specific population characteristics and varying data quality. Subsequently, the performance of different linkage approaches is compared, using these simulated files. In short, the following steps were taken: 1. Linkage variables selection. We want to link records using identifiers that are highly discriminative when combined and that are commonly available in registers and biobanks. Content-specific variables, such as types of disease, should be used only as optional linkage variables or as a tool to validate the linking results. 2. Dataset simulation. Different registers and biobanks cover different parts of the population. For example, the general population register (in Dutch: Gemeentelijke Basisadministrate personen, GBA) covers the vast majority of the Dutch population, while a specific disease cohort register covers a specific part of the population and does not necessarily reflect the Dutch population. Because of these differences, a particular linkage strategy may work perfectly for a certain type of register (or combination thereof), but might be less suitable for another type. Because our goal is to examine a linkage strategy that can handle different types of registers and biobanks, it is desirable to test the same methodology on various types of data: - A dataset covering the population in general (such as the GBA) - A dataset covering a specific part of the population (such as specific disease registrations) - A dataset covering a very specific part of the population (such as birth cohort, females, twins register) We created simulation datasets that have the properties of the specific datasets proposed above. Chapter 4 describes how, and Appendix I contains more details. 3. Data error simulation. To simulate various degrees of data quality, we introduced errors into the identifiers. For example, the postal code may not be up-to-date and the date of birth may not be always known for non-natives (Arts et al., 2000). Furthermore, we introduced realistic typographical errors (Oberaigner, 2007; Christen and Pudjijono, 2009). 4. Record linkage simulation. We evaluated the chosen linkage methods in a number of scenarios based on both availability and quality of the linking variables, as well as different overlaps between data sources. The final chapter summarises the conclusions from the simulation study. (Author/publisher)

Publication

Library number
20170438 ST [electronic version only]
Source

The Hague, Statistics Netherlands CBS, 2014, 64 p., 77 ref.; 2014 Scientific Paper

Our collection

This publication is one of our other publications, and part of our extensive collection of road safety literature, that also includes the SWOV publications.