Record linkage for health studies : three demonstration projects.

Auteur(s)
Grootheest, G. van Groot, M.C.H. de Laan, D.J. van der Smit, J.H. & Bakker, B.F.M.
Jaar
Samenvatting

Record linkage is becoming more and more common in statistical and academic research. Linkage of records makes it possible to combine data from different sources to answer research questions that are very difficult or impossible to answer using data from just one source. Although linkage can be regarded as a more efficient way of obtaining data than setting up a new collection, it is important to understand the technical, methodological and legal restrictions that may apply. The project Biolink NL aims to report on the methodological, technical and legal aspects of record linkage of health data in the Netherlands. Biolink NL is one of the so-called rainbow projects funded by the Dutch Biobanking and Biomolecular Resources Research Infrastructure (BBMRI-NL). BBMRI-NL aims to stimulate collaboration and data sharing between research institutes (mostly biobanks), building on existing infrastructures, resources and technologies. The Biolink NL project is a combined effort of researchers from a number of academic research institutes and Statistics Netherlands (CBS). In a previous paper (Ariel et al, 2014), we reviewed the existing literature and compared the performance of different combinations of data sets, linkage variables, and linkage algorithms in a simulation study. Following this comparison of linkage approaches in a simulated setting where true links are known, the next step is to apply the same linkage methods to real datasets in which error rates and true links are unknown. In the current paper we demonstrate the feasibility of record linkage in a number of demonstration projects that include health care data. These demonstration projects have been chosen in such a way that they differ from each other in terms of population characteristics, time span of data collection, number of records per dataset, and the way the data are collected. We describe how different approaches perform under these reallife circumstances, compare how much work needs to be invested in each approach, and in the end provide a practical guide for researchers who wish to link their data to external registrations. There are several advantages of studying linkage methods in a simulation rather than in real datasets. First, a simulation gives full control over the characteristics of the datasets. The researchers can add linkage variables, change the number of records, or introduce any type of error. The second advantage is that the true matches between the created datasets are known, which means that the number of correct links, incorrect links, and missed links can be precisely reported. Based on the simulations that we performed, we described how different characteristics of the datasets influence linkage performance, and that some approaches are more susceptible for variations in these characteristics than others. One of the determinants of linkage success is the algorithm that is used. The most basic approach is deterministic linkage, which looks for exact matches between variables in two datasets. More flexibility can be reached with a probabilistic algorithm that gives weights to any similarity between record pairs. A pair is considered a link when a certain threshold is reached. A higher sensitivity can be reached by choosing a lower threshold. However, this will also increase the chance of creating false links. Generally speaking, probabilistic methods can identify more links than deterministic linkage, but an appropriate threshold must be chosen to avoid incorrect links. The second parameter that influences linkage results is the choice of linkage variables. Although the government and health care providers use a national identification number (NIN, in Dutch: Burger Service Nummer, BSN) for each citizen of the Netherlands, research cohorts are not allowed to process this number. When the use of the BSN is not possible, personal information such as sex, date of birth, name and address must be used. Thirdly, the number of records in the two datasets and the overlap between them are important factors. A research cohort typically has fewer records than the database from which additional information is retrieved, while the later does not cover the entire population either. In other words, the overlap is generally less than 100 percent and it is unknown which records should have a match in the other dataset. In general, both the sensitivity and precision of probabilistic linkage decrease as the overlap becomes smaller and the datasets become larger. Fourthly, no dataset is free of error. Discrepancies between two datasets may be caused either by incorrect data entry or by the change of variables over time. Obviously, sensitivity drops as the error rate increases. The effect of error on precision however depends on the size of datasets and their overlap. Unfortunately, it can be difficult to estimate how much error the linkage variables contain. Best linkage results are achieved if both datasets have been created or updated around the same time, and if the address history is recorded. Additionally, pre-processing can help to standardise variables and remove common spelling mistakes. In this paper we describe three different linkage projects. The first two consist of an academic research cohort linked to a larger (non-academic) registry; the third entails the linkage of two datasets that both contain millions of records. 1. The Netherlands Twin Register (NTR) linked with the Achmea Health Database (AHD). 2. The KOALA cohort with a number of pharmacies in the database of the Stchting Farmaceutsche Kengetallen (SFK). 3. The population register (Basisregistratie Personen, BRP, formerly the Gemeentelijke Basisadministrate) and employment register (ER, in Dutch: Werknemersbestand). The datasets selected for these demonstration projects differ greatly in size and coverage of the general population. Moreover, each of the linkages has certain features that impose a unique challenge. For example, the NTR consists of young twins, who share most of their personal information, such as address and date of birth. The SFK only has access to anonymised data, and linkage of the employment register with the population register is mostly challenging because of the large number of records. The aim of the current study is to establish whether data linkage can be an effective and efficient way to enrich research cohorts with additional information from external sources. The quality of such enrichment is crucial when addressing research questions that would otherwise be impossible to answer or would only be addressed in smaller samples, or with lower precision. The current paper consists of several chapters, in which we describe how the choice of the combined datasets, linkage methods, and linkage variables affect the feasibility of each linkage project and the reliability of the results. In this light, we do not only evaluate the quality of the linked datasets and try to answer the cohort’s research question, but also describe the work invested in each demonstration project. In the last chapter we summarise the results and provide a number of recommendations on how to go about record linkage in diverse situations. (Author/publisher)

Publicatie

Bibliotheeknummer
20170439 ST [electronic version only]
Uitgave

The Hague/Heerlen/Bonaire, Statistics Netherlands CBS, 2015, 71 p., 41 ref.; 2015 Scientific Paper - ISBN 978-90-357-2025-1

Onze collectie

Deze publicatie behoort tot de overige publicaties die we naast de SWOV-publicaties in onze collectie hebben.