Using a probabilistic model to assist merging of large-scale administrative records.

Author(s)
Enamorado, T. Fifield, B. & Imai, K.
Year
Abstract

Since most social science research relies upon multiple data sources, merging data sets is an essential part of researchers' workflow. Unfortunately, a unique identifier that unambiguously links records is often unavailable and data sets may contain missing and inaccurate information. These problems are severe especially when merging large-scale administrative records. The authors develop a faster and more scalable algorithm to implement a canonical probabilistic model of record linkage that has many advantages over deterministic methods frequently used by social scientists. The proposed methodology efficiently handles millions of observations while accounting for missing data and measurement error, incorporating auxiliary information, and adjusting for uncertainty about merging in post-merge analyses. The authors conduct comprehensive simulation studies to evaluate the performance of our algorithm in realistic scenarios. They also apply our methodology to merging campaign contribution records, survey data, and nationwide voter files. They provide open-source software for implementing the proposed methodology. (Author/publisher)

Publication

Library number
20200513 ST [electronic version only]
Source

SSNR, published online 8 August 2018, 39 p., ref.

Our collection

This publication is one of our other publications, and part of our extensive collection of road safety literature, that also includes the SWOV publications.