What does it mean to deidentify data?
Many journals, including PLoS Medicine, have policies regarding the availability of datasets underpinning published research findings. For example, PLoS journal policies require the “…agreement of authors to make freely available any materials and information described in their publication that are reasonably requested by others for the purpose of academic, non-commercial research…”. And the Annals of Internal Medicine publishes a statement with each paper declaring the willingness of authors to share study protocols, statistical code, and underlying datasets.
However, when those datasets result from clinical or public health research, the question of patient confidentiality arises. Although attempts have been made to define what’s meant by “reproducible research” in epidemiology, such attempts have led some authors to state that “Under our definition, it would seem impossible to simultaneously honor those promises [of confidentiality] and make one’s research reproducible”. (eg, http://aje.oxfordjournals.org/cgi/content/full/163/9/783).
A recently published paper in the journal Trials attempts to set out a standard for deidentifying datasets (mainly for clinical trials) so that they can be freely published alongside articles. The authors “define a dataset as that containing the minimum level of detail necessary to reproduce all numbers reported in the paper”. In addition, they collate a set of 28 datatypes which are directly, or potentially, identifying. In the paper the authors propose that if a dataset contains three or more “indirect identfiers”, researchers seek advice from an appropriate oversight body (such as their ethics committee) before releasing the dataset.
Although this guidance offers a clear starting point to researchers, a field of “de-identification science” seems to be rapidly emerging — as highlighted in a recent Guardian article and PNAS paper — whereby researchers attempt to tie together public data sources in an effort to see whether they can identify the individuals within datasets. These efforts suggest that even an attempt to remove obvious identifiers may not be enough to protect individuals. One possible way forward may be to ask participants in research studies to consent prospectively to release of the dataset, although even this approach has limitations.