Blog | Friday, August 9, 2013

The limits to de-identification of clinical data

There is a wealth of opportunity for putting digital clinical data to use for better understanding health and disease as well as improving health care delivery, consistent with the vision from the Institute of Medicine of the "learning health system" [1]. Yet as we have seen from the recent news around the US government monitoring of phone call metadata and Internet data, there are serious concerns about the mis-use of digital data that could be a deal-breaker for health-related data if we do not address privacy and security head on.

Concerns about privacy and security of health data are quite valid. Barely a day goes by before we hear about another data breach in a health care organization, with those large enough going on the "wall of shame" of the U.S. Department of Health and Human Services Office of Civil Rights (OCR). These concerns are demonstrated well in the famous but eerie ACLU pizza video. Another recent study shows it is quite easy to discern attributes, including those that may be health-related, about people based on what they share on their Facebook wall [2].

If we want to get the health-related benefits of clinical data, we must address privacy and security issues. We need not only strict regulation of what can and cannot be done with data, but also an ethos around its responsible use. However, if we address those issues appropriately, then perhaps we should be, as pointed out by Zak Kohane recently, demanding "more surveillance" of medical records for health-related purposes.

When it comes to protecting data, we need to be realistic about what does and does not work. One solution commonly proposed is "de-identification" of data, i.e., removal of elements that identify individuals. There is certainly a role for the use of de-identified data in many types of analysis of clinical data. There are, however, limits to the use of de-identified data.

The problems of de-identified are two-fold. First, as famously shown by Latanya Sweeney over a decade ago, data de-identified one database can be combined with data in other sources to re-identify people, including the Governor of Massachusetts [3]. She recently demonstrated this again with a study of people who volunteered their data for the Personal Genome Project [4]. Cimino has shown groups of lab test results (e.g., chemistry panels) can allow re-identification of people [5]. The bottom line is that we are awash in data than can allow re-identification of people, and it will only be exacerbated by the ultimate personal identifier that will soon be available, namely the variants in our own genome.

But perhaps the more important limitation is that data that is truly de-identified, i.e., to the point it cannot be re-identified, may lead to incompleteness in the ability for its use in a comprehensive manner. This is mainly because people get health care at different places [5,6]. While we may be able to re-identify data within an organization, it is typically difficult when data goes into multi-organization repositories. Again, this de-identified data may be perfectly fine for some purposes, it does not give us the longitudinal data to which we might want to ask more complex questions.

While recent events may give us a jaundiced view of the uses of data, hopefully that view will moderate as we start to see the benefits. Many of those benefits may emanate from health care, making it imperative that we both address privacy and security concerns seriously but also put such data to beneficial use.


1. Smith M, Saunders R, Stuckhardt L, and McGinnis JM, Best Care at Lower Cost: The Path to Continuously Learning Health Care in America. 2012, Washington, DC: National Academies Press.
2. Kosinski M, Stillwell D, and Graepel T, Private traits and attributes are predictable from digital records of human behavior. Proceedings of the National Academy of Sciences, 2013. 110: 5802-5805.
3. Sweeney L, k-anonymity: a model for protecting privacy. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 2002. 10: 557-570.
4. Sweeney L, Abu A, and Winn J, Identifying participants in the personal genome project by name. Social Science Research Network, 2013.
5. Cimino JJ, The false security of blind dates: chrononymization’s lack of impact on data privacy of laboratory data. Applied Clinical Informatics, 2012. 3(4): 392-403.