A Very Personal Genome Project
The data and samples that participants share in the Personal Genome Project (PGP) are considered highly identifiable. One of the key aspects for defining what it means to be an implementation of the Personal Genome Project is an absence of anonymity:
From our guidelines for PGP implementations:
“Non-anonymous. The risks of participant re-identification are addressed up front, as an integral part of the consent and enrollment process; neither anonymity nor confidentiality of participant identities or their data are promised to research participants.”
We have designed a consent process that includes many layers of upfront and ongoing education about the unique nature of public genomics research studies like the PGP. One of the important messages to participants is that their data are highly identifiable and therefore not “anonymous”. For example, the study guide that accompanies our mandatory entrance exam provides one of the more famous examples of how only a few pieces of demographic data can reveal a person’s identity:
From the PGP Study Guide:
“Identities can be discovered with surprisingly little information — for example, the combination of sex, birth date and ZIP code is specific enough to be uniquely identifying information for 87% of people!”
We know that hands-on demonstrations of otherwise abstract concepts can be extremely valuable for learning. Talking about a “personal genome” in the abstract can be a far different experience compared to wading through millions of variants contained in your very own personal genome sequence! So to enhance understanding of identifiability, we invited two research groups to demonstrate how re-identification is possible using public PGP data during GET Labs in Boston (April 25-26).
Latanya Sweeney’s Data Privacy Lab drew upon her pioneering work on the identifiability of demographic data to show how these techniques can be applied to public PGP profiles containing sex, birth date, and ZIP code. It was no surprise to find that many PGP participants are, in fact, identifiable. Indeed, all PGP participants should expect this potential outcome.
This is important, considering Harvard PGP participants are able to add ZIP codes to their public profiles in anticipation of research activities that explore how geographic location — and all the associated chemical exposures, microbes, viruses, air quality, allergens, etc. — impacts health. For anyone who was not at the GET Conference, Sweeney’s group has created a tool showing how identifiable you are in your own zip code. Check it out here: http://aboutmyinfo.org/
A word of caution is required here about the best way for PGP participants to respond: we strongly advise any participant concerned about the identifiability of their data to reconsider their participation in the Personal Genome Project. Another viewpoint, one that we find worrisome, is for participants in the PGP to deploy clever tricks for reducing the identifiability of their public data. As part of their demonstration, the Data Privacy Lab is providing tools to participants that “scrub” their data (e.g. replacing a 5 digit zip code with a 3 digit zip code, etc). This may create the impression of privacy, but it will not make participants anonymous. Earlier this year an exciting study published by Gymrek et al. in Yaniv Erlich’s lab forcefully demonstrated that genome data alone is extremely identifying. Melissa Gymrek also had a table at the GET Conference this year where she demonstrated the technique to participants. Their research matched whole genome Y-chromosome data to ancestry databases, which link surnames with Y-chromosome markers. With these surname clues and just a few other pieces of publicly available data, their group was able to identify specific individuals and families from their highly distributed “anonymous” cell lines.
Thus, all participants should believe that they are identifiable: there is no such thing as an “anonymous” genome!
In our experience, many participants want to be identified and are very open about which public profile is theirs. The PGP does not require participants to reveal their names, but with the media coverage of the Sweeney group’s work we realize that the project appears to outsiders as “anonymous” — even though participants, after passing our enrollment exam, know better (or should)! To meet the desires of some participants and to further clarify the non-anonymous nature of the PGP, we’re going to work on allowing participants to add their photos and/or name to their public PGP profiles. I expect it will make PGP profile pages much more “personal” and create a provocatively different scientific database!