Skip to content

Exploring the Harvard PGP Dataset with Untap

November 19, 2015

Recently, my co-worker Abram Connelly scraped the phenotypes in the Harvard Personal Genome Project and made it available in a small SQLite database, publicly available for anyone to download. He made a small webapp around the database where people can play around with the data directly in their browser.

Current webapp (first page has link to gzip of the database) is linked to at the top of this page.

dataset

011

The dataset consists of people who have completed the enrollment process and were free to upload their own data for public release and also answer the surveys online (but not necessarily people who have donated samples, had the samples sequence, or had the samples released publicly).

To put it concretely, there are around 4000 people enrolled, and around 200 people with whole genome sequences that were  sequenced, interpreted, and returned by Harvard PGP as of August 2015 (though keep your ears open for upcoming news). (Participants may have whole genomes sequenced independently and then elect to upload and donate the data to the Harvard PGP).

webapp

Returning to the webapp, there are a few default tabs, where you can do things like explore what year PGP participants were born, where you can see that our population is mostly young folks…

041

“Summary” Pre-Packaged View of Allergies of Participants

…or with two clicks see what allergies are most common in PGP participants. Note that this is a quick scrape of the Tapestry database and no clean-up has been done, so you’ll notice allergies being listed twice with different spellings.

http://gfycat.com/ifr/EvenEnormousElephantbeetle

SQL Queries for Participants with “Oak” Allergies

On the “queries” tab, you can query the sql database and see the results in neat table form in your browser.

http://gfycat.com/ifr/NiftyExcellentKoala

Additionally, there are some pre-packaged but interactive visualizations, where you can edit the text and have the graph update to reflect your changes / newly requested data.

For instance, here’s a display of the participant gender ratio at different ages which I modify to display information about the allergies at different age buckets

before, displaying gender of participants

271

 

and after, displaying penicillin and house dust allergies

281

 

Obligatory cat statistics

http://gfycat.com/ifr/DearContentAracari

Although one could hope that this graph shows that PGP participants are not more likely to develop allergies to cats as they grow older, we have a lot more younger participants and this is absolute and not percent frequency, so we might have to say the data points to the opposite. Sad!

(Disclaimer: Just for fun, no real thought put into this analysis :] )

Conclusion

Ever wanted a public genotype + phenotype dataset? The Harvard PGP has you covered!

We have phenotype surveys galore (including a recently released one that includes blood type and eye color), with responses available in CSV form. The questions on the survey forms are available on github for now.

I hope you all enjoy! The source code for Untap is on github

https://github.com/abeconnelly/untap

and Abram welcomes feature requests / issue reporting. We hope this is beneficial to the GA4GH working groups specifically and other researchers in general.

Comments are closed.

Follow

Get every new post delivered to your Inbox.

Join 129 other followers

%d bloggers like this: