Skip to content

Exploring the Harvard PGP Dataset with Untap

November 19, 2015

Recently, my co-worker Abram Connelly scraped the phenotypes in the Harvard Personal Genome Project and made it available in a small SQLite database, publicly available for anyone to download. He made a small webapp around the database where people can play around with the data directly in their browser.

Current webapp (first page has link to gzip of the database) is linked to at the top of this page.



The dataset consists of people who have completed the enrollment process and were free to upload their own data for public release and also answer the surveys online (but not necessarily people who have donated samples, had the samples sequence, or had the samples released publicly).

To put it concretely, there are around 4000 people enrolled, and around 200 people with whole genome sequences that were  sequenced, interpreted, and returned by Harvard PGP as of August 2015 (though keep your ears open for upcoming news). (Participants may have whole genomes sequenced independently and then elect to upload and donate the data to the Harvard PGP).


Returning to the webapp, there are a few default tabs, where you can do things like explore what year PGP participants were born, where you can see that our population is mostly young folks…


“Summary” Pre-Packaged View of Allergies of Participants

…or with two clicks see what allergies are most common in PGP participants. Note that this is a quick scrape of the Tapestry database and no clean-up has been done, so you’ll notice allergies being listed twice with different spellings.

SQL Queries for Participants with “Oak” Allergies

On the “queries” tab, you can query the sql database and see the results in neat table form in your browser.

Additionally, there are some pre-packaged but interactive visualizations, where you can edit the text and have the graph update to reflect your changes / newly requested data.

For instance, here’s a display of the participant gender ratio at different ages which I modify to display information about the allergies at different age buckets

before, displaying gender of participants



and after, displaying penicillin and house dust allergies



Obligatory cat statistics

Although one could hope that this graph shows that PGP participants are not more likely to develop allergies to cats as they grow older, we have a lot more younger participants and this is absolute and not percent frequency, so we might have to say the data points to the opposite. Sad!

(Disclaimer: Just for fun, no real thought put into this analysis :] )


Ever wanted a public genotype + phenotype dataset? The Harvard PGP has you covered!

We have phenotype surveys galore (including a recently released one that includes blood type and eye color), with responses available in CSV form. The questions on the survey forms are available on github for now.

I hope you all enjoy! The source code for Untap is on github

and Abram welcomes feature requests / issue reporting. We hope this is beneficial to the GA4GH working groups specifically and other researchers in general.

Joining as a PGP Volunteer

November 3, 2015

I’m Nancy. I recently joined the Harvard Personal Genome Project as a volunteer. I think I’ve joined at a great time, when the Harvard PGP has the world’s largest public dataset that has whole genome sequences linked with genotypes.

I’m excited to join in what I view as an effort that addresses the inherent ethical issues in genomics research: genomes are as individual as a fingerprint, and to stretch the analogy a bit, a smudged fingerprint (de-identified) or summaries of large amounts of fingerprints (aggregation) is only so useful, especially as with the rise of precision medicine we start targeting smaller and smaller subsets of the population with precision medicine.

I think there are many challenges in the HPGP right now, among them challenges in funding and staffing, which contribute to a lot of frustration on behalf of participants, many who have donated blood and saliva samples and waited months and even years without a returned sample from us.

As I’ve worked with the HPGP staff over the last few months, I’ve come to see that every last one of the staff members is working extremely hard to get samples sequenced and genomes returned. However, none of us work on HPGP full-time and we also rely on donated effort from other organization, such as sequencing centers (which we’re very grateful for!). Although our pace may seem slow, I’m still really impressed by how much work has been done already.

I also like to brainstorm about the future. A future where, among other things, you might be able to check on the status of your genome ala Domino’s Pizza instead of having to email us and have to wait for us to laboriously reply to the many emails we get each week.

(just kidding).

On that note, happy fall everyone!

–Nancy Ouyang

PGP & the Critical Assessment of Genome Interpretation

September 9, 2015

CAGI logo

We’re thrilled to announce that data from the Harvard Personal Genome Project is being used in a challenge this year presented by the Critical Assessment of Genome Interpretation (CAGI). CAGI challenges test the ability of researchers to interpret genome data and make phenotype predictions.

PGP data is uniquely valuable for these challenges as it is completely “open source”: the algorithms and data can be completely open. In this challenge, experimenters are asked to predict matching phenotype profiles for a set of genomes. To read more about the challenge, follow this link to CAGI’s website:

The 2015 Harvard PGP conference

August 31, 2015

Next month, the Harvard Personal Genome Project will hold its annual U.S. conference (MindEx 2015) and labs events (PG-Palooza) in Cambridge, MA. The conference will take place on Saturday, September 12 at Harvard University’s famed Sanders Theatre. PG-Palooza labs will be held on Sunday, September 13 at the Cambridge Innovation Center. Thanks to the generosity of our sponsors, all PGP participants will be admitted to both MindEx and PG-Palooza for free!

In years past, the PGP was featured at the GET Conference. This year, the GET Conference is going international. It will take place in Vienna (Sept 17-19, and will feature Genom Austria, and other members of the growing international PGP consortium.

For this year’s U.S. MindEx conference, the Harvard PGP is working together with the Mind First Foundation, and a focus of the conference will be the mental realm: mind and brain, cognition and behavior. Still, as in previous years, the U.S. conference and labs will provide its established focus on open source genomics and citizen participatory science.

To register as a PGP participant for MindEx, please click here to visit the MindEx and PG-Palooza page at the Harvard PGP website (you’ll need to log in to your account), and click on the “Participate” button at the bottom of the page, or go straight to the appropriate EventBrite page ( We recently made all registration free, so simply use Public Registration. At the conference we’ll register you separately for PG-Palooza, which is open only to those enrolled in the PGP.

More about MindEx and PG-Palooza

Conference speakers will include PGP founder and Harvard Professor Dr. George Church, Dr. Ron Kessler (Harvard Medical School), Dr. Martine Rothblatt (United Therapeutics), Dr. Ed Boyden (MIT Synthetic Neurobiology Group), Dr. Richard Wrangham (Harvard), Dr. Madeleine Price Ball (PGP Harvard and Open Humans Project), Dr. Sasha Wait Zaranek (PGP Harvard and Curoverse), Dr. Jordan Smoller (Broad Institute, Harvard Medical, Massachusetts General Hospital), best-selling psychology author David McRaney, gut microbiome experts Justine Debelius and Dr. Siavosh Rezvan Behbahani, and more. PG-Palooza will feature presentations and collections of specimens and data by the Harvard PGP, American Gut, uBiome, LifeNaut, MindModeling@Home, H-Scan,, and more!

For additional details about the conference, labs, speakers, venues, hotels, directions and maps, visit the MindEx conference pages on the Mind First Foundation website (

We hope to see you there!

Oppenheimer Foundation survey results

May 5, 2015

The following is a guest post by Alan Oppenheimer. The Alan and Priscilla Oppenheimer Foundation seeks to advance humanity through scientific research and education and has been a long-time supporter of the Harvard Personal Genome Project. The views of this guest post, and responses from participants reported upon here, do not necessarily reflect the views of the Harvard Personal Genome Project. It is important to keep in mind that the Harvard Personal Genome Project study is not intended nor expected to help participants diagnose or improve personal health issues.

Following up on our previous blog post, here’s a quick summary of the results of the Harvard Personal Genome Project enrollee survey “What are you looking for in your genome, and how can we help you find it?” There were about 280 respondents.

The first questions were about the participant’s background. The “average” participant has been in PGP about 3 years, may or may not have donated a sample, is most interested in inherited disease risk, has 23andMe or FamilyTree/Ancestry DNA data, and is very computer savvy, reading articles/journals.

In terms of the key question in the title of the survey, participants would slightly prefer their genome analysis through either current tools like GET-Evidence and Promethease or an easy-to-use overview tool, versus raw data or a genome browser (see figure below). Primary important factors in exploring their genome include medical analysis and broad, flexible in-depth data, both slightly favored over ease-of-use and accessibility of an overview (and significantly favored over the ability to share/compare with family members).

survey results

The most interesting items from the survey were the comments, mainly in the free-form “What else would you like to tell us” question at the end (entered by about 1/3 of the respondents). Most prevalent of those were:

  • “here’s what’s wrong with me that I’m hoping my genome will help me find/understand/fix” (which is thus the number one answer to the title of the survey)
  • “I wish there was a blood collection event in my area.”

Thanks to everyone who took part. Our next step after this survey: decide on what tool(s) we here at the Oppenheimer Foundation should start building (or assisting the Personal Genome Project with) to best address the survey responses.

VIDEO: Genomics in Medicine Panel at the 2014 GET Conference

April 10, 2015

At the 2014 Get Conference, Robert Green described how medical genetics is being integrated into primary care, Michael Linderman spoke on how to prepare the next generation of genomicists, and Diana Bianchi presented on how prenatal screening using sequencing of cell-free fetal DNA is revolutionizing prenatal care. Afterwards, they were led in a moderated discussion by Boston Globe reporter Carolyn Johnson. Watch the video.

VIDEO Sporty genomes: Are elite athletes born or made?

March 6, 2015

If a major goal of genomics research is to understand the underlying molecular causes of beneficial phenotypes, for purposes of promoting overall health in society, then perhaps sports, in many regards, can help facilitate this process. The canonical athletic phenotype, with highly desirable physical traits, may serve as a model for understanding optimal fitness. And certainly professional athletes, at the pinnacle of their respective sport, have tremendous social and economic influence by inspiring everyday athletes and fans alike to emulate their performances. Therefore, a deeper understanding (or at least discussion) of what makes an “elite” athlete, or who has the potential to become one, is warranted. With 99% percent of the human genome being identical, is it plausible to think we all have the inherent ability to become elite athletes? Or, do the remaining 30 million divergent nucleotides of our genetic code determine who can or cannot become an Olympian? At the annual Genes, Environment, and Traits (GET) conference, a sports genomics panel was held to discuss this provocative topic. Invited speakers were:

  • David EpsteinInvestigative reporter at ProPublica, former senior writer for Sports Illustrated, author of the New York Times best seller The Sports Gene.
  • Heidi Rehm, PhDChief Lab Director at Partners’ Laboratory for Molecule Medicine, associate professor of Pathology at Brigham & Women’s Hospital, expert on genomic medicine & integrating genetic discovery into clinics
  • Mark Gerstein, PhDProfessor of Bioinformatics at Yale University, expert in human genome mining & annotation, author on over 400 computational biology research publications.
  • Jonathan Scheiman (moderator), a research fellow in the genomics laboratory of George Church, former Division I athlete, and NBA correspondent for an international radio show.

In a lively debate at the 2014 GET Conference – which included moments of scientific inquiry, levity, and moral contemplation – panelists engaged in discourse over the inheritability and trainability of athletic traits as well as selective pressure from society to enrich for performance phenotypes. Additional topics discussed included:

  • Evolution of athletic body types
  • Performance enhancing polymorphisms
  • Genetic tests and specialization of athletes at a young age
  • Genetic tests for ensuring athlete health – requisite or optional?
  • Whole genome sequencing of elite athletes for beneficial allele discovery
  • Professional athlete salaries vs. science funding – can we collaborate?!?
  • Quantified self and advanced analytics in professional sports
  • The future and potential of genomics in sports analytics
  • Competition, fairness, and genetic engineering

Practice vs. genetics. Nature vs. nurture. A timeless debate, with a new quantitative spin from current cutting edge advances in next generation genomics technologies. Never before has society had access to such powerful tools to read and write DNA. And athletes, with a history of transcending sport, are as are as popular as ever in mainstream culture. Perhaps the next revolution in science will entail a sports star allowing us all to peak into their biological greatness. Scientists vs. athletes? Why? As the sports genomics panel at the GET conference displayed, these are two communities that stand to benefit from playing on the same team.

Watch the video.

(special thanks to moderator Jonathan Scheiman for this written summary)


Get every new post delivered to your Inbox.

Join 126 other followers