PGP Harvard data in Google Cloud Storage
At PGP Harvard our participants are, by and large, very enthusiastic about understanding genetics and their own genomes. Many participants are programmers, researchers, and often both! It should come as no surprise that our staff are often asked “can I see more of the raw data?”
We’ve always wanted the entire “raw data” to be public, for participants and researchers alike. One issue that stymied us was the intractable size of the data: this sort of data is typically shipped on terabyte disks. I’m now happy to share that we now have an answer and a place to find the data, although accessing this requires some familiarity with using a command line interface and maybe a smidge of programming.
The full data sets PGP Harvard received from Complete Genomics are now shared on a public bucket on Google Cloud Storage, using credits generously donated by Google. Data is organized by huID.
The bucket: gs://pgp-harvard-data-public
To access the bucket, you should read about installing and using gsutil.
Some example commands
List contents of bucket top level:
gsutil ls gs://pgp-harvard-data-public
Download/copy the var file from hu011C57 Complete Genomics data to your current directory (234 MB):
gsutil cp gs://pgp-harvard-data-public/hu011C57/GS000018120-DID/GS000015172-ASM/GS01669-DNA_B05/ASM/var-GS000015172-ASM.tsv.bz2 .
With multi-threading and recursion, copy the hu011C57 directory to your current directory. (40.8 GB):
gsutil -m cp -R gs://pgp-harvard-data-public/hu011C57 .
Use a Google Compute Engine VM to analyze the data
You can also access this data using virtual machines in the Google Compute Engine – this could save you a lot of disk space! Once you have a virtual machine you can, for example, use the Python Client Library to automatically access data.