Work has begun on the full-scale effort to build a public database containing information from the genomes of 2,500 people from 27 populations around the world.
Since its launch in 2008, the 1000 Genomes Project has conducted the three pilot studies to test different strategies for producing a catalogue of genetic variants that are present in one per cent or greater frequency in the different populations chosen for study (European, African and East Asian).
Researchers will use the full catalogue, which will be developed over the next two years, to study the contribution of genetic variation to illness. In addition to distributing the results on the project’s own web sites, the pilot data set is available via the Amazon Web services (AWS) computing cloud, enabling anyone to access this unprecedentedly large data set, even if they do not have capacity to download it locally.
“In the pilot projects we have made significant progress in optimising the use of next generation sequencing platforms to study human genetic variation, and we can now apply what we have learned to accelerate our efforts to sequence this reference collection of human genomes,” said Richard Durbin, of the Wellcome Trust Sanger Institute, co-chair of the consortium.
“Completing the goals of the initial pilot projects has been critical to informing how to apply next-generation sequencing in human genetic research, and provides a solid foundation the next stage of the project,” said David Altshuler, of the Broad Institute, Cambridge, Massachusetts, the other co-chair. “We are eager to make rapid progress on the full set of 2,500 genomes and to provide the resulting data for use by the disease genetic community. I fully expect that these data will more precisely define genetic risk factors already discovered, and lead to the discovery of many new risk factors for disease.”
The previous public project, the International HapMap Project, provided an initial database of over 3 million human DNA variants present in 270 DNA samples. Information and methods developed by HapMap fuelled the first generation of Genome-Wide Association Studies, which have localised over 600 novel genetic risk factors for common diseases such as diabetes, heart attack, inflammatory bowel disease, breast cancer, schizophrenia, and other disorders. These studies were limited by technology, however, to studying a subset of more common DNA variants, with a frequency greater than five to ten per cent.
The 1000 Genomes Project exploits next-generation DNA sequencing technologies to develop a much more complete database that goes much lower in frequency, and is extended to more human populations. This database will contain all forms of variation, single letter changes or SNPs, small insertions and deletions (termed indels) and copy number variations or large changes in the structure and copy number of chromosomes This integrated map is a novel contribution, as previous studies have focused exclusively on one form of DNA variation (even though each of our genomes contains all variety of variation).
“We are committed to make these data public to make certain that any institution or researcher around the world can access and work with our datasets to better understand common disease,” said Jun Wang, associate director of the Beijing Genomics Institute in Shenzhen, China, and member of the 1000 Genomes Project steering committee. “We must work together if we are going to find those subtle differences in the human genome that lead to diseases like cancer and diabetes.
The move to make the pilot data public represents the first major release of biomedical data on the Amazon Web Services Cloud. The amount of data produced to date is unprecedented, even for biomedical research. Currently, the database consists of over 50 terabytes, corresponds to almost eight trillion DNA base pairs, or terabases, of sequence data.
Early in the project, merely copying the vast quantities of data between the European Bioinformatics Institute in the UK and National Center for Biotechnology Information in the US consumed large fractions of both groups’ capacity on the Internet for several days.
Researchers can access the 1000 Genomes Project pilot data through the 1000 Genomes web site, http://www.1000genomes.org.
Researchers can download the data from NCBI at: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ or from the EBI at: ftp://ftp.1000genomes.ebi.ac.uk/.
The pilot datasets of the 1000 Genomes Project (7.3 terabytes of data) are now available as a public dataset through Amazon Web Services and integrated into the company’s Elastic Compute Cloud.