For more than 20 years, scientists have relied on the human reference genome, an agreed genetic sequence, as a standard against which to compare other genetic data. Used in countless studies, the reference genome has made it possible to identify genes involved in specific diseases and track the evolution of human traits, among other things.
But it has always been a faulty tool. One of its biggest problems is that about 70 percent of its data comes from a single man of predominantly African-European origin whose DNA was sequenced during the Human Genome Project, the first effort to capture all of a person’s DNA. As a result, it can tell us little about the 0.2 to one percent of the genetic sequence that makes each of the seven billion people on this planet different from one another, creating an inherent bias in biomedical data that is believed to be are responsible for some of the health problems. disparities affecting patients today. Many genetic variants found in non-European populations, for example, are not represented at all in the reference genome.
For years, researchers have called for a more inclusive resource of human diversity with which to diagnose disease and guide medical treatment. Now, scientists from the Human Pangenome Reference Consortium have made groundbreaking progress in characterizing the fraction of human DNA that varies between individuals. As recently posted on Naturehave assembled genomic sequences from 47 people from around the world into a so-called pangenome in which more than 99 percent of each sequence is represented with great precision.
Superimposed on each other, these sequences revealed almost 120 million base pairs of DNA that had not been seen before.
While still a work in progress, the pangenome is public and can be used by scientists around the world as a new standard reference for the human genome, says Erich D. Jarvis of Rockefeller University, one of the principal investigators.
“This complex genomic collection represents significantly more precise human genetic diversity than ever before,” he says. “With a greater breadth and depth of genetic data at their disposal, and a higher quality of genome assemblies, researchers can refine their understanding of the link between genes and disease traits and accelerate clinical research.”
Diversity of supply
Completed in 2003, the first draft of the human genome was relatively imprecise, but it became sharper over the years as gaps were filled, errors corrected, and sequencing technology advanced. Another milestone was reached last year, when the final eight percent of the genome was finally sequenced, primarily tightly coiled DNA that does not code for proteins and repetitive regions of DNA.
Despite this progress, the reference genome remained imperfect, especially with respect to the critical 0.2 to one percent of DNA that represents diversity. The Human Pangenome Reference Consortium (HPRC), a government-funded collaboration between more than a dozen research institutions in the United States and Europe, was launched in 2019 to address this issue.
At the time, Jarvis, one of the consortium leaders, was refining advanced sequencing and computational methods through the Vertebrate Genomes Project, which aims to sequence all 70,000 vertebrate species. His lab and other collaborators decided to apply these advances to high-quality diploid genome assemblies to reveal variation within a single vertebrate: Homo sapiens.
To collect a diversity of samples, the researchers turned to the 1000 Genomes Project, a public database of sequenced human genomes that includes more than 2,500 individuals representing 26 geographically and ethnically diverse populations. Most of the samples come from Africa, home to the greatest human diversity on the planet.
“In many other large human genome diversity projects, scientists selected mainly European samples,” says Jarvis. “We made a determined effort to do the opposite. We were trying to counter the prejudices of the past.”
It is likely that among these populations genetic variants could be found that could inform our knowledge of common and rare diseases.
Mom, dad and child
But to expand the gene pool, researchers had to create sharper, clearer sequences from each individual, and approaches developed by members of the Vertebrate Genome Project and associated consortia were used to solve a longstanding technical problem in the field.
Each person inherits one genome from each parent, so we end up with two copies of each chromosome, giving us what’s known as a diploid genome. And when a person’s genome is sequenced, separating the parents’ DNA can be challenging. Older techniques and algorithms have routinely made mistakes by merging the genetic data of an individual’s parents, resulting in cloudy vision. “The differences between mom’s and dad’s chromosomes are bigger than most people realize,” says Jarvis. “Mom can have 20 copies of a gene and dad only two.”
With so many genomes represented in a pangenome, that cloud threatened to become a storm of confusion. So the HPRC relied on a method developed by Adam Phillippy and Sergey Koren at the National Institutes of Health on “trios” of parents and children: a mother, a father, and a child whose genomes had all been sequenced. Using the data from mom and dad, they were able to clarify the lines of inheritance and come up with a higher-quality sequence for the child, which they then used for pangenome analysis.
new variations
The researchers’ analysis of 47 people yielded 94 distinct genomic sequences, two for each set of chromosomes, plus the sex Y chromosome in males.
They then used advanced computational techniques to align and superimpose the 94 sequences. Of the 120 million base pairs of DNA that have not been seen before or are in a different location than indicated in the above reference, around 90 million result from structural variations, which are differences in the DNA. of people that arise when fragments of chromosomes are rearranged. – moved, deleted, reversed or with additional copies of duplications.
It’s an important discovery, Jarvis notes, because studies in recent years have established that structural variants play an important role in human health, as well as in population-specific diversity. “They can have dramatic effects on differences in traits, diseases, and genetic functions,” she says. “With so many new ones identified, there will be many new discoveries that were not possible before.”
filling gaps
Pangenome assembly also fills in gaps that were due to repetitive sequences or duplicate genes. One example is the major histocompatibility complex (MHC), a group of genes that code for proteins on the surface of cells that help the immune system recognize antigens, such as those of the SARS-CoV-2 virus.
“They’re really important, but it was impossible to study MHC diversity using the older sequencing methods,” says Jarvis. “We are seeing much greater diversity than we expected. This new information will help us understand how immune responses against specific pathogens vary between people.” It could also lead to better methods of matching organ transplant donors to patients, or identifying people at risk of developing autoimmune diseases.
The team has also discovered surprising new features of centromeres, which lie in the center of chromosomes and drive cell division, pulling apart as cells replicate. Mutations at centromeres can lead to cancers and other diseases.
Despite having highly repetitive DNA sequences, “centromeres are so diverse from haplotype to haplotype that they can account for more than 50 percent of genetic differences between people or maternal and paternal haplotypes even within the same individual,” Jarvis says. . “The centromeres appear to be one of the most rapidly evolving parts of the chromosome.”
Building a relationship
However, the current pangenome of 47 individuals is only a starting point. HPRC’s ultimate goal is to produce high-quality, nearly error-free genomes of at least 350 individuals from diverse populations by mid-2024, a milestone that would allow capturing rare alleles that confer important adaptive traits. Tibetans, for example, have alleles related to oxygen use and ultraviolet light exposure that allow them to live at high altitudes.
A major challenge in collecting this data will be gaining the trust of communities that have seen biological data abused in the past; For example, there are no samples in the current study of Native American or Aboriginal peoples, who have long been ignored or exploited by scientific studies. But you don’t have to go far back in time to find examples of the unethical use of genetic data: just a few years ago, DNA samples from thousands of Africans were traded in various countries without the knowledge, consent or benefit of the donors. .
These crimes have sown mistrust of scientists among many populations. But by not being included, some of these groups could remain genetically obscure, leading to perpetuation of data biases and continued disparities in health outcomes.
“It’s a complex situation that will require a lot of relationship building,” says Jarvis. “There is a greater sensitivity now.”
And even today, many groups are willing to participate. “There are people, institutions and government agencies from different countries saying, ‘We want to be a part of this. We want our people to be represented,'” Jarvis says. “We are already making progress.”
—————————————————-
Source link