Publicly available genetic summary data have high utility in research and the clinic, including prioritizing putative causal variants, polygenic scoring, and leveraging common controls. However, summarizing individual-level data can mask population structure, resulting in confounding, reduced power, and incorrect prioritization of putative causal variants.
In an effort to increase the utility and equity of large genetic databases, researchers led by Audrey Hendricks, Ph.D., associate professor of statistics, from the University of Colorado Denver developed a new method that estimates the genetic ancestry in these databases.
Known as Summix, this method adjusts the information to match the ancestry of a person or sample of people with the goal of making large genetic databases more useful for people of traditionally underrepresented ancestries, as they are underrepresented in genetic databases and studies.
“Summix is a method to efficiently deconvolute ancestry and provide ancestry-adjusted allele frequencies (AFs) from summary data,” the authors write in their paper in American Journal of Human Genetics.
The team tested the effectiveness of Summix in 5,000 simulation scenarios and in the Genome Aggregation Database (gnomAD), a publicly available genetic resource. They found estimates of African, non-Finnish European, East Asian, Indigenous American, South Asian ancestry proportions within 0.1% for all simulation scenarios.
When Summix was applied to gnomAD v.2.1 exome and genome groups and subgroups, the team found heterogeneous ancestry similar to what is expected in African/African American, American/Latinx, Other, and South Asian groups. This included
African/African American (∼84% AFR, ∼14% EUR) and American/Latinx (∼4% AFR, ∼5% EAS, ∼43% EUR, ∼46% IAM). Compared to the unadjusted gnomAD AFs, Summix’s ancestry-adjusted AFs more closely match respective African and Latinx reference samples.
The Summix reference panel was created from the 1000 Genomes Project (GRCh37/hg19) superpopulations (African, Non-Finish European, East Asian, South Asian) and an Indigenous American population (616,568 SNPs and 43 individuals, GRCh37/hg19). Tri-allelic SNPs and SNPs with missing allele frequency information were removed, leaving 613,298 SNPs across the 22 autosomes.
For gnomAD, the researchers merged their reference panel searching for allele matching and strand flips. The final dataset included 582,550 genome SNPs and 9,835 exome SNPs across the 22 autosomes.
“It turns out that we can estimate fairly precise ancestry proportions using fairly small numbers of SNPs (~100-500) opening the door for future extensions to local ancestry proportion estimates,” says Hendricks.
“Even on modern, dense panels of summary statistics, Summix yields results in seconds, allowing for estimation of confidence intervals via block bootstrap,” she adds. “Given an accompanying R package, Summix increases the utility and equity of public genetic resources, empowering novel research opportunities.” The Summix method is available in open access software.