Advertisementspot_imgspot_img
38.1 C
Delhi
Thursday, April 23, 2026
Advertismentspot_imgspot_img

The evolutionary history and unique genetic diversity of Indigenous Americans

Date:


Sample description and DNA sequencing

In this study, we generated new whole-genome sequencing (WGS) data from 128 Indigenous American individuals representing 45 populations and 28 language families across 8 Latin American countries (Extended Data Fig. 1 and Supplementary Note 1). Whole genomes were sequenced at the Beijing Genomics Institute (China) and Dasa Genômica (Brazil) using BGISEQ-500 and Illumina NovaSeq 6000, with an average sequencing depth of about 44×.

Ethical approval for sample collection was granted by local ethics committees in each country: Argentina (Puerto Madryn Zonal Hospital, resolution no. 009/2015; San Carlos de Bariloche Zonal Hospital, resolution no. 1510/2015), Brazil (CONEP, resolution nos. 763, 4599, 3828655, 7107656 and 8273857), Bolivia (Universidad Mayor de San Andrés), Ecuador (Universidad de Las Américas; Consejo Nacional de Ciencia y Tecnología—CONACyT, grant no. 69856; Instituto Nacional de Ciencias Médicas y de la Nutrición Salvador Zubirán, refs. 15,18), Mexico (CONACyT grant no. 69856; Instituto Nacional de Ciencias Médicas y de la Nutrición Salvador Zubirán, refs. 15,18; CNIC Salud 2013-01-201471; Committee of Ethics and Research, UADY, notice F-FENC-SAC-14/REV: 04, registry no. 09/17) and Peru (Universidad San Martín de Porres). Written informed consent was obtained from all participants before sample collection. Logistical support in Brazil was provided by the Fundação Nacional do Índio. All sampling adhered to the Declaration of Helsinki and the relevant national laws and regulations at the time (Supplementary Note 10).

Read mapping, variant calling and annotation

Whole-genome sequence data preprocessing, variant calling and annotation were performed using the Sarek v.3.5.0 pipeline69. Specifically, sequence data in FASTQ format were aligned to the GRCh38 reference genome and preprocessed according to the GATK Best Practices for germline variant discovery and joint variant calling. Variants in the joint cohort variant call format were then normalized and annotated using the Ensembl Variant Effect Predictor (VEP)70 v.113, incorporating several annotation sources. Annotations from dbSNP, ClinVar and more custom annotations were retrieved using SnpSift and VEP plugins. Ancestral alleles were filtered using the VEP Ancestral Allele plugin to improve the specificity of downstream population genetic analyses. New SNVs were identified by comparing their positions and alleles with those of the variants in public databases (1KGP18, HGDP13, gnomAD19 and dbSNP20). Variants absent from the dataset of more than 270,000 individuals were classified as new (Supplementary Note 2).

Site frequency spectra

We estimated the number of segregating single-nucleotide polymorphisms (SNPs) as a function of sample size using a rarefaction approach on the basis of the site frequency spectrum (SFS). Alternate allele counts were computed for biallelic sites and were used to construct the SFS using Scikit-Allel v.1.3.13. To account for different sample sizes and normalize the comparisons, the folded SFS was projected onto smaller sample sizes using a hypergeometric downsampling method.

Dataset assembly and quality control

Genomic coordinates of the newly sequenced individuals were mapped to the hg38 reference genome. The 128 newly generated genomes were merged with the following publicly available WGS databases (Supplementary Note 1): (1) 1KGP High Coverage, (2) HGDP and (3) SGDP. Sites and individuals with more than 5% missing data were eliminated, biallelic SNVs were selected, and positions with significant deviations (P < 10−8) from the Hardy–Weinberg equilibrium expectations were excluded. Ambiguous positions (A-T and C-G) were also removed. The resulting dataset comprised 199 Indigenous American individuals from 31 language families and 53 ethnic groups. It includes 5,308,880 biallelic SNPs and 3,710 individuals from 201 populations worldwide.

After the initial quality assessment, linkage disequilibrium pruning was performed with ‘SNPRelate’ v.1.28.0 R package71 to exclude markers exhibiting a pairwise correlation greater than 20% (r2 > 0.2) in a 50-kb sliding window, advancing in 10-kb steps. This procedure yielded an linkage-disequilibrium-pruned dataset for downstream analyses that required an independent set of markers (for example, PCA and ADMIXTURE).

PCA

PCA was performed using the ‘SNPRelate’ v.1.28.0 R package71 on both the complete dataset and a subset comprising only local ancestry-masked Indigenous American populations to assess potential biases introduced during data merging and quality control, as well as to explore broad patterns of ancestry and genetic differentiation. In the case of the second analysis, positions with more than 10% missing data were eliminated, as well as those with a minor allele frequency below 5%.

Global ancestry inference

We analysed the Indigenous American genomes using the supervised mode of ADMIXTURE72 v.1.3.0 and three putative ancestry components (K = 3) using a reference panel of diverse African (Bantu from Kenya, Bantu from South Africa, Biaka, Dinka, Khomani-San, Mandenka, Mbuti, San and Yoruba), European (Basque, Bergamo Italian, French, Orcadian, Sardinian and Tuscan) and Indigenous American (Karitiana, Surui, Colombian and Pima) populations without evidence of recent admixture with other continental groups, running 10 independent iterations with 100 bootstrap replicates per run. The consensus results of independent runs were obtained using CLUMPP73 v.1.1.2.

We also investigated the genetic structure of present-day Indigenous Americans and their relationship with ancient Indigenous individuals through unsupervised ADMIXTURE analyses, considering 2 to 12 putative ancestry components (K = 2 to K = 12) and visualized the results using PONG74 v.1.5. In the first analysis, we incorporated a reference panel of African, European and East Asian populations to model the non-Indigenous American ancestry. In the second analysis, we masked non-Indigenous American ancestry (following the approach detailed below) in present-day Indigenous American individuals and then combined them with ancient ones.

Relatedness analysis and sample selection

Subsequently, we performed kinship analysis to identify and remove closely related individuals from the dataset, minimizing the bias introduced by close relatives in downstream analyses. Using PLINK v.1.9 (ref. 75), we estimated the IBD between all pairs of individuals, calculated as PI_HAT = P(IBD = 2) + 0.5 × P(IBD = 1). On the basis of these estimates, we identified the largest set of unrelated individuals by applying a first-degree kinship cut-off. Filtering was conducted using PRIMUS76 v.1.9.0.

Haplotypic phase, local ancestry inference and masking

The haplotypic phase of the genomic data was statistically inferred using ShapeIT4 (ref. 77) v.4.2.2, with the 1KGP dataset as the reference panel. The parameters were adjusted for sequencing data using the ‘–sequencing’ option, with the following settings: 15 burn-in iterations, 15 pruning iterations and 100 main iterations. Local ancestry inference was conducted using RFMix78 v.1.5.4, applying a window size of 0.2 cM and a minimum of five reference haplotypes per tree node, using a reference panel of unadmixed (that is, with no evidence of recent admixture) Indigenous Americans, Sub-Saharan Africans and Western Europeans. We then used the inferred local ancestry tracts to mask genomic sites (code to perform local ancestry masking is available at https://github.com/macscastro/lamask), assigning segments with a posterior probability of being ancestrally Indigenous Americans below 99% as missing data (Supplementary Note 3).

Two further datasets were generated: (1) the first dataset was generated by combining the local ancestry-masked dataset with ten Indigenous Andamanese, of which six were Onge and four were Jarawa79; and (2) the second dataset was created by combining the local ancestry-masked dataset with the Allen Ancient DNA Resource39 and ancient DNA data from sambaqui mound builders found in Brazil15.

Effective migration surface modelling

EEMS modelling80 (https://github.com/dipetkov/eems) was applied to subsets of 116 unadmixed and 160 ancestry-masked Indigenous Americans. The model used 1,200 demes and ran for 4 × 106 iterations with a 2 × 105 burn-in period (Supplementary Note 4). Migration and diversity rates were visualized using scripts from EEMS developers.

Patterns of allele sharing

Using the ‘admixtools’ v.2.0.10 R package, we calculated population pairwise f3(Mbuti; X, Y) to investigate genetic similarity patterns among contemporary Indigenous American populations. These patterns were visualized using a neighbour-joining tree and multidimensional scaling. The same method was used to assess spatial and temporal genetic variation, although genetic similarity was estimated for all pairs of contemporary and ancient X and Y groups. Clusters with genetic similarities were identified as clades in a neighbour-joining tree. The 1 − outgroup f3 distances were also used to infer the existence of a correlation with geographic distances, estimated as great circle distances with the R package ‘geosphere’ v.1.5.18.

Admixture graph and ancestry modelling

We used the ‘find_graphs’ function from the ‘admixtools’ v.2.0.10 R package to identify plausible population history models for contemporary Indigenous American populations (Supplementary Note 6). The algorithm ran ten times with 200 iterations for each inferred number of admixture events. The best-fitting admixture graph for each scenario (ranging from zero to five admixture events) was selected on the basis of its highest score. These graphs were constructed using f2 statistics restricted to transversions, including one representative of each genetic cluster identified in the previous step.

To summarize our findings alongside existing evidence from the literature, we manually constructed admixture graphs and tested their fit to the data using the ‘qpgraph’ function from ‘admixtools’ v.2.0.10 R package. Confidence intervals for drift lengths and admixture weights were computed with the ‘qpgraph_resample_snps’ function, which fits the graph several times using random SNP subsets. The results were summarized using the ‘summarize_fits’ function to generate a data frame with parameter estimates. As in previous analyses, these models were inferred using only transversions, thus avoiding systematic errors in ancient DNA data caused by postmortem damage, which induces C-to-T transitions at methylated CpG sites.

We used the qpWave method to estimate the minimum number of independent ancestral sources contributing to third-dispersal populations and the rotating qpAdm approach to model the sources and their proportional ancestry contributions to present-day Indigenous South Americans (Supplementary Note 6), both from the ‘admixtools’ v.2.0.10 R package. Analyses were restricted to transversions with a maximum per-site missing rate of 10%. Two complementary analyses were performed: (1) including representatives from all genetic clusters and (2) focusing on individuals from the first and second dispersals, alongside ancient genomes from the North American Pacific Coast (that is, excluding present-day Indigenous Americans and Ceramic-period Caribbeans). We identified feasible models in which source contributions summed to 100% and plotted their probabilities and admixture proportions (Extended Data Fig. 11). Full model statistics are reported in Supplementary Table 10.

Effective population size history and IBD sharing

We used IBDNe81 v.07May18.6a4 to infer the effective population size (Ne) histories of present-day Indigenous Americans. IBD segments were identified using the Refined IBD82 v.12Jul18.a0b by merging those with short gaps (maximum gap = 0.6, maximum discordant homozygotes = 1). Ne was inferred by pooling individuals by language families or genetic clusters (minimum ten individuals; populations from Mesoamerica and Aridoamerica were pooled together), as previous studies indicate that historical Ne trajectories remain robust with smaller sample sizes (Supplementary Note 5). Segments greater than 2 cM were analysed using the default IBDNe parameters.

IBD sharing patterns were assessed by categorizing segments by length, which reflected the time since a shared ancestor, to examine shared IBD within and between populations over time. The IBD networks were estimated using the ‘as_tbl_graph’ function (directed = FALSE) and visualized with the ‘ggraph’ function (layout = ‘fr’) from the ‘tidygraph’ v.1.3.1 and ‘ggraph’ v.2.2.1 R packages, respectively.

Effective population size, coalescence rates and divergence times

We applied the coalescent-based method Relate83 v.1.2.2 to the phased WGS data to infer historical changes in Ne and estimate divergence times between contemporary populations. Input files were converted from variant call format to the haps/sample format using the RelateFileFormats script, and haplotypes were flipped according to the ancestral genome using the PrepareInputFiles script, which also filtered SNPs and adjusted the distances on the basis of the genomic mappability mask. Input preparation was performed using the GRCh38 ancestral genome (human_ancestor_GRCh38), the GRCh38 genome mask (20160622_genome_mask_GRCh38) and GRCh38 recombination maps provided by the developer.

Ancestral recombination graphs were inferred using the parameters −m = 1.25 × 10−8 (mutation rate) and −N = 30,000 (effective population size), and the coalescence rate trajectories for each population were estimated using the EstimatePopulationSize script. Ne estimates were obtained by calculating the inverted coalescence rate in the form of 0.5/(coalescence rate), and putative divergence times between groups were identified as the time points at which the inverted coalescence rate values in each population started to diverge from one another, whereas the inverted cross-coalescence rate values between populations increased. This indicates a decrease in cross-coalescence rates between populations, indicating genetic separation and increasing divergence over time. We also inferred the Ne history by inferring the ancestral recombination graph, while considering all individuals as a single population.

ROHs

ROHs were inferred for unrelated and unadmixed Indigenous Americans using PLINK v.1.9 with a sliding window of 50 SNPs, allowing for up to one heterozygous site and five missing calls, a minimum density of one SNP per 50 kb, a maximum gap of 100 kb and a minimum ROH length of 500 kb. We compared the total and average ROH lengths per individual across global populations and Indigenous American clusters of genetic similarity. For global populations, we visualized individual total ROH counts and lengths, whereas for Indigenous Americans, we additionally plotted population-wise averages. We estimated the inbreeding coefficient (FROH − ROH-base inbreeding coefficient) and average inbreeding per population and tested the correlations between ROH counts and ancestry proportions. Additionally, ROH hotspots, defined as regions with an above-average ROH occurrence (more than three standard deviations), were identified (Supplementary Note 5).

Excess affinity with Australasian populations

We analysed the excess genetic affinity between Indigenous American and Australasian populations by computing D(Mbuti, Australasian; X, Y), where X and Y are Indigenous American groups, and Australasians are represented by Australian, Bougainville, Jarawa, Onge and Papuan (‘PapuanHighlands’ and ‘PapuanSepik’) populations. This included comparisons between present-day and ancient Indigenous American populations to trace the Ypykuéra ancestry across space and time. To minimize bias from missing data, we used a subset of unrelated and unadmixed Indigenous American individuals to ensure more reliable results when integrating present-day and ancient genomes, the latter generally having high missingness levels. We also investigated the signatures of natural selection in genomic regions with excess genetic affinity for Australasian populations (Ypykuéra ancestry) and their potential functional effects (Supplementary Note 7).

Selection scans

Natural selection analysis was performed on a subset of unrelated individuals, masking segments of non-Indigenous American ancestry. To detect positive selection signals, we used two approaches based on population differentiation (di statistic84 and PBS85) and extended haplotype homozygosity (iHS86 and xpEHH87). For all four statistics, we conducted sliding-window analysis using 200 SNPs per window with a step size of 50 SNPs. We then combined the genome-wide ranks of the four statistics for each window using Fisher’s combined score88. This score is calculated as the sum, over the four statistics, of −ln(rank of the statistic/number of windows tested). Outlier regions were defined as windows with Fisher’s combined score scores in the 99.9th percentile (Supplementary Note 8).

Archaic introgression inference

To identify genomic regions exhibiting signals of archaic introgression in ancestry-masked Indigenous Americans, we used segments detected by the SPrime method89, using Indigenous Americans as targets and African Mbuti from the HGDP as unadmixed outgroups. To enhance robustness, we applied filtering steps following ref. 64, retaining only (1) high-confidence archaic sites and (2) those found at low frequency in Africa (less than 0.01) but present at greater than or equal to 0.01 in at least one non-African population (Supplementary Note 9).

For sites passing through these filters, we classified a match when the Archaic genotype contained the putative Archaic-specific allele (present in both Neanderthals and Denisovans). Additionally, we identified Neanderthal-specific (matching Neanderthals but differing from Denisovans) and Denisovan-specific (matching Denisovans but differing from Neanderthals) sites.

ORA

ORA was performed using the WEB-based GEne SeT AnaLysis Toolkit (WEB-GESTALT)90, focusing on phenotypes and Gene Ontology categories, including Biological Processes, Cellular Components and Molecular Functions. To address redundancy, we applied a weighted set cover approach, which identified the minimum subset of gene sets that covered all genes from the enriched sets. The weight or cost of adding a gene set was based on P.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.



Source link

Share post:

Advertisementspot_imgspot_img

Popular

More like this
Related

Advertisementspot_imgspot_img