Southwest China has always been regarded as a multi-ethnolinguistic region, comprising diverse language families, namely, Sino-Tibetan, Tai-Kadai, Hmong-Mien, and Austroasiatic. There was evidence that southwest China was characterized by significantly different genetic lineages from southern and northern East Asia (represented by Qihe and Bianbian, respectively) (Wang T. et al., 2021). The population continuity from earlier Longlin and bidirectional gene flow between southwest and southeast China played a formative role in the demographic history of this region. Furthermore, southwest China was a vital channel of population migration between Southeast Asia and southern China since the Neolithic Age. The ancestry related to Hoabinhian hunter-gatherers extending from Southeast Asia into southern China was detected in Neolithic Baojianshan samples, while Mainland Southeast Asians from Neolithic Age and Bronze Age showed a genetic affinity with modern southern Chinese Tai-Kadai and Hmong-Mien people (Lipson et al., 2018; Mccoll et al., 2018; Wang T. et al., 2021). Besides, the genetic makeup of ancient and modern southern Chinese was influenced by the southward migration of northern millet farming populations throughout history (Wen et al., 2004; Wang et al., 2014, 2022; Wang C. C. et al., 2021; Wang T. et al., 2021; Yao et al., 2017; Guo et al., 2019b; 2022; Xu et al., 2019; Liu Y. et al., 2020; Ning et al., 2020; Yang et al., 2020; He et al., 2021).

The Sui people in southwest China belonging to the Tai-Kadai language family has a unique and ancient culture–“Sui Shu”. However, the origin of the Sui people is relatively controversial currently, and there are several hypotheses as follows: (1) Sui is an ancient indigenous group in southern China for 6,000 years; (2) Sui is a branch of Baiyue people who had inhabited southern China before the Han dynasty; (3) Sui originated from the Sui River Basin, gradually migrated to southern China, and integrated into the Baiyue tribes. In summary, the origin of the Sui population is still in dispute and needs to be further investigated.

The resolution of previous studies focused on the Tai-Kadai populations including Sui people was quite limited due to the low density of genetic markers used: the study analyzed 15 autosomal STRs of 2,022 unrelated individuals from five ethnic groups including Sui, Han, Gelao, Jing, and Zhuang from Guangxi and Guangdong and found that Zhuang and Gelao were genetically closer with Han when comparing with Sui and Jing people (Yang et al., 2013). A study using 23 Y-STR loci of 200 unrelated Sui individuals from Guizhou showed that Sui people were genetically distantly related to northern Chinese populations such as northern Han, Manchu, Hui, Mongolian, Japanese, and Koreans (Ji et al., 2017). The analysis by 19 X-STRs of 400 unrelated individuals from Guizhou Sui suggested that the Sui had greater genetic similarity with the Tai-Kadai-speaking populations and southern Han Chinese (Guo et al., 2019a). From the analyses of uniparental lineages, Sui possessed high frequencies of southern Chinese specific paternal Y-chromosome lineage O1a-M119 and also maternal mitochondrial haplogroups, such as B, M7, F, and R (Li et al., 2007, 2008). Bin et al. (2021) reported genome-wide SNPs of 68 Sui individuals from seven indigenous populations in Guizhou Province and Guangxi in Southwest China and found that Sui people showed a strong genetic affinity with populations from southern China and Southeast Asia, especially Tai-Kadai and Hmong-Mien-speaking populations, which supported the hypothesis that Sui people came from southern China originally.

Although several reports explored the genetic relationship of Guizhou Sui by autosomal and X chromosomal STRs, STR may not be the best marker for population genetic structure, origin, and immigration analysis due to the high mutation rate of 10–3 to 10–2 (Slooten and Ricciardi, 2013). Insertion and deletions (Indels) are abundant in the human genome with a low mutation rate and are beginning to be used in forensic, anthropological, and population genetic studies (Mills et al., 2006). The Investigator® DIPplex kit with 30 InDels is a sophisticated assay that is used to investigate the allele frequency distribution among the worldwide ethnically and linguistically diverse populations in a large number of genetic studies (Turrina et al., 2011). However, only a few studies reported these data from Chinese ethnic groups.

In summary, the Guizhou Sui population is an ancient ethnic group with a unique culture. However, the available genetic information on Sui people is largely limited to only autosomal and X chromosomal STRs (Ji et al., 2017; Guo et al., 2019a). Therefore, we genotyped 30 InDels from 511 Guizhou Sui unrelated individuals and 700,000 genome-wide SNPs from 15 Sui individuals to thoroughly explore the genetic origin, admixture history, and forensic characteristics of the Guizhou Sui population.

Materials and Methods

Sample Collection, DNA Extraction, and Quantification

A total of 511 blood samples were collected using the FTA card, and 15 saliva samples were collected using the DNA Genotek Kit (WeGene, Shenzhen, China) from unrelated individuals of the Guizhou Sui population with written informed consent, and the Geographical distribution of Guizhou Sui was shown in Supplementary Figure 1A. Our study and sample collection were reviewed and approved by the Medical Ethics Committee of Guizhou Medical University and followed the recommendations provided by the revised Helsinki Declaration of 2000. The participants included are all with official Sui recognition and non-consanguineous marriages over three generations. Genomic DNA was extracted using the QIAamp DNA Blood Mini Kit (QIAGEN, Hilden, Germany), and bloodstain samples were extracted using the Chelex-100 (Walsh et al., 1991), quantified using the Nanodrop-2000 spectrophotometer (Thermo Scientific, Wilmington, DE, United States).

PCR Amplification, Capillary Electrophoresis, and Array Genotyping

A total of 30 InDel markers were amplified on the GeneAmp PCR System 9700 Thermal Cycler (Thermo Fisher Scientific, Wilmington, MA, United States) using the Investigator DIPplex Commercial Kit (Qiagen, Hilden, Germany) and 10 μl reaction volume, which included 0.24 μl multi Taq2 DNA polymerase, 1.0 μl template DNA, 2 μl primer mix, and 2 μl reaction mix, appropriated ddH2O. PCR condition was followed by 4 min at 94 °C, 30 cycles of 94 °C for the 30 s, 61 °C for 120 s, and 72 °C for 75 s, and a final extension at 68 °C for 60 min. The PCR product was separated on Applied Biosystems 3500 Genetic Analyzer (Thermo Fisher Scientific, Wilmington, MA, United States) with HiDi Formamide (Thermo Fisher Scientific, Wilmington, MA, United States) and DNA size standard 550 (BTO) (Qiagen, Hilden, Germany) and genotyped according to GeneMapper ID-X version 1.3 software (Thermo Fisher Scientific, Wilmington, MA, United States). The ddH2O and 9948 (Promega, Madison, WI, United States) were used as negative control and positive control, respectively. A total of 690,193 SNPs were genotyped on the Illumina WeGene Arrays at the WeGene genotyping center, Shenzhen.

Reference Datasets and Data Merging

All the allelic frequency data of 30 InDel markers from 100 populations, including African, (Northern and Southern) American, European, and Asian, were collected as the dataset f. Since the genotype data could not be found in some populations in their reports, we sorted out the genotype data from 52 populations as the dataset g, including an African, 3 European, 7 American, 41 Asian populations, and Guizhou Sui (Fondevila et al., 2012; Friis et al., 2012; Kis et al., 2012; Larue et al., 2012; Akhteruzzaman et al., 2013; Carvalho and Pinheiro, 2013; da silva et al., 2013; Martin et al., 2013; Kim et al., 2014; Ferreira Palha et al., 2015; Hefke et al., 2015; Meng et al., 2015; Guo et al., 2016, 2018; Martinez-Cortes et al., 2016; Mei et al., 2016; Du et al., 2017, 2019; Kong et al., 2017; Li et al., 2018, 2019; Ma et al., 2018; Chen et al., 2019; He et al., 2019a,c; Jian et al., 2019; Liu Y. et al., 2020). The genome-wide SNP data of the Sui population were merged with the reference dataset including 1240 k, Human Origin, and so on (Patterson et al., 2012; Lipson et al., 2018; Mccoll et al., 2018; Liu D. et al., 2020; Ning et al., 2020; Yang et al., 2020; Wang C. C. et al., 2021) to generate two combined datasets covering 189,177 and 71,989 SNPs for subsequent analyses. The 1240 k harbored more SNPs than the Human Origin dataset; nevertheless, the Human Origin dataset included more reference peoples than 1240 k, such as Maonan, Tibetan, Dong_Guizhou, Han, and others from China.

Statistical Analysis

Allelic frequency distribution and corresponding forensic statistical parameters of 30 Indels including the power of discrimination (PD), probability of exclusion (PE), probability of match (PM), and the polymorphism information content (PIC) were calculated using the online tool of STRAF (Gouy and Zieger, 2017) (STR analysis for forensics). Linkage disequilibrium (LD), Hardy–Weinberg equilibrium (HWE), observed heterozygosity (Ho), and expected heterozygosity (He) were analyzed using Arlequin software version 3.5 (Excoffier and Lischer, 2010). Nei’s and DA genetic distance between Guizhou Sui and other 99 reference populations were calculated according to PHYLIP software version 3.52 (Shimada and Nishida, 2017) and the DISPAN program based on the dataset f. The principal component analyses (PCAs) were performed using MVSP software version 3.22 (Kovach, 2007) among the 100 populations, and the results were shown by the heat map and scatter diagram according to R Statistical software version 3.0.2. The phylogenetic tree was conducted based on DA genetic distance using a neighbor-joining algorithm implemented in Mega 7.0 (Kumar et al., 2016). The analysis of ancestral components for individuals in each population was performed using STRUCTURE version (Evanno et al., 2005) based on the dataset g, with the parameters of running 15 replicates from K = 2 to K = 8 with 10,000 burn-ins and 10,000 MCMC.

Genomic-Based Statistical Analysis

We performed the population genetic analysis including PCA (Patterson et al., 2006) and ADMIXTURE after filtering strong LD by Plink with parameters “–indep-200 25 0.4” (Alexander et al., 2009; Chang et al., 2015), f statistics (Patterson et al., 2012), qpWave, and qpAdm (Patterson et al., 2012; Haak et al., 2015). We calculated outgroup-f3(Sui_Sandu, Populations; Mbuti) to measure the shared genetic drifts between the Sui population and reference groups and computed admixture-f3-statistics in the form of f3(Source1, Source2; Sui_Sandu populations) using the qp3pop program with default parameters in ADMIXTOOLS to explore the potential existing admixture signals. We further estimated the f4-statistics values in the forms of f4(Mbuti, X; Sui_Sandu, Tai-Kadai/Guizhou) and f4(ancient1, ancient2; Sui_Sandu, Mbuti) using the qpDstat program to study genetic affinity, continuity, and admixture using the default block jackknife. We estimated that pairwise genetic distance by Fst using the smartpca program of EIGENSOFT with fstonly was YES and inbreed was YES parameter. Then, we also inferred a rooted maximum likelihood tree using TreeMix software (Pickrell and Pritchard, 2012). We used qpWave as implemented in ADMIXTOOLS to explore the minimum number of potential ancestral sources of the Sui population and used qpAdm to quantify the proportions of ancestral populations. We finally used ALDER (Loh et al., 2013) to estimate the time of admixtures with 28 years as one generation length.


Genetic Diversity, Allele Frequencies, and Forensic Efficiency Parameters Based on the 30 InDels

The genotypes of 30 InDels are all listed in Supplementary Table 1. The results of HWE analysis showed that after Bonferroni correction (p < 0.05/30), 30 InDels were all in HWE. The deeper the red color in LD analysis represents the stronger the linkage degree (Supplementary Figure 1B). No sites were found with LD when considering r2 > 0.8 as the threshold. When using the p-value to measure the LD, the result showed no LD between every two loci after Bonferroni correction (p < 0.05/30) (Supplementary Table 2).

The insertion allelic frequencies ranged from 0.0939 (HLD39) to 0.9207 (HLD118) (Supplementary Table 3). We found that the values of forensic parameters HO, PD, PE, PIC, and TPI of loci HLD118, HLD64, HLD39, HLD99, HLD111, HLD84, HLD122, HLD114, HLD81, and HLD67 were lower than that of other InDel sites, indicating a low level of genetic diversity in these loci in Guizhou Sui population, which is similar to other Asian ethnic groups reported by previous studies (Guo et al., 2018; He et al., 2019c; Zhang et al., 2019). The combined power of discrimination (CPD) of 30 InDels was 0.999999999983, suggesting this assay could satisfy the requirement of individual identification for the Guizhou Sui population, while the cumulative probability of exclusion (CPE) was only 0.9986, indicating that those 30 InDels could only be used as a supplement system for parentage testing.

Genetic Structure and Genetic Affinity Explorations Among 99 Worldwide Populations via the DIPplex System

We collected the frequency data of the 30 InDels of the Investigator® DIPplex system from 99 published global reference populations (Supplementary Figure 2). These populations were mainly divided into four branches, namely, the African populations, the East Asian populations, the American populations, and the European populations, demonstrating that these InDels showed excellent population differentiation power among different intercontinental populations. The 30 InDels were roughly clustered into three branches: cluster A contained six loci HLD118, HLD67, HLD84, HLD99, HLD64, and HLD81, while cluster B included five loci HLD114, HLD128, HLD111, HLD39, and HLD122, and the remaining loci were clustered into a large branch C. The frequencies of the five loci included in branch B showed the most significant differences among these 100 populations, which should be more advantageous to be used as ancestral information markers for inferring ethnic groups. Guizhou Sui clustered with the Tai-Kadai and Hmong-Mien speaking populations in southern China, and its allelic frequencies showed no significant difference from other East Asian ethnic groups.

To investigate the genetic relationships between Guizhou Sui and other reference populations, we performed a genetic analysis based on 30 InDel markers. The PCA analysis based on the frequency data of 30 InDels of 100 populations in Dataset f showed that populations from each continent clustered together and Sui clustered with East Asian populations (Figure 1A). When removing European, African, American, and South Asian populations, three genetic clusters of East Asian populations were observed as follows: high-altitude Tibetans, low-altitude northern populations, and southern populations. Sui samples clustered with the Tai-Kadai populations and were most closely to Gelao_Guangxi, Zhuang_Guangxi, and Vietnamese belonging to Austroasiatic people (Figure 1B). We then calculated the DA and Nei’s genetic distance (Supplementary Tables 4, 5 and Supplementary Figures 3, 4) and constructed a phylogenetic tree (Figure 2). We found that population clustering was consistent with geographic and linguistic similarity, and Sui had the closest genetic relationship with other Tai-Kadai-speaking groups and geographical neighbors.

Figure 1. The patterns of genetic relationship between Guizhou Sui and global reference populations based on the InDel low-density dataset. (A) The first two components of principal component analysis (PCA) results between the Guizhou Sui and 99 worldwide reference populations. (B) PCA-based East Asian populations showed the closer genetic Sui and other Tai-Kadai.

Figure 2. Neighbor-joining polygenetic tree constructed on the DA genetic distance among Guizhou Sui and 99 worldwide reference populations.

We further performed ancestral component analysis using STRUCTURE (Supplementary Figure 5) according to the model-based algorithm and hypothetic ancestry source populations from 2 to 8 using a dataset with 30 InDel genotype data of Guizhou Sui and 51 populations from different continents. We observed the lowest CV errors at K = 4 with four ancestral components in East Asian populations. Most East Asians possessed two ancestral components except for Hui, Kazakh, Dongxiang, and Yugur due to West Eurasian admixture (Wang C. C. et al., 2021). There was a subtle difference in ancestral component composition between Tai-Kadai and Hmong-Mien speaking populations. The ancestral component composition of the Guizhou Sui was similar to Hmong-Mien speakers such as Miao, Yao, and She and some Tai-Kadai-speaking groups such as Gelao_Guangxi and Zhuang.

Population Genomic Analyses Revealed the Genetic Affinity of Sui Based on Genome-Wide Data

We generated genome-wide data from 15 Sui individuals, and first conducted PCA based on genome-wide data to explore the population genetic structure across East and Southeast Asia. We projected published ancient individuals onto PC plots (Figure 3A). Tai-Kadai-speaking populations clustered at the southern end of the south-north cline and overlapped with some Austronesian individuals. Ancient populations of southern China, such as Late Neolithic Xitoucun samples from Fujian and Iron Age Hanben samples from Taiwan, partly overlapped with the modern Tai-Kadai populations, documenting a long-term genetic continuity. In addition, mainland Southeast Asians from Bronze Age ancient populations to modern Vietnamese such as Muong/Kinh also had a closer genetic relationship with Tai-Kadai speakers. The 15 Sui individuals clustered tightly with other Tai-Kadai-speaking populations.

Figure 3. The population structure of modern and ancient populations in Asia based on genome-wide data. (A) PCA results showed an overview population relationship between modern populations and ancient populations. (B) ADMIXTURE results (the lowest CV errors K = 6): ancestral components among Guizhou Sui and modern and ancient populations in Asia.

The outgroup f3 statistics (Figure 4) showed that the Sui population had the closest genetic affinity with Tai-Kadai-speaking groups and also shared more genetic drift with Hmong-Mien groups in Vietnam and with some Austronesian groups such as Ami and Kankanaey. When compared with ancient populations, Sui shared most alleles with ancient Taiwanese and Southeast Asians such as Malaysia_Historical, Vietnam_BA_NuiNap, and Neolithic Vanuatu (Lipson et al., 2020), followed by Yellow River farming populations from the Late Neolithic to Iron Age. The phylogenetic relationships between the studied Sui population and modern East and Southeast Asian populations were further confirmed by a Treemix-based phylogenetic tree (Supplementary Figure 6). We observed southern populations including Austronesian, Austroasiatic, Hmong-Mien, and Tai-Kadai speaking groups clustered together as a southern branch. Sui first clustered with Maonan of Guangxi in China and then with Nung of Vietnam.

Figure 4. The shared genetic drift between Sui and modern/ancient populations in Asia showed a closer affinity of Sui and Tai-Kadai/Hmong-Mien populations.

We also conducted Wright’s fixation index Fst among 87 modern and ancient populations (Supplementary Table 7), and Sui possessed the smallest genetic distance with Zhuang and Dong_Guizhou (Fst = 0.003), followed by Dong_Hunan and Maonan (Fst = 0.004). Overall, Sui showed a close genetic relationship first with Tai-Kadai-speaking groups and then with Austroasiatic-speaking groups such as Kinh, Muong, Vietnamese, and southern Han. The closer genetic affinity of Kinh/Vietnamese/southern Han with Tai-Kadai groups was also shown in other Chinese Tai-Kadai speakers, indicating genetic influence between Tai-Kadai groups and Vietnamese Austroasiatic groups and southern Han. We also found a smaller genetic distance between Sui and ancient millet farming populations in the Yellow River basin from Late Neolithic Age onward.

We further performed four population statistics to study the genetic relationship of Sui_Sandu with other Tai-Kadai-speaking groups in the form of f4(Mbuti, X; Sui_Sandu, CoLao/Dai/Dong_Guizhou/LaChi/Li/Maonan/Mulam/Zhuang) (Supplementary Table 7). Sui showed relatively genetic homogeneity with other Tai-Kadai populations as reflected in non-significant f statistics (|Z| < 2), except for Dai who was genetically influenced by Austroasiatic speakers. Notably, we observed significant allele sharing with Hmong in Sui_Sandu as reflected in significantly negative statistics in the form of f4(Mbuti, Hmong; Sui_Sandu, LaChi/Li/Maonan/Mulam/Zhuang). When the Z-value was less than negative 3 in the form f4(Hmong-Mien/Austronesians/Austroasiatic, Hmong-Mien/Austronesians/Austroasiatic; Sui_Sandu, Mbuti) (Figure 5), the levels of allele sharing with Sui were roughly similar in the Hmong-Mien population, Austronesians in and adjacent to Taiwan including Ami, Atayal, and Kankanaey, and Austroasiatic speaking including Kinh, Vietnamese, and Muong. In ancient populations from the Neolithic to Iron Age, we found that later populations related to Austronesians in Taiwan shared more alleles with Sui than the earlier populations in southern China, and we observed the same pattern in using ancient Southeast Asians and northern populations in the form f4(Taiwan_Hanben/Taiwan_Gongguan, ancients; Sui_Sandu, Mbuti) with the Z-value was more than 3 (Supplementary Figure 6). Millet farmers in the Yellow River Basin since Late Neolithic also showed a closer genetic affinity with Sui in the form f4(YR_LN/YR_LBIA, ancients; Sui_Sandu, Mbuti) with the Z-value was more than 3. Late Neolithic southern Chinese populations and Southeast Asian populations since the Neolithic Age shared roughly equal levels of genetic affinity with Sui in the formf4 (Xitoucun/Tanshishan, Neolithic/Bronze Age/Historical Southeast Asian population; Sui_Sandu, Mbuti) with the Z-value was less than 2 (Supplementary Figure 7). When the Z-value was less than negative 2 in the form (f4(Laos_Hoabinhian, Vietnam_N_ManBac/Vietnam_LN/Vietnam_BA_NuiNap/Vietna m_BA_DongSonCulture/Vietnam_LN_HaLong Culture; Sui_ Sandu, Mbuti), and f4(Vietnam_N_ManBac, Vietnam_LN; Sui_Sandu, Mbuti) (Supplementary Figure 6), these results suggesting the genetic influx from Sui flowed into Southeast Asia could be detected in Neolithic Vietnamese. Furthermore, we explored the genetic relationship between the Sui and other populations in Guizhou by f4 statistics in form of f4(Mbuti, X; Sui_Sandu, Manchu/Mongolian/Dong) (Supplementary Table 8). The result suggested that Sui showed a similar genetic affinity pattern with Dong as mentioned above, and Sui had a significantly different affinity pattern with Tungusic and Mongolic speaking groups in Guizhou since Sui shared more alleles with southern populations compared with Manchu and Mongolian.

Figure 5. Genetic affinity between Guizhou Sui and southern populations shown in f4(population1, population2; Sui_Sandu, Mbuti). Significant Z scores were labeled by “++” “– –” (Z > 6, Z < –6, respectively), and Z-values with 3 < Z < 6, –6 < Z < –3 were labeled by “+,” “–,” respectively.

Ancestral Origins and Genetic History Reconstruction of Sui From Modern/Ancient DNA Perspectives

We co-analyzed Sui samples with modern and ancient East and Southeast Asians using ADMIXTURE to infer the population clustering patterns (Figure 3B). We observed that the Sui population harbored four main ancestral components. The first dominant ancestral component was maximized in Lachi/Boy which were the Sino-Tibetan group, the second ancestral component was enriched in Hmong-Mien speaking groups represented by Hmong, the third component was related to Austroasiatic-speaking populations represented by Malaysia_LN, and the fourth component was related to Austronesian-speaking populations represented by Taiwan_Hanben. We also detected a small proportion of northern ancestral components represented by ancient samples from Nepal and Yellow River Basin, present-day Ulchi from northeast Asia. Interestingly, the genetic composition of some Hmong-Mien groups was similar to that of Tai-Kadai populations, such as Dao, which was consistent with previous studies that showed there have been extensive interactions between the Hmong-Mien and Tai-Kadai speaking groups (Liu Y. et al., 2020; Macholdt et al., 2020).

To find out the potential genetic sources of Sui_Sandu, we then performed f3 statistics in the form of f3 (source1, source2; Sui_Sandu) using 147 populations from Asia including 91 modern populations and 46 ancient populations (Supplementary Table 9). After excluding ancient source pairs with overlapping SNPs of less than 1,000 (a total of 139 samples), we just found 179 source pairs performing admixture signals when the f3 value was negative (−2.2 < Z < 0). But the numbers of overlapping SNPs in generating those negative results were small, implying that there might be no genetic admixture in the Sui population in recent times due to the geographic isolation, or it is a false-negative signal caused by the strong genetic drift of Sui.

The Z-value was less than negative 2 in the form f4(Mbuti, YR_LN; Sui_Sandu, Southern ancient Chinese/Southeast Asian), and the Z-value was less than negative 3 in the form f4(Mbuti, Taiwan_Hanben; Sui_Sandu, YR/Northeast Asian/Southeast Asian/Nepal)) and f4(Mbuti, ancient Southeast Asian; Sui_Sandu, YR/Northeast Asian), indicating that the allele frequencies of Sui people were in-between those of northern and southern Chinese populations (Supplementary Table 10). To quantitatively explore the plausible admixture models of Sui and other Tai-Kadai groups and estimate the corresponding ancestral proportion, we used qpWave/qpAdm statistics to infer the minimum number of ancestry and proportion estimation (Supplementary Table 11). The qpWave results showed that the best two-way or three-way admixture models could be used to explain the genetic formation of the Sui people (χ2 p > 0.05). The Tai-Kadai populations could be modeled as the admixture of southern East Asian and northern East Asian populations. We used qpAdm to model the proportions of potential ancestral sources. We found that Tai-Kadai populations descended from Atayal-related ancestry spanning from ∼61.4 to ∼85% and Late Neolithic millet farmers from Yellow River belonging to Longshan Culture-related ancestry spanning ∼38.6–∼15% (0.045 < χ2 p < 0.19), which was in accordance with the historical record of the southward expansion of agriculturists from the Yellow River Basin. Sui_Sandu was suggested to harbor ∼75.2% Atayal-related ancestry and ∼34.8% YN_LN-related ancestry (χ2 p = 0.0498197467). A more complex three-way model of YR_LN + Atayal + Vietnam_LN fitted all populations (χ2 p > 0.05) but with a large standard error for Sui_Sandu, Maonan, and Dong_Guizhou. When we replaced ancient Southeast Asians with modern Vietnamese Austroasiatic speakers by Mang, Tai-Kadai populations could be modeled as 9.4–34.5% YR_LN + 56.7–74.5% Atayal + 8.8–19.5% Mang, suggesting that Southeast Asian-related ancestry contributed to the gene pool of Tai-Kadai populations. Therefore, we inferred that the interaction of rice farmers in the Yangtze River and earlier Austroasiatic populations and millet farmers in the Yellow River Basin influenced the formation of Tai-Kadai.

We further used the ALDER method based on weighted LD statistics to estimate the date when northern, southern Chinese, or Southeast Asian-related ancestry was introduced into the gene pool of the Sui population. We explored a total of 1,539 pairwise two-reference groups and found that two pairs give a reasonable admixture date of Guizhou Sui (Supplementary Table 12). Our results showed the time for Southeast Asian admixture in Sui spanning from 49.65 ± 23.78 generations ago corresponding to 1,390 ± 666 BP in the Hmong-Ede model to 42.59 ± 12.46 generations ago corresponding to 1,193 ± 349 BP in the Hmong-Cambodian model. We also used one-reference mode and detected more recent admixture events in Sui including the northern admixture at approximately ∼31.7 ± 17.0 generations using Dongxiang as a single reference and the admixture with other Tai-Kadai speakers at approximately 79.4 ± 20.3 generations ago using Li as a single reference and 87.6 ± 11.6 generations ago using Dai as a single reference.

The Uniparental Genetic Profile of Guizhou Sui

We obtained uniparental Y chromosomal and mtDNA haplotypes for 15 individuals, of whom 7 were men (Supplementary Table 13). From the maternal perspective, Sui individuals had 10 mitochondrial haplotypes with frequencies ranging from 0.067 in F3a1 to 0.2 in M7b1a1a3, in which M7b was the dominant lineage. The maternal profile was similar to that of the Tai-Kadai populations in southern China, but we also detected a northwestern lineage D4 in Sui samples (Li et al., 2007).

We found 3 different paternal Y chromosomal lineages with frequencies ranging from 0.143 of O1a1a1b2-Z23404 and C2c-F9851 to 0.714 of O2a2a1a2a1a2-N5 in Sui samples. The haplogroup O2a2a1a2a1a2-N5 is a lineage prevalently found in Hmong-Mien-speaking populations (Xia et al., 2019), documenting the paternal genetic affinity of Sui and Hmong-Mien people.


Guizhou is not only a multiethnic province located in the hinterland of southwest China but also a multi-language region comprising Hmong-Mien, Tai-Kadai, and Sinitic-Tibetan languages. The region was regarded as an indispensable resource for exploring genetic diversity, origin, migration, and mixing of various ethnic groups in China. However, the genetic origin, admixture, and forensic characteristics of Tai-Kadai are still unclear and need to be comprehensively explored. Therefore, we provided 30 InDels from 511 unrelated individuals of Sui in Guizhou and also generated genome-wide SNP data and merged with all publicly available modern and ancient genomes to reconstruct the Sui population history of Guizhou in southwest China and shed light on the formation of the people.

First, we observed that the majority of 30 InDel loci were highly polymorphic in the Guizhou Sui population and can be used for individual identification in Chinese populations, but we also noted that the following 11 low polymorphic loci: HLD118, HLD64, HLD39, HLD99, HLD111, HLD84, HLD122, HLD114, HLD81, and HLD6710, were not recommended to be included in the InDel system for forensic use in Chinese populations. In addition, the InDel system can be widely used as a powerful tool for individual identification based on CPD and CPE, but could only be used as a supplement for paternity testing due to the low CPE.

The origin of Sui has been relatively controversial since the description of historical documents was indistinct. What is more, the genetic relationship and admixture history of the Sui population are not well studied due to the sparse sampling. In this study, we first used forensic high polymorphic and informative InDel markers to explore the genetic relationship between Sui and East Asian populations. We found that the allelic frequencies of 30 InDels showed significant differences among intercontinental populations and the genetic markers’ frequency characteristics in Sui people were similar to other Chinese groups. Sui had a close genetic relationship with the Tai-Kadai-speaking groups in the surrounding area. Second, through the comprehensive genome-wide SNP analysis, our findings clearly showed that the genetic formation of Sui people has been influenced by surrounding populations, including Tibetan-Burman, Hmong-Mien, Austronesian, and Austroasiatic speaking groups (He et al., 2020). We observed that the Tai-Kadai populations have the closest genetic affinity with Hmong-Mien people. The extensive interaction between Tai-Kadai and Hmong-Mien has also been shown in the analyses of uniparental and InDel markers and also in linguistics (Chen et al., 2007, 2018; Blench, 2017; He et al., 2019b; Huang et al., 2020; Macholdt et al., 2020). According to linguistic information and historic documents, Hmong-Mien and Tai-Kadai language families both originated in southern China and then expanded southward into mainland Southeast Asia approximately two millenniums ago (Edmondson and Gregerson, 2007; Bellwood and Ness, 2015). We also detected the gene flow from Tai-Kadai-related populations into Southeast Asians since the Neolithic Age in the negative f4(Laos_Hoabinhian, Neolithic to present-day Southeast Asians; Sui_Sandu, Mbuti), which is consistent with previous genomic studies (Lipson et al., 2018; Huang et al., 2020; Liu D. et al., 2020). We found that the Tai-Kadai people had a high proportion of Austroasiatic-related ancestry at a proportion of 8.8–19.5% using Mang as the genetic source, indicating the prosperous communications between ancient Tai-Kadai and Austroasiatic populations. The admixture between Tai-Kadai, Hmong-Mien, and Austronesian/Austroasiatic was consistent with the bidirectional interactions between populations in southwest China and Southeast Asia. The admixture times among ancient Tai-Kadai, Austronesian, and Austroasiatic speakers were estimated at approximately 2,600–1,400 years ago. For populations in Guizhou, the studied Sui was genetically similar to the previously studied Tai-Kadai populations in Guizhou, such as Sui, Bouyei, and Dong. These populations could be modeled as the admixtures of Neolithic Yellow River farmers (15–40%) and ancient coastal southern East Asians (45–85%) (Bin et al., 2021; Wang C. C. et al., 2021). Interestingly, we found that the dominant paternal Y chromosomal lineage of our studied Sui samples derived most likely from Hmong-Mien people, but the mtDNA profile was similar to that of the Tai-Kadai populations in southern China, documenting a possible sex-biased admixture in the formation of the Guizhou Sui people.

Our findings also showed a phylogenetic relationship between present-day Tai-Kadai and Austronesian speaking populations, which corroborated previous genetic study about the affinity between Tai-Kadai and Austronesian (Wang C. C. et al., 2021) and was also compatible with the hypothesis of “the common origin of Austronesian-Tai-Kadai” in linguistics (Sagart, 2004). Meanwhile, the genomic analysis of ancient populations in Fujian Province and Taiwan supported that the origin of the Austronesian language family can trace back to southeastern China (Yang et al., 2020; Wang C. C. et al., 2021). We observed a close connection between ancient populations in southeastern China and Tai-Kadai populations, which indicated that the common origin place of Austronesian-Tai-Kadai-speaking groups was probably southern China (Huang et al., 2020). According to historic documents, southern China was used to be a region where the ancient Baiyue tribes inhabited, and the Baiyue tribes gradually developed into Austronesian and Tai-Kadai groups (Lee, 2011). The Yangtze River rice farmers were hypothesized to contribute to the gene pool of Austronesian, Tai-Kadai, and Austroasiatic speakers (Lipson et al., 2018; Mccoll et al., 2018). The formation of Tai-Kadai speakers has also been influenced by the southward expansion of northern populations, such as the millet farmers of Neolithic Yangshao Culture and Longshan Culture in the Yellow River Basin (Yang et al., 2020).


We used both low-density forensic InDels and high-density genome-wide SNPs in this study to infer the genetic characteristics of Guizhou Tai-Kadai-speaking Sui people. The 30 InDel loci were found forensically informative in Guizhou Sui and can be used for individual identification. Furthermore, population genetic landscape and demographic history reconstruction based on genome-wide SNP data showed that Sui people were genetically similar to other Tai-Kadai groups. We found that Tai-Kadai populations had a genetic affinity with Hmong-Mien, Austroasiatic Kinh, Muang, and modern and ancient Austronesian-related populations in southern China, suggesting that the interactions among Tai-Kadai/Austronesian, Hmong-Mien, and Austroasiatic speaking populations in southern China during the past 2,000 years played a crucial part in the formation of modern Tai-Kadai. We found a possible sex-biased admixture in the formation of the Guizhou Sui people with the dominant paternal Y chromosomal lineage deriving most likely from Hmong-Mien-speaking people. The additional southward expansion of millet farmers from the Yellow River Basin has impacted the gene pool of southern populations including Tai-Kadai-speaking people.

Data Availability Statement

The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found in the article/Supplementary Material.

Ethics Statement

The studies involving human participants were reviewed and approved by the Ethical Committee on Clinical Trials of Drugs of the Affiliated Hospital of Guizhou Medical University (2020 Ethics Approval No. 85). The patients/participants provided their written informed consent to participate in this study.

Author Contributions

C-CW and JH designed this study. MY, ZR, HaZ, QW, YL, HoZ, JJ, and JC collected the samples and conducted the experiment. XY wrote the manuscript. XY, GH, and JG analyzed the results. C-CW modified the manuscript. All authors reviewed the manuscript.


This study was supported by the Guizhou Scientific Support Project, Qian Science Support (2021) General 448; Guizhou Province Education Department, Characteristic Region Project, Qian Education KY No. (2021)065; Guizhou “Hundred” High-Level Innovative Talent Project, Qian Science Platform Talents (2020)6012; Guizhou Scientific Support Project, Qian Science Support (2020) 4Y057; Guizhou Science Project, Qian Science Foundation (2020) 1Y353; Guizhou Scientific Support Project, Qian Science Support (2019) 2825; Guizhou Scientific Cultivation Project, Qian Science Platform Talent (2018) 5779-X; and Guizhou Engineering Technology Research Center Project, Qian High-Tech of Development and Reform Commission No. (2016) 1345, the Major Project of National Social Science Foundation of China granted to C-CW (21&ZD285) and (20&ZD248), the National Natural Science Foundation of China (31801040), the “Double First Class University Plan” key construction project of Xiamen University (the origin and evolution of East Asian populations and the spread of Chinese civilization, 0310/X2106027), Major Project of Marxist Theoretical Research and Construction Project (2021MZD014), European Research Council (ERC) grant (ERC-2019-ADG-883700-TRAM), and Nanqiang Outstanding Young Talents Program of Xiamen University (X2123302).

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s Note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.


S. Fang and Z. Xu from the Information and Network Center of Xiamen University are acknowledged for their help with high-performance computing.

Supplementary Material

The Supplementary Material for this article can be found online at:

Supplementary Figure 1 | (A) Geographical distribution of Guizhou Sui in this study. (B) The linkage disequilibrium patterns among the 30 InDel loci of the Guizhou Sui population. (C,D) Allele frequencies and corresponding forensic parameters of the 30 Indels in the Guizhou Sui population.

Supplementary Figure 2 | The heat map on the basis of the deletion allelic frequency distributions for Guizhou Sui and 99 worldwide reference populations.

Supplementary Figure 3 | The heat map of pairwise Nei’s genetic distances among Guizhou Sui and 99 worldwide reference populations.

Supplementary Figure 4 | The heat map of pairwise DA genetic distances among Guizhou Sui and 99 worldwide reference populations.

Supplementary Figure 5 | The results of ancestry component analysis on the 30 InDels with k ranging from 2 to 8 of the Guizhou Sui and 51 worldwide reference populations by STRUCTURE.

Supplementary Figure 6 | The phylogenetic relationships between the Guizhou Sui population and modern Asian populations based on Treemix.

Supplementary Figure 7 | The genetic affinity between Guizhou Sui and ancient populations in Asia. Significant Z scores were labeled by “++” “– –” (Z > 6, Z < −6, respectively), and Z-values with 3 < Z < 6, −6 < Z < −3 were labeled by “+,” “–,” respectively.



This article is autogenerated using RSS feeds and has not been created or edited by OA JF.

Click here for Source link (