Study area and research period

Two villages in China, Hounan (in Dabu County, Meizhou City) and Zhoutian (in Huiyang District, Huizhou City), located in different regions but under the same homologous cultural system, were selected for spatial analysis and comparison.

The Hakka migration in Guangdong Province experienced two important processes. During the Song and Yuan Dynasties, the Hakka people mainly lived in southern Jiangxi, western Fujian and northeast Guangdong. The first major flow of population occurred during 1360-1530 of the Ming Dynasty. Under the influence of war, famine, and the evil of banditry, the Hakka population at the junction of these three provinces frequently interflowed. The second major population flow occurred around 1680-1860. Attracted by the government’s immigration and land reclamation policies, large numbers of the Hakka population in northeastern and northern Guangdong moved to the hilly areas near the Pearl River Delta [60, 61]

According to Hakka’s migration process in Guangdong Province from Ming to Qing Dynasty (see Fig. 2), Hounan village is located in the upstream of the migration route, in a small basin surrounded by mountains on the southern bank of the Meitan River. The Yang clan has settled here since the 1500s and has developed over 500 years. At present, the village covers an area of 7.9 km2 with a total population of about 5400. The Meitan River is a rather busy river due to heavy transportation of people and goods. Its downstream leads east to Fujian province and the Chaoshan area, along the coast of Guangdong Province. Historically, residents inhabiting the area mainly lived on growing, harvesting, and selling tobacco plants. At present, the village is that of a clustered type of spatial pattern.

Fig. 2
figure 2

Hakka’s migration routes in Guangdong Province, and the location of Hounan and Zhoutian

Zhoutian Village is located in the valley of hilly area at the northeast edge of the Pearl River Delta, which represents the lower reaches of the migration route. The Ye clan settled here in the 1660s. With a history of more than 350 years, the village covers an area of 17.8 km2. Its total population reached more than 4800. Throughout history, its residents mainly cultivated land on which they grew rice for living. Currently, the village is that of a dispersed type of spatial pattern (see Fig. 3).

Fig. 3
figure 3

Satellite images of Hounan and Zhoutian in 2021. (Google Earth satellite imagery data in Level 19, the data comes from the map provider of TuxinGIS)

Hounan Village is a village with a concentrated layout, while Zhoutian village has a scattered layout. These two villages were chosen as research samples for the following reasons. First, both villages have been on the list of “Chinese Traditional Villages” since 2013 due to their high historical, contextual and architectural significance. Second, the two villages have the same Hakka culture system in Guangdong Province, thus avoiding the interference of cultural attributes in a comparative study of different spatial forms. In addition, given the feasibility of model calibration, both villages are on a considerable spatial scale to provide architectural samples that meet the requirements of this study.

Due to the loss of rural population caused by unbalanced regional development after the Chinese reform and opening up, these two villages did not have significant spatial expansion after the 1970s. Therefore, the research period of this study focuses on the spatial process from the initial stage of village foundation to the middle of the twentieth century.

Data processing

First of all, for the two villages in question, the patches containing buildings, rivers, canals, ponds, and roads were extracted from the satellite image coming from the map provider TuxinGIS, taken in 2021 with a resolution of 5 m and mapped onto a spatial lattice. We compared the current satellite images of the selected villages with the earliest geodetic survey maps (around 1930) that could be found (see Fig. 4). Although the measurement accuracy of historical maps is not high enough, it is found that the natural environment of the village has not changed, and the infrastructure of the village, such as natural channels (adapted to the terrain), main roads and ponds, has no obvious changes. However, the current layout of the branch roads in the village has increased, i.e., the evolution of branch roads in the village tends to increase with the proliferation of buildings. Other historical changes in the infrastructure are the lack of documentation. However, based on the comparison of the image of the village in two periods, we can roughly estimate that the changes in infrastructure in different periods may not be drastic. Due to the limitations of historical data and in order to reduce the complexity of the spatial simulation, this study ignored the small changes of the main infrastructure of the village in the historical period. In addition, in the analysis of the spatial driving factors, the less influential and destabilizing factor, i.e., the factor of branch roads, was not taken into account.

Fig. 4
figure 4

Hounan and Zhoutian Village, as recorded in historic geodetic survey maps, circa 1900. (From the Atlas of mainland China at a scale of 1:25,000, produced by the Japanese Land Survey Bureau, published by Academy of Science in Tokyo, 1990.)

The distances between those mapped ground objects and the cells in the spatial lattice were calculated using ArcMap 10.7 (see the raster data in columns 4–7 in Fig. 5), which is primarily used to view, edit, create, and analyze geospatial data. Secondly, we obtained the rasterized elevation data from the 5-m resolution Digital Elevation Model (DEM) generated by TuxinGIS and used them to obtain the slope information via deduction process (see the raster data in columns 2–3 in Fig. 5). Then, based on the information of building construction time in village annals, the multi-temporal spatial patterns of the two villages were reconstructed (Fig. 6). Finally, the data where the buildings were located were extracted. The data about distances between buildings and about arable land area were calculated through SCILAB programming method. All of the above represent the basic data for model training.

Fig. 5
figure 5

Analysis of spatial data based on ArcMap 10.7

Fig. 6
figure 6

Multi-temporal spatial patterns of Hounan and Zhoutian

Simulation process based on CA

The spatial resolution used in the simulation should not exceed the minimum scale of a house. In addition, considering data accuracy and model operation efficiency of villages with different area and building scales, a two-dimensional lattice with 5-m resolution for Hounan, and a lattice with 10-m resolution for Zhoutian, were used to construct the CA space. Cellular state included the main road, river, canal, pond, building, and land yet to be developed. Among them, the state about transportation and water source was fixed, and there is only one condition for state transformation, i.e., for land to be developed into land used as building area.

Figure 7 shows the CA simulation process created in reference to the common logic framework of “outlying + edge-expansion”. In the site selection stage of the simulation process, buildings generated in the outlying expansion mode are marked as nodal buildings, while the ones generated in the edge-expansion mode are marked as ordinary buildings.

Fig. 7
figure 7

The CA simulation process

Selection of spatial constraints

Owing to the inhomogeneity of geographical space, CA models usually work with constraints when applied to geographical spatial simulation [62]. The constraints could be reflected in the cell attributes, which can be expressed by Boolean attribute variables (using 1 or 0 to mark whether the cell has some attributes or not) and floating-point attribute variables (using decimals to mark the cell’s value of certain attributes). All the attribute values will exert a comprehensive influence on the cellular state transformation.

Moreover, too many constraints will reduce the working efficiency of the model [47], whereas too few of them will lower the outcome accuracy of the simulation. Balancing increased explanatory power with a reasonable number of cell attribute is important. In order to ensure that the explanatory power of the model is met, first we analyzed the historical, social, and geographical contexts of the region, which was helpful for the primary selection and exclusion of attributes. For further verification and dimensionality reduction of attributes, the univariate Gaussian mixture model (UniGMM) was introduced in order to analyze the probability density distribution of building-cell samples in each attribute:

$${text{UniGMMEval}}(x) = sumlimits_{{j = 1}}^{k} {omega _{j} N(x|mu _{j} ,sigma _{j}^{2} )} ,,sumlimits_{{j = 1}}^{k} {omega _{j} } = 1$$


where, there are certain k components. Each component is a Gaussian distribution parameterized by ({mu }_{j}) and ({sigma }_{j}^{2}), with ({omega }_{j}) as the weight of component j.

The peaks of the GMM curve indicate the feature value of the building-cell attributes, which can also be considered as the spatial preference for village growth. The dispersion degree of the attribute value distribution is expressed as γ, which indicates the sensitivity of a settlement distribution to each attribute:

$$upgamma =frac{sumlimits_{i=1}^{k}{omega }_{i}{sigma }_{i}}{{x}_{max}-{x}_{min}}, sum_{i=1}^{k}{omega }_{i}=1$$


where, (sumlimits_{i=1}^{k}{omega }_{i}{sigma }_{i}) is the standard deviation of the attribute values of building cells, and ({x}_{max}-{x}_{min}) is the domain of the attribute values of building cells. The lower γ in the formula is, the more concentrated the attribute values of building cells, which indicates that the attribute has the stronger impact.

Calculation of the transition probability

Spatial state transition involves three cases: the growth of single building area, the site selection for new buildings in edge-expansion mode, and the counterpart in outlying mode. Accordingly, the state transition probability of candidate cells was also calculated in all three cases.

When it was considered that the size of the building area was going to increase, the neighboring function would be taken into account, and the attributes about building interval would not be. The calculation formula of transition probability is the following:

$${text{Blocks}}_{text{prob}}left( {{i}} right) , = {text{ GMMEval}}_{text{ATTRI}}left( {{i}} right) , times {text{ NeighborEval}}left( {{i}} right)$$


In the case of site selection for new buildings in edge-expansion mode, the transition probability was calculated as follows:

$${text{Blocks}}_{text{prob}}left( {{i}} right) = {text{GMMEval}}_{text{ATTRI}}left( {{i}} right) times {text{GMMEval}}_{text{Ordi}}left( {{i}} right)$$


In the case of site selection for new buildings in outlying mode, the transition probability was calculated as follows:

$${text{Blocks}}_{text{prob}}left( {{i}} right) = {text{ GMMEval}}_{text{ATTRI}}left( {{i}} right) times {text{ GMMEval}}_{text{Nod}}left( {{i}} right)$$


In the above three formulas, Blocks_prob(i) is the transition probability of the candidate cell, while GMMEval_ATTRI(i) is the contribution of environmental attributes (elevation, slope, the shortest distance to traffic facilities, water source, etc.) to the transition probability. GMMEval_Ordi(i) and GMMEval_Nod(i) are the contribution of social attributes (the average distance to the two nearest pre-existing nodal buildings and the nearest distance to the pre-existing ordinary building, etc.) to the transition probability of the candidate cell in edge-expansion mode and in outlying expansion mode. Finally, NeighborEval(i) is the transition probability contribution of the neighboring cell state to the central cell.

The above three formulas touch upon the issue of inferring the transition probability from the cell attribute value. Compared to cities, villages are on a smaller scale, and the data available for machine learning is lacking. In other words, the spatial data of a village for model calibration are not “large” enough. In order to obtain better simulation results with smaller data, to effectively circumvent overfitting [63], and to articulate the complex comprehensive effects of multiple factors, GMMEval_ATTRI(i), GMMEval_Ordi(i), and GMMEval_Nod(i) are calculated through the multivariable Gaussian mixture model (Multi-GMM) as follows:

$${mathrm{MultiGMMEval}}left( {vec{X} } right) = mathop sum limits_{j = 1}^{k} omega_{j} N({mathop{vec X}} |{mathop{vec mu_{j}}} ,{{vec{Sigma}}}_{j} ),mathop sum limits_{j = 1}^{k} omega_{j} = 1$$


A Multi-GMM model is composed of (k) Multi-Gaussian distributions, where each Multi-Gaussian distribution is considered as a component (j) with the mixture weight ({omega }_{j}) and mean vector ({vec{mu }}_{j}); (vec{Sigma }_{j}) is a d × d covariance matrix of component (j), which captures the correlation between different variables; and (vec{X}) is the vector formed by the multivariate attributes of cells.

The neighborhood of the CA model was defined by Moore (8 cells) neighborhood configuration mode. Supposing that the neighborhood effect on the central cell i (represented by NeighborEval(i)) was related to its adjacent cellular number (represented by NEIi), then the neighborhood effect could be defined as follows:



Judgement on the area growth and relevant parameters

In the CA model of this study, the growth of a building area was determined by the current size of the building and the covered area of the surrounding constructions. We set two prerequisites, and only if both of them were satisfied, can the building continue to grow. The first prerequisite states that, in order to present the importance of the size of a building, we compare the reciprocal of the building area with a random number from 0 to 1. If the former is smaller than the latter, the building will stop growing; otherwise, it will continue to grow. The second prerequisite states that, in order to simulate the influence of the surrounding construction amount, we set a module scale for area control, namely, the scale of an economic unit (a circular region with the centroid of the building as the center, and R as the radius), as well as the area threshold (AEx) of the construction in the module. If the current construction areas around the building within the control module were smaller than the threshold, the building was marked to grow; if not, it stopped growing. To simulate the uncertainty of the real world, we introduce random perturbations to the threshold.

Judgement on the expansion mode and relevant parameters

The modes of new building site included selection of edge-expansion site and outlying site. The feature boundary of the edge-expansion (normal building) and the outlying expansion (nodal building) is the shortest distance from the new building to the pre-existing buildings. Therefore, we distinguished two modes according to the distance threshold value (DThres). Furthermore, we found that the behavior of location choice in the outlying mode occurred approximately periodically. In addition, the number of spatial nodes in the early stage of village development is higher than in the late stage. The occurrence period of outlying expansion was set as TInter and different occurrence probabilities of outlying site selection as Ppro and Pana in different periods, in order to control the spatial expansion in different developmental stages.

Model calibration

Rural building patches are scattered patches of low density. In addition, the scale of building patches is small. Therefore, compared to urban CA models, the simulation results of the village models are difficult to accurately coincide with the real situation [64]. In essence, it was important that the locations were similar and that the overall location patterns resembled each other in relevant ways [65]. CA models should be assessed on the basis of plausibility rather than on the one-to-one correspondence or correlation measures [66]. We believe that a simulation is valid if the geometric distance deviation between the simulation results and the real situation is within an acceptable range. Hence, the Matching Index with Tolerance [67] (denoted as IT) was adopted in the model assessment. The acceptable deviation is theoretically related to the scale of the economic unit. Being aware of that, we have delimited the acceptable deviation range as the union of circular regions with the centroid of the real building patches as the center, and the radius of the economic unit (R) as the radius. Simulated cells falling within the acceptable deviation range were considered effective (see Fig. 8). The percentage of effective cells was calculated as simulated accuracy.

Fig. 8
figure 8

Determination of effective simulation

Model comparison

Using the same simulation logic made it easier to compare the models of clustered villages and dispersed villages, in order to find differences and similarities between them. Model comparison was performed from cellular attribute variables and model parameters.

Regarding the cellular attribute variables, we were able to obtain two pieces of information. One was the sensitivity of the settlement distribution to each attribute, which was used for the validation of the constraints; the other was the feature value of each attribute variable that indicated the settlement distribution tendency.

The important model parameters need to be preset through context analysis and to be finalized through the process of trial and error. These parameters include the threshold to distinguish the outlying expansion and the edge-expansion (DThres), the radius of an economic unit (R), the area threshold within an economic unit (AEx), the occurrence period of outlying expansion (Tinter), and the occurrence probabilities of outlying site selection (Ppro and Pana).

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit


This article is autogenerated using RSS feeds and has not been created or edited by OA JF.

Click here for Source link (