# Generalization enhancement of artificial neural network for turbulence closure by feature selection – Advances in Aerodynamics

Jan 13, 2022

### 2.1 Workflow

The workflow includes two parts: neural network training and coupling of the N-S equation solver with the constructed model. The first part aims to obtain a neural network model, of which the performance should be satisfying and convergent. The word “convergent” means that the loss function is hard to be lower. This part mainly consists of three aspects: data acquisition, data preprocessing and model training. Model performance is improved mainly by adjusting model parameters and hyperparameters during the training process. If the performance is not adequate, the input features and training samples can be further optimized in data preprocessing. The target of the second part is to substitute the conventional RANS model with the constructed surrogate model and obtain the results that agree well with those calculated by S-A model. The surrogate model acts as a turbulence solver, as shown in the dot box below. The entire process is shown in Fig. 1, and some details will be addressed later.

### 2.2 Modeling strategy

According to the dimensional analysis, the kinetic eddy viscosity can be expressed as the product of mixing velocity umix and mixing length lmix.

$${upsilon}_t={u}_{mix}{l}_{mix}$$

(1)

The “mixing length” derives from the molecular mean free path in gas dynamics, which was proposed by Prandtl [30]. The algebraic model with zero equation can express turbulent eddy viscosity explicitly in the form of Eq. (1). Generally, two length scales are used to characterize the turbulence from wall to the outer edge of the boundary layer, namely the wall friction scale δυ and the boundary layer thickness δ. The δυ and δ are corresponding to the inner and outer layers, respectively, where the boundary is around the outer edge of the logarithmic layer. The mixing length of the inner layer is related to the distance to the wall. There are many researches on the scale analysis of the inner layer, and the universality is good in many cases. However, the outer flow variables are only slightly influenced by the wall, and are more disorderly and irregular [31]. Therefore, the scale analysis of the outer layer is more challenging and of less universality. More information about the mixing length is detailed in Granville’s work [32]. The mixing length adopted in this paper is based on the Prandtl-van Driest formulation [33] and limited by the boundary layer thickness of flat plane,

$${l}_{mix}=min left(kappa dleft(1-{e}^{-frac{mathrm{d}}{d_{y^{+}approx 1}{A}^{+}}}right),0.4{delta}_{flat}right)$$

(2)

where A+ = 26 and κ = 0.4 in this work, ( {mathrm{d}}_{y^{+}approx 1} ) denotes the estimated wall friction scale, calculated by the function at the following address: https://www.cfd-online.com/Tools/yplus.php. Because it is found that the real wall friction scale leads to worse convergence during the iteration process. Referring to Clauser’s constant eddy viscosity hypothesis [34], the mixing length of the outer layer is the same order with the boundary layer thickness. For convenience, the boundary layer thickness here is approximate to that of the flat plane, since the curvature of the airfoil is small. It should be noted that Eq. (2) is effective qualitatively rather than quantitatively, which is a compromise to the lack of robustness of the constructed neural network model. Once the mixing length is determined, the mixing velocity can be deduced according to Eq. (1). The constructed model in this paper is a nonlinear mapping between mean flow variables and the mixing velocity. Since the modeling strategy includes the distance to the wall, which can be less effective in separated flows, the flow cases in this work are all attached flows.

### 2.3 Artificial neural network

The term “neural network” comes from neurology and is often referred to as “artificial neural network” in machine learning. Neuron is the basic unit of a neural network. The number of neurons in each layer is called width and the number of layers is called depth, which are two hyperparameters of a neural network. For a fully-connected neural network, only the neurons in adjacent layers are connected to each other, and the output of the neurons in the previous layer is the input of the neurons in the current layer. The neural network used in this work is shown in Fig. 2.

The first layer is the input layer while the last layer is the output layer, and the rest is the hidden layers. The parameters of the neuron are called weight and bias. Each neuron contains two operations. The first one is the linear addition between the product of input vector X and weight vector W and bias b,

$$z=mathbf{WX}+b$$

(3)

and the second one is the nonlinear activation operation.

The common activation functions f(·) are Tanh, Sigmoid, Relu, LeakRelu and so on. In this work, Tanh is adopted, see Fig. 3. Since the grid search method is time-consuming, the structure is determined by a trial-and-error approach. Firstly, we fixed the width and tested the effect of depth on the model performance. Then, we fixed the optimal depth and tested the effect of width on the model performance. It was found that the structure (20, 20, 20, 20, 20, 1) is enough for reaching the optimal solution. More neurons do not significantly improve model performances but increase computation time.

The error back propagation (BP) algorithm based on gradient descent is used to train the neural network. The gradient is calculated by the chain rule to update the parameter values of the neurons. Because the gradient descent algorithm is prone to stop at the local minimum, the global optimum is generally approximated by several training with different initial values. More information about neural networks can be found in many works [35,36,37].

According to the universal approximation theory, neural networks with enough neurons can approximate any measurable functions [38]. The complexity of the neural network can be adjusted by changing the depth and width. The strength of mapping complex nonlinear functions makes neural networks popular with researchers. It has already been used in turbulence computation [10, 39].

### 2.4 Feature selection

We do not intend to delve into feature construction that depends on specific problems or researchers’ physical view, but rather try to select some features that are helpful to model performance from existing candidate features. Although we assume that the neural network itself can play a role in feature selection more or less in the training process, the important role played by artificial feature selection is irreplaceable according to our own experience.

The proposed feature selection method includes two parts. The first one is measuring the numerical space of the features and kicking off those with obvious extrapolation according to the mean extrapolation distance (MED). The second one is ranking the feature and determining the feature subset according to the model performance. The whole process is shown in Fig. 4 and to be more specifically, it includes the following steps.

• Step 1: Get the dataset ( mathtt{D}=left{left({mathbf{Q}}_{mathtt{D}},{mathbf{y}}_{mathtt{D}}right),{mathbf{Q}}_{mathtt{D}}in {mathrm{mathbb{R}}}^{l_{mathtt{D}}times m},{mathbf{y}}_{mathbf{D}}in {mathrm{mathbb{R}}}^{l_{mathtt{D}}times 1}right} ) and ( mathtt{T}=left{left({mathbf{Q}}_{mathtt{T}},{mathbf{y}}_{mathtt{T}}right),{mathbf{Q}}_{mathtt{T}}in {mathrm{mathbb{R}}}^{l_{mathtt{T}}times m},{mathbf{y}}_{mathtt{T}}in {mathrm{mathbb{R}}}^{l_{mathtt{T}}times 1}right} ), where Q denotes the feature vectors, y the labels, m the number of features, ( {l}_{mathtt{D}} ) and ( {l}_{mathtt{T}} ) the sample number of ( mathtt{D} ) and ( mathtt{T} ), respectively. Normalize each feature and the output of ( mathtt{D} ) to [−1, 1], and normalize the dataset ( mathtt{T} ) according to the corresponding minimum and maximum values from ( mathtt{D} ). Mark the space domain and the domain boundary formed by each feature and the output of ( mathtt{D} ) as Ωi and Ωi, respectively, where i = 1, 2…, m. Put the sample of ( mathtt{T} ) in corresponding domain according to the feature and regard those samples outside the domain boundary as extrapolation samples. Calculate the MED of each feature by Eq. (5) and remove the features with large MED values. Denote the number of the remaining features as n. Extract the remaining features and the output from ( mathtt{D} ) and ( mathtt{T} ), and reshape the datasets as ( overline{mathtt{D}}=left{left({mathbf{Q}}_{overline{mathtt{D}}},{mathbf{y}}_{mathtt{D}}right),{mathbf{Q}}_{overline{mathtt{D}}}in {mathrm{mathbb{R}}}^{l_{mathtt{D}}times n},{mathbf{y}}_{mathbf{D}}in {mathrm{mathbb{R}}}^{l_{mathtt{D}}times 1}right} ) and ( overline{mathtt{T}}=left{left({mathbf{Q}}_{overline{mathtt{T}}},{mathbf{y}}_{mathtt{T}}right),{mathbf{Q}}_{mathtt{T}}in {mathrm{mathbb{R}}}^{l_{mathtt{T}}times n},{mathbf{y}}_{mathtt{T}}in {mathrm{mathbb{R}}}^{l_{mathtt{T}}times 1}right} ), respectively.

$$mathrm{MED}=frac{sum limits_{i=1}^{N_e}{d}_i}{N_e}$$

(5)

where Ne is the number of the extrapolation samples.

• Step 2: Divide the dataset ( overline{mathtt{D}} ) into training set ( {mathtt{D}}_{mathtt{train}} ) and validation set ( {mathtt{D}}_{mathtt{valid}} ) randomly, where ( {mathtt{D}}_{mathtt{train}}=left{left({mathbf{q}}_{mathbf{1}},{mathbf{q}}_{mathbf{2}},dots, {mathbf{q}}_{mathbf{n}},{mathbf{y}}_{mathbf{train}}right),{mathbf{q}}_iin {mathrm{mathbb{R}}}^{l_{mathtt{train}}times 1},{mathbf{y}}_{mathbf{train}}in {mathrm{mathbb{R}}}^{l_{mathtt{train}}times 1},i=1,2,dots, nright} ). Construct a neural network M according to ( {mathtt{D}}_{mathtt{train}} ) and ( {mathtt{D}}_{mathtt{valid}} ). Create a new dataset ( overset{sim }{mathtt{T}}=left{left(overset{sim }{{mathbf{q}}_{mathbf{1}}},overset{sim }{{mathbf{q}}_{mathbf{2}}},dots, overset{sim }{{mathbf{q}}_{mathbf{n}}},{mathbf{y}}_{mathtt{test}}right),overset{sim }{{mathbf{q}}_i}in {mathrm{mathbb{R}}}^{l_{mathtt{train}}times 1},{mathbf{y}}_{mathtt{test}}in {mathrm{mathbb{R}}}^{l_{mathtt{train}}times 1},i=1,2,dots, nright} ) by selecting ( {l}_{mathtt{train}} ) samples from ( overline{mathtt{T}} ). Replace the feature of ( {mathtt{D}}_{mathtt{train}} ) by the same feature of ( overset{sim }{mathtt{T}} ) one by one and combine the output of ( overset{sim }{mathtt{T}} ), thus, we can get n datasets ( {mathtt{Q}}_{mathtt{i}} ), where i = 1, 2…, n.

$${mathtt{Q}}_{mathtt{i}}=Big{left({mathbf{q}}_{mathbf{1}},{mathbf{q}}_{mathbf{2}},dots, overset{sim }{{mathbf{q}}_{mathtt{i}}},dots, {mathbf{q}}_{mathbf{n}},{mathbf{y}}_{mathtt{test}}right)$$

Test the model by datasets ( {mathtt{Q}}_{mathtt{i}} ) and get the test errors. Rank the features according to the test errors. The smaller the error, the more important the feature and the higher the rank.

$${displaystyle begin{array}{l}kern1.25em mathrm{Feature} mathrm{Importance}:1,kern0.75em 2,dots, kern1.5em n\ {}kern2.75em {mathtt{D}}_{mathtt{train}}=left{{mathbf{q}}_{mathbf{1}},{mathbf{q}}_{mathbf{2}},dots, {mathbf{q}}_{mathbf{n}},{mathbf{y}}_{mathbf{train}}right}\ {}kern2.75em overset{sim }{mathtt{T}}=left{overset{sim }{{mathbf{q}}_{mathbf{1}}},overset{sim }{{mathbf{q}}_{mathbf{2}}},dots, overset{sim }{{mathbf{q}}_{mathbf{n}}},{mathbf{y}}_{mathtt{test}}right}end{array}}$$

Create the training set ( overset{sim }{{mathtt{D}}_{mathtt{train}}} ) and validation set ( overset{sim }{{mathtt{D}}_{mathtt{valid}}} ) by selecting the first j features from ( {mathtt{D}}_{mathtt{train}} ) and ( overset{sim }{mathtt{T}} ), respectively. Then, construct a neural network according to ( overset{sim }{{mathtt{D}}_{mathtt{train}}} ) and ( overset{sim }{{mathtt{D}}_{mathtt{valid}}} ), and record the validation error Errj on ( overset{sim }{{mathtt{D}}_{mathtt{valid}}} ). The feature number j increases from min _ Feature _ NO set by the user to n one by one. Finally, select the feature subset with the minimal validation error.

In this paper, datasets ( mathtt{D} ) and ( mathtt{T} ) are composed of 2525 and 5437 samples, respectively. These samples are obtained from different flow cases (see Table 1) using the sample selection algorithm described below. The flow cases are sampled through the Latin hypercube sampling method (LHS), and the sampling space is Ma [0.1, 0.5], Re [2 × 106, 8 × 106], α [0°, 5°]. The total candidate features before feature selection are shown in Table 2, where p denotes the pressure, x and y denote the streamline and normal direction of the freestream respectively and the subscripts denote the partial derivative in the direction. The sign function sgn() is defined as follows.

$$operatorname{sgn}(x)=left{begin{array}{l}-1kern0.5em x<0\ {}0kern0.75em x=0\ {}begin{array}{cc}1& x>0end{array}end{array}right.$$

(6)

It can be found that the candidate features are not limited to those with invariant property. Some features have invariant property, such as ( {u}_ifrac{partial p}{partial {x}_i} ) and S ’ /Ma2, while others don’t, such as velocity vector and its derivatives, which are invariant to translation but not to rotation. According to step 1, the MED of each feature can be calculated (see Fig. 5) and the space domain formed by each feature and the output of dataset ( mathtt{D} ) and ( mathtt{T} ) can also be plotted. For example, the space domain of some features can be seen in Fig. 6. After removing the features with large MED, the remaining features are (q5, q8, q10, q11, q12, q13, q14). For step 2, the mean square error (MSE) of 7 datasets ( {mathtt{Q}}_1-{mathtt{Q}}_7 ) is shown in Fig. 7. Thus, the feature importance is q8 > q13 > q5 > q12 > q14 > q10 > q11 according to the MSE. In step 3, the min _ Feature _ NO is set to be 3, which means that the features q8, q13, q5 are necessary.

Starting from these three features, an individual feature subset can be generated as the feature number increases by one in the order of the feature importance and a corresponding neural network model can be constructed. The best feature subset can be found by the model performance on validation set. In order to avoid the influence of occasionality, the model is trained three times for each feature subset, and the model performance is measured according to the mean value of the errors on validation set. As is shown in Fig. 8 where the horizontal axis means the feature number of the feature subset, the model performance is the best when the first four features are adopted. Therefore, the features used to construct the model hereinafter are (q8, q13, q5, q12).