# Identification of citrus diseases based on AMSR and MF-RANet – Plant Methods

Sep 24, 2022

### Data acquisition

The dataset used in the experiment comes from 2 sources: the dataset website and orchard collection. The dataset website includes the China Science Data Network [22] and digipathos Website [23]. A total of 736 images of 5 common citrus diseases were carefully selected on the dataset website. The other source is in cooperation with Central South University of Forestry Science and Technology. The image data were collected from the economic forest and fruit production, study and research base jointly built by Central South University of Forestry Science and Technology and Changsha Forestry Bureau. The camera model is a Canon EOSR, and its image pixels are 2400*1600. We used a camera to take optical images of different diseases and normal citrus from different angles in the morning, middle and evening under sunny, cloudy and foggy weather conditions. The shooting background is the complex background of the orchard. Such photos can reflect many complex citrus growth situations in the orchard to ensure that the collected images are more representative. A total of 1772 images were finally collected, including 482 samples with uniform illumination on sunny days, 531 samples with uneven illumination and 368 samples on cloudy days. A total of 391 samples were disturbed by clouds and dust. A total of 2525 images of five diseases and normal citrus were finally obtained through orchard collection and dataset websites. There are 687 samples with uniform illumination in sunny days, 534 samples in cloudy days and 757 samples with uneven illumination. 557 samples disturbed by clouds and dust. Because a large number of datasets are needed for network training, we enhanced the original citrus image and amplified the data by rotating, flipping, random clipping and brightness transformation. A total of 10,100 images were finally obtained in the database. Table 1 shows the categories and data distribution of citrus diseases selected in this paper, including healthy citrus, Huanglong disease, citrus Corynespora, citrus fat spot yellow spot, citrus scab and citrus canker.

### Citrus image enhancement based on AMSR

A variety of citrus plants are planted in mountainous and hilly areas, which are prone to cloudy and rainy weather. In the process of collecting citrus disease images, the images may be disturbed by dust, clouds, low light and other environments. This may result in some lesions in the dataset being blocked or unclear. For example, in Fig. 3, citrus yellow dragon disease with fog and Corynespora blight on citrus with low light. The disease characteristics of some citrus were not obvious under low light and cloud occlusion. Due to the limitation of the citrus environment, the network citrus diseases identification accuracy is reduced to a certain extent. To help further improve citrus disease recognition accuracy in the follow-up network, this paper uses the AMSR algorithm to enhance citrus disease images. The AMSR algorithm can effectively alleviate the interference of clouds and low light on the clarity of citrus disease spots. Figure 3 shows that after image enhancement, the disease spot features in the citrus Huanglong image obscured by clouds and the Corynespora brightness of the citrus image in low light become clear and obvious. This is more conducive to the neural network for extracting the characteristics of subsequent citrus diseases.

In citrus image enhancement based on ASMR, first, the original image of citrus diseases is decomposed to obtain three RGB colour channels. Second, the incident component of a single channel is estimated, and three Gaussian surround functions are constructed by using three scale parameters. The Gaussian surround function is used to filter the image channels, and then the weight coefficient is introduced to obtain the three-channel incident component weight of each region. Next, the reflection component is calculated. The reflection component is obtained by subtracting the original image from the illumination component in the logarithmic domain. Finally, the R, G and B channels of the whole picture are restored, and the brightness compensation factor is added to repair and adjust the defect of colour distortion caused by contrast enhancement in local areas of the image. The schematic diagram of the AMSR algorithm is shown in Fig. 2, in which the original input image Fig. 2a is the citrus yellow dragon disease image taken on cloudy days, Fig. 2b is the citrus scab image with uneven illumination on sunny days, and Fig. 2c is the citrus ulcer disease image with uniform illumination on sunny days. Finally, the enhanced images in three different cases are obtained by the AMSR algorithm. The operation process of the AMSR algorithm is divided into the following steps:

### Image decomposition

The resulting citrus disease image is decomposed into three colour channels: R, G and B. Subsequent calculations are implemented in each channel.

### Incident component estimation

#### (1) Single-channel incident component estimation

The incident component reduces the high-frequency difference of the original fluctuating citrus diseases image, covers the disease features, blurs the edge details, and visually becomes a large black (homogeneous area). Due to the influence of incident light, the resulting picture deviates from the inherent properties of the object, so it is necessary to eliminate the interference of incident light. Next, the Gaussian surround function is used to filter the three channels of the citrus disease image to estimate the incident light. The three-channel Gaussian surround function can be expressed by the following formula:

$${text{G}}left( {x,y} right) = frac{1}{{2pi sigma^{2} }}e^{{left( { – frac{{x^{2} + y^{2} }}{{2sigma^{2} }}} right)}}$$

(1)

In the formula, σ is the Gaussian surround scale, which is related to the overall smoothness of the Gaussian function. The incoming and outgoing images are calculated by Fourier transform.

#### (2) Weight analysis of three channels

Three different weights represent different incident components in different regions. Because different pixels are located in different regions, they have different weights. The pixel weight is determined by judging whether the pixel is in the image edge area carrying high-frequency information or in the homogeneous area carrying low-frequency information. If the pixel is in the edge area (carrying high-frequency information), the effect image with a smaller (upsigma) value has greater weight. If the pixels are in the homogeneous area (carrying low-frequency information), the rendering with a larger (upsigma) value has greater weight. This article presents a new weight coefficient, the frequency comparison coefficient, which is calculated as follows:

$$A = a_{1} cos^{ – 1} left( sigma right) + a_{2}$$

(2)

(upsigma) is the standard deviation of the local area where the pixel is located. When the standard deviation of a region is small, the pixel value fluctuation in the region is very small, which means that the region where the pixel is located belongs to the homogeneous region. The (arccot) function is applied to (upsigma) to make the value in the homogeneous area larger, that is, the part with lower brightness information in the homogeneous area is taken into account to make the image information more clearly visible. In contrast, the value in the edge detail area is smaller. The estimated three channel reflection image can be obtained by adding their respective weights to the filtering results obtained on the three scales.

### Acquisition of the reflection component

A given citrus disease image S (x, y) can be decomposed into two different images: reflected image R(x, y) and incident image L (x, y). The incident light is reflected on the reflected object and is reflected into the human eye after reflection of the object. The final formed image can be represented by the following formula:

$${text{S}}left( {x,y} right){text{ = R}}left( {x,y} right){text{*L}}left( {x,y} right)$$

(3)

where S (x, y) represents the image taken by the camera, L (x, y) represents the irradiation component of external light, R (x, y) represents the reflection component of the photographed object, and * represents simple multiplication. Then, a logarithmic operation is performed on the above formula, and the formula is as follows:

$${text{Log}}left[ {Rleft( {x,y} right)} right] = {text{Log}}left[ {Sleft( {x,y} right)} right] – Logleft[ {Lleft( {x,y} right)} right]$$

(4)

L (x, y) is approximately replaced by the convolution of S (x, y) and a Gaussian kernel. L (x, y) is the incident component estimated above; then, R (x, y) can be expressed by the following formula:

$$R_{{MSR_{i} }} = sumnolimits_{n = 1}^{N} {omega_{n} R_{ni} } = sumnolimits_{n = 1}^{N} {omega_{n} } left{ {{text{log}} S_{i} left( {x,y} right) – {text{log}}left[ {Gleft( {x,y} right)*S_{i} left( {x,y} right)} right]} right}$$

(5)

In the above formula, S is the original input image, F is the Gaussian filter function, N is the number of scales, (omega) is the weight of each scale, and R represents the output of the image in the log domain. Since the incident component L (x, y) is obtained by Gaussian convolution, the reflected component can be obtained by using the above formula. Then, the weights of the filtered results on the 3 scales are added to obtain the estimated illuminance image. ({omega }_{k}) represents the weighting coefficient when the k-th scale is weighted, which needs to be met:

$$sumnolimits_{n = 1}^{N} {omega_{k} } = 1$$

(6)

### Channel merging and brightness compensation

The grey reflection components are combined and brightness compensated to restore the three colour channels of R, G and B, and the increased brightness compensation factor is (lambda). The formula is as follows:

$$R_{j} left( {x,y} right) = frac{{I_{j} left( {x,y} right)}}{{Ileft( {x,y} right)}}*lambda$$

(7)

({I}_{j}(x,y)) refers to the R, G and B channels of the original image, and (lambda) refers to adjusting the brightness factors of the three bands. Through experiments, it is found that the effect of (lambda) taking 1 is better. The AMSR algorithm proposed in this article can enhance and preserve the edge information under low illumination based on ensuring the image colour, and the principle of the algorithm is simple. The AMSR algorithm can solve the contradiction that details and picture colours cannot be retained simultaneously, and its actual effect on image enhancement is significantly improved compared with the MSR algorithm.

The comparison of citrus disease images in the AMSR enhancement algorithm before and after enhancement is shown in Fig. 3. Figure 3a shows the actual images of five diseases and healthy citrus taken in three scenarios: cloudy, even and uneven light on sunny days; Fig. 3b is the citrus image enhanced by AMSR:

### Identification of citrus diseases based on MF-RANet

Many kinds of citrus diseases have similar characteristics. For example, both corynespora blight of citrus and fat spot macular disease in Fig. 3 have dark brown scar-like concave small particles, and a few are yellow–brown spots. Yellow white upwards convex lesions were found in both citrus scab and citrus canker disease, and their distribution positions were relatively concentrated. There are also different imaging features in the early and late stages of the same disease. The location of early diseases is hidden, and the area is small. Therefore, recognizing citrus diseases is difficult and requires the use of a deep neural network to extract more detailed feature recognition to achieve a higher degree of recognition. When using a deep neural network for feature extraction, the network deepens after reaching a certain depth, and the possibility of gradient degradation increases, which does not improve the classification performance. This leads to slower network convergence, lower accuracy, and easy loss of the main features. At this time, even if the dataset is increased, the classification performance and accuracy will not be improved. Therefore, we propose a new network structure MF-RANet to solve the above problems. MF-RANet is composed of a main feature frame path and a detail feature frame path. They extract the main recognition features and detailed features in citrus diseases.

### Main feature box road

The main feature box is composed of each layer of the ResNet50 network and attention module. The following focuses on the ResNet50 and RAM structures.

#### 1. ResNet50

ResNet is a network model proposed by He et al. [24] in 2015. At present, it has surpassed a series of algorithms, such as VGG [25], R-CNN [26], Fast R-CNN [27], and Faster R-CNN [28], in image classification and has become a basic feature extraction network in the field of general computer vision. ResNet50 uses a residual unit, which reduces the number of parameters and adds a direct channel in the network, increasing CNN’s ability to learn features [29]. It can solve the difficult problem of gradient vanishing network training in a deep network. Through this residual unit structure, the network learning goal can be simplified, and the classification accuracy can be improved, which has good portability.

Each layer of the ResNet50 network contains 2 modules, an identity block and a convolution block. Convolution blocks can change the network dimension, but they cannot be connected in series; identity blocks are used to deepen the network and can be connected in series. With the deepening of the network level, the learned things become more complex, and the number of output channels increases. Therefore, while using identity blocks to deepen the network, it is also necessary to use convolution blocks to convert dimensions so that the features of the part in front can be transmitted to the feature layer in the back. Compared with previous networks, it is still one of the classic and used networks because of its few parameters, deep layers and excellent classification and recognition effect. However, for the problems that small features are easy to ignore and similar features are not easy to distinguish in the identification of citrus diseases, a single ResNet50 structure is still not enough. Therefore, this article improves upon ResNet50.

#### 2. RAM

When people observe and recognize the target, they will focus on the prominent part of the target and ignore some global and background information. This selective attention mechanism is consistent with the characteristics of the discrimination part in fine-grained image classification. Then, in order to focus on monitoring the different features of the citrus diseases image, this paper cross adds the residual attention mechanism in each layer of resnet50. The RAM can give higher weight distribution to the features containing disease identification information. It can effectively improve the effect of fine-grained classification. There are two branches in RAM, namely, the mask branch and trunk branch. The trunk branch is convolution, and the mask branch outputs the attention feature map with the same dimension through feature map processing. Then, the characteristic graphs of the 2 branches are combined by point multiplication to obtain the final output characteristic graph. Finally, the RAM model is formed, as shown in Fig. 5. The structure and model construction of the RAM module are introduced below:

In the trunk branch structure, there are two convolutions in the main RAM branch structure, and the input features are directly processed into the same size as the mask branch structure 7 × 7.

In the mask branch structure, the processing operation of the feature map includes a forward downsampling process and an upsampling process. The downsampling process ensures fast coding and obtains the global features of the feature map. Upsampling combines the extracted global high-dimensional features after upsampling with the features without downsampling to fuse the features of high and low latitudes. The specific operations are as follows: mask branch for fixed input, after multilayer convolution calculation, use maximum pooling to reduce the feature map dimension. The dimension is reduced until the width and height of the feature map reach the minimum size of the network output feature map 7 × 7. Then, the width and height dimensions of the feature graph are expanded layer by layer by using the bilinear difference method and added to the previous features under the same dimension. The mask branching structure combines global and local features to enhance the expression ability of the feature map.

The RAM model built from these 2 parts is described below. The trunk branch output characteristic diagram is ({T}_{i,c}(x)). The output characteristic diagram of the mask branch is ({M}_{i,c}left(xright)). Finally, the output characteristic diagram of the attention module is ({H}_{i,c}left(xright)); the framework formula of the model is:

$$H_{i,c} left( x right) = left[ {1 + M_{i,c} left( x right)} right]*T_{i,c} left( x right)$$

(8)

({M}_{i,c}left(xright)) is the value in the [0,1] interval. Adding them to 1 can well solve the problem of reducing eigenvalues proposed in 1. In this part, the difference between this article and the residual network is that the formula ({H}_{i,c}left(xright))=x+({T}_{i,c}(x)) of the residual network learns the residual result between output and input, while in this article, ({T}_{i,c}(x)) is learned and fitted by a deep convolutional neural network structure. Combined with the results of the mask branch output, the important features in the output characteristic diagram of ({T}_{i,c}(x)) can be strengthened, while the unimportant features can be suppressed. Finally, the overlapping residual attention module and the residual block of ResNet50 can gradually improve the expression ability of the network.

### Detailed feature box road (AugFPN)

The detail feature box is used to extract the detail features by AugFPN feature fusion. As mentioned above, a network for identifying citrus diseases is proposed based on ResNet50. Although the features are extracted by convolution, after resampling again, some small pixel object features have been lost and cannot be recognized effectively. To ensure that the detailed features are not lost in the citrus disease recognition, the object features of any size can be effectively detected, and the correct recognition results can be obtained. Based on the main feature frame path, this article uses the AugFPN after the improved FPN [30] to add feature fusion. First, the consistency monitoring mechanism is used to implement the same monitoring signal on these feature maps so that the laterally connected feature maps contain similar semantic information. Second, the residual features are enhanced, and the ratio invariant adaptive pool is used to extract different context information to reduce the information loss of the highest level features in the feature pyramid by means of residuals. Third, soft ROI selection is introduced to make better use of ROI features at different pyramid levels to provide better ROI features for subsequent location refinement and classification. A schematic diagram of its principle is shown in Fig. 6. B1–4 in the figure represent the four feature layers added to the attention mechanism of ResNet50. M1–4 layers represent the auxiliary loss of 4 characteristic layers, and P represents the main loss. The same supervision signal is added to the features of each layer. The specific steps are as follows:

#### 1. Consistency monitoring module

First, the feature pyramid is constructed based on the multiscale features (B1, B2, B3, B4) in the main feature box. The ROI characteristics of each level (M1, M2, M3, M4) are obtained through ROI align. Then, the ROI features of (M1, M2, M3, M4) are convoluted by 3 × 3 to obtain the feature pyramid (P1, P2, P3, P4) to generate multiple ROIs. A detector and a classifier are added after (P1, P2, P3, P4) each feature before fusion. These classification and regression parameters are shared at different levels, which can further force different feature maps to learn similar semantic information outside the same monitoring signal. For more stable optimization, the weight is used to balance the auxiliary loss and original loss caused by consistent supervision. Formally, the final loss function formula is as follows:

$$L_{rcnn} = lambda left[ {L_{cls,M} left( {pM,t^{*} } right) + beta left( {t^{*} > 0} right)L_{loc,M} left( {d_{M} ,b^{*} } right)} right] + L_{cls,P} left( {p,t^{*} } right) + beta left( {t^{*} > 0} right)L_{loc,P} left( {d,b^{*} } right)$$

(9)

({L}_{cls,M}) and ({L}_{loc,M}) are the objective functions corresponding to the auxiliary losses attached to (M1, M2, M3, M4). ({L}_{cls,P}) and ({L}_{loc,P}) are the original loss functions on the characteristic pyramids (P1, P2, P3, P4). (pM), ({d}_{M}) and (p), (d) are the predictions of the middle layer and the final pyramid layer, respectively. ({t}^{*},{b}^{*}) are basic fact category labels and regression targets, respectively. λ is the weight used to balance the auxiliary loss and the original loss. β is the weight used to balance classification and localization losses. Finally, these classification and regression parameters are shared at different levels, which can further force different features to map in the same monitoring. Learn similar semantic information outside the signal. As shown in Fig. 6 Schematic diagram of the AugFPN fusion framework (a), through the above measures, consistency monitoring can reduce the semantic gap between different scales of information.

#### 3. Residual feature enhancement

AugFPN fusion proposes residual feature enhancement to reduce the loss of semantic information caused by the reduction in the number of channels through spatial information compensation. First, B4 is downsampled into three parts as large as B4 through adaptive pooling. Then, the four layers are fused into P5. The weight of each layer is (alpha 1,alpha)2,(alpha)3, which are 0.1, 0.2 and 0.3, respectively. After generating P5, it is combined with P4 by summation and propagated to other functions at a lower level. The residual feature enhancement structure is shown in Fig. 6b.

#### 3. Soft ROI feature selection

First, because (P1, P2, P3, P4) each layer contains ROI features, we use the adaptive spatial fusion module (ASF) to adaptively fuse ROI features. ASF generates different spatial weight maps for different levels of region of interest features and weights and fuses the region of interest features. The specific fusion process of different levels of features and the framework of adaptive fusion (ASF) are shown in Fig. 7.

Based on the above principles, AugFPN reduces the semantic gap between different scale features before feature fusion through consistency monitoring. In feature fusion, the ratio invariant context information is extracted by residual feature enhancement to reduce the information loss of feature mapping at the highest pyramid level. The soft ROI selection method is used to better realize feature extraction and fusion through adaptive spatial fusion. Then, they are integrated with the full connection layer of the network, which can effectively solve the common problem of losing small image features in the main feature frame.

### Mixed activation function—ELU function

The introduction of an activation function increases the nonlinearity of the neural network model. The nonlinear expression ability of the activation function is strong. When the linear input is large, the output will not expand infinitely, which does not easily lead to gradient explosion. In addition, gradient descent can be effectively realized because the nonlinear activation function is differentiable. Traditional saturation activation functions such as sigmoid and tanh have the problem of gradient disappearance. This will make the convergence of the training network increasingly slower.

The activation function used in the original ResNet50 model is the ReLU [31] function. The linear and unsaturated form of the ReLU function allows the ResNet50 model to solve the problem of gradient disappearance in the positive region. However, if the input distribution after network initialization is not ideal, or a large gradient suddenly occurs in the training process, which affects the distribution of the next input, the distribution centre becomes negative. Then, most of the inputs in the ResNet50 model are negative. When the negative input is activated and zeroed, the gradient will not be obtained. Finally, the weight of the negative input cannot be updated.

In view of the above shortcomings of the ReLU function, we choose another improved ReLU-ELU activation function. The image and its derivative function are shown in Fig. 8:

The expression of the ELU function is:

$$f( x ) = left{ begin{array}{ll} x, &quad ifleft( {x > 0} right) hfill \ alpha left( {e^{x} – 1} right), &quad otherwise hfill \ end{array} right.$$

(10)

The ELU function is consistent with the part where the ReLU function is greater than 0. When it is less than 0, the ELU function expression is (alpha ({e}^{x}-1)). Thus, the ELU function still outputs when the input is negative. First, this ensures that the ELU function can inherit the advantages of the ReLU function and solve the problems of gradient explosion and gradient disappearance in the network. Second, when the input value of the ELU function is less than 0, the parameters can also be updated, which can effectively solve the problem of neuron death so that the negative part of the activation function can be used effectively. It can also make the network convergence faster.

### Label smoothing regularization

In neural network, because of too many model parameters, it is easy to cause overfitting of the model. Typical regularization methods such as L1, L2 and dropout [32] are used to suppress the overfitting phenomenon of the network due to too many model parameters. However, in citrus disease classification, we add the softmax function to calculate the probability that the input image belongs to each disease. Then, the image with the highest probability is used as the input of the disease category, and the cross entropy is used as the loss function. This leads to the maximum reward for correct classification and the maximum penalty for incorrect classification. Therefore, in the classification task, the phenomenon of overfitting easily exists. Therefore, in the MF-RANet network, this paper uses label smoothing regularization to alleviate the overfitting phenomenon in classification. The specific steps of label smoothing regularization are as follows:

In the citrus disease classification task, the confidence scores of citrus disease images corresponding to various diseases are obtained through the MF-RANet network. These scores are normalized by the softmax function [33], and finally, the probability that the current input belongs to each category is obtained. The formula of the softmax function is as follows, where k represents a total of 6 citrus images (5 disease images and 1 health image):

$${text{q}}_{i} = frac{{exp ,left( {z_{i} } right)}}{{mathop sum nolimits_{j = 0}^{K} {text{exp}}left( {z_{i} } right)}}$$

(11)

Label smoothing changes the probability distribution into simple uniformly distributed noise. The formula is as follows: (varepsilon) varepsilon is a small super parameter:

$${text{P}}_{text{i}} = left{ begin{array}{ll} (1 – varepsilon ), &quad f( i = y) \ frac{varepsilon }{K – 1} &quad if ( i ne y ) end{array} right.$$

(12)

The cross entropy is

$${text{Loss}}_{{text{ i}}} = left{ begin{array}{ll} (1 – varepsilon )*{text{Loss}}, &quad ifleft( {i = y} right) \ varepsilon *{text{Loss}} , &quad if ( i ne y) \ end{array} right.$$

(13)

When training the MF-RANet network, the cross entropy of the prediction probability and label real probability is minimized to obtain the optimal prediction probability distribution. The prediction probability distribution of the optimal fitting effect of label smoothing is:

$${text{Z}}_{i} = left{ begin{array}{ll} log frac{{left( {k – 1} right)left( {1 – varepsilon } right)}}{{varepsilon + alpha }}, &quad ifleft( {i = y} right) \ alpha ,&quad ifleft( {i ne y} right) \ end{array} right. left( {alpha {text{can be any real number}}} right)$$

(14)

The essence of label smoothing regularization is to suppress the output difference between positive and negative samples and smooth the label. The smoothed label can prevent the network from overlearning. This can effectively alleviate the overfitting phenomenon.

ResNet50 is the basic network of the MF-RANe network. Before the hidden layer of the ResNet50 network, there is a batch normalization layer to normalize the data. Therefore, the overfitting phenomenon of the network in training citrus disease samples is inhibited by batch normalization. Unlike batch normalization regularization, the use of label smoothing regularization can effectively alleviate the overfitting phenomenon in the classification process.