# SeparaFill: Two generators connected mural image restoration based on generative adversarial network with skip connect – Heritage Science

#### ByChaohui Lv, Zilu Li, Yinghua Shen, Jinghua Li and Jin Zheng

Aug 31, 2022

It is observed that the murals are dominated by contour lines through the analysis of the painting characteristics of murals. Therefore, its pixels can be divided into contour line parts and color block parts separated by contour line for a mural image. The contour line parts of the mural image are composed of rich thin, narrow and continuous lines. Meanwhile, the contour line color is relatively single, which usually contains the dark colors such as black, brown and red. However, the color block parts contain rich color information. The size and shape depend on the contour line, but the gray scale inside the color block is continuous and the texture information is simple. With the above mural image characteristics, a two generators connected mural image restoration network based on U-Net network architecture is proposed. This method restores the mural image contour and its internal color blocks separately and reduces the difficulty of restoration. After training, the network can obtain the better restoration result compared with other algorithms.

The mural restoration network mainly consists of three parts: contour restoration generator network, content completion restoration generator network, global and local discriminator network.

### Contour restoration network

The generator of the contour restoration network is improved on the EdgeConnect model [20]. 9 channels are put into the network, which include the damaged contour image of RGB (red, green, blue) three channels, the damaged image and the edge feature map extracted by Sobel edge detection. The input of the damaged image with the edge image together provide rich information for network and guide the contour restoration. The network consists of a down sampling convolution block contained a 5 × 5 convolution layer and three 3 × 3 convolution layers, 8 Res2Net blocks and a up sampling recovery resolution convolution block. Skip connections are added to four up sampling convolution layers and down sampling convolution layers. Skip connections reuse the edge features of lower layers and retain more dimension information. So, the up sampling network of the generator can select between shallow features and deep features to enhance the robustness of the network. For the convolution in the shallow feature extraction network and the up-sampling convolution network, we use gate convolution to replace the ordinary convolution.

The block of Res2Net is used to replace the residual block structure in contour restoration network. Res2Net block modifies the structure of Residual Networks (ResNet) block [31] which is shown in Fig. 2. Firstly, the input features are passed through a layer of 1 × 1 convolution, further divide the output features equally according to the number of channels, and fuse the segmented features in different channel blocks. The expression formula is as follows:

$$y_{i} = left{ begin{gathered} x_{i} quad quad quad quad quad quad i = 1; hfill \ K_{i} (x_{i} + y_{{i – 1}} ) quad 1 < i le s hfill \ end{gathered} right..$$

(1)

where ({x}_{i}) represents the number of equally divided blocks, s represents the number of equally divided blocks, y represents the output of convolution, and ({K}_{i}( )) represents 3 × 3 convolution. The residual structure of Res2Net retains the function of ResNet to avoid gradient disappearance and gradient explosion, and realizes channel block multiplexing of the 3 × 3 convolution layer in the ResNet block. This multi-scale channel fusion of input features can make the ability of feature extraction stronger without add network parameters.

### Content completion restoration network

The aim of the second stage is to complete the color blocks between the contour lines. The network inputs are the contour map generated in the first stage and the corresponding damaged image after repairing contour lines. Content completion restoration block include a U-shaped image repair branch network and a convolution branch without down sampling. Through the superposition of multiple modules, the missing areas can be repaired finely. The U-shaped image repair branch network consists of the 4 down sampling convolution layers that the kernel is 3 × 3 with 2 step sizes, the feature extraction network with two self-attention blocks, and up sampling layers of 3 × 3 convolution corresponding to the down sampling layers. After adding feature fusion directly, the feature map of each dimension contains more features. This operation reduces network parameters and memory footprint, thus can provide space for module stacking. Furthermore, an accumulation feature extraction mechanism is proposed, the feature map output through each convolution layer is superimposed with the feature map output through the layer in front, so as to realize multi-level feature fusion under different resolutions. The implementation formula is as follows:

$$y^{l} = frac{{y^{l} }}{{2^{l} }} + frac{1}{{2^{l} }}sumlimits_{i = 1}^{l} {2^{i – 1} } y^{i},$$

(2)

where (y^{l}) represents the convolution fusion output of layer (l), and (y^{i}) represents the fusion output of convolution layer (i). It can be seen from the formula that the superposition fusion mechanism makes the features extracted by the fusion convolution layer accumulate the characteristics output of all the convolution layers in front. Since the feature fusion of direct addition requires the same size of the input feature map, the feature map with different resolution is matched by reducing the resolution through 1 × 1 convolution layer.

As the convolution kernel operates from the local region of the image and represents the local features, the influence of the global features on the current region becomes very small with the deepening of the convolution network. Self-attention mechanism [32] can capture long-distance dependencies, namely pay attention to the global characteristics so as to enlarge receptive fields of the network. After the feature accumulation layer, the self-attention mechanism is employed to capture the overall features and detail features of the mural image. It can make the generated image more detailed in Fig. 3.

The other branch does not use down sampling in the process of sending and processing the input image information, and keeps the resolution of the original input information, so as to reuse the input information and help refine the texture of the image restoration.

Dilated convolutions are used in the content completion restoration network. The dilation is set as a loop of 1, 2 and 5 to increase the perception domain of the convolution. Since the large damaged area has divided into small pieces after the contour lines have been repaired in the first stage, the difficulty of restoration becomes easier. Therefore, the partial convolutions with fewer parameters are exploited to update the mask and perform detailed restore through the superposition of modules.

### Loss function

The loss function in the contour inpainting phase is expressed as:

$$L_{s_G} = lambda_{adv} L_{adv} + lambda_{rec} L_{rec} + lambda_{FM} L_{FM} ,$$

(3)

where ({L}_{adv}) is the adversarial loss based on the discriminator, ({L}_{rec}) is the ({L}_{1}) reconstruction loss, ({L}_{FM}) is the feature matching loss, and ({lambda }_{adv}), ({lambda }_{rec}) and ({lambda }_{FM}) are the weights of each loss respectively.

GAN obtains the optimal solution by optimizing the value function. The value function is expressed as:

$$mathop {min }limits_{G} mathop {max }limits_{D} Vleft( {D,G} right) = E_{{xsim P_{data} left( x right)}} left[ {log Dleft( x right)} right] + E_{{isim p_{out} left( i right)}} left[ {log left( {1 – Dleft( {Gleft( i right)} right)} right)} right],$$

(4)

where (x)represents the input data, (P_{data} left( x right)) represents the distribution of the real data (P_{out} left( i right)) represents the distribution of the image generated by the generator, (D) represents the discriminator, the probability that the output input is the real data, and G represents the generator, which outputs the generated image. The goal of the discriminator is to maximize the value function.

The reconstruction loss is used to constrain the image pixel level restoration, so as to optimize the detail restore ability of the contour. The reconstruction loss is expressed as follows:

$$L_{rec} = left| {I_{{re{text{cov}} er}} { – }left. {I_{gt} } right|} right._{1} { * }lambda_{rec} { + }left| {I_{{re{text{cov}} er}} } right. odot masks{ – }I_{gt} odot left. {masks} right|_{1} { * }lambda_{rec} ,$$

(5)

where masks is the binary mask image, and is the Hadamard product, used to calculate the global and local reconstruction losses for the generated image and the hole area under mask constraints respectively, ({lambda }_{rec}) represents the weight value of the loss function.

The feature-matching loss is used to compare the feature maps in the intermediate layers of the discriminator. The feature-matching loss is expressed as follows:

$$L_{FM} = {rm E}left[ {sumlimits_{i = 1}^{L} {frac{1}{{N_{i} }}left| {D_{1}^{left( i right)} left( {I_{gt} } right) – D_{1}^{left( i right)} left( {I_{{re{text{cov}} er}} } right)} right|}_{1} } right] * lambda_{FM} ,$$

(6)

where L is the number of convolution layers of the discriminator, Ni is the number of characteristic diagrams of the activation layer of layer i, and ({{D}_{1}}^{(i)}) is the activation number of layer of the discriminator. ({uplambda }_{mathrm{FM}}) is the regularization parameter.

The content restoration network needs to restore the texture of the image and maintain the semantic consistency between the restored image and the ground truth image. The loss function consists of confrontation loss, reconstruction loss, perceptual loss [33] and structural similarity loss. The loss function is expressed as follows:

$$L_{G} = lambda_{adv} L_{adv} + lambda_{rec} L_{rec} + lambda_{SSIM} L_{{MS{ – }SSIM}} + lambda_{style} L_{style} ,$$

(7)

The loss function and weight of the reconstruction loss ({L}_{rec}) and adversarial loss ({L}_{adv}) are the same as first part of contour restore. In order to better ensure that the texture and color of the image restoration area fit the original mural, and make the style of the whole restored image consistent, the perceptual loss function is introduced. The perceptual function is divided into content loss and style loss, compares the high-level abstract features through the VGG 19 pre-training model, the formula is as follows:

$$ell_{feat}^{varphi ,j} left( {mathop ylimits^{{ wedge }} ,y} right) = frac{1}{{C_{j} H_{j} W_{j} }}left| {varphi_{j} left( {mathop ylimits^{{ wedge }} } right) – varphi_{j} left( y right)} right|_{2}^{2} ,$$

(8)

where ({C}_{j}), ({H}_{j}) and ({W}_{j}) represent the channel numbers, height and width of the characteristic graph respectively, j represents the jth layer of the network, and (mathrm{varphi }) represents the output after convolution network processing. Content loss let the generated image obtain better visual effect, but large loss weight will produce texture to image that does not conform to the original image, so it is necessary to reduce the weight of content loss in the later stage of training.

The multi-scale structure similarity loss function [34] is introduced. The combination of structural similarity loss and L1loss can balance the brightness and color of the image, thus making the restored image more detailed. The function expression is as follows:

$$L_{{MS{ – }SSIM}} left( P right) = 1 – MS{ – }SSIMleft( {mathop plimits^{sim } } right),$$

(9)

where (MSSSIM(widetilde{p})) is SSIM calculation for images with different resolutions after scaling, which can obtain better results than simple SSIM loss.

### Training and testing procedures

Limited by the small mural image datasets, the parameter of batch size of training will affect the training results. Therefore the parameter of batch size is set as 5, each of which batch has 3000 data, and the parameter of num_workers is set as 16, which is used to preload the batch data of the next iteration into memory.

The specific algorithm steps are as follows in Table 1.