# CAMFuzz: Explainable Fuzzing with Local Interpretation – Cybersecurity

#### ByJi Shi, Wei Zou, Chao Zhang, Lingxiao Tan, Yanyan Zou, Yue Peng and Wei Huo

Sep 1, 2022

Overview. The overall architecture of our framework is shown in Fig. 1. We have four core modules: Model Training, Explanation, Node Selection, and Magic Value Identification. The term “Magic Value” in this paper not only means the signature of a file format, but it also represents constant values(e.g., enumeration type) in other fields. The core idea of our fuzzer is to utilize the local explanation technique in XAI to recognize critical input bytes related to code blocks and guide the fuzzer to focus on a worthy part of the input to mutate first. Since a neural network is the premise of explanations, we need to train a model to map an input to the coverage map. Besides, we use a Node Selection module to help the fuzzer focus on the more critical code, which may bring new coverage. Finally, we use the Magic Value Identification module to extract particular values that will guide the fuzzer on how to fuzz during mutation.

Training. We aim to train a model that can simulate the behavior of a program: given a seed input, the model can predict the code coverage map. We use a classification model to achieve the goal. First, we collect the seeds and their corresponding coverage map to preprocess. Then we train a convolution neural network with global average pooling (GAP) instead of a fully connected layer next to the last convolution layer.

Local Explanation. The local explanation is the core module we use to locate critical positions of the input to mutate. Here we consider the critical positions as the part of the input, which could unlock unvisited CFG nodes with mutations. In this module, we use the GAP layer to help us in interpretation. GAP brings the ability to localize an object in the input (Zhou et al. 2016). In the fuzzing task, since we have trained a model to map the input (similar to pictures in computer vision) to the coverage map (similar to labels), our intuition is to feed the explanation module with a block in the coverage map. Then the module may guide us with positions which have a tight connection with it. Once the input positions are known, we can guide the fuzzer to focus on mutations to this part with higher priority.

Node Selection. In the Local Explanation module, we intend to feed the model with a code block. But which block should be chosen first? We have two considerations when choosing the code to explore:

(i):

Many nodes in the CFG have not been visited, and we cannot choose them directly because the explanation module does not understand how to explain this “label” since the trained model does not know this feature.

(ii):

Some CFG nodes are hard to reach in fuzzing. For example, in a call to malloc with a fixed size, the failure branch may never be reached in fuzzing since the memory might be big enough to ensure the malloc a success. Focus on these hard-to-reach nodes first will waste time.

Based on the above considerations, our idea is to feed the model with visited nodes. These nodes should have more unvisited code blocks connected to them and are easier to reach. We have two main steps to select code to be explored.

The first step is inspired by CollAFL. We analyze the CFG of the program and choose the nodes which directly connect unvisited child nodes. Then we initialize them with weights depending on how many untouched child nodes they have. The nodes chosen in this step are visited ones, but they have unvisited nodes directly connected to, which means mutation to input bytes related to this code may affect the conditional branch of the node. Therefore we are more likely to reach a new path.

The second step is to consider the difficulty of reaching the code. During the mutation process by local explanation, we found some nodes are hard to be explored. It means even if we mutate the positions extracted by local interpretation many times, we still cannot cover its directly connected children nodes. We apply the Dynamic Weight Adjustment (DWA) algorithm to decrease the weight of these nodes to focus on easier-to-reach nodes first.

Magic Value Identification. Since we have figured out which code blocks should be explored and which bytes of the input have a strong relationship with the blocks, we use a lightweight static analyzer to identify if there are constant values we can extract to help the fuzzer know how to fuzz, for instance, file signature, enumeration value, and loop count.

### Training

#### Data preprocess

Data preprocessing is fundamental in the training process as the training data quality will directly affect the performance of the model (García et al. 2015). We need to collect two data types from the original fuzzing progress: seed file and coverage map. For the seed file, since we cannot clarify every file structure, we treat the input as raw bytes, the value of each byte is from 0 to 255. For the coverage map, we use binary instrumentation to obtain the map. If a specific block is covered, we set it to 1, otherwise 0.

We describe the algorithm of preprocessing in Algorithm 1 . We first figure out the largest file in the training data, set the file size with (S_{max}), then pad the rest of the data to (S_{max}) with byte 0x00. Note the sequence of line 5 and line 6 in Algorithm 1. After padding the file, we run the program because the code coverage may be different while parsing the padded file and original file.

After running all the padded files and get code coverage, we then prune the code coverage using the under-sampling method and directly discard some coverage labels which has few seeds covered. This step is essential because the original code coverage is usually imbalanced. We choose readelf as a target to illustrate, we randomly choose 120 ELF files as seeds, then run readelf with the seeds. The total block number we traced is 4001. We then draw a histogram of code coverage distribution in Fig. 3. The X-axis indicates the intervals that show the count of test cases. The max value is 120, the same as the total amount of the seeds. The Y-axis indicates the count of distinct blocks. Before preprocessing, the sum of all the columns in Fig. 2a equals to 4001, which is the total number of basic blocks covered by the seeds. In Fig. 2a, we observe that the distribution data is hugely imbalanced. At the very left of Fig. 2a, There are almost 1600 blocks covered by a small number (about less than 10) of cases. At the right of Fig. 2a, some nodes are covered by almost all cases. It is easy to uncover the reason: some common blocks can be covered by most of the seeds while some of the unique seeds may cover a number of rare blocks.

Imbalanced class distribution data will negatively affect the training model. There are many studies to deal with such problem, such as Castillo et al. (1997) and Provost (2000). We follow the basic principles to process the imbalanced data: we use the under-sampling (Barandela et al. 2004) approach to reduce the data in majority classes and discard some minority classes data to rebalance the training data.

Figure 2b shows the distribution of the coverage after being pruned. The data is more balanced than those in Fig. 2a and the training model will benefit from the pruning process.

#### Model

We use a convolution neural network (CNN) model in the training process, and the model is used to map seed input to coverage map. That is to say, given a well-trained model f and a corresponding input (consists of bytes (x_1,ldots x_ildots ,x_n)), the model can predict if a node (cov_i) can be covered or not.

Denote that (x_i) is between [0x0, 0xFF] and (cov_i) is the probability which is between [0, 1].

begin{aligned} fleft( frac{x_1}{255},ldots frac{x_i}{255}ldots ,frac{x_n}{255}right) = (cov_0,ldots ,cov_i,ldots ,cov_m) end{aligned}

(1)

Here we detail three critical points in our model.

(i):

We use a GAP layer instead of a fully connected layer.

(ii):

We set the shape of our feature map produced by the last convolution layer to 128.

(iii):

As we are clear that the count of code blocks may be huge when we run the seeds. To improve efficiency, we design a segmental training algorithm shown in Algorithm 2.

In this algorithm, we firstly count the preprocessed nodes and sort them by the number of occurrences. Then for each turn, we select the top window of labels to train the model. Note that the sort procedure is essential since each turn, there might be new edges after sorting, which will be added to the tail, and these features can be trained later.

We choose CNN as our model for the following two reasons: First, a CNN is usually used to do classification tasks. For example, given several panda pictures, a CNN automatically learns the hidden feature in these pictures. Later, when we feed the model with a new image, the model may tell us whether there is a panda in the picture. In our task, given a seed input, we want the model to predict whether a specific node can be covered or not. Both of these two scenarios are classification tasks.

Second, a CNN can easily filter out the background information in the target input. The model will focus on the main object by using the convolution and pooling operation, which means the background has less effect on the final decision. That is very similar to program analysis: the input file usually consists of metadata and data. Mutations to metadata will significantly influence the code coverage, whereas mutating data may bring less coverage variance. CNN has been demonstrated its ability to filter out the background in the picture. We utilize this characteristic in fuzzing: filter out useless parts of data and focus on those which may have a significant influence on code coverage.

### Local interpretation

Local interpretation helps us understand why the model gives the decision in classification tasks, especially which part of input contributes to the prediction. In the fuzzing progress, we can collect a large amount of training data, which are input files and the corresponding covered code blocks. We can train a model to map the input to visited code blocks. After the model is well trained, we provide a block that is visited but has unvisited nodes connected. We expect local interpretation to tell us which bytes are highly related to this block, then by mutating the input bytes instructed by the interpretation, we have a higher possibility to reach those unvisited nodes.

Note that, although the model only learns the mapping between input bytes and visited blocks, it can report input bytes related to visited blocks, including those related to the entry condition and the exit condition of the visited block. Mutating input bytes related to the entry condition may lead the program to explore the sibling blocks of the visited block, while mutating input bytes related to the exit condition may lead the program to explore children blocks of the visited block, including those unvisited children.

If we know which byte is the variable related to the branch, we can reach the unvisited branch by mutating the enumeration bytes. For example, the header parser function will generally process the header part of the input file rather than the data section. If we feed the local interpretation module with a block in this function, this block has unreached nodes connected. It will guide the fuzzer to mutate bytes in the header to the unvisited blocks covered.

We will review one of the local explanation techniques named CAM (Zhou et al. 2016). The core ideas behind CAM are:

(i):

Use a GAP layer to replace the fully connected layer.

(ii):

Take advantage of partial information in the feature map produced by the last convolution layer behind GAP and the weight produced by GAP to generate localization information.

We will look at how the localization information is calculated: suppose we have an input with shape (hw). The last convolution layer locates at (L_{th}) layer in the network, and its output shape is (mnk), where k means the number of features, (m,n) means the shape of one feature map. The GAP accepts the (mnk) feature maps and produces (1, 1, k) values which represent the global average value of each feature in the (L_{th}) layer. (w_i^c) is the weight regarding class c. The final score, S, of class c is calculated as:

begin{aligned} S=sum _{i=1}^kF_i*w_i^c end{aligned}

(2)

In this formula, (F_i) is the feature map produced by the (L_{th}) layer. Note that after the CAM is calculated, the size of the heatmap is (mn), which means we should restore this map to the shape of the original input. The general way is to upsampling the CAM to the input size, which is the final heatmap. In our framework, we design the network and set the size of the last convolution layer as 128 of the input size. We have the following considerations when we choose the size. Foremost, we do not have to keep the shape of the feature map the same as raw input since only parts of the input significantly influence the code coverage. Simultaneously, if we keep the original size, there would be lots of convolution calculations, which would cause the training of the network to slow down.

### Node selection

By using the local interpretation, we can guide the fuzzer on the part of the input strongly related to the given code. Here we aim to use the Node Selection to choose the more valuable code as an explanation target.

Consider the sample code in Fig. 3a and its CFG in Fig. 3b. The buggy function will be called if the initialization of object o fails. In the CFG, black nodes are covered while white ones have not been covered yet. Node B, D, and G will be called if the malloc fails, Node F represents the buggy function call. We should focus on C more than A because it is almost impossible to cover B, D, and G, although more new nodes are connected to A than C.

Initial Selection. We will first choose those nodes with more untouched nodes connected through initial selection. As general knowledge, a trained model can only map and explain the features that it has learned. For those unknown features, the model cannot do anything. That is, we cannot use the model to help us increase the code coverage if we feed the model with unvisited nodes.

To solve the gap, we select the nodes which are near unvisited ones. In Figure 3b, we will choose nodes A and C (untouched node B directly connected to A and F directly connected to C). Inspired by CollAFL, we first count the unvisited nodes connected to A and C. Then assign them with the initial weight based on the number of untouched nodes.

Dynamic Weight Adjustment (DWA). By using the DWA algorithm, we intend to decrease the weight of the hard-to-reach code. Considering the code snippet in Fig. 3a and its CFG in Fig. 3b. Nodes B, D, and G represent lines 3-5, and the code will be executed if malloc fails. Generally, almost all the cases will succeed in malloc operation during fuzzing. Thus Node B, D, G are very hard to be touched. If we only use the initial weight to evaluate the importance of the node, we may waste the power on useless nodes. So our solution is to decrease the weight of node A dynamically even it has more untouched child nodes at the beginning.

Suppose we have N cases that can cover node A and node C at the beginning. The initial weights for these two nodes are 3 and 2. Choosing a case (N_i) to mutate, we feed the interpretation module with node A and get the related positions in the input. If node B is not covered after the fuzzer mutates the positions, we will subtract the current weight with (frac{1}{N}), then the weight of node A will be updated as (3-frac{1}{N}). As fuzzing goes on, the weight of node A will be decreased during mutation. Note that the N keeps the same during one iteration, which means it will only change after the model is retrained.

### Magic value identification

We augment the original mutator of AFL with external magic values embedded in CFG nodes. The term “magic value” in this paper not only means the signature of a file format, but it also represents constant values(e.g., enumeration type) of other fields in the file. In our design, magic values can be extracted in two ways: constant values and switch cases. For constant values, we implement a static analyzer similar to Vuzzer (Rawat et al. 2017). For switch cases, since the assembly code will use a jump table rather than cmp instructions, we design a method to identify switch cases in the assembly code and extract the magic values. Different as REDQUEEN (Aschermann et al. 2019b): they try to find subtractions to identify switch-case, which can be improved by a more generalized way to extract case values in switch-case.

Here we detail the idea on how to extract case values embedded in switch-case. First, choose the nodes which have more than three successors since we consider switch-case usually has more than three branches. Then, convert the code in the node to intermediate representation and find the base address of the jump table. At last, for each successor of this node, calculate the difference between the offset of the successor and the offset of the base address of the jump table. The results are the case values, which can guide the fuzzer on how to mutate. The extraction algorithm is shown in Algorithm 3.

From line 18 to line 21 in Algorithm 3, instead of just using the constant values directly, we set a threshold to mutate sufficiently. The reason behind this is that if the value represents a loop count or a size, we may want to mutate the value around.

## Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.