In this section, we introduce an NLP-based anomaly detection framework for post-analysis of LTTng traces. It is designed to help developers to efficiently find the root causes of abnormal behaviors in microservice environments. We aim to provide a general framework applicable to microservice-based applications with different settings.

Figure 1 presents the architecture of our approach along with its three main modules, i.e., the tracing module, the data extraction module, and the analysis module. We discuss this architecture in detail in the following subsections.

Fig. 1
figure 1

The architecture of our proposed anomaly detection method for microservice environments

Tracing module

Tracing is an efficient way of gaining information from a system for further analysis and debugging, thus minimizing the monitoring influence. Distributed tracing is derived from traditional tracing so as to be employed within distributed systems. Distributed tracing technologies provide a view of the communication among microservices [6]. Microservices mostly use Representational State Transfer (REST) as a usual way to communicate with other microservices.

We aim to provide a general anomaly detection framework that can be easily applicable for any microservice-based application in practice and subsequently lead to the discovery of the cause of the identified anomalies. We have described how to analyze an application and prepare its associated dataset, instead of using pre-existing available datasets which do not inherently contain information needed to extract spans and their associated sequence of events.

As our case study, we created our dataset by tracing a distributed software available in Ciena Corporation. Many new releases of this software are provided by the developers of this company every day. Thus, we collected traces from different releases to compose the dataset. We denote the set of all traces collected from different releases as Γ={T1,T2,…,Tn}, where n indicates the number of collected traces.

Figure 2 illustrates the structure of our tracing module that make use of the LTTng open-source tool. As presented in this figure, LTTng is deployed on each node to send the tracing data to the manager. The running LTTng-relayd daemon on the manager collects the tracing data received from the nodes. Later, Trace Compass integrates the traces obtained from different nodes to form a Trace (phantom {dot {i}!}T_{i} = {e_{1}, e_{2},…, e_{g(T_{i})} }), where g(Ti) is the number of events associated to Ti. Actually, Ti is represented as an enumerated collection of events sorted by their timestamps.

Fig. 2
figure 2

The overview of our distributed tracing module

During the execution of a microservice application, many tasks or spans, such as opening a web page, are performed. In fact, a trace can be divided into a set of spans, where each span consists of a sequence of events that are invoked in a specific order to perform the desired task. It should be noted that spans can not be directly retrieved using LTTng. In the sequel, we will discuss in detail how to extract spans from tracing data.

Data extraction module

We implemented the data extraction module within the Trace Compass open-source tool, which offers scripting capability [40] and visualization mechanisms to promote our analysis. LTTng generates a CTF (Common Trace Format) file for every node in the microservice environment. The CTF format is a file format optimized for the production and analyses of big tracing data [7]. After generating the CTF files, Trace Compass is used to read these files and integrates them into trace Ti, where i indicates the index of this trace in Γ. The result of this process is an enumerated collection of events sorted by their timestamps.

An event is composed of well-defined fields that are common to all events, such as name, timestamp, and process ID. However, the delivered sequence of time-ordered events does not provide the spans that reflect separate tasks. In order to extract spans and their subspans, Γ is scanned with respect to the tag of request/response events. Other events are then processed, so as that each event is assigned to the span it belongs. In our framework, events are stored by means of their associated keys composed by the name of the event and its arguments.

In order to train a model which is able to detect performance anomalies as well as release-over-release degradations, a massive training dataset is required to cover as many normal patterns of keys as possible. Actually, the training data Γ correspond to entries of traces obtained from the execution of previous stable releases of an application. Figure 3 summarizes how to create such a dataset. After collecting n different traces, each of them is processed, so as that all the spans associated with each trace are individually extracted from Γ. Next, for each span, its sequence of events is collected and stored in Si, for i=1,…,m. In our framework, each sequence Si is represented by its corresponding keys (kappa _{i}^{1},ldots,kappa _{i}^{h(S_{i})}), where (kappa _{i}^{k}) represents the k-th key in the sequence Si, and h(Si) indicates the length of sequence Si.

Fig. 3
figure 3

Illustration of the process for creating the training dataset from multiple traces

Extracting spans

In the following, we describe how spans are extracted from an LTTng trace. LTTng uses tracepoints designed to hook mechanisms that can be activated at runtime to record information about a program execution. Tracepoints are placed inside the code by developers or debuggers to extract useful information without the need of knowing the code in-depth. Hence, one can expect to encounter different event types in trace data, indicating the beginning or the end of a span, or any other operation.

Requests and responses are the two types of events we consider for extracting spans. Each span starts with a request and ends with a response. In addition, the request and response associated with a span possess the same tag. For example, a request with tag 00 indicates the start of a span, whereas a response with the same tag marks its end. Moreover, many sub-spans may be generated during a span’s lifetime since a service may communicate with other services to answer a demand. Similar to spans, sub-spans are created with a request and a response that share the same tag. Besides, the parent’s tag of each sub-span is embedded into the children’s tag. For example, 00/01 indicates a sub-span whose parent is represented by the 00 tag. As shown in Fig. 4, each span and its sub-spans form a tree. Yet, each span can be displayed as a sequence of requests and responses sorted by their timestamp. In the example of Fig. 4, this sequence would be S={Req,Req,Resp,Req,Resp,Resp}.

Fig. 4
figure 4

The structure of a span and its sub-spans in a distributed trace

Construction of sequences of keys

In addition to requests and responses, many other userspace and kernel events happen during each span. After collecting all spans, all events in Γ are processed, and assigned to the span to which they belong. The appropriate span for each event is found by comparing the event’s arguments (e.g., TID and PID) with the arguments of the events that have been assigned to the spans. Once the appropriated span is identified, the event is placed in the sequence according to its timestamp. In the example of Fig. 4, if an event happens right after the first request, the resulting sequence becomes S={Req,Event,Req,Resp,Req,Resp,Resp}. This process is repeated for all events so that a set containing all sequences is obtained, where each sequence refers to a span.

The previous paragraph explained how sequences are extracted from a trace. Now, we explain how the arguments of events are used. Whenever a specific tracepoint is encountered at runtime, an event is produced with its arguments such as a name, timestamp, and possibly many others. Event arguments such as process name, message, and event type contain important information to increase detection quality.

The scope of this work is limited to the arguments that are common to all events. In our experimental traces, event name, process name (Procname), Thread ID (TID), Process ID (PID), timestamp, message, and event type are present in most of the events. We divided the events into two categories: 1) requests/responses, and 2) other events. Table 1 lists the arguments we selected for each category of events. The key for requests and responses is created using the name, type, tag, and procname arguments. Event type specifies whether the event is a request or a response, and tag specifies the span or sub-span to which the event belongs. The second category of general events uses the event name, procname, and message arguments to compose the keys. Thus, the resulting keys are all textual strings, and V={v1,v2,…,vd} denotes the set of all possible unique keys.

Table 1 The categories of events and the arguments used by our framework

Although simple, extending our framework to a new argument may require a much larger dataset depending on the number of values that argument may have. To illustrate, let us suppose we use only one argument to create keys, and that this argument has β1 different values. In this case, only β1 unique keys are created (d=β1). If another argument with β2 different values is then used, β1×β2 unique keys are obtained (d=β1×β2). Thus, each time a new argument with βi different values is considered, the number of unique keys increases βi times.

Analysis module

In microservice environments such as our experimental application, events are expected to occur in a particular order. Actually, the keys in the sequences obtained by the data extraction module must follow specific patterns and grammar rules similar to the ones found in natural languages. Hence, only a few possible keys can appear as the next key in a sequence following a specific set of keys. The training dataset in our experiments includes normal sequences of keys obtained from previous stable releases of the application. In this section, we review the machine learning model we have proposed to distinguish normal patterns from abnormal ones. We adopted an LSTM network to model this sequence to word problem given its success for modeling text prediction and other similar natural language processing tasks. This model learns the probable keys at the moment t according to the previously observed sequences of keys. Later in the detection phase, the model determines which events in a sequence do not conform to normal patterns.

We modeled the anomaly detection problem on our sequences of keys as a multi-class classification problem for which the input length α is fixed. Remark that the sequences obtained by the data extraction module are of different lengths. Multiple sub-sequences of fixed size are hence obtained by considering a window of size α over the larger sequences. It should be noted that sequences smaller than α are very rare in our dataset. These small sequences are related to light operations that are often not prone to performance anomalies. Consequently, they are simply ignored by the analysis module. Let V={v1,v2,…,vd} be the set of all possible unique keys, where each key vi defines a class. From a sequence of size h(Si), h(Si)−α subsequences are analyzed. Thus, for each sequence Si, the input of the model is denoted by (X_{i}^{j}={kappa _{i}^{j},kappa _{i}^{(j+1)},…,kappa _{i}^{(j+alpha -1)}}) and the output is expressed by (Y_{i}^{j} = kappa _{i}^{(j+alpha)}), where j1,…,h(Si)−α. Sequences represent a part of a task’s execution path in which keys happen in a particular order. Hence, for each (X_{i}^{j}), (Y_{i}^{j}) can only take a few of the d possible keys from V and is dependent on the sequence (X_{i}^{j}) that appeared before (Y_{i}^{j}). In other words, the input of the model is a sequence of α recent keys, and the output is a probability distribution over the d keys from V, expressing the probability that the following key in the sequence is vrV. Eventually, a model of the conditional probability distribution (Prob(kappa _{i}^{j+alpha }=v_{r} | left {kappa _{i}^{j},kappa _{i}^{(j+1)},…,kappa _{i}^{(j+alpha -1)} right }), v_{r} in V) is made after the training. Figure 5 shows an overview of the described anomaly detection model.

Fig. 5
figure 5

The overview of our anomaly detection model

An LSTM network is employed to learn the probability distribution (Prob(kappa _{i}^{j+alpha }=v_{r} | left {kappa _{i}^{j},kappa _{i}^{(j+1)},…,kappa _{i}^{(j+alpha -1)} right })) that maximizes the probability of the training sequences. The architecture of this LSTM network is shown in Fig. 6. Each layer contains α LSTM blocks, where each block processes a key of the input sequence. LSTM blocks have a cell state vector C and a hidden vector H. Both values are moved to the next block to initialize its state. The values of input (kappa _{i}^{q}) and (H_{i}^{q-1}), for q{j,j+1,…,j+α−1}, determine how the current input and the previous output affect that state. They indicate how much of (C_{i}^{q-1}) (the previous cell state) holds in the state (C_{i}^{q}). They also influence the construction of the output (H_{i}^{q}). Our deep LSTM neural network architecture includes two hidden layers in which the hidden state of the previous layer is used as the input of each corresponding LSTM block in the next layer.

Fig. 6
figure 6

The architecture of the LSTM network we used in our anomaly detection framework

During training, appropriate weights are assigned to input so that the final output of the LSTM provides the desired key. The categorical cross-entropy loss function [41] is used as the loss function for the designed multi-classification task. Then, a standard multinomial logistic function is applied to translate the last hidden state into the probability distribution (Prob(kappa _{i}^{j+alpha }=v_{r} | left {kappa _{i}^{j},kappa _{i}^{(j+1)},…,kappa _{i}^{(j+alpha -1)} right }, v_{r} in V)).

In the detection phase, the trained model is used to analyze unseen tracing data. This trace can be obtained from an old or a new release of the software. Like what was done to collect the training data, spans are extracted and then converted into sequences of different lengths. Therefore, from a sequence of size h(Si), h(Si)−α subsequences are obtained, and h(Si)−α probability distributions are predicted. The model predicts the probablity distribution (Prob(kappa _{i}^{j+alpha }| left {kappa _{i}^{j},kappa _{i}^{(j+1)},…,kappa _{i}^{(j+alpha -1)} right })= left { v_{1} : p_{1}, v_{2}:p_{2},…, v_{d} : p_{d} right }), where pj describes the probability of vj to appear as the next key value. Then, (kappa _{i}^{j+alpha }) is marked as an unexpected key if the probability of the real seen value of (kappa _{i}^{j+alpha }) is less than the confidence threshold of 0.5.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit


This article is autogenerated using RSS feeds and has not been created or edited by OA JF.

Click here for Source link (