On one hand, identifying an endpoint application passively from network traffic requires no interaction with the endpoints. If accurate passive identification is possible, it could be the most beneficial method in the context of an NGFW. With passive identification, the protection scope is also the greatest: it encompasses all endpoint types including mobile devices as well as desktop and server machines.
On the other hand, passive identification exposes a security issue: if the NGFW can passively identify an endpoint application from network traffic, so can anyone else with visibility to the traffic. With the information that the NGFW uses to protect vulnerable endpoints, a malicious entity can likewise hunt for vulnerable targets. The security foundation of an NGFW should not contradict with the overall security of the endpoint.
To be able to perform passive identification as described in this section, an NGFW needs to be able to perform deep packet inspection on the addressed network protocols. Different network protocols have different issues regarding false positive and false negative identifications. As an example, many network protocols have the ability to transport files, and files are one source for potential false positive and false negative identifications. However, different NGFW vendors have different levels of support for extracting and performing analysis on files from different network protocols. Also, this kind of vendor-specific information is not fully openly disclosed by vendors. Therefore, in this paper, we do not take a deeper look into what protocols, and in what depth, different NGFW vendors are able to perform deep packet inspection on. This level of detail is out of the scope of this study.
In this section, we consider the current research on passive endpoint application identification from the network traffic and the existing methods. In section “HTTP”, we briefly consider identifying the endpoint application based on the User-Agent string in the HTTP header. In section “TLS”, we take a deeper look into the current research on identifying the endpoint application from TLS traffic. In section “QUIC”, we take a look at the QUIC protocol. In section “Other protocols”, we consider the current research regarding other protocols. In section “Hash fingerprinting”, we consider the recent development in the field of endpoint application identification from network protocols using hash fingerprints. Finally, in section “Mobile application”, we survey the current research on mobile application identification.
Since the context of this study is to find ways for providing endpoint-aware security for vulnerable endpoint applications, using the User-Agent string in the HTTP header for identifying the browser application (Mozilla Foundation 2020a; Microsoft Corporation 2020a; Google 2020a) can be considered a sufficient method for identifying the endpoint application from plain HTTP. Although various browsers allow spoofing the User-Agent field (Microsoft Corporation 2020b; Google 2020b), this can be considered an intentional attempt to evade security, which is not in the scope of this study.
The most popular browser, Google Chrome (Wikimedia Foundation 2020; Net Applications 2020; StatCounter 2020), has expressed plans to freeze the User-Agent string (Google 2020c). This means removing platform and version specific information from the User-Agent but leaving the browser specific information. It remains to be seen how this will affect the content of the User-Agent field on other platforms.
TLS is the most relevant subject for research due to its popularity. The latest reports from Google (2020d) show that, on average, over 80% of the pages loaded with the Chrome browser use HTTPS. Because of their wide usage, web browsers are also a major target for attackers. The number of published security vulnerabilities in 2019 for the Google Chrome browser was 177 (Google 2020e) and for Mozilla Firefox 105 (Mozilla Foundation 2020b).
Identifying the endpoint application from encrypted TLS traffic can be useful in the context of an NGFW, especially when considering routing or basic access control. When considering the issue of providing the best security for a vulnerable endpoint application, in most cases it can be assumed that the traffic can be decrypted by the NGFW and the tunneled traffic can be inspected as is. Some exploits, however, such as FREAK (miTLS Team 2020), Logjam (Adrian et al. 2015) and POODLE (The OpenSSL Project 2020), aim at vulnerabilities in the TLS implementation on the endpoint application. In this section, we will review the current research on identifying the application from encrypted TLS.
Identifying the network application from encrypted TLS traffic, especially in the case of HTTPS, has been extensively studied. A vast collection of different machine learning and deep learning techniques have been proven to be extremely effective in classifying TLS traffic (McCarthy and Zincir-Heywood 2011; Rezaei and Liu 2019; Shbair et al. 2016). Although identifying the Network Application is not in the scope of this study, it should be noted that when considering future research, similar methods may be useful for identifying endpoint applications.
Identifying the endpoint application from TLS traffic has not been as extensively studied as identifying the network application. Identification of the endpoint client application, most often the web browser, has been explored in some papers. A common element in almost all papers on TLS identification is the use of supported cipher suites, either alone or together with other identifiers, to identify the underlying endpoint application.
In an early study on SSL identification from 2007, Bernaille and Teixeira investigate identifying the application layer protocol inside an SSL tunnel based on the first few packets (Bernaille and Teixeira 2007). The method relies on the size of the first three encrypted application packets, taking into account that different encryption algorithms result in different packet sizes. The researchers used encrypted traffic of HTTP, POP, FTP, Bittorent, and Edonkey. The accuracy reached over 90% for all other protocols but Bittorrent, which reached 78% accuracy.
The research in Bernaille and Teixeira (2007) only focuses on SSL, so its applicability for current network traffic using TLS is unclear and requires further research. Despite the research being partially outdated, it introduces a simple but interesting method for separating encrypted web traffic from encrypted non-web traffic. This can help an NGFW to separate HTTPS traffic generated by a web browser from other traffic that is tunneled over TLS, thus better focusing the inspection features correctly.
Husák et al. (2016) explore the method of mapping the list of supported cipher suites provided in the Client Hello message to the User-Agent in the HTTP message. To achieve this, the authors used two methods: first, they created a test TLS server for harvesting accurate results by collecting the User-Agent information from decrypted TLS connections, and second, they observed student network traffic and correlated HTTP and HTTPS connections initiated from the same endpoint roughly at the same time.
Husák et al. first observed that a small amount of unique lists cover a great part of all TLS traffic: the top 31 of the observed 1598 unique lists cover over 90% of all traffic. They also found that several User-Agent strings often map to one list of cipher suites. For each list, the authors selected the most commonly matching endpoint application and endpoint OS. Using this method the authors were able to identify at least one of the two with a 95.4% accuracy. For only 4.6% of the observed TLS traffic, neither the endpoint OS nor the endpoint application could be identified. However, the endpoint application alone could not be identified for about 16% of the traffic, yielding about 84% accuracy for endpoint application identification with this method.
The disadvantage of the research by Husák et al. (2016) is the inaccuracy of the second method for mapping the cipher suite lists with the HTTP User-Agents: several client applications usually initiate HTTP and HTTPS connections from one client machine, which can lead to false correlations. Nevertheless, this method is very simple, and the results obtained in their study show good promise for this method to be easily applicable in the context of an NGFW. This method, however, provides no visibility to the server application.
A similar, but more detailed, method for endpoint client application identification from network traffic alone, is presented by Muehlstein et al. (2017). They explore the passive identification of the operating system, browser and network application from encrypted HTTP traffic. They derive a detailed feature set from a TLS connection. It includes information on the TLS connection such as the supported cipher suites, compression methods and extension count, but also information on the traffic flow such as amount of packets and bytes and information on traffic bursts. This feature set accompanied by machine learning [Support Vector Machine (SVM) (Cortes and Vapnik 1995) with Radial Basis Function (RBF)] provides an identification accuracy of 96.06%. For generating the dataset, the authors used Selenium Web Automation tool (The Selenium Project 2019).
The research focuses on TLS traffic in port TCP/443 and on the identification of browsers, with references to “Non-Browser” traffic for so-called Microsoft-Background traffic. Nearly 10% of the traffic in their test is produced by these non-browser endpoint applications. Due to the selected feature set, this method is applicable only after the whole connection has been processed. In the context of an NGFW, post processing may be useful for traffic analyses such as network traffic reports, but it is not a feasible solution for inline security. Again, the research does not cover identifying the server application.
The QUIC protocol is an encrypted multiplexed stream transport protocol over UDP, originally designed by Google (Roskind 2020). The usage of QUIC protocol is constantly growing in the internet. In Rüth et al. (2018), the researchers show that QUIC accounted for 2.6–9.1% of the traffic in the internet in 2017, with Google using QUIC for 42.1% of its traffic, and Sy et al. (2019) mention that approximately 7% of global internet traffic in 2018 was QUIC. In Sy et al. (2019) it is also observed that 186 of the Alexa Top Million sites (Alexa Internet Inc 2020) had QUIC support in 2018.
As the QUIC protocol was adopted by IETF and standardized in RFC 9000 (Iyengar and Thomson 2021), the term gQUIC became a common term for referencing the original protocol specification from Google. The IETF standardized version of QUIC utilizes TLS 1.3 inside the QUIC packets, and the use of HTTP inside the IETF standardized QUIC has been titled HTTP/3 (Bishop 2019).
Despite being an encrypted protocol, the gQUIC protocol introduces the client’s User Agent ID value in the unencrypted Client Hello. Shah (2018) finds that it is possible to identify the endpoint applications and operating systems from the network based on the gQUIC User Agent value. It is noteworthy that this field is an optional field, and most implementations do not include it in the Client Hello (Lastovicka et al. 2018). However, in the latest version of Chrome browser at the time of writing (86), the User Agent value is included in the unencrypted Client Hello, as seen in Fig. 2.
Due to being a relatively new addition to the common web protocols, not much academic research has yet been released on identifying endpoint information from QUIC traffic, when not taking the User Agent into account. This is a severe lack in the current research, due to the rapidly growing popularity of the protocol. In addition, as HTTP/3 will use QUIC, the importance of further research on QUIC increases.
HTTP and TLS comprise a large part of all internet traffic, which explains their great coverage in current research. An NGFW is, however, often used for inspecting many other protocols as well, and protecting vulnerable endpoint applications that process network traffic using these protocols. Identifying the endpoint applications from other protocols than HTTP and TLS is also a major area of interest.
Many protocols contain short simple byte patterns, or magic bytes, that provide a sufficient way of identifying the protocol itself. In addition, there are many papers providing other methods for identifying the application layer protocol. In Yun et al. (2016), the researchers present a tool called Securitas, which they show to be able to identify the protocol to a 98.4% accuracy. It does not, however, identify the endpoint application producing the traffic.
Identifying the endpoint application from other protocols than HTTP or TLS has not raised much academic interest. Despite being unencrypted, there is no simple method, like the User-Agent is for HTTP, for mapping the endpoint application for many common protocols such as DNS, FTP, and SMB. The need for identifying the underlying endpoint application, however, still exists in the concept of an NGFW. The vulnerabilities in the client and server implementations for certain protocols, such as SMB or DNS, may lead to a massive system compromise. A good example is the WannaCry ransomware campaign in 2017 (Mohurle and Patil 2017), which propagated exploiting a vulnerability (Microsoft Corporation 2020c) in the older implementations of the SMB protocol on Windows systems. Despite there not being much academic research on the endpoint application identification for other protocols, several methods for fingerprinting the endpoint application from multiple different protocols have recently been developed. These methods are explored in the next section.
Endpoint application fingerprinting, by calculating a hash value from certain protocol details, is a fairly recent, but active field of study. An established method for creating a fingerprint is to generate an MD5 hash from a suitable set of significant and distinctive values from the protocol fields. Several methods for different protocols have been proposed, and more methods are constantly being developed, but little academic research has been published on their effectiveness. However, case studies have shown that the proposed methods show enough promise to justify the need for further research. In this section, we present the current status of endpoint application fingerprinting utilizing this method.
JA3 and JA3S fingerprinting
The first endpoint application fingerprinting methods utilizing the hash fingerprinting method were the JA3 and JA3S fingerprints. JA3 and JA3S are methods for fingerprinting TLS handshakes, developed by Salesforce employees John Althouse, Jeff Atkinson and Josh Atkins, and open-sourced by the company in 2017 (Althouse et al. 2020a). The method is based on TLS fingerprinting research by Lee Brotherston, presented in 2015 at DerbyCon (Brotherston 2015).
In essence, the JA3 fingerprint uses similar methods as the studies discussed in subsection TLS. To generate a JA3 fingerprint from the Client Hello message of a TLS handshake, the following values are stored in a specific format: Version, Accepted Ciphers, List of Extensions, Elliptic Curves, and Elliptic Curve Formats. An MD5 hash value is then taken of the stored value to obtain the JA3 fingerprint. Similarly, the JA3S fingerprint is generated from the Server Hello message by storing the values for Version, Accepted Cipher, and List of Extensions, and taking an MD5 hash value of the result.
Despite being a somewhat recent invention, there has already been some academic research published on the JA3 and JA3S fingerprints. In Truong (2019), Ai Truong evaluates the accuracy of JA3 and JA3S for identifying TLS connections. The research focuses on identifying malicious traffic from normal traffic, but it also has some valid points on JA3 accuracy when considering the issue in general. The research shows that, to separate malware connections from legitimate traffic, more accurate results are received from combining the use of JA3 and JA3S fingerprints than from using JA3 alone. It also shows that using machine learning techniques, more specifically decision trees, on the values that are stored for the JA3 and JA3S fingerprints, provides more accurate results than the fingerprints themselves. The research concludes that a big issue with JA3/JA3S fingerprinting is that many applications use the same underlying TLS libraries for generating the TLS connections, which is the main reason for false identifications.
The main weakness of the JA3 and JA3S fingerprints, which causes the large amount of collisions, is that they leave out most of the information that could be gained from the extensions in the TLS handshake messages. Some extensions, such as the Server Name Indication extension in the Client Hello message, are clearly web service specific, and using their value when generating the fingerprints would only reduce the usability of the fingerprints. Many of the extensions are, however, not web service specific in the same way, and could be used to make the fingerprints more precise.
The JA3 fingerprinting method is, however, a lot more thorough than the method of only taking the list of the supported cipher suites used in Husák et al. (2016). This leads to the assumption that JA3 fingerprints should lead to more accurate results than what was achieved in Husák et al. (2016). The JA3S fingerprint is also the only method for identifying the server application that was encountered during this study. It is clear that more research on the accuracy and applicability of the JA3 and JA3S fingerprints is needed.
Hash fingerprinting for other protocols
After the release of the JA3 and JA3S fingerprinting methods, many similar fingerprinting methods have been introduced for different protocols. Many of them show promising preliminary results. Due to the cutting-edge nature of these methods, however, no academic research has yet been published on their effectiveness.
Methods for fingerprinting SSH client and server applications, called HASSH and HASSHServer, were developed by Salesforce employee Ben Reardon and open sourced by the company in 2018 (Reardon 2020). The HASSH and HASSHServer fingerprint is calculated from the clear-text SSH_MSG_KEXINIT messages sent by both the client and the server. The fingerprint utilizes the set of supported and preferred algorithms listed in these messages. These values are first stored in a specific format, and an MD5 hash value is then taken of the stored value. The author presents preliminary results which indicate that the method can be especially useful for identifying malicious connections, but they also show that it can be used for identifying legitimate connections.
A method for fingerprinting gQUIC clients, called CYU, was developed by Salesforce employee Caleb Yu and open sourced by the company in 2019 (Yu 2020). To generate the CYU fingerprint, the version, and a list of the tag values, are collected from the gQUIC Client Hello message. An MD5 hash is then calculated from the collected values, which constitutes the CYU fingerprint. A simple use case presented in the publication indicates that the CYU fingerprints can be used for identifying malicious actors, but further research is needed to verify whether the method can be used for providing endpoint aware inspection in the context of an NGFW.
RDP fingerprinting is explored by Karimishiraz (2020) published in 2019. The author notes that RDP clients that use the Enhanced Security mode can be fingerprinted using the JA3 fingerprinting method. For fingerprinting RDP clients that use the Standard Security mode, they introduce a preliminary method called RDFP. The method is described as experimental, and susceptible for modifications before final release. The method utilizes certain values collected from the clientSecurityData, clientClusterData, clientNetworkData and clientCoreData structures sent during the Basic Settings Exchange phase of the connection, and calculates an MD5 hash of these values. Due to being in the early stages of development, no further publications on RDFP were found during this study. A similarly preliminary method for fingerprinting SMB, called SMBFP, is introduced by Torres (2020) published in 2020. This method is described as being incredibly bleeding edge by the author, and no preliminary results on the method’s effectiveness have yet been published.
Identifying mobile applications based on the network traffic is explored in several papers. Although identifying mobile applications is useful for an NGFW, the main focus is on local workstations and laptops. In this section we briefly touch on the current research on mobile application identification.
In Shbair et al. (2016), the researchers introduce NetworkProfiler, a tool for identifying a mobile application based on HTTP traffic. The tool utilizes UI fuzzing for profiling the application traffic. UI fuzzing is the method of automatically generating different UI events that trigger new behavior, and thus new network traffic patterns, from the application. NetworkProfiler suffers from being limited to unencrypted traffic. In Taylor et al. (2016) the researchers present AppScanner, which also uses UI fuzzing for traffic generation. AppScanner focuses on TCP traffic, but it does not look into the packet payloads. This makes the tool application-layer protocol agnostic: AppScanner is not restricted to HTTP or HTTPS traffic. It analyzes traffic bursts, meaning all network packets from the device within a certain time frame, and traffic flows, meaning sequences of packets within a burst with the same destination IP address and port number. Based on these, AppScanner generates application fingerprints. The paper explores several different classification methods, and provides performance numbers for the different methods. Despite being presented as a mobile application identifier, Taylor et al. point out that due to its modular design, it should be easily portable to other platforms as well.
In Taylor et al. (2018) the researchers further expand the research on AppScanner, by analyzing its accuracy with different application versions and time passage. They also present a method for separating ambiguous traffic, or traffic that is common among different applications, from the dataset. The research shows that, if no post-processing or removal of ambiguous traffic is done, the performance of the method is not too impressive. The best result was achieved when the device and application version remained the same over the course of six months, yielding a 40.9% detection accuracy. The accuracy decreased when the device or the application version changed. After applying post-processing and removal of ambiguous traffic, the method reached 73% accuracy in the worst-case test and 96% accuracy in the best case test.
Due to the research being restricted to a mobile device, it is difficult to estimate how the methods used by AppScanner would work for an NGFW. In the context of an NGFW, the method of fingerprinting the traffic bursts and traffic flows in this way, however, again restricts the method only to post processing tasks such as traffic reports. For inline processing, the identification needs to happen earlier in the process. The UI fuzzing methods introduced in Shbair et al. (2016) and Taylor et al. (2016), however, provide an interesting ground for further research on endpoint application identification on other platforms.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.