# kngMap: Sensitive and Fast Mapping Algorithm for Noisy Long Reads Based on the K-Mer Neighborhood Graph Ze-Gang Wei, et al.

May 6, 2022

## 1 Introduction

Single molecular sequencing (SMS) developed by Pacific Biosciences and Oxford Nanopore Technologies has been increasingly applied in DNA sequencing studies since its emergence (Ono et al., 2012; Rhoads and Au, 2015). Compared to next-generation sequencing (NGS), SMS can produce considerably longer reads with less sequencing bias and lower cost (Yang et al., 2017; Marchet et al., 2019). Recently, the N50 and maximum lengths of the reads generated by the MinION nanopore sequencer can achieve more than 100 kbp and 1 Mbp, respectively (Chen et al., 2021). Such long read lengths offer new solutions to bioinformatic questions that are hard to be resolved by NGS, such as the de novo genome assembly, genome resequencing, structural variant (SV) discovery, and transcriptome analysis (Stöcker et al., 2016; Wei and Zhang, 2018; Liu et al., 2019; Cao et al., 2021).

Broadly speaking, most existing mapping methods or tools for SMS reads adopt the typical seed-chain-align strategy (Liu et al., 2016a). In brief, they first find the matched k-mers (also called seeds) of query reads to the reference genome; then, the candidate regions (also called alignment skeleton) for aligning in the query read and the reference genome are chosen based on the seeds. Finally, the local subsequences surrounding the seeds are base-to-base aligned to compose the final read alignment with the seeds. Based on the index technique of seed searching, SMS mappers can be categorized into three groups: Burrows–Wheeler Transform full-text minute-space (BWT-FM) index-based (Langmead and Salzberg, 2012), hash table index-based, and short read aligner-based methods.

BWT-FM index-based methods find the matched seeds by constructing the BWT-FM index of the reference genome; such methods include BLASR (Chaisson and Tesler, 2012), BWA-MEM (Li, 2013), lordFAST (Haghshenas et al., 2018), and smsMap (Wei et al., 2020). BLASR (Chaisson and Tesler, 2012) initially builds the BWT-FM index of the genome to find short exact matches; then, the candidate aligned region is generated by using sparse dynamic programming (SDP), and the final detailed alignment within the area defined by SDP is performed by dynamic programming. BWA-MEM (Li, 2013) first searches the supermaximal exact matches according to the BWT-FM index of the reference and detects a group of seeds that are colinear and close to each other by a chaining algorithm, which then extends the seed with a banded affine-gap-penalty dynamic programming. lordFAST (Haghshenas et al., 2018) selects the candidate alignment regions in the genome using the longest matches identified by the BWT-FM index; then, the base-to-base alignment between consecutive seeds is obtained by performing dynamic programming. Recently, the smsMap (Wei et al., 2020) mapper proposed by us also utilizes the BWT-FM index to quickly find the matches in the reference genome and then locates the best starting positions in genome and query read by defining a credibility function. Finally, the detailed alignment result is obtained with the column reduction banded alignment method.

Hash-based mapping methods search the matched seeds through a hash table index of the genome. YAHA (Faust and Hall, 2012) utilizes a hash table index to store the k-mer locations in the reference and then extends seeds to generate fragments that contain contiguous matching bases between the query sequence and the reference genome. Next, it combines the fragments to select potential regions for alignment; lastly, it completes the full alignment by applying a modified version of Smith–Waterman algorithm to the unmatched regions. rHAT (Liu et al., 2015) starts to build the regional hash table (RHT) of the genome; then, the matches of the k-mers within the query read are retrieved through RHT and a direct acyclic graph is built to compose the skeleton of the alignment. At the end, unaligned pairs of segments in the skeleton are aligned with the banded Smith–Waterman algorithm. GraphMap (Ivan et al., 2016) implements the q-gram seeding strategy that allows for fast lookup of inexact matches by constructing a hash index of the reference sequence, which then generates alignment anchors through a fast graph-based ordering of seeds and extends anchors to achieve final alignments. minimap2 (Li, 2018) indexes minimizers in a hash table that allows for fast lookup of exact matches and then identifies colinear anchor sets; final base-level alignments are obtained by applying dynamic programming to regions between adjacent anchors. conLSH (Chakraborty and Bandyopadhyay, 2020) computes context-based locality-sensitive hashing values of genomes to facilitate seed search and then generates a series of sites for candidate alignment after extension; finally, it produces the best possible alignment results by applying the sparse dynamic programming-based approach. Later, S-conLSH (Chakraborty et al., 2021), an improved version of conLSH, was developed by introducing the spaced context-based locality-sensitive hashing for mapping long noisy SMS reads.

Short read aligner-based methods retrieve matches by using extant mappers of short reads. For example, LAMSA (Liu et al., 2016b) first looks for long approximate matches through short read aligner GEM (Marco-Sola et al., 2012), finds a set of possible alignment skeletons, and fills the gaps within the skeletons to generate valid alignments for the whole read. NGMLR (Sedlazeck et al., 2018) identifies similar segments in read and genome by short aligner of NextGenMap (Sedlazeck et al., 2013); then, it extracts the sub-sequences in read and reference to compute a pairwise sequence alignment using a convex gap cost model. Lastly, it reports the set of linear alignments with the highest joint score as the mapping results.

To address the aforementioned issues, in this study, we propose a k-mer neighborhood graph mapper (named kngMap), a novel long-read mapping algorithm which is specifically designed to improve mapping sensitivity and deal with SV events. Overall, kngMap works in four main stages. It initially constructs a searching index for the reference genome to quickly find matched k-mers for query reads. Such matches are then used to construct a k-mer d-neighborhood graph where matched k-mers are viewed as vertices and each pair of matched k-mers is connected by a direct edge. Third, a high quality of alignment skeleton is identified by designing a chaining approach. At the end, each unaligned gap in the alignment skeleton is classified into several categories of SV events and filled with a specific alignment method according to its category. The whole read alignment is accomplished by integrating the skeletons and the alignments of the gaps. We benchmarked kngMap on simulated dataset reads with different types of SVs and real-life datasets generated from PacBio SMRT and Oxford Nanopore platforms. The experimental results demonstrated that kngMap has superior ability in terms of base-level sensitivity and end-to-end alignment, which can produce consecutive alignments for the whole read; meanwhile, it can deal with different categories of SVs in the reads.

## 2 Methods

The main motivation of kngMap is to efficiently improve mapping sensitivity and simultaneously have a superior ability of dealing with SVs for long reads generated by SMS sequencing technologies. The underlying design principle is to effectively find one high-quality alignment skeleton in the reference genome for each read, even though the read contains SVs and more sequencing errors, before the costly procedure of base-to-base alignment to the reference genome. An overview of the kngMap algorithm is depicted in Figure 1. kngMap mainly includes four stages: 1) building a searching index of the reference genome in advance, which is used to quickly find matched k-mers for a query read Figure 1A; 2) constructing a k-mer d-neighborhood graph where matched k-mers are viewed as vertices and each pair of matched k-mers is connected by a direct edge based on the positions in genome and the query read Figure 1B; 3) building and refining the high quality of the alignment skeleton by designing a chaining approach Figure 1C; and 4) classifying the unaligned gaps between pairs of consecutive matched k-mers in the alignment skeleton into several categories of SV events and handling each of them with a specific alignment method according to its category Figure 1D. A more detailed illustration of each step is provided later.

FIGURE 1. Schematic representation of stages in kngMap. (A) Building a searching index of the reference genome. (B) Constructing a k-mer d-neighborhood graph where matched k-mers (also called anchors) are viewed as vertices and each pair of matched k-mers is connected by a direct edge based on the positions in genome and the query read. (C) Building and refining the high quality of alignment skeletons by designing a chaining approach. (D) Classifying the unaligned gaps between pairs of consecutive anchor matches in the selected chain into several categories of SV events and handling each of them with a specific alignment method according to its category. (E) Generating the detailed base-to-base alignment by performing dynamic programming between consecutive anchor matches in the selected chain.

### 2.1 Constructing Reference Genome Index

In order to quickly find the occurrence positions of a k-mer (subsequence with length of k) in the reference genome, a lookup index for the reference genome is usually needed to be built. Currently, two superior index techniques, BWT-FM index (Lippert, 2005) and hashing table (Ning et al., 2001), are successfully used by most extant mapping methods (Lindner and Friedel, 2012). BWT-FM index allows long reference genome to be searched efficiently with low memory usage (Hayashi and Taura, 2013). Hash table takes linear time to find the positions in the genome when a certain k-mer is given (Berlin et al., 2015). Compared with BWT-FM index, the indexing speed based on the hash table is faster (Li, 2016; Prezza et al., 2016), but it will consume more computational memory. Both index techniques have their own advantages (Alser et al., 2021). Therefore, kngMap utilizes these two strategies [the BWT-FM technique implemented in combined-index (Haghshenas et al., 2018) and the hash index implemented in minimizers (Li, 2016)] to construct the index for the reference genome (the default index module is hash).

### 2.2 Generating k-Mer d-Neighborhood Graph

Once the index of the reference genome is constructed, all the matches of the k-mers and their matched positions for a query read can be retrieved through the genome index; these matches are subsequently used to form a k-mer d-neighborhood graph. Specifically, given a set of long reads, kngMap constructs the d-neighborhood anchor graph for each read at a time in two steps as follows:

##### Step 1: Extracting the Matched k-Mers

KngMap extracts all k-mers for one query read and finds the matched positions through the reference genome index built in the aforementioned stage. Each of the matches [also called anchors in some references (Chaisson and Tesler, 2012; Haghshenas et al., 2018)] of the k-mers can be denoted as a tuple:

$M(x,y,l)$

, where x and y are the matched positions on the read and the reference genome, respectively, and l is the length of the match (here l = k). In other words,

$M(x,y,l)$

indicates that the substring interval [xk + 1, x] on the reference matches the interval [yk + 1, y] on the query exactly.

##### Step 2: Generating the Anchor d-Neighborhood Graph

A d-neighborhood anchor graph is formed according to the list of anchors ordered by their positions on the reference genome and then on the read. In this graph, each anchor is viewed as nodes; two nodes (ViVj) are connected with a direct edge if the pair of vertices meets the following condition:

${Vi(x)−Vj(x)≤d, x is the node position in the genomeVi(y)−Vj(y)≤d, y is the node position in the query read,(1)$

where

$Vi(x)$

is the position of node i in the reference genome,

$Vj(x)$

is the position of node j in the query read, and d is a constraint parameter which is applied to model the maximum distance between two anchors in the alignment for a read. The edge direct (from Vi to Vj) denotes that Vi is the ancestor (also called precursor or predecessor) node of Vj. The setting of parameter d is motivated by the fact that, given a certain error rate, the distance between two anchors on the read can be modeled by a geometric distribution (Chaisson and Tesler, 2012). In kngMap, this property is utilized to describe the maximum allowed distance on the read between two vertices connected by an edge in considering the error rate of the read. Theoretically, two continuous k-mers in an error-free read are also matched in the continuous positions in the genome (Supplementary Figure S1). However, in fact, with the sequence errors derived from SMS sequencing, they can be matched with two discontinuous positions in the genome (Supplementary Figure S2), and the distance between these two discontinuous positions is usually lower than d. Under this circumstance, d can prevent a lot of unnecessary connections; furthermore, it can facilitate building the skeleton in the next step due to the fact that some vertices are lack of precursors. Here, we set

$d=λ×L(r)$

, which is a variable value that depends on the length of the query read, where

$λ$

(default value is 1.2) is an empirical value according to the length ratio (defined as the length of the aligned region in reference divided by the length of the aligned region in read) distribution of the aligned results shown in Supplementary Figure S3. With the setting of d, it has a high probability that the distance between two neighboring true positive nodes can be successfully connected with an edge.

### 2.3 Generating the Skeleton of Alignment

Building the skeleton of alignment is a core step for a mapping algorithm since it will dramatically facilitate the base-to-base alignment. The alignment skeleton (Liu et al., 2021) is a set of nonoverlapping anchors that have the highest possibility to form the candidate aligned region in the read and reference genome. In consideration of the potential breakpoints within reads, we proposed a specific chaining strategy to find the skeleton of alignment in the following two steps:

##### Step 1: Generate the Initial Alignment Skeleton

KngMap finds the optimal path connecting from Vstart to Vend, which maximizes the total number of matched bases as the skeleton of alignment. This is implemented by scoring the vertices in the d-neighborhood graph with the following recurrence equation:

$score(Vi)={maxVj∈pre(Vi){score(Vj)+α}, if pre(Vi)≠∅0,otherwise,(2)$

where

$score(Vi)$

is the score assigned to the vertice Vi in the graph,

$pre(Vi)$

is the precursor set of Vi, and

$α$

(default value is 1) is the reward score between the two vertices linked with an edge. With Eq. 2, each vertex can find a precursor maximizing its score, and the node (Vend) with the highest score among all nodes is regarded as the ending node. As a result, the path from Vend to a starting node (Vstart backtracking from Vend to a node without precursor) formulates the initial alignment skeleton.

From Eq. 2, we can see that the parameter of reward score (

$α$

) is set with a constant value; this is critical to deal with some regions containing SV events (e.g., longer insertion and deletion gaps) or high sequencing errors, especially when the lengths of these regions are long. By setting to a constant value, two anchors can successfully span the region with SV or high sequencing errors, as shown in Supplementary Figure S4; otherwise, two chains will be generated if a low score is assigned to an anchor when the distance between this anchor and its ancestor is too large (Li, 2018). Therefore, with a constant scoring parameter

$α$

, it can make sure that the edge connecting the two matches flanking the SV or regions with high errors can be detected. Thus, the scoring scheme in Eq. 2 helps recover longer insertions and deletions.

##### Step 2: Refining the Alignment Skeleton

Next, kngMap refines the skeleton by pruning the initial alignment skeleton. This pruning process aims to avoid some false positions caused by sequencing errors or repeat regions. As shown in Figure 1C, the skeleton is generated by the previous step. Obviously, the two ending anchors in Figure 1C should not be considered for downstream detail alignment. Thus, an increased refining procedure is carefully designed to deal with such situations. In the skeleton, kngMap only chooses the chains (a subset of the anchors in the initial skeleton) with the highest increased score using the following equation:

$arg maxVi, Vj{score(Vi)−score(Vj)}, Vi(x)−Vj(x)

where l is the constraint length parameter (default value l = len(r)), which guarantees that the increased score is calculated in a fixed window length in the path. This setting is motivated by observing that the anchors of a read alignment are fallen in the region with a length of l, which is found by the statistics based on the alignment results (Supplementary Figure S5). After the pruning, the two nodes that have the maximum increased scores are selected as the final starting and ending nodes (denoted as

$V′start$

and

$V′end$

), and the path between

$V′start$

and

$V′end$

is regarded as the final refined skeleton of alignment.

### 2.4 Filling the Gaps Between Anchors

With the refined skeleton of alignment, the numbers of pairs of unaligned segments can be directly partitioned by the anchors, which is shown in Supplementary Figure S6. In order to obtain the whole base-to-base alignment of the read, kngMap distinguishingly performs a detailed alignment between each pair of unaligned segments. As illustrated in Supplementary Figure S6, the skeleton (from M1 to M7) partitions the read and the reference into eight paired segments. Each pair of unaligned segments, that is, (SRi, SGi), i = 1, 2, 3, 4, 5, 6, 7, will be categorized into one of the following conditions, which carefully take the SV into consideration according to the length of segments.

1) Match case:

$|L(SRi)−L(SGi)|≤L(SRi)×μ$

.

2) Deletion case:

$L(SRi)−L(SGi)

.

3) Insertion case:

$L(SRi)−L(SGi)>L(SRi)×μ$

.

Here, SRi and SGi are the paired partitioned subsequence in the read and genome, respectively,

$L(SRi)$

is the length of SRi,

$L(SGi)$

is the length of SGi, and

$μ$

is a user-defined parameter categorizing the cases. The parameter of

$μ$

is used to model specific categories of SVs and the default value of

$μ$

is set with 2, which is an empirical value based on the alignment statistics (Supplementary Figure S7). A schematic illustration of the three categories of unaligned segments is shown in Figure 2.

FIGURE 2. Schematic illustration of the three categories of unaligned segments in the read and genome. (A) Match, (B) deletion, and (C) insertion. In the figure, the light blue, yellow, and green bars present the read, the reference genome, and the anchored k-mers, respectively. The dashed lines connecting the anchored k-mers indicate the matched positions of the k-mers on the reference genome and the read. The classification conditions are based on the lengths of pair of unaligned segments. The second and third rows give an example on how to generate the alignment for each case.

KngMap then fills the paired segments within the skeletons to solve the breakpoints of SVs and generate valid alignments for the whole read. Each of the gaps is filled by one of the following strategies according to its category. For the match case, as shown in Figure 2A, kngMap directly performs an end-to-end alignment based on the NW (Needleman–Wunsch) algorithm between the read segment (SRj) and the reference genome segment (SGj) to fill the gap; an example of dealing with this match case is also provided in Figure 2A. For a deletion case (Figure 2B), there is a potential deletion in SRj; moreover, we do not know if the SV position exists. In order to effectively deal with such cases, a semi-global alignment is performed, which is beneficial for handling multiple breakpoints within SGj. For the example in Figure 2B, there are two deletions within the SRj; the semi-global alignment can effectively recognize them since it can find the most similar regions in SGj for SRj. In the implementation of the semi-global alignment, the genome segment (SGj) was chosen to be the target sequence and read segment (SRj) to be the query sequence as the alignment gaps at the query start and end are not penalized. Similarly, for an insertion case (Figure 2C), there is a potential insertion in SRj, and we do not know if its positions exist. For this case, the semi-global alignment is also applied to recognize the insertion. The read segment (SRj) was chosen to be the target sequence and genome segment (SGj) to be the query sequence in the implementation of the semi-global alignment.

### 2.5 Extending the Boundaries of the Skeletons

All the operations mentioned previously fill the inner gaps of the skeleton, which are anchored by two matched k-mers. For the outer boundaries of the refined skeleton, that is, the two pairs of unsigned segments at the starts and ends of the skeletons as shown in Supplementary Figure S8, kngMap assumes that these boundaries of the genomic region is SV-free and directly extends the boundaries by performing a modified global alignment to obtain the base-to-base alignment. To be more specific, for the alignment between the suffix of the read and the reference following the last anchor, kngMap first extracts the SRend as the query sequence and the SGend (with two times length than SRend) as the target sequence. Then, the modified global alignment (similar to the global NW alignment method but with a small twist gap at the query end that is not penalized) is performed for SRend and SGend; the modified global alignment can find out how well SRend fits at the beginning of SGend. A toy example of it is illustrated in Supplementary Figure S8. Similarly, the alignment between the prefix of the read and the reference prior to the first anchor can be computed in an identical fashion. Finally, the whole read alignment is accomplished by integrating the skeletons and the alignments of the gaps Figure 1E. Overall, the pseudo-code for the kngMap method procedure is described in Algorithm 1:

Algorithm 1. kngMap Algorithm.

## 3 Results

The performance of kngMap was evaluated against 10 state-of-the-art long read aligners: BLASR (v) (Chaisson and Tesler, 2012), minimap2 (v2.12-r829) (Li, 2018), BWA-MEM (v0.7.17-r1194) (Li, 2013), GraphMap (v0.5.2) (Ivan et al., 2016), rHAT (v0.1.2) (Liu et al., 2015), NGMLR (v0.2.7) (Sedlazeck et al., 2018), lordFAST (v0.0.9) (Haghshenas et al., 2018), smsMap (Wei et al., 2020), conLSH (v0.0.1) (Chakraborty and Bandyopadhyay, 2020), and S-conLSH (v0.0.1) (Chakraborty et al., 2021). All methods were benchmarked on simulated and real-life SMS sequencing datasets. All the experiments were performed on a Linux machine server running Ubuntu 14.04 system equipped with two twelve-core (two threads per core) Intel Xeon Gold 5118 CPU @ 2.30GHz and 256 gigabytes (GB) RAM. The run command lines and parameters of each mapping tools are available in Supplementary Table S1.

It is to be noted that with the rapid development of SMS sequencing technology, more than 99% of the raw sequencing data produced by the SMS sequencing platform has a read length of 1,000 bp or longer (Haghshenas et al., 2018). Herein, reads with lengths shorter than 1,000 bp were filtered so that the remaining reads that have 1,000 bp or longer are aligned in the following experiments.

### 3.1 Experiment on Simulated Datasets

##### 3.1.1 Simulation Without Structural Variations

We first adopted the simulated sequence datasets without structural variations (SVs) to evaluate the performance of kngMap algorithm against the aforementioned mapping methods. The simulated datasets were generated by PBSIM2 (Ono et al., 2020), a new simulator that can capture the characteristics of errors in reads for both PacBio and Nanopore sequencer. The H. sapiens (CHM1) genome sequences were downloaded from NCBI (assembly accession: GCA_000306695.2) and fed into PBSIM2 software using default error parameters (8% deletions, 7% insertions, and 1% substitutions). As a result, 100,000 simulated sequences with an average read length of 10,157 bp were generated by PBSIM2. The detailed running commands of PBSIM2 and some statistics of simulated datasets can be found in Supplementary Tables S2 and S3.

For the simulated dataset, we know exactly the true mapped region and bases in the reference genome for each read, so the correctly mapped reads (CMRs) and correctly mapped bases (CMBs) were applied to investigate the overall quality of the alignments. A read r is called CMR if this read is aligned to the derived genome with the correct strand, and the overlap between the aligned subsequence on the reference and the true mapping subsequence has at least p bases (p = 0.9 × len(r), where len(r) is the read length of r). A base is called CMB if it is located within T bp (here T = 5) of the corresponding truth position on the genome. We further compared the aligned coverage (proportion of aligned bases for one read) to assess the alignment for different methods. A method with higher aligned coverage means that it can align more bases to the reference genome for a read. In addition, base-level sensitivity and precision (Marco-Sola et al., 2012) are used to evaluate the performance of different mappers for the simulated sequence dataset. Sensitivity is defined as the number of correct matched bases divided by the total number of bases, and precision is defined as the number of correct matched bases divided by the number of mapped bases.

Based on the aforementioned metric definition, Table 1 shows the evaluation result of kngMap, rHAT, smsMap, lordFAST, BLASR, BWA-MEM, GraphMap, minimap2, NGMLR, conLSH, and S-conLSH on the simulated datasets without structural variations. As can be seen from Table 1, kngMap not only correctly mapped more reads than any other mapper but also correctly aligned 99.82% of the total number of bases, which improves the sensitivity by 0.1%–2.4% over its competitors, except the conLSH and S-conLSH methods. For the base sensitivity and precision in Table 1, we can observe that kngMap still performed the best, indicating that most bases are correctly mapped in the alignments generated by kngMap. It is worth noting that the alignment results of conLSH and S-conLSH are obviously worse than those of other methods, so we do not show the results of these two methods in the following comparisons.

TABLE 1. Mapping results of different mapping tools on the simulated human dataset. This dataset contains 10,000 reads and 1.015 billion bases. The best results are labeled with a bold typeface.

Importantly, kngMap achieved 100% aligned coverage, demonstrating that all mapped reads by kngMap are completely aligned, while the alignment by other aligners cannot cover the whole sequence, which means that these methods always output local mapping results and cannot obtain the end-to-end alignment for the whole read. After checking the details of the alignments, soft clipping at the starting or ending of sequences is usually produced by other methods. These results in aligned coverage explain that kngMap has a higher base-level sensitivity. This 100 percent coverage achieved by kngMap is beneficial for SMRT data analysis since higher aligned coverage is a key requirement for mapping tools and downstream mapping-based applications (Schmieder and Edwards, 2012; Laver et al., 2015).

In addition, we inspected the reads of the simulated human dataset which were incorrectly aligned by kngMap and found that 255 (88.29%) of the reads have a higher similarity than the original similarity simulated by the PBSIM2. To give an example, Supplementary Material (named as S9_46798.txt) reports the alignment result for one simulated read with a length of 6,692 bp; the mapped alignment identity obtained by kngMap is 90.74%, while the simulated identity generated by PBSIM2 is 87.50%, which is lower than the similarity produced by kngMap. Therefore, this further indicates that kngMap can find the most similar mapped region in the reference sequence for each query read.

##### 3.1.2 Simulation in the Presence of Structural Variations

For assessing the ability of handling SVs for different methods, a simulated sequence dataset with different types of SVs was generated. Specifically, 16 SVs (i.e., five insertions, eight deletions, and three inversions) with different sizes were first added into the reference chr1 (chromosome 1 of CHM1) genome by the RSVSim (Bartenhagen and Dugas, 2013) simulator, an R package tool for the simulation of SVs with various sizes and SV types in any genome available as the FASTA file. Then, the chr1 genome of CHM1 with SVs was fed into PBSIM2 to produce the final simulation read set. Among the simulated reads, a total of 129 reads cover the SV breakpoints. The detailed SVs and its breakpoints are listed in Supplementary Table S4.

Here, we provide the number of mapped reads that span SVs (#SVs) to evaluate the performance of different mapping tools. For one mapped read, if the start and end alignment coordinates in the genome cover the actual simulated breakpoints, we consider this read spanning SVs. Specifically, each of the breakpoints described by the ground truth in the genome can be denoted as a tuple:

$BPiT(PGstartT, PGendT), i=1, 2, ⋯ , NBPT$

, where

$PGstartT$

and

$PGstartT$

are the starting and ending positions of the breakpoint on the reference genome, respectively, and

$NBPT$

is the total number of ground truth breakpoints. Similarly, each of the alignment coordinates mapped by a mapping tool for a simulated read with SV can be denoted as a tuple:

$MCi(MGstart, MGend), i=1, 2, ⋯ , NreadSV$

, where

$MGstart$

and

$MGend$

are the starting and ending mapped coordinates on the reference genome, respectively, and

$NreadSV$

is the total number of reads covering the SV breakpoints (

$NreadSV$

= 129).

With the ground truth breakpoints and the mapped coordinates obtained by a mapping tool, we can assess the number of ground truth breakpoints being recovered. For a certain read, a ground truth breakpoint,

$BPiT(PGstartT, PGendT)$

, is considered being recovered, only if there is at least one mapped coordinate,

$MCi(MGstart, MGend)$

, meeting the following condition:

${MGstartPGendT.(4)$

Finally, the number of ground truth breakpoints being recovered from our kngMap and other seven mapping programs is listed in Table 2, from which we can see that kngMap can map more reads with SVs on the genome than the other approaches, which demonstrate that our kngMap is capable to handle the SV-spanning reads.

TABLE 2. Number of mapped reads that span SV breakpoints for different mapping methods.

### 3.2 Experiment on Three Real Datasets

In this experiment, three datasets (generating from PacBio SMRT and Oxford Nanopore platforms) of A. thaliana, E. coli UTI89, and H. sapiens (CHM1) were used to benchmark kngMap against other methods on real sequencing data. Among these datasets, A. thaliana and H. sapiens (CHM1) were generated from the PacBio SMRT platform, and E. coli UTI89 was generated from the Oxford Nanopore MinION sequencer. The availability of these datasets and their reference genomes are provided in Supplementary Tables S5 and S6, respectively, and the detail statistics (e.g., read number and length distributions) related to these datasets are given in Supplementary Figure S7. In the absence of true mapping locations for these real-life datasets, the mappers were compared based on different metrics as follows.

TABLE 3. Number of mapped reads, mapped bases, mapped matched bases, and the alignment score of nine methods on the real A. thaliana dataset. The best results are labeled with a bold typeface.

Next, we investigated the consecutive alignments, which can reflect the details of the alignments (Liu et al., 2015). To compare the consecutiveness of the alignments, we set four covering thresholds (i.e.,

$c1=80%$

,

$c2=85%$

,

$c3=90%$

, and

$c4=95%$

) to inspect the aligned proportion of the read which has at least one alignment covering at least ci (i = 1,2,3,4) proportion of the whole read. If one alignment covers more than ci of the read, the alignment is defined as the consecutive alignment at threshold ci. Figure 3 describes the results of the consecutive alignment for nine methods, and the detailed values shown in Figure 3 can be seen in Supplementary Table S10. From Figure 3, we can clearly see that kngMap and smsMap always achieved 100% consecutive alignments at all thresholds, indicating that kngMap and smsMap can achieve the end-to-end alignment for each read, which will facilitate downstream analysis in practice since each mapped read is completely aligned and not split aligned. GraphMap also achieved high consecutive alignment, while other methods, especially BWA-MEM and NGMLR, tend to produce more reads that have a large proportion of bases being clipped. Similar results can be found in Supplementary Figures S9 and S10 for the real dataset of E. coli and H. sapiens. The consecutive results in Figure 3 and Supplementary Figures S9 and S10 demonstrate that kngMap can provide better consecutive alignments, which is the main reason for the higher base-level sensitivity of kngMap.

FIGURE 3. Consecutiveness alignment of different methods on the A. thaliana dataset. The percentage values labeled on the horizontal axis are the thresholds of the proportion of covered bases, that is, ci (i = 1,2,3,4), for considering if the alignment result for a read is consecutive. The vertical axis bars indicate the proportions of reads consecutively mapped by different mappers. kngMap always achieved 100% consecutive alignments with all thresholds, while GraphMap achieved 99.91%, 99.88%, 99.84%, and 99.77% consecutive alignment at 80%, 85%, 90%, and 95% coverage thresholds, respectively. Therefore, kngMap is superior to GraphMap in providing better consecutive alignments. Each bar also corresponds to a value line in Supplementary Table S10.

Furthermore, the agreement between different methods based on their alignment results was measured. Supposing that alignments x1 and x2 are two alignment results obtained by two mapping methods for a query read, we define that x1 has an agreement with x2 if and only if the mapped region on the reference genome covered by x1 overlaps with at least 90% of the mapped region on the reference genome covered by x2. Figure 4 describes some toy examples to illustrate the covering and non-covering alignments. As a result, the agreement alignments between each pair methods on the A. thaliana dataset are shown in Table 4. Each row in Table 4 denotes the percentage of the alignments produced by the corresponding method that covers alignments generated by other tools listed in the column. For example, among all mapped reads for kngMap and minimap2 in Table 4, kngMap covers 91.85% of the alignments produced by the minimap2 method, while minimap2 only covers 78.82% of the alignments generated by kngMap. Supplementary Tables S12 and S13 report the agreement between different methods on E. coli and H. sapiens datasets, respectively. We can see that the mapping results of kngMap have a high coverage of the alignments obtained by other mapping methods.

FIGURE 4. Examples of covering and non-covering alignment results. x1, x2, x3, and x4 are different alignment results reported by four mapping methods for the same query read. The dotted line indicates the mapped region in the reference genome sequence for the corresponding alignment result. In this figure, x1 and x2 alignments cover each other as they span the mapped regions on the reference genome that have at least 90% overlap. The alignments x1 and x2 cover x3 alignment but not the x4 alignment, while x3 does not cover x1 and x2 alignments. On the other hand, the alignment x4 does not cover either alignment x1, x2, or x3.

TABLE 4. Agreement of different alignment methods for the real A. thaliana dataset.

Finally, we compared the performance of the tools on reads for which their alignments do not agree. The number of disagreeing alignments and average identity difference between the two methods for the inconsistent alignments are listed in Table 5. The numbers in each row indicate the number of reads for which the corresponding method reports alignments that do not cover alignments outputted by the other methods listed in the column. The positive/negative percentage in parentheses means that the average identity of the disagreeing alignments for the corresponding method is higher/lower than the average identity of the disagreeing alignments for the other methods listed in the column. For instance, there are 1,997 reads for which BLASR does not cover the alignments of kngMap. For those reads, BLASR produced alignments with an average of 5.83% lower identity than kngMap. On the contrary, there are 5,497 reads for which kngMap does not cover BLASR’s alignments. For those reads, on average, kngMap alignments have 24.74% higher identity than those of BLASR. This indicates that kngMap can find more similar alignments than BLASR. The performance of the tools on E. coli and H. sapiens reads is provided in Supplementary Tables S13 and S14. With a lack of the true mappings for the real dataset, the information in Table 5 and Supplementary Tables S13 and S14 are some extra support for the fact that the alignments of our kngMap are reliable.

TABLE 5. Performance of each pair methods on the A. thaliana dataset for which their alignments do not agree.

## 4 Discussion

For SMS read mapping, the seed-chain-align procedure (Lin and Hsu, 2020), as is used by most mapping methods, is a classical and effective strategy to obtain the superior alignment results since it can help to seek the alignment skeleton consisting of matched segment pairs between the long SMS read and the reference genome. Generally, the seed-chain-align procedure can be summarized as three phases: first, matched k-mers are found as seeds in one read and in the reference sequence. Then, a group of seeds that are colinear or close to each other is chained as the candidate alignment skeleton. Finally, the non-seed fragments within the alignment skeleton are being extended to generate the base-level alignment. In the stage of generating the alignment skeleton, these methods are usually difficult to capture the matches with long distance, especially for the read parts with SVs and high sequencing errors since that there are few or no matched k-mers in the parts. Thus, most methods failed to generate the alignment skeleton of the query read or just build an alignment skeleton that covers a small part of the read. As a result, these methods cannot obtain the alignment result or just report a local mapping result for the query read, other than obtaining the whole end-to-end alignment, leading to low mapping sensitivity and aligned coverage.

To address the aforementioned challenge, here, we developed kngMap to increase the mapping sensitivity and with 100% aligned coverage for long noisy alignment. kngMap is also a seed-chain-align method using the hashing index technique, compared with other methods. There are three main key features of kngMap: 1) kngMap proposes a scoring strategy in the chaining procedure to choose a group of anchors to form the initial alignment skeleton for each read, even in the situation that the read contains SVs or some regions that matched seeds are dispersedly distributed caused by high sequencing errors; 2) defining an increased credibility function to refine the alignment skeleton, and then a high quality of alignment skeleton is subsequently obtained; and 3) for each of the gaps within the skeleton, kngMap classifies it into one of three categories and implements a specific alignment strategy to fill the corresponding gaps. The scoring strategy and increased credibility function ensures that kngMap can effectively find the alignment skeleton for every query read, even in the situation that the matched seeds are dispersedly distributed in the reference genome. Thus, kngMap can obtain higher mapping sensitivity, that is, align more reads and bases. The last feature can guarantee that the whole end-to-end alignment is obtained, not local alignment achieved by other methods. Therefore, the aligned coverage of kngMap is higher than that of other methods. In addition, in order to further evaluate the performance of the kngMap for a higher error rate up to ∼15%, we applied the PBSIM2 simulator to generate another two simulated datasets with 20% (parameter: –accuracy-mean 20) and 25% (parameter: –accuracy-mean 25) error rates; the other parameter settings are similar to those given in Supplementary Table S2. Supplementary Table S14 shows the evaluation result of kngMap, rHAT, smsMap, lordFAST, BLASR, BWA-MEM, GraphMap, minimap2, and NGMLR on the simulated datasets with 20% error rate. As can be seen from Supplementary Table S14, kngMap correctly mapped the maximum number of reads and bases to the reference genome. Importantly, kngMap achieved 100% aligned coverage, demonstrating that all mapped reads by kngMap are completely aligned. For the base sensitivity and precision in Supplementary Table S14, we can see that kngMap still performed the best, indicating that most bases are correctly mapped in the alignments generated by kngMap. Similar mapping results can be found in Supplementary Table S15 with simulated dataset with 25% error rate. Therefore, these results in Supplementary Tables S14 and S15 demonstrate that our kngMap is more robust to sequencing errors, and it can obtain better mapping results for a higher error rate up to ∼15%.

The massive amount of long noisy read data produced by SMS technologies brings some challenges to existing mapping approaches. In addition to accuracy, computational complexity is another important issue that needs to be considered. The time complexity of kngMap has four main components: 1) In the phase of constructing the reference genome index, it needs to extract all k-mers of the reference genome sequences to build the hash table index; thus, the maximum complexity is in the order of O(G), where G is the length of the genome sequences. 2) In the phase of generating k-mer d-neighborhood graph, it needs to retrieve all the matches of the k-mers for each query read, so the maximum complexity is O(N*L), where N is the number of reads, and L is the average read length. 3) In the phase of generating the skeleton of alignment, it needs to recurrently calculate the score of each node, so the maximum complexity is O(N*M), where M is the average number of matched k-mers for all reads. 4) In the phase of filling the gaps between anchors, it needs to obtain the detailed base-to-base alignment, so the maximum complexity is O(K*Q2), where K is the average number of gaps, Q is the average length of gaps, and Q<<L. In summary, the total complexity for kngMap is O(G+N*L+N*M+K*Q2). Since Q<<L, kngMap has a time complexity of the order of G and L. In order to graphically evaluate the computational efficiency of our kngMap, we compared kngMap with other mapping tools on the real H. sapiens datasets. Figure 5 shows the relative running time (wall-time) by using the nine tools. In terms of computational efficiency, we can see that the speed of kngMap is almost as fast as minimap2 and is several folds faster than other mapping methods. For example, kngMap is overall about 2-folds faster than rHAT and 6- to 9-folds faster than smsMap, lordFAST, BLASR, BWA-MEM, and GraphMap mappers. The speed comparison result described in Figure 5 indicates that kngMap is efficient to align SMS reads.

FIGURE 5. Relative speed of the aligners benchmarking on the real H. sapiens datasets. The relative speed of one mapper is defined as T(aln)/T(kngMap), where T(aln) and T(kngMap) are the alignment times of a compared aligner and kngMap, respectively. The lower the height of the bar, the faster the speed of the corresponding method. We can see that kngMap is faster than rHAT, smsMap, lordFAST, BLASR, BWA-MEM, GraphMap, and NGLMR mappers.

## 5 Conclusion

Since the infancy of SMS technologies (e.g., PacBio and Oxford Nanopore MinION) that produce longer but higher sequencing error reads, mapping these reads to a reference genome is often the most basic and computing-expensive step for downstream genome sequence analysis. Developing novel long read mapping tools is on demand for improving the mapping sensitivity and effectiveness of SMS read alignment.

In this study, we present kngMap, a new, fast, and highly sensitive mapping algorithm for long noisy reads. Mainly, kngMap contains three key characteristics: 1) kngMap proposes a scoring strategy in the chaining procedure to choose a group of anchors to form the initial alignment skeleton for each read. 2) kngMap designs an increased credibility function to refine the alignment skeleton. The scoring strategy and increased credibility function progressively ensure that kngMap can locate the aligned region for every query read, even in the situation that the matched seeds are dispersedly distributed in the reference genome. 3) For each of the gaps within the skeleton, kngMap classifies it into one of three categories and implements a specific alignment strategy to fill the corresponding gaps, which can help to robustly handle potentially different types of SVs in the reads. kngMap was benchmarked on simulated and real datasets across various genomes with other state-of-the-art mappers. The experimental results demonstrated that kngMap has higher accuracy and sensitivity that can correctly map more sequences and bases to the reference genome and achieves 100% aligned read coverage ratio; meanwhile, it also has good ability to span different types of SVs within the reads.

## Data Availability Statement

The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding authors.

## Author Contributions

Z-GW wrote the draft manuscript and programmed the C++ source codes. X-GF and HZ participated in the benchmarking experiments, prepared several figures, and collected some tables. X-DZ helped download the source codes of other compared mappers and install them. FL designed the overall study and reviewed the manuscript. YQ and S-WZ helped in improving and revising the manuscript. All authors contributed to the conception and design of the study, participated in the analysis of the experimental results, and edited the manuscript. All authors read and approved the final manuscript.

## Funding

This study was supported by the Scientific Research Program Funded by Shaanxi Provincial Education Department (No. 21JK0486), the Natural Science Basic Research Plan in Shaanxi Province of China (Nos. 2021JQ-811 and 2022JZ-03), and the National Natural Science Foundation of China (Nos. 61873202, 61473232, and 91430111).

## Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

## Publisher’s Note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors, and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.