Efficient Alignment of Very Long Sequences

Article history: Received: 20 March, 2018 Accepted: 16 April, 2018 Online: 30 April, 2018


Introduction
Sequence alignment is a fundamental and well-studied problem in the biological sciences. In this problem, we are given two sequences A[1 : m] = a 1 a 2 · · · a m and B[1 : n] = b 1 b 2 · · · b n and we are required to find the score of the best alignment and possibly also an alignment with this best score. When aligning two sequences, we may insert gaps into the sequences. The score of an alignment is determined using a matching (or scoring) matrix that assigns a score to each pair of characters from the alphabet in use as well as a gap penalty model that determines the penalty associated with a gap sequence. In the linear gap penalty model, the penalty for a gap sequence of length k > 0 is kg, where g is some constant while in the affine model this penalty is g open + (k − 1) * g ext . The affine model more accurately reflects the fact that opening a gap is more expensive than extending one. Two versions of sequence alignment-global and local-are of interest. In global alignment, the entire A sequence is to be aligned with the entire B sequence while in local alignment, we wish to find a substring of A and B that have the highest alignment score. The alphabet for DNA, RNA, and protein sequences is, respectively,   Figure 1(a) is a global alignment and that of Figure 1(b) is a local one. To score the alignments, we have used the linear penalty model with g = −2 and the scores for pairs of aligned characters, which are taken from BLOSUM62 matrix in [1], are c(T, T ) = 5, c(A, A) = 4, c(C, C) = 9, c(G, G) = 6, and c(C, T ) = −1. The score for the shown global alignment is 17 while that for the shown local alignment is 23. If we were using an affine penalty model with g open = −4 and g ext = −2, then the penalty for each of the gaps in positions 1 and 8 of the global alignment would be −4 and the overall score for the global alignment would be C. Zhao et al. / Advances in Science, Technology and Engineering Systems Journal Vol. 3, No. 2, 329-345 (2018) 13. In [2], the authors first proposed an O(mn) time algorithm, called Needleman-Wunsch(NW) algorithm, for global alignment using the linear gap model. This algorithm requires O(n) space when only the score of the best alignment is to be determined and O(mn) space when the best alignment is also to be determined. In [3], the authors proposed a new algorithm called Smith-Waterman(SW) algorithm, which modified the NW algorithm so as to determine the best local alignment. In [4], the author proposed a dynamic programming algorithm called Gotoh algorithm, for sequence alignment using an affine gap penalty model. The asymptotic complexity of the SW and Gotoh algorithms is the same as that of the NW algorithm.
When mn is large and the best alignment is sought, the space, O(mn), required by the algorithms of NW, SW and Gotoh exceeds what is available on most computers. The best alignment for these large instances can be found in [5] using sequence alignment algorithms derived from Hirschberg's linear space divide-and-conquer algorithm for the longest common subsequence problem. In [6], the authors developed a Myers-Miller alignment algorithm. It is the linear space O(mn) time version of Hirschberg's algorithm for global sequence alignment using an affine gap penalty model. And in [7], the authors do this for local alignment.
In an effort to speed sequence alignment, fast sequence-alignment heuristics have been developed. As in [8,9,10],BLAST, FASTA, and Sim2 are a few examples of software systems that employ sequence alignment heuristics. Another direction of research, also aimed at speeding sequence alignment, has been the development of parallel algorithms. Parallel algorithms for sequence alignment may be found in [11]- [19], for example.
In this paper, we focus on reducing the number of cache misses that occur in the computation of the score of the best alignment as well as in determining the best alignment. Although we explicitly consider only the linear gap penalty model, our methods readily extend to the affine gap penalty model. Our interest in cache misses stems from two observations-(1) the time required to service a last-level-cache (LLC) miss is typically 2 to 3 orders of magnitude more than the time for an arithmetic operation and (2) the energy required to fetch data from main memory is typically between 60 to 600 times that needed when the data is on the chip. As a result of observation (1), cache misses dominate the overall running time of applications for which the hardware/software cache prefetch modules on the tar-get computer are ineffective in predicting future cache misses. The effectiveness of hardware/software cache prefetch mechanisms varies with application, computer, and compiler. So, if we are writing code that is to be used on a variety of computer platforms, it is desirable to write cache-efficient code rather than to rely exclusively on the cache prefetching of the target platform. Even when the hardware/software prefetch mechanism of the target platform is very effective in hiding memory latency, observation (2) implies excessive energy use when there are many cache misses.
This paper is an extension of work originally in [20], which has been presented by us in the 2015 IEEE 5th international conference on Computational Advances in Bio and Medical Sciences (ICCABS). The main contributions are 1. cache efficient single-core and multi-core algorithms to determine the score of the best alignment; 2. cache efficient single-core and multi-core algorithms to determine the best alignment.
The rest of the paper is organized in the following way. In Section 2, we describe our cache model. Our cache-efficient algorithms for scoring and alignment are developed and analyzed in Section 3. Experimental results are presented in Section 4. In Section 5, we present a discussion of these results and in Section 6, we present the limitations of our work. Finally, we conclude in Section 7.

Cache Model
For simplicity in the analysis, we assume a single cache comprised of s lines of size w words (a word is large enough to hold a piece of data, typically 4 bytes) each. So, the total cache capacity is sw words. The main memory is partitioned into blocks also of size w words each. When the program needs to read a word that is not in the cache, a cache miss occurs. To service this cache miss, the block of main memory that includes the needed word is fetched and stored in a cache line, which is selected using the LRU (least recently used) rule. Until this block of main memory is evicted from this cache line, its words may be read without additional cache misses. We assume the cache is written back with write allocate. Write allocate means that when the program needs to write a word of data, a write miss occurs if the block corresponding to the main memory is not currently in cache. To service the write miss, the corresponding block of main memory is fetched and stored in a cache line. Write back means that the word is written to the appropriate cache line only. A cache line with changed content is written back to the main memory when it is about to be overwritten by a new block from main memory.
Rather than directly assess the number of read and write misses incurred by an algorithm, we shall count the number of read and write accesses to main memory.
C. Zhao et al. / Advances in Science, Technology and Engineering Systems Journal Vol. 3, No. 2, 329-345 (2018) Every read and write miss makes a read access. A read and write miss also makes a write access when the data in the replacement cache line is written to main memory. We emphasize that the described cache model is a very simplified model. In practice, modern computers commonly have two or three levels of cache and employ sophisticated adaptive cache replacement strategies rather than the LRU strategy described above. Further, hardware and software cache prefetch mechanisms are often deployed to hide the latency involved in servicing a cache miss. These mechanisms may, for example, attempt to learn the memory access pattern of the current application and then predict the future need for blocks of main memory. The predicted blocks are brought into cache before the program actually tries to read/write from/into those blocks thereby avoiding (or reducing) the delay involved in servicing a cache miss. Actual performance is also influenced by the compiler used and the compiler options in effect at the time of compilation. As a result, actual performance may bear little relationship to the analytical results obtained for our simple cache model. Despite this, we believe the simple cache model serves a useful purpose in directing the quest for cache-efficient algorithms that eventually need to be validated experimentally.
When i > 0 and j > 0, where c(a i , b j ) is the match score between characters a i and b j and g is the gap penalty. For local alignment, H ij denotes the score of the best local alignment for A[1 : i] and B [1 : j]. In [3], the Smith-Waterman equations for local alignment using the linear gap penalty model are: When i > 0 and j > 0, Several authors (in [5,6], for example) have observed that the score of the best local alignment may be determined using a single array H[0 : n] as in algorithm Score(Algorithm 1.) The scoring algorithm for the Needleman and Wunsch algorithm is similar. It is easy to see that the time complexity of the algorithm of Algorithm 1 is O(mn) and its space complexity is O(n). For the (data) cache miss analysis, we focus on read and write misses of the array H and ignore misses due to the reads of the sequences A and B as well as of the scoring matrix c (notice that there are no write misses for A, B, and c). Figure 2 shows the memory access pattern for H by algorithm Score. The first row denotes the initialization of H and subsequent rows denote access for different value of i (i.e., different iterations of the for i loop). , each iteration of the for i loop also results in n/w read accesses and approximately n/w write accesses. So, the total number of read accesses is (m + 1)n/w ≈ mn/w and the number of write accesses is also ≈ mn/w. The number of read and write accesses is ≈ 2mn/w, when n is large. We note, however, that when n is sufficiently small that H[] fits into the cache, the number of read accesses is n/w (all occur in the initialization loop) and there are no write accesses. In practice, especially in the case of local alignment involving a very long sequence, one of the two sequences A and B is small enough to fit in the cache while the other may not fit in the cache. So, in these cases, it is desirable to ensure that A is the longer sequence and B is the shorter one so that H fits in the cache entirely. This is accomplished by swap A and B sequences. When

Diagonal Algorithm
An alternative to computing the score by rows is to compute by diagonals. While this uses two one-dimensional arrays rather than one, it is more readily parallelized than Score as all elements on a diagonal can be computed at the same time; elements on a row need to be computed in sequence. x ← (d <= m?0 : d − m); y ← (d <= n?d : n); 5: for i ← x to y do 6: The total number of read accesses is mn/w for each diagonal array and the total number of write accesses is mn/w for both arrays combined. The number of cache misses for Diagonal is approximately 3mn/w when n is large.

Strip Algorithm
When neither H[1 : m] nor H[1 : n] fits into the cache, accesses to main memory may be reduced by computing H ij by strips of width q such that q consecutive elements of H[] fit into the cache. Specifically, we partition H[1 : n] into n/q strips of size q (except possibly the last strip whose size may be smaller than q) as in Figure 3. First, all H ij in strip 0 are computed, then those in strip 1, and so on. When computing the values in a strip, we need those in the rightmost column of the preceding strip. So, we save these rightmost values in a one-dimensional array strip[0 : m]. The algorithm is given in Algorithm 3. We note that sequence alignment by strips has been considered before. For example, in [12], the authors using the similar approach in their GPU algorithm. Their use differs from ours in that they compute the strips in pipeline fashion with each strip assigned to a different pipeline stage in round robin fashion and within a strip, the computation is done by anti-diagonals in parallel. On the other hand, we do not pipeline the computation among strips and within a strip, our computation is by rows.  15: diag ← nextdiag 16: end for 17: 1. q/w read accesses for the appropriate set of q entries of H for the current strip and q/w write accesses for the cache lines whose data are replaced by these H values. The write accesses are, however, not made for the first strip.
2. m/w read accesses for strip and m/w write accesses. The number of write accesses is less by s for the last strip.
So, the overall number of read accesses is m/w + (q/w + m/w) * n/q = m/w + n/w + mn/(wq) and the number of write accesses is approximately the same as this. So, the total number of main memory accesses is ≈ 2mn/(wq) when m and n are large.

Alignment Algorithms
In this section, we examine algorithms that compute the alignment that results in the best score rather than just the best score. While in the previous section we explicitly considered local alignment and remarked that the results readily extend to global alignment, in this section we explicitly consider global alignment and remark that the methods extend to local alignment.

Myers and Miller's Algorithm
When aligning very long sequences, the O(mn) space requirement of the full-matrix algorithm exceeds the available memory on most computers. For these instances, we need a more memory-efficient alignment algorithm. In [6], Myers and Miller have adapted Hirschberg's linear space algorithm for the longest common subsequence problem to find the best global alignment in linear space. Its time complexity is O(mn). However, this linear space adaptation performs about twice as many operations as does the full-matrix algorithm. In [11], the authors have developed a hybrid algorithm, FastLSA, whose memory requirement adapts to the amount of memory available on the target computing platform. In this section and the next, we focus on the adaptation of Myers-Miller algorithm.
It is easy to see that an optimal (global) alignment is comprised of an optimal alignment of A . Hence, an optimal alignment is comprised of a sequence of optimal crossover points. This is depicted visually in Figure 4. Figure 4(a) shows alignments using 3 possible crossover points at row m/2 of H. Figure 4(b) shows the partitioning of the alignment problem into 2 smaller alignment problems (shaded rectangles) using the optimal crossover point (meeting point of the 2 shaded rectangles) at row m/2. Figure 4(c) shows the partitioning of each of the 2 subproblems of Figure 4(b) using the optimal crossover points for these subproblems (note that these crossovers take place at rows m/4 and 3m/4, respectively). Figure 4(d) shows the constructed optimal alignment, which is presently comprised of the 3 determined optimal crossover points.
This modified version M Score differs from Score only in that M Score returns the entire array H rather than just H[n]. Using the returned H arrays for the forward and reverse alignments, the optimal crossover point for the best alignment is computed as in algorithm M M (Algorithm 4). Once the optimal crossover point is known, two recursive calls are made to optimally align the top and bottom halves of A with left and right parts of B. The approximately time complexity for iteration k is O(2mn/2 k ), hence the total time complexity is rough 2mn.
In each level of recursion, the number of main memory accesses is dominated by those made in the calls to M Score. From the analysis for Score, it follows that when n is large, the number of accesses to main memory is ≈ 2mn/w(1 + 1/2 + 1/4 + · · · ) ≈ 4mn/w. From the analysis for Diagonal, it follows that when m and n are large, the number of accesses to main memory is ≈ 3mn/w(1 + 1/2 + 1/4 + · · · ) ≈ 6mn/w. From the analysis for Strip, it follows that when m and n are large, the number of accesses to main memory is ≈ 2mn/(wq)(1 + 1/2 + 1/4 + · · · ) ≈ 4mn/(wq).

Parallel Score Algorithm
As remarked earlier in Score algorithm, the elements in a row of the score matrix need to be computed sequentially from left to right because of data dependencies. So, we are unable to parallelize the inner for loop of Score (Algorithm 1). Instead, we adopt the unusual approach of parallelizing the outer for loop while computing the inner loop sequentially using a single processor. Initially, processor s is assigned to do the outer loop computation for i = s, 1 ≤ i ≤ p, where p is the number of processors. Processor s begins after a suitable time lag relative to the start of processor s − 1 so that the data it needs for its computation has already been computed by processor s − 1. That is, processor 1 begins the inner loop computation for i = 1 at time 0, then, with a suitable time lag, processor 2 begins the outer loop computation for i = 2, then, with a further lag, processor 3 begins the i = 3 computation and so on. When a processor has finished with its iteration i computation, it starts on iteration i + p of the outer loop. Synchronization primitives are used to ensure suitable time lags. The time complexity of the resulting p-core algorithm P P _Score is O(mn/p).

Parallel Diagonal Algorithm
The inner for loop of Diagonal (Algorithm 2) is easily parallelized as the elements on a diagonal are independent and may be computed simultaneously. So, in our parallel version, we divide the diagonal d into p blocks, where p is the number of processors. We assign a block to each processor from left to right as in Figure 5. The time complexity of the resulting p-core algorithm P P _Diagonal is O(mn/p).

Parallel Strip Algorithm
In the Strip scoring algorithm, we partition the score matrix H into n/q strips of size q ( Figure 3) and compute the strips one at a time from left to right. Inside a strip, scores are computed row by row from top to bottom. We see that the computation of one strip can begin once the first row of the previous strip has been computed. In our parallel version of this algorithm, processor i is initially assigned to compute strip i, When computing a value in its assigned strip, a processor needs to wait until the values (if any) needed from the strip to its left have been computed. When a processor completes the computation of strip j, it proceeds with the computation of strip j + p. Figure 6 shows a possible state in the described parallel strip computation strategy. We maintain an array signal[] such that signal[r] = s + 1 iff the row r computation for strips 1 through s has been completed. This array enables the processor working on the strip to its right to determine when it can begin the computation of its r'th row. The time complexity of the resulting parallel strip algorithm, P P _Strip, is O(mn/p).

Parallel Alignment Algorithms
In the single-core implementation, we divide the H matrix into two equal size parts and apply the scoring algorithm to each part. Then, we determine the optimal crossover point where the sum of the scores from both directions is maximum. This crossover point is used to divide the matrix into two smaller score matrices to which this decomposition strategy is recursively applied. The first application of this strategy yields two independent subproblems and following an application of the strategy to each of these subproblems, we have 4 even smaller subproblems. Following k rounds, we have 2 k independent subproblems.
For the parallel version of alignment algorithms, we employ the following strategies: • When the number of independent matrices is small, each matrix is computed using the parallel version of score algorithms P P _Score, P P _Diagonal and P P _Strip; where p processors are assigned to the parallel computation. In other words, the matrices are computed in sequence.
• When the number of independent matrices is large, each matrix is computed using the single-core algorithms Score, Diagonal and Strip. Now, p matrices are concurrently computed.

Experimental Settings and Test Data
We implemented the single-core scoring and alignment algorithms in C and the multi-core scoring and alignment algorithms in C and OpenMP. The relative performance of these algorithms was measured on the following platforms: 1. Intel Xeon CPU E5-2603 v2 Quad-Core processor 1.8GHz with 10MB cache.
For convenience, we will, at times, refer to these platforms as Xeon4, Xeon6, and Xeon24 (i.e., the number of cores is appended to the name Xeon).
All codes were compiled using the gcc compiler with the O2 option. On our Xeon4 platform, we used the "perf" [21] software to measure energy usage through the RAPL interface. So, for this platform, we report cache misses and energy consumption as well as running time. For the Xeon6 and Xeon24 platforms, we provide the running time only.
For test data, we used randomly generated protein sequences as well as real protein sequences obtained from the Globin Gene Server [22] and DNA/RNA/protein sequences from the National Center for Biotechnology Information (NCBI) database [23]. We used the BLOSUM62 [1] scoring matrix for all our experiments. The results for our randomly generated protein sequences were comparable to those for similarly sized sequences used from the two databases [22] and [23]. So, we present only the results for the latter data sets here.  Table 1 give the number of cache misses on our Xeon4 platform for different sequence sizes. The last two columns of Table 1 gives the percent reduction in the observed cache miss count of Strip relative to Score and Diagonal. Strip has the fewest cache misses followed by Score and Diagonal (in this order). Strip reduces cache misses by up to 86.2% relative to Score and by up to 92.3% relative to Diagonal.   Table 2 give the running times of our scoring algorithms on our Xeon4 platform. In the figure, the time is in seconds while in the table, the time is given using the format hh : mm : ss. The table also gives the percent reduction in running time achieved by Strip relative to Score and Diagonal. As can be seen, on our Xeon4 platform, Strip is the fastest followed by Score and Diagonal (in this order). Strip reduces the running time by up to 17.5% relative to Score and by up to 22.8% relative to Diagonal. The reduction in running time, while significant, isn't as much as the reduction in cache misses possibly due to the effect of cache prefetching, which reduces cache induced computational delays. Figures 9 and Tables 3 give the CPU and cache energy consumed, in joules, by our Xeon4 platform. On our datasets, Strip required up to 18.5% less CPU and cache energy than Score and up to 25.5% less than Diagonal. It is interesting to note that the energy reduction is comparable to the reduction in running time suggesting a close relationship between running time and energy consumption for this application. Figure 10 and Table 4 give the number of cache misses on our Xeon4 platform for our parallel scoring algorithms. P P _Strip has the fewest cache misses followed by P P _Score and P P _Diagonal (in this order). P P _Strip reduces cache misses by up to 98.1% relative to P P _Score and by up to 99.1% relative to P P _Diagonal. We observe also that the total cache misses for P P _Score is slightly higher than for Score for smaller instances and lower for larger instances. P P _Diagonal, on the other hand, consistently has more cache misses than Diagonal. P P _Strip exhibits a significant reduction in cache misses. This is because we chose the strip width to be such that p strip rows fit in this cache. Most of the cache misses in the Strip are from the vector that transfers boundary results from one strip to the next. When p strips are being worked on simultaneously, the inter-strip data that is to be transferred is often in the cache and so many of the cache misses incurred by the single-core algorithm are saved. The remaining two algorithms do not allow this flexibility in choosing the segment size a processor works on; this size is fixed at O(n/p).    Figure 11 and Table 5 give the running times for our parallel scoring algorithms on our Xeon4 platform. In the figure, the time is in seconds while in the table, the time is given using the format hh : mm : ss. As in the table, P P _Strip is the fastest algorithm in practice, which is up to 40.0% faster than P P _Score and up to 38.4% faster than P P _Diagonal.  Table 6 gives the speedup of each of our parallel scoring algorithms relative to their sequential counterparts. As can be seen, the speedup of P P _Strip (i.e., Strip/P P _Strip) is between 3.92 and 3.98, which is quite close to the number of cores (4) on our Xeon4 platform. P P _Score achieves a speedup in the range 2.82 to 2.94 and the speedup for P P _Diagonal is in the range 3.12 to 3.21.

Parallel Scoring Algorithms
The excellent speedup exhibited by P P _Strip is due largely to our ability to greatly reduce cache misses for this algorithm. Figures 12 and Tables 7 give the CPU and cache energy consumed, in joules, by our Xeon4 platform. On our datasets, P P _Strip required up to 41.2% less CPU and cache energy than P P _Score and up to 45.5% less than P P _Diagonal. Compared to the sequential scoring algorithms, the multi-core algorithms use higher CPU power but less running time. Since the power increase is less than the decrease in running time, energy consumption is reduced.   Table 8 give the number of cache misses of our single-core alignment algorithms on our Xeon4 platwww.astesj.com C. Zhao et al. / Advances in Science, Technology and Engineering Systems Journal Vol. 3, No. 2, 329-345 (2018) Figure 14 and Table 9 give the running times of our single-core alignment algorithms on our Xeon4 platform.     Figure 16 and Table 11 give the number of cache misses of our multi-core alignment algorithms on our Xeon4 platform.    Figure 17 and Table 12 give the running times for our parallel alignment algorithms on the Xeon4 platform. P P _M M Strip is faster than P P _M M by up to 37.4% and faster than P P _M M Diagonal by up to 40.3%.  Table 13 gives the speedup of each parallel alignment algorithm relative to its single-core counterpart. The speedup achieved by P P _M M Strip (relative to M M Strip) ranges from 3.56 to 3.94 while that for P P _M M is in the range 2.77 to 2.88 and that for P P _M M Diagonal is in the range 2.53 to 2.81. Tables 14 give the CPU and cache energy consumption, in joules, by our multi-core alignment algorithms. On our datasets, P P _M M Strip required up to 29.9% less CPU and cache energy than P P _M M and up to 42.1% less than P P _M M Diagonal. Once again, the energy reduction is comparable to the reduction in running time suggesting a close relationship between running time and energy consumption for this application.  Table 15 give the running times of our single-core scoring algorithms on our Xeon6 platform. As can be seen, Strip is the fastest followed by Score and Diagonal (in this order). Strip reduces running time by up to 14.3% relative to Score and by up to 22.4% relative to Diagonal.    Figure 20 and Table 16 give the running times for our parallel scoring algorithms on our Xeon6 platform. As with Xeon4, P P _Strip is faster than P P _Score and P P _Diagonal and reduces the running time by up to 42.5% and 55.6%, respectively.  Table 18 gives the speedup of each of our parallel algorithms relative to their single-core counterparts. P P _Strip achieves a speedup of up to 5.89, which is very close to the number of cores. The maximum speedup achieved by P P _Score and P P _Diagonal was 4.09 and 4.25, respectively.   Figure 22 and Table 20 give the running times of our parallel alignment algorithms on the Xeon6. P P _M M Strip is faster than P P _M M and P P _M M Diagonal and reduces the running time by up to 39.9% and 44.8%, respectively.  Table 21 gives the speedup of each of our parallel algorithms relative to their single-core counterwww.astesj.com C. Zhao et al. / Advances in Science, Technology and Engineering Systems Journal Vol. 3, No. 2, 329-345 (2018) Figure 23 and Table 19 give the running times of our single-core scoring algorithms on our Xeon24 platform. As was the case on our other test platforms, Strip is the fastest followed by Score and Diagonal (in this order). Strip reduces running time by up to 19.7% relative to Score and by up to 35.1% relative to Diagonal.  Figure 24 and Table 22 give the running times for our parallel scoring algorithms on our Xeon24 platform. P P _Strip is faster than P P _Score and P P _Diagonal and reduces the running time by up to 61.4% and 76.2%, respectively.  Table 23 gives the achieved speedup. P P _Strip scales quite well and results in a speedup of up to 22.22. The maximum speedups provided by P P _Score and P P _Diagonal are 11.36 and 9.56, respectively.   Table 24 give the running times of our single-core alignment algorithms on our Xeon24 platwww.astesj.com C. Zhao et al. / Advances in Science, Technology and Engineering Systems Journal Vol. 3, No. 2, 329-345 (2018) Figure 26 and Table 25 give the running times of our parallel alignment algorithms on Xeon24. As can be seen, P P _M M Strip is faster than P P _M M and P P _M M Diagonal. It reduces the running time by up to 47.3% and 84.6%, respectively.  Table 26 gives the speedup of our parallel algorithms. P P _M M Strip achieves a speedup of up to 16.2 while P P _M M and P P _M M Diagonal have maximum speedups of 9.79 and 6.58.

Discussion
By accounting for the presence of caches in modern computers, we are able to arrive at sequence alignment algorithms that are considerably faster than those that do not take advantage of computer caches. Our benchmarking demonstrates the value of optimizing cache usage. Our cache-efficient algorithms Strip and M M Strip were the best-performing single-core algorithms and their parallel counterparts were the bestperforming parallel algorithms. Strip reduced running time by as much as 19.7% relative to the classical scoring algorithm Score due to Smith

Limitations
Our cache miss analyses assume a simple cache model in which there is a single LRU cache. In practice, computers have multiple levels of cache and employ sophisticated and proprietary cache replacement strategies. Despite the use of a simplified cache model for analysis, the developed cache-efficient algorithms perform very well in practice.

Conclusion
The main contributions of this papers are 1. cache efficient single-core and multi-core algorithms to determine the score of the best alignment; 2. cache efficient single-core and multi-core algorithms to determine the best alignment.
The effectiveness of our cache-efficient algorithms has been demonstrated experimentally using three computational platforms. Future work includes developing the cache-efficient algorithms for other problems in computational biology.

Conflict of Interest
The authors declare no conflict of interest.