A Comparative Study of Sorting Algorithms with FPGA Acceleration by High Level Synthesis

Ben Jmaa, Yomna; Ben Atitallah, Rabie; Duvivier, David; Ben Jemaa, Maher; Ben Jmaa, Yomna; Ben Atitallah, Rabie; Duvivier, David; Ben Jemaa, Maher

doi:10.13053/cys-23-1-2999

Services on Demand

Journal

Article

Indicators

Cited by SciELO
Access statistics

Computación y Sistemas

On-line version ISSN 2007-9737Print version ISSN 1405-5546

Comp. y Sist. vol.23 n.1 Ciudad de México Jan./Mar. 2019 Epub Feb 26, 2021

https://doi.org/10.13053/cys-23-1-2999

Articles

A Comparative Study of Sorting Algorithms with FPGA Acceleration by High Level Synthesis

Yomna Ben Jmaa¹²^*

Rabie Ben Atitallah²

David Duvivier²

Maher Ben Jemaa¹

^¹ NIS University of Sfax, ReDCAD, Tunisia. Yomna.BenJmaa@etu.univ-valenciennes.fr, maher.benjemaa@enis.rnu.tn

^² Polytechnical University Hauts-de-France, France. Rabie.BenAtitallah@univ-valenciennes.fr, David.Duvivier@univ-valenciennes.fr

Abstract:

Nowadays, sorting is an important operation for several real-time embedded applications. It is one of the most commonly studied problems in computer science. It can be considered as an advantage for some applications such as avionic systems and decision support systems because these applications need a sorting algorithm for their implementation. However, sorting a big number of elements and/or real-time decision making need high processing speed. Therefore, accelerating sorting algorithms using FPGA can be an attractive solution. In this paper, we propose an efficient hardware implementation for different sorting algorithms (BubbleSort, InsertionSort, SelectionSort, QuickSort, HeapSort, ShellSort, MergeSort and TimSort) from high-level descriptions in the zynq-7000 platform. In addition, we compare the performance of different algorithms in terms of execution time, standard deviation and resource utilization. From the experimental results, we show that the SelectionSort is 1.01-1.23 times faster than other algorithms when N < 64; Otherwise, TimSort is the best algorithm.

Keywords: FPGA; sorting algorithms; heterogeneous architecture CPU/FPGA; zynq platform

1 Introduction

At present, Intelligent Transportation Systems (ITS) is an advanced application combining engineering transport, communication technologies and geographical information systems. These systems [¹⁵] play a significant part in minimizing the risk of accidents, the traffic jams and pollution. Also, ITS improve the transport efficiency, safety and security of passengers. They are used in various domains such as railways, avionics and automotive technology. At different steps, several applications need to use sorting algorithms such as decision support systems, path planning [⁶], scheduling and so on.

However, the complexity and the targeted execution platform(s) are the main performance criteria for sorting algorithms. Different platforms such as CPU (single or multi-core), GPU (Graphics Processing Unit), FPGA (Field Programmable Gate Array) [¹⁴] and heterogeneous architectures can be used.

Firstly, the FPGA is the most preferable platform for implementing the sorting process. Thus, industry uses frequently FPGAs for many real-time applications improve performance in terms of execution time and energy consumption [⁴^,⁵]. On the other hand, using an FPGA board allows to build complex applications which have a high performance. These applications are being made by receiving a large number of available programmable fabrics. They provide an implementation of massively-parallel architectures [⁵]. The increase in the complexity of the applications has led to high-level design methodologies[¹⁰]. Hence, High-Level Synthesis (HLS) tools have been developed to improve the productivity of FPGA-based designs.

However, the first generation of HLS has failed to meet hardware designers' expectations; some reasons have facilated researchers to continue producing powerful hardware devices. Among these reasons, we quote the sharp increase of silicon capability; recent conceptions tend to employ heterogeneous Systems on Chips (SoCs) and accelerators. The use of behavioral designs in place of Register-Transfer-Level (RTL) designs allows improving design productivity, reducing the time-to-market constraints, detaching the algorithm from architecture to provide a wide exploration for implementing solutions [¹¹]. From a high-level programming language (C / C ++), Xilinx created the vivado HLS [³⁴] tool to generate hardware accelerators.

The sorting algorithm is an important process for several real-time embedded applications. There are many sorting algorithms [¹⁷] in the literature such as BubbleSort, InsertionSort and SelectionSort which are simple to develop and to realize, but they have a weak performance (time complexity is O(n ²)). Several researchers have used MergeSort, HeapSort and QuickSort with O(nlog(n)) time complexity to resolve the restricted performance of these algorithms [¹⁷]. On the one hand, HeapSort starts with the construction of a heap for the data group. Hence, it is essential to remove the greatest element and to put it at the end of the partially-sorted table. Moreover, QuickSort is very efficient in the partition phase for dividing the table into two. However, the selection of the value of the pivot is an important issue. On the other hand, MergeSort gives a comparison of each element index, chooses the smallest element, puts them in an array and merges two sorted arrays.

Furthermore, other sorting algorithms are improving the previous one. For example, ShellSort is an enhancement of the insertion sort of algorithm. It divides the list into a minimum number of sub-lists which are sorted with the insertion sort algorithm. Hybrid sorting algorithms emerged as a mixture between several sorting algorithms. For example, Timsort combines MergeSort and InsertionSort. These algorithms choose the InsertionSort if the number of elements is lower than an optimal parameter (OP) which depends on the target architecture and the sorting implementation; otherwise, MergeSort is considered with integrating some other steps to improve the execution time.

The main goal of our work is to create a new simulator which simulates the behavior of elements / components related to intelligent transportation systems; In [²], the authors implemented an execution model for a future test bench and simulation. However, they propose a new hardware and software execution support for the next generation of test and simulation system in the field of avionics. In [³³], the authors developed an efficient algorithm for helicopter path planning. They proposed different scheduling methods for the optimization of the process on a real time high performance heterogeneous architecture CPU/FPGA. The authors in [²⁵] presented a new method of 3D path design using the concept of "Dubins gliding symmetry conjecture". This method has been integrated in a real-time decisional system to solve the security problem.

In our research group, several researchers proposed a new adaptive approach for 2D path planning according to the density of obstacle in a static environment. They improved this approach into a new method of 3D path planning with multi-criteria optimizing. The main objective of this work is to propose a different optimized hardware accelerated version of sorting algorithms (BubbleSort, InsertionSort, SelectionSort, HeapSort, ShellSort, QuickSort, TimSort and MergeSort) from High-Level descriptions in avionic applications. We use several optimization steps to obtain an efficient hardware implementation in two different cases: software and hardware for different vectors and permutations.

The paper is structured as follows: Section II presents several studies of different sorting algorithms using different platforms (CPU, GPU and FPGA). Section III shows a design flow of our application. Section IV gives an overview of sorting algorithms. Section V describes our architecture and a variety of optimizations of hardware implementation. Section VI shows experimental results. We conclude in Section VII.

2 Background and Related Work

Sorting is a common process and it is considered as one of the well-known problems in the computational world. To achieve this, several algorithms are available in different research works. They can be organized in various ways:

- Depending on the algorithm time complexity, Table I presents three different cases (best, average and worst) of the complexity for several sorting algorithms. We can mention that QuickSort, HeapSort, MergeSort, timSort have a best complexity of O(nlog(n)) in average case. By contrast, the worst case performance for the four algorithms is O(n ²) obtained by QuickSort. A simple pretest makes all algorithms in the best case to O(n) complexity.
- Each target implementation platform, such as CPUs, GPUs, FPGAs and the hybrid platform, has a specific advantage: FpGa is the best platform in terms of power consumption while CPU is considered as a simple platform for programmability. GPU appears as a medium solution.
The authors in [⁹] used MergeSort to sort up to 256M elements on a single Intel Q9550 quadcore processor with 4GB RAM for single thread or multi-thread programs, whereas the authors in [²⁹] considered the hybrid platform SRC6 (CPU Pentium 4+FPGA virtex2) for implementing different sorting algorithms (RadixSort, QuickSort, HeapSort, Odd-EvenSort, MergeSort, BitonicSort) using 1000000 elements encoded in 64 bits [¹⁷].
- According to the number of elements to be sorted, we choose the corresponding sorting algorithms. For example, if the number of elements is small, then InsertionSort is selected, otherwise; MergeSort is recommended.

Table 1 Complexity of the sorting algorithms

	Best	Average	Worst
BubbleSort	O(n)	O(n ²)	O(n ²)
InsertionSort	O(n)	O(n ²)	O(n ²)
SelectionSort	O(n ²)	O(n ²)	O(n ²)
HeapSort	O(n log(n))	O(n log(n))	O(n log(n))
ShellSort	O(n log(n))	O(n log(n)²)	O(n log(n)²)
QuickSort	O(n log(n))	O(n log(n))	O(n ²)
MergeSort	O(n log(n))	O(n log(n))	O(n log(n))
TimSort	O(n)	O(n log(n))	O(n log(n))

Zurek et al. [³⁵] proposed two different hardware implementation algorithms: the quick-merge parallel and the hybrid algorithm (parallel bitonic algorithm on the GPU + sequence MergeSort on CPU) using a framework openMP and CUDA. They compared two new implementations with a different number of elements. The obtained result shows that multicore sorting algorithms are the best scalable and the most efficient. The GPU sorting algorithms, compared to a single core, are up to four times faster than the optimized quick sort algorithm. The implemented hybrid algorithm (executed partially on CPU and GPU) is more efficient than algorithms only run on the GPU (despite transfer delays) but a little slower than the most efficient, quick-merge parallel CPU algorithm. They showed that the hybrid algorithm is slower than the most efficient quick-merge parallel CPU algorithm.

Abiramiet al. [¹] presented an efficient hardware implementation of the MergeSort algorithm with Designing Digital Circuits. They measured the efficiency, reliability and complexity of the MergeSort algorithm with a digital circuit. Abirami used only 4 input and compared the efficiency of the MergeSort to the bubble sort and the selection sort algorithms. The disadvantage of this work is using only MergeSort algorithm in FPGA.

Pasetto et al. [²⁶] presented several parallel implementations of five sorting algorithms (MapSort, MergeSort, Tsigas-Zhang's Parallel QuickSort, Alternative QuickSort and STLSort) using an important number of elements. They calculated the performance of this algorithm using two different techniques (direct sorting and key/pointer sorting) and two machines: Westmere (Intel Xeon 5670, frequency: 2.93 GHz and 24 GB of memory) containing 6 cores i7 and Nehalem (Intel Xeon 5550, 2.67 GHz, 6 GB of memory) integrating 4 cores i7. After that, they compared the several algorithms in terms of CPU affinity, throughput, microarchitecture analysis and scalability.

Based on the obtained results, the authors recommended MergeSort and MapSort when memory is not an issue. Both of these algorithms are advised if the number of elements is up to 100000, as they need an intermediate array size. Danelutto et al. [¹²] presented an implementation of the pattern using the openMP, the Intel TBB framework and the Fastflow parallel programming running on multicore platforms. They proposed a high-level tool for the fast prototyping of parallel Divide and Conquer algorithms. The obtained results show that the prototype parallel algorithms allow a reduction of the time and also need a minimum of programming effort compared to hand-made parallelization.

Jan et al. [¹⁹] presented a new parallel algorithm named the min-max butterfly network, for searching a minimum and maximum in an important number of elements collections. They presented a comparative analysis of the new parallel algorithm and three parallel sorting algorithms (odd even sort, bitonic sort and rank sort) in terms of sorting rate, sorting time and speed running on the CPU and GPU platforms. The obtained results show that the new algorithm has a better performance than the three others algorithms.

Grozea et al. [²⁵] allowed to accelerate existing comparison algorithms (MergeSort, Bitonic Sort, parallel Insertion Sort) (see, e.g. [²²^,²⁸] for details) to work at a typical speed of an Ethernet link of 1 Gbit/s by using parallel architectures (FPGAs, multi-core CPUs machines and GPUs). The obtained results show that the FPGA platform is the most flexible, but it is less accessible. Beside that GPU is very powerful but it is less flexible, difficult to debug and requiring data transfers to increase the latency. Sometimes, the CPU is perhaps too slow in spite of the multiple cores and the multiple CPUs, but it is the easiest to use.

Konstantinos et al [¹⁶] proposed an efficient hardware implementation for three algorithms based on virtex 7 FPGAs of image and video processing using high level synthesis tools to improve the performance. They focused only in the MergeSort algorithm for calculating the Kendall Correlation Coefficient. The obtained results show that the hardware implementation is 5.6x better than the software implementation. Janarbek et al. [¹⁶] proposed a new framework which provides ten basic sorting algorithms for different criteria (speed, area, power…) with the ability to produce hybrid sorting architectures. The obtained results show that these algorithms had the same performance as the existing RTL implementation if the number of elements is lower than 16K elements whereas they overperformed it for large arrays (16K-130K). We are not in this context because the avionic applications need to sort at most 4096 elements issuing from previous calculation blocs.

Chen et al. [⁸] proposed a methodology for the hardware implementation of the Bitonic sorting network on FPGA by optimizing energy and memory efficiency, latency and throughput generate high performance designs. They proposed a streaming permutation network (SPN) by "folding" the classic Clos network. They explained than the SPN is programmable to achieve all the interconnection patterns in the bitonic sorting network. The re-use of SPN causes a low-cost design for sorting using the smallest number of resources.

Koch et al. [²¹] proposed an implementation of a highly scalable sorter after a careful analysis of the existing sorting architectures to enhance performance on the processor CPU and GPUs. Moreover, they showed the use of a partial run time reconfiguration for improving the resource utilization and the processing rate…

Purnomo et al. [²⁷] presented an efficient hardware implementation of the Bubble sort algorithm. The implementation was taken on a serial and parallel approach. They compared the serial and the parallel bubble sort in terms of memory, execution time and utility, which comprises slices and LUTs. The experimental results show that the serial bubble sort used less memory and resource than the parallel bubble sort. In contrast, the parallel bubble sort is faster than the serial bubble sort using an FPGA platform.

Other researchers works on high-speed parallel schemes for data sorting on FPGA are presented in [¹³^,⁴]. Parallel sorting which was conducted by Sogabe [³²] and Martinez [²⁴]. Finally the comparison study of many sorting algorithms covering parallel MergeSort, parallel counting sort, and parallel bubble sort on FPGA is an important step [⁷].

The obtained results in a certain amount of works which aim to parallelize the sorting algorithms on several architectures (CPU, GPU and FPGA) show that the speedup is proportional to the number of processors. These researches show that the hardware implementation on FPGA give a better performance in terms of time and energy. In this case, we can parallelize to the maximum but we do not reach the values of speedup because the HLS tool could not extract enough parallelization since we need only a small number of elements to be sorted.

Hence, the HLS tool improves hardware accelerator productivity while reducing the time of design. Also, the major advantage of HLS is the quick exploration of the design space to find the optimal solution. HLS Optimization Guidelines Produce Hardware IP that verifies Surface and Performance Tradeoff. We notice that there are several algorithms not used for FPGA. To this end, we choose in this work the FPGA platform and the HLS tool to improve performance and to select the best sortng algorithm.

We proposed in our previous work [²⁰] an efficient hardware implementation of MergeSort and TimSort from high-level descriptions using the heterogeneous architecture CPU/FPGA. These algorithms are considered as part of a real-time decision support system for avionic applications. We have compared the performance of two algorithms in terms of execution time and resource utilization. The obtained results show that TimSort is faster than MergeSort when using an optimized hardware implementation.

In this paper, we proposed an improvement of the previous work based on the permutation generator proposed by Lehmer to generate the different vectors and permutations as input parameter. we compared the 8 sorting algorithms (BubbleSort, InsertionSort, SelectionSort, ShellSort, QuickSort, HeapSort, MergeSort and TimSort) implemented on FPGA using 32 and 64 bit encoded data. Table 2 presented the comparative analysis approaches of the discussed studies.

Table 2 Different context analysis approaches

Approach	Algorithms			Parallelization	Platforms			Avionic application	High Level tool
Approach	Name	Complexity	High performance	Parallelization	CPU	GPU	FPGA	Avionic application	High Level tool
	MergeSort	O(n log(n))
Grozea et al [¹⁰]	Bitonic sort	O(n log(n)ˆ2)	Yes	Yes	Yes	Yes	Yes	No	No
	Parallel InsertionSort	O (nˆ2)
Chhugani et al [¹¹]	MergeSort	O(n log (n))	Yes	Yes	Yes	No	No	No	No
Satish et al [¹²]	RadixSort	O(nK)	Yes	Yes	No	Yes	No	No	No
Satish et al [¹²]	MergeSort	O(n log(n))	Yes	Yes	No	Yes	No	No	No
Zurek et al [¹³]	Parallel quick merge Hybrid algorithm	-	Yes	Yes	Yes	Yes	No	No	Yes
Abirami et al[¹⁴]	MergeSort	O(n log (n))	Yes	Yes	No	No	Yes	No	No
	MapSort	O(n/P(log(n))
	MergeSort	O(n log(n))
Pasetto et al [¹⁵]	Tsigas-Zhangs	O(n log(n))	Yes	Yes	Yes	No	No	No	No
	Alternative QuickSort	O(n log(n))
	STLSort	-
Danelutto et al [¹⁶]	Divide and conquer	O(n log(n))	Yes	Yes	Yes	No	No	No	Yes
Jan et al [¹⁷]	min-max butterfly	-	Yes	Yes	Yes	Yes	No	No	No
Konstantinos et al [²⁰]	MergeSort	O(n log(n))	Yes	Yes	Yes	No	Yes	No	Yes
Chen et al [²¹]	BitonicSort	O(n log(n)ˆ2)	Yes	Yes	No	No	Yes	No	No
Koch et al [²²]	Highly scalable sorter	-	Yes	Yes	Yes	Yes	No	No	No
Purnomo et al [²³]	BubbleSort	O(nˆ2)	Yes	Yes	No	No	FPGA	No	No
	MergeSort	O(n log(n))
[²⁴ ²⁵ ²⁶ ²⁷ ²⁸]	Parallel CountingSort	O(n+k)	Yes	Yes	No	No	Yes	No	No
	Parallel BubbleSort	O(nˆ2)
Previous work [²⁹]	MergeSort	O(n log(n))	Yes	Yes	Yes	No	Yes	Yes	Yes
Previous work [²⁹]	TimSort	O(n log(n))	Yes	Yes	Yes	No	Yes	Yes	Yes
	BubbleSort	O(nˆ2)
	InsertionSort	O(nˆ2)
	SelectionSort	O(nˆ2)
Our work	QuickSort	O(n log(n))	Yes	Yes	Yes	No	Yes	Yes	Yes
	HeapSort	O(n log(n))
	MergeSort	O(n log(n))
	ShellSort	O(n(log(n)ˆ2))
	TimSort	O(n log(n))

3 Design Flow

To accelerate the different applications (real time decision system, 3D path planning) , Figure 1 presents the general structure of the organization for design. Firstly, we have a set of tasks programmed in C/C++ and to be executed on a heterogeneous platform CPU/FPGA. After that, we optimize the application via High Level Synthesis(HLS) tool optimization directives for an efficient hardware implementation. In addition, we optimize these algorithms by an another method that runs in parallel the same function with different input. Next, the optimized program will be divided into software and hardware tasks using the different metaheuristic or a Modified HEFT (Heterogeneous Earliest-Finish Time) (MHEFT) while respecting application constraints and resource availability. This step is proposed in [³³]. The hardware task will be generated using the Vivado HLS tool from C/C++. Finally, the software and the hardware tasks will communicate with bus to run them on the FPGA board. This communication is done with ISE or vivado tools to obtain an efficient result implementation.

Fig. 1 General Design flow

4 Overview of Sorting Algorithms

4.1 BubbleSort

Bubble sort [³¹^,²⁴] is a simple and well-known algorithm in the computational world.

However, it is inefficient for sorting a large number of elements because its complexity is very important O(n ²) in the average and the worst case. Bubble sort is divided into four steps. Firstly, bubble sort is the high level, which allows sorting all input. The second step is to swap two inputs if tab[j]>tab[j+1] is satisfied. Subsequently, the comparator step makes it possible to compare two inputs. Finally, an adder is used to subtract two inputs to define the larger number in the comparator component. These different steps are repeated until you sorted array is obtained (Figure 2).

Fig. 2 BubbleSort algorithm

4.2 SelectionSort

The selection sorting algorithm [¹⁸] is a simple algorithm for analysis, compared to other algorithms. Therefore, it is a very easy sorting algorithm to understand and it is very useful when dealing with a small number of elements. However, it is inefficient for sorting a large number of elements because its complexity is O(n ²) where n is the number of elements in the table. This algorithm is called SelectionSort because it works by selecting a minimum of elements in each step of the sort. The important role of selection sort is to fix the minimum value at index 0. It searches for the minimum element in the list and switches the value at the medium position. After that, it is necessary to increment the minimum index and repeat until the sorted array is obtained (Figure 3).

Fig. 3 SelectionSort Algorithm

4.3 InsertionSort

InsertionSort [³¹] is another simple algorithm used for sorting a small number of elements as shown in Figure 4. Nevertheless, it has a better performance than BubbleSort and SelectionSort. InsertionSort is less efficient when sorting an important number of elements, which requires a more advanced algorithm such as QuickSort, HeapSort, and MergeSort because its complexity is very important O(n ²) in the average and the worst case. The InsertionSort algorithm is used to integrate a new element in each iteration and to compare the values of elements in the list. If the value of an element is less than the current value of that element then a switch is performed. Repeat this step until n-1 item.

Fig. 4 InsertionSort Algorithm

4.4 ShellSort

ShellSort [³⁰] is a very efficient sorting algorithm for an average number of elements and it is an improvement of the InsertionSort algorithm as it allows to switch the elements positioned further. The average-case and worst-case complexities of this algorithm are of O(n(1og(n)²)). The principle role of this algorithm is to compute the value of h and divides the list into smaller sub-lists of equal h intervals. After that, it sort each sub-list that contains a large number of elements using InsertionSort. Finally, repeat this step until a sorted list is obtained. ShellSort is not widely used in the literature.

Fig. 5 ShellSort Algorithm

4.5 QuickSort

Quicksort [²³] is based on a partitioning operation: Firstly, this algorithm divides a large array into two short sub-arrays: the lower elements and the higher elements. It is divided into different steps:

Select an element from array, named pivot.
Partitioning: move all smaller elements'values before the pivot, and move all bigger'elements values after it. After this partitioning, the pivot is in its final position, called partition operation.
Repeat recursively the different steps for two sub-arrays with smaller and greater values of elements.

Therefore, it is a divide and conquer algorithm. Quicksort is faster in practice than other algorithms such as BubbleSort or Insertion Sort because its complexity is O(n log n). However, the implementation of QuickSort is not stable and it is a complex sort, but it is among the fastest sorting algorithms in practice.

The most complex problem in QuickSort is selecting a good pivot element. Indeed, if at each step QuickSort selects the median as the pivot it will obtain a complexity of O(n log n), but a bad selection of pivots can always lead to poor performance (O(n ²) time complexity), cf. Figure 6.

Fig. 6 QuickSort Algorithm

4.6 HeapSort

HeapSort [³] is based on the same principle as SelectionSort, since it searches for the maximum element in the list and places this element at the end. This procedure is repeated for the remaining elements. This algorithm is a better sorting algorithm being in place because its complexity is O(n log(n)). In addition, this algorithm is divided into two basic steps:

- Creating a Heap data structure (Max-Heap or Min-Heap) with the first element of the heap, which is the greatest or the smallest (depending upon Max-Heap or Min-Heap)
- Repeating this step using the remaining elements to select again the first element of the heap and to place this element at the end of the table until you get a sorted array. Heapsort is too fast and it is not stable sorting algorithm. It is very widely used to sort a large number of elements.

Fig. 7 HeapSort Algorithm

4.7 MergeSort

In 1945, MergeSort [²⁰] was established by John von Neumann. The implementation of this algorithm retains the order of input to output. Therefore, this algorithm is an efficient and stable algorithm. It is based on the famous divide and conquer paradigm. The necessary steps of this algorithm are : 1- Divide array into two sub-arrays, 2- Sort these two arrays recursively 3- Merge the two sorted arrays to obtain the result. It is a better algorithm than the HeapSort, QuickSort, ShellSort and TimSort algorithms because its complexity in the average case and the worst case is O(n log(n)) but its is O(n log(n)) in the best case as shown in Figure 8.

Fig. 8 MergeSort Algorithm

4.8 TimSort

TimSort [²⁰] is based on MergeSort and Insertion-Sort algorithms. The principle role of this algorithm is to switch between these two algorithms. This step depends on the value of the optimal parameter (OP) which is fixed to 64 for the architecture of the processor Intel i7. The execution time is almost equal for parallel architectures when we change the value of parameter OP. Therefore, we consider in this work the value of OP is 64 because several research use this standardized value. However, we could follow two different ways by mean of the size of elements to be sorted: If the size of the array is greater or equal than 64 elements, then MergeSort will be considered; otherwise, InsertionSort is selected in the sorting step as shown in Figure 9.

Fig. 9 TimSort Algorithm [²⁰]

5 Optimized Hardware Implementation

In this section, we will present the different optimizations applied to the sorting algorithms defined in the previous section using HLS directives with the size of data 64 bits. Then, we will explain our execution architecture. Finally, we will propose several input data using the Lehmer method to take a final decision.

5.1 Optimization of Sorting Algorithms

In order to have an efficient hardware implementation, we applied the following optimization steps to the C code for each sorting algorithm:

- Loop unrolling: The elements of the table are stored in BRAM memory, which are described by dual physical ports. The dual ports could be configured as dual write ports, dual read ports or one port for each operation. We profit from this optimization by unrolling the loops in the design by factor=2. For example, only writing elements are executed in the loop. Hence, the two ports for an array could be configured as writing ports and consequently we could unroll the loop by factor=2.
- Loop iterations are pipelined in order to reduce the execution time. Loop iterations are pipelined in our design with only one clock cycle difference in-between by applying loop pipelining with Interval iteration (II)=1. To satisfy this condition, the tool will plan loop execution.
- Input/output Interface: Input/output ports are configured to exploit the AXI-Stream protocol for data transfer with minimum communication signals (DATA, VALID and READY). Also, the AXI-Lite protocol is employed for design configuration purposes; for example, to determine the system's current state (start, ready, busy).

5.2 Hardware Architecture

Today, the heterogeneous architecture presents a lot of pledge for high performance extraction by combining the reconfigurable hardware accelerator FPGA with the classic architecture. The use of this architecture causes a significant improvement in performance and energy efficiency. In the literature, different heterogeneous architectures such as Intel HARP, Microsoft Catapult and Xilinx Zynq are used.this architecture promises a massive parallelism for proceeding improvement in hardware acceleration of the FPGA technology. In fact, the FPGA technology can offer a better performance (power consumption, time…), up to 10 times compared to the CPU. We use in this work a Xilinx Zynq board which has two different modes of communication: AXI4-Lite and AXI4-Stream.

In this case, we choose the AXI4-Stream protocol in this paper because it is one of the AMBA protocols designed to carry streams of Arbitrary width data of 32/64 bit size in the hardware. These data are generally represented as vector data on the software side that can be transferred 4 bytes per cycle. On one side, the AXI4-Stream is designed for high-speed streaming data. This mode of transfer supports unlimited data burst sizes and provides point-to-point streaming data without using any addresses. However, it is necessary to fix a starting address to begin a transaction between the processor and HW IP. Typically, the AXI4-Stream interface is used with a controller DMA (Direct Memory Access) to transfer much of the data from the processor to the FPGA as presented in Figure 10. In this case, An interrupt signal is invoked when the first packet of data is transferred, for the associated channel to initialize a new transfer. On the one hand, we consider that the scatter-gather engine option is disabled on the AXI DMA interface.

Fig. 10 Hardware architecture

Figure 10 shows how the HLS IP (HLS is sorting) is connected to the design. The input data are stored in the memory of the processing element (DDR Memory). They are transferred from the processing system (ZYNQ) to the HLS core (HLS Sorting Algorithm) through AXI-DMA communication. After the data are sorted, the result is written back through the reverse path.

5.3 Choice of the Input Data

In this part, we propose different input data, which are stored in several file systems. Firstly, we present the management of files on an SD card (Secure Digital Memory Card) while retaining a strong portability and practicality from the FPGA. SD cards are not easily eresable enforceable with FPGAs and are widely used the portable storage medium. Nowadays, several studies show that using the SD card controller with FPGA play an important role in different domains. They are based on the use of an API interface (Application Programming Interface), AHB bus (Advanced High performance Bus), etc. They are dedicated to the realization of an ultra-high-speed communication between the SD card and upper systems. All the communication is synchronous to a clock provided by the host (FPGA). The file system design and implementation of an SD card provides three major means of innovation:

- The integration and combination of the SD card controller and the file system, gives a system which is highly incorporated and convenient.
- The utilization of file management makes processing easier. In addition, it improves the overall efficiency of the systems.
- The digital design provides a high performance and it allows a better portability since it is independent of the platform.

In this paper, we implemented the different algorithms using many data encoded in 8 bytes,which is another solution for optimizing the performance. Thus, several studies in computer science on sorting algorithms use the notion of permutations as input.

5.3.1 Permutation

Firstly, permutation is used with different combinatory optimization problems especially in the field of mathematics. Generally, a permutation is an arrangement of a set of n objects 1,2,3,,n into a specific order and each element occurs just only once. For example, there are six permutations of the set 1,2,3, namely (1,2,3), (1,3,2), (2,1,3), (2,3,1), (3,1,2), and (3,2,1). So, the number of permutations depends only on the n objects. In this case, there are exactly n! permutations. Secondly, a permutation π is a bijective function from a set 1,2,...,n to itself (i.e., each element i of a set S has a unique image j in S and appears exactly once as image value). We also considered that pos _π (i) is the position of the element i in the permutation π; π(i); is the element at position i in π and S _n is a set of all the possibilities of permutation of size n. Among the many methods used to treat the problem of generating permutations, the Lehmer method is used in this paper.

6 Experimental Results

In this section, we present our results of execution time and resource utilization for sorting algorithms on Software and Hardware architecture. We compare our results for different cases: Software and optimized hardware for several permutations and vectors. A set of R=50 replications is obtained for each case and permutations/vectors. The array size ranges from 8 to 4096 integers encoded in 4 and 8 bytes. As previously mentioned, we limited the size to 4096 elements because the best sorting algorithm is mainly used for real-time decision support systems for avionic applications. In this case, it sorted at most 4096 actions issuing from the previous calculation blocks.

We developed our hardware implementation using a Zedboard platform. The hardware architecture was synthesized using the vivado suite 2015.4 with default synthesis/implementation strategies. Firstly, we compared the execution time between several sorting algorithms on a software architecture (processor Intel core i3-350M)). The frequency of this processor is 1.33 GHz. Table 3 reports the average execution time of each algorithm for different sizes of arrays ranging from 8 to 4096 elements with 50 replications (R=50).

Table 3 Execution time of different algorithms

	BubbleSort (US)	InsertionSort (US)	SelectionSort (US)	HeapSort (US)	QuickSort (US)	ShellSort (US)	MergeSort (US)	TimSort (US)
8	1.57	1.074	1.243	2.296	2.316	1.514	2.639	1.818
16	5.725	2.781	3.730	5.040	4.725	4.471	4.253	2.870
32	24.375	9.011	15.293	12.371	13.197	12.184	14.641	13.084
64	96.871	28.516	41.079	48.304	41.555	42.830	32.052	34.702
128	432.990	98.884	164.989	83.592	93.241	122.110	52.753	86.434
256	1037.802	403.147	580.504	296.508	299.465	329.927	159.046	192.016
512	3779.371	1394.601	2127.602	468.886	1136.260	1145.126	275.380	507.135
1024	12290.912	4548.036	7249.245	1504.325	3291.473	3528.034	774.106	1140.902
2048	47304.790	24829.288	27876.093	2093.825	10978.576	12180.592	1554.525	1911.241
4096	181455.365	66851.367	107568.834	4281.869	41848.195	46796.158	2985.149	4098.597

Figure 11 and table 3 show that the BubbleSort, the InsertionSort and the SelectionSort algorithms have a significant execution time when the size of the arrays is greater than 64; otherwise the InsertionSort is the best algorithm if the size of the array is smaller or equal to 64.

Fig. 11 Execution time in software (us)

Hence, we compared the execution time of only five algorithms (ShellSort, QuickSort, HeapSort, MergeSort and TimSort) for average cases as shows the figure 12. We concluded that MergeSort is 1.9x faster than QuickSort, 1.37x faster than HeapSort, 1.38x faster than TimSort and 1.9x faster than ShellSort running on a processor Intel Core i3 (Figure 12).

Fig. 12 Execution time of the HeapSort, QuickSort, MergeSort, ShellSort et TimSort algorithms on a processor

Second, we compared the performance of the sorting algorithms in terms of standard deviation as shown in table 4, which illustrates the standard deviation for each algorithm. Finally, we calculated the different resource utilization for the sorting algorithms (Table 5) and we noted that BubbleSort, InsertionSort and SelectionSort consume less of a resource. In contrast, ShellSort is the best algorithm in terms of resource utilization (Slice, LUT, FF and BRAM) (Figure 14, 15) because BubbleSort, InsertionSort and SelectionSort have an important complexity. From the results, it is concluded that MergeSort is the best algorithm.

Table 4 Standard deviation of different algorithms

	BubbleSort	InsertionSort	HeapSort	MergeSort	QuickSort	SelectionSort	ShellSort	TimSort
8	2753	2173	3636	3198	2989	2486	2770	2911
16	4786	3654	4524	4102	4413	3941	4410	3660
32	7310	5671	6075	6713	6301	9772	6123	6400
64	112167	9209	23719	9919	58629	9895	69133	10317
128	36686	10277	13835	7573	13875	42599	104016	10721
256	182061	302001	557507	19687	431402	83059	74621	75778
512	844048	757201	168523	116047	757716	284322	389055	782516
1024	2519380	676104	1239939	409297	437650	1854144	1362460	1164588
2048	4825091	1993282	210328	85225	2207375	2400166	585655	243134
4096	9895405	3179333	756228	159427	3462193	6339409	1431203	539362

Table 5 Utilization resources of the sorting algorithms

	Slices	LUT	FF	BRAM
BubbleSort	57	173	159	2
InsertionSort	55	183	132	2
SelectionSort	95	306	238	2
ShellSort	107	321	257	2
QuickSort	188	546	512	6
HeapSort	211	569	576	2
MergeSort	278	706	797	6
TimSort	998	3054	2364	69

Fig. 13 Standard deviation of different algorithms running on Software

Fig. 14 Resources utilization of different algorithms

Fig. 15 BRAM memory of different algorithms

After that, we used HLS directives in order to improve the performance of the different algorithms. We calculated the execution time for each optimized Hardware implementation of sorting algorithms. We compared those algorithms using different sizes of the array (8-4096) and several permutations (47 permutations) and vectors (47 vectors) which are generated using a generator of permutations proposed by Lehmer encoded in 32 and 64 bits to analyze the performance of each algorithm. Hence, we talk about multicriteria sorting for use in real-time decision support systems.

We calculated the execution time of the sorting algorithms for different sizes of arrays ranging from 8 to 4096 elements. The permutations are generated using the Lehmer code and encoded in 4 or 8 bytes. Table 6 and table 7 report the minimum, average and maximum execution time for each algorithm in 4 bytes and 8 bytes respectively. Figure 16 shows the execution time of the sorting algorithms for the average cases using elements encoded in 4 bytes. When N is smaller than 64, we display a zoom from the part framed red. Hence, SelectionSort is 1.01x-1.23x faster than the other sorting algorithms if N ≤ 64.

Table 6 Execution time of the sorting algorithms encoded in 4bytes of us

	BubbleSort (US)			InsertionSort (US)			SelectionSort (US)			ShellSort (US)			QuickSort (US)			HeapSort (US)			MergeSort (US)			TimSort (US)
	Min	Avg	Max	Min	Avg	Max	Min	Avg	Max	Min	Avg	Max	Min	Avg	Max	Min	Avg	Max	Min	Avg	Max	Min	Avg	Max
8	11.244	17.302	23.181	11.036	17.106	22.98	10.78	16.84	22.29	11.816	17.88	23.768	12.55	18.68	24.6	11.903	17.99	23.8	12.84	18.93	24.847	12.12	18.12	24
16	15.934	22.009	27.897	14.11	20.206	26.102	13.914	20	25.88	16.1	22.25	28.1	17.16	23.22	29.11	15.77	21.85	27.73	17.52	23.6	29.5	16	22.07	27.95
32	32.317	38.763	44.647	24.02	30.15	36.03	23.85	29.9	35.25	27.12	33.25	39.14	28.3	34.3	40.2	24.97	31.1	37.02	28.36	34.48	40.36	27.51	31.18	37
64	97.11275	103.255	109.136	59.44	65.56	77.185	59.13	65.25	71.13	54.12	60.24	66.12	59.6	62.8	68.6	46.66	52.78	58.66	52.33	58.465	64.34	44.8	50.89	56.7
128	348.937	358.255	360.94	190.94	196.96	202.84	191	197.2	203	120.2	126.3	132.1	131.7	137.9	143.7	96.535	102.665	108.54	105.71	111.88	117.76	88.39	94.5	100.3
256	1344.71	1350.84	1356.72	708.1	714.21	720.039	701	707.2	713	281.8	293.55	293.8	342.7	351.2	357.07	222.47	228.57	234.45	223.1	231.3	237.25	182.9	189	197.9
512	5304.07	5310.11	5315.92	2711.18	2717.29	2723.17	2698.4	2711.4	2717.1	662.16	668.2	674.1	984.1	1001.3	1007	459.92	466.05	471.93	476.4	782.5	788.4	387	393.17	399
1024	21095.1	21101.2	21107.01	10629.9	10635.7	10645.4	10649	10655	10661.1	1607.6	1625.1	1803	2881	3121	3127	1001.8	1015.5	1024.6	1025.4	1031.5	1037.4	826.75	832.85	837.2
2048	32767	83852.9	84177.8	31733.1	42209.7	42215.5	32767	42288	42294	4021.6	4027.67	4033.48	4232	10660	10665	2221.04	2227.1	2233	2182.1	2211.5	2217.3	1763.4	1769.5	1775
4096	32767	333077	336274.5	31744.8	168307	168313	32767	168531	168536	9750.4	9756.4	9762	20630	101507	101512	4810.5	4848.5	4854.2	4728.9	4734.9	4740.7	3750.1	3756	3761.9

Table 7 Execution time of the sorting algorithms encoded in 8bytes of us

	BubbleSort (US)			InsertionSort (US)			SelectionSort (US)			ShellSort (US)			QuickSort (US)			HeapSort (US)			MergeSort (US)			TimSort (US)
	Min	Avg	Max	Min	Avg	Max	Min	Avg	Max	Min	Avg	Max	Min	Avg	Max	Min	Avg	Max	Min	Avg	Max	Min	Avg	Max
8	11.312	17.447	23.385	11.042	17.15	22.95	10.94	17.01	22.88	11.92	18	23.87	12.56	18.67	24.57	11.87	17.96	23.85	12.79	18.89	24.78	12.12	18.13	24.01
16	16.32	22.419	28.341	14.38	20.48	26.37	14.36	20.5	26.48	16.2	22.5	28.5	17.4	23.5	29.38	16.048	22.16	28.08	17.91	24.02	29.91	16.38	22.43	28.29
32	33.433	39.53	45.416	24.73	30.82	36.69	24.55	30.31	36.5	27.8	33.9	39.78	28.9	34.9	40.8	25.673	31.76	37.636	28.95	35.05	40.9	25.74	31.8	40.86
64	98.482	104.75	110.45	60.79	66.89	72.03	60.5	66.61	72.48	55.49	61.57	67.47	58	64.14	70.02	48.05	54.15	60.027	53.75	59.86	65.74	46.18	52.28	58.14
128	353.24	359.35	365.236	193.64	199.75	205.63	193.8	199.9	205.8	123	129.1	135	134.6	140.6	146.5	99.36	105.47	111.35	108.4	114.5	120	91.22	97.3	103.21
256	1350.31	1356.42	1362.3	710.56	717.07	802.5	704.7	718.2	724.1	292.9	299	304.9	345.5	351.7	357.5	222.47	228.57	234.45	228.1	234.45	240.12	188.5	194.5	200.4
512	5315.43	5321.45	5327.25	2722.62	2728.8	2716.5	2722.5	2728.6	2735.6	670.4	676.6	682.5	482.5	489	494.8	471.27	477.38	484.32	484.7	490.8	496.7	398.3	404.5	410.3
1024	20926.15	21243.01	21140.12	10652.9	10656.11	10664.8	10672	10678.2	10684	1642.5	1648.7	1654.6	2893	3132.8	3138.7	1035.47	1041.58	1047.7	1048.1	1054.2	1060.1	849.6	855.7	861.8
2048	32767	84217.69	84223.49	31734.6	41802.5	42264.7	32767	41525.3	46392	4067	4073	4078.8	4254.9	10682	10688	2244.7	2272.5	2278.43	2250.9	2257	2262.9	1809	1815	1821
4096	32767	336356	336361.8	30728	168554	168578	32767	168618	168624	9814	9820.5	9826.3	67153	38599	38684	4930	4936.1	4941.95	4812.4	4819.5	4825.3	3837	3843.9	3849.7

Fig. 16 Comparison between the sorting algorithms using data encoded in 4 bytes

Otherwise, Figure 17 shows the execution time of the sorting algorithms when N > 64. We note that BubbleSort, InsertionSort, SelectionSort and QuickSort have a high execution time. Thereafter, we compared only the other four algorithms to choose the best algorithm in terms of execution and standard deviation. In addition, we calculated the standard deviation for the different sorting algorithms when N ≤ 64. Tables 8 and 9 show that the standard deviation is almost the same. Since we rejected the BubbleSort, InsertionSort, SelectionSort and QuickSort algorithms if N > 64, we compared the standard deviation and execution time between HeapSort, MergeSort, ShellSort and TimSort. Consequently, TimSort has the best standard deviation with N ≤ 64 and N > 64 as shown in table 9 and Fig 18.

Fig. 17 Comparison between sorting algorithms when N > 64

Table 8 Standard deviation for sorting algorithms (N ≤ 64)

	BubbleSort	InsertionSort	SelectionSort	QuickSort	HeapSort	MergeSort	ShellSort	TimSort
8	3.47244	3.474	3.485	3.504	3.486	3.491	3.4752	3.471
16	3.4789	3.481	3.507	3.4849	3.4777	3.483	3.4742	3.4712
32	3.474	3.474	3.476	3.4757	3.475	3.4749	3.4763	3.4731
64	3.475	3.4743	3.4762	3.4739	3.474	3.475	3.4747	3.4724

Table 9 Standard deviation for sorting algorithms with N > 64

	HeapSort	MergeSort	ShellSort	TimSort	HeapSort	MergeSort	ShellSort	TimSort
128	3.475	3.4752	3.4745	3.4736	3.486	3.491	3.4752	3.471
256	3.474	3.4737	3.4758	3.4731	3.4777	3.483	3.4742	3.4712
512	3.6695	3.707	3.472	3.4714	3.475	3.4749	3.4763	3.4731
1024	4.082	3.479	3.56	3.4714	3.474	3.475	3.4747	3.4724

Fig. 18 Comparison of execution time between TimSort, MergeSort, HeapSort and ShellSort

Figure 18 shows the execution time of Timsort, MergeSort, HeapSort and ShellSort algorithms for the average case. For example, when N=4096, the execution time was 3756us, 4734.9us, 4848.5us and 9756.4us for Timsort, MergeSort, HeapSort and ShellSort respectively. For a large number of elements, we concluded that TimSort was 1.12x-1.21x faster than MergeSort, 1.03x-1.22x faster than HeapSort and 1.15x-1.61x faster than ShellSort running on FPGA (Figure 21). Table 7 presents the execution time for each algorithm using several elements encoded in 8 bytes. The obtained results show almost the same results obtained when using elements encoded in 4 bytes. From table 6 and 7, it is observed that the results obtained for TimSort in table 6 and 7 have a smaller execution time in each case. Thus, the performance of the sorting algorithms depends on the permutation or vectors.

Moreover, we notice that the computational execution time of 6 and 7 in hardware implementation is reduced compared that in processor Intel (Table 3) when the same frequency is used. For example, when N=2048, the hardware implementation was 1815 us (50 MHz) and the software implementation was 61,926 us (2260 MHz) for Timsort. We study two cases where the frequency is 2260Mhz the execution time is 61.926 us. In contrast, if the frequency decreases to 50MHz then the execution time increases and for this reason, we notice that the time on FPGA is very faster if the frequency is 50MHz.

From Software and Hardware implementation, we concluded that BubbleSort, InsertionSort and SelectionSort have an important execution time for a large number of elements. In addition, we concluded that MergeSort is the best algorithm in software execution and TimSort in the FPGA when N > 64; otherwise, we note that InsertionSort much faster than the other algorithms running on the processor and faster than SelectionSort running in the hardware platform. Table 10 shows the different resources utilization for BubbleSort, InsertionSort, SelectionSort, HeapSort, Merge-Sort, ShellSort and TimSort. We can see that Tim sort consumes more resources than the other algorithms. TimSort consumed 44% more for Slice LUT in hardware implementation and around 9 % for Slice Register. We concluded that when we increase the performance of the algorithm in terms of execution time, then we increase the amount of available the number of available resource utilization on the FPGA.

Table 10 Utilization of resources of the sorting algorithms

	BubbleSort			InsertionSort			SelectionSort			ShellSort			QuickSort			HeapSort			MergeSort			TimSort
	LUT	FF	BRAM	LUT	FF	BRAM	LUT	FF	BRAM	LUT	FF	BRAM	LUT	FF	BRAM	LUT	FF	BRAM	LUT	FF	BRAM	LUT	FF	BRAM
8	242	121	1	219	131	0	336	203	1	356	224	0	733	541	2	524	444	0	955	877	0	5299	2480	35
16	176	136	1	226	139	0	333	214	1	355	232	0	750	552	2	526	459	1	985	883	0	5326	2498	35
32	189	136	1	187	115	0.5	351	225	1	314	208	0.5	776	564	2	541	474	1	856	809	1.5	5293	2482	35
64	199	149	1	201	123	0.5	363	236	1	319	216	0.5	814	608	2	556	489	1	871	815	1.5	5320	2499	35.5
128	215	162	1	212	131	0.5	373	247	1	337	224	0.5	820	588	2	583	504	1	885	821	1.5	5487	2516	35.5
256	228	175	1	221	139	0.5	365	258	1	345	232	0.5	835	600	2	606	519	1	893	827	1.5	5468	2535	35
512	238	188	1	239	147	0.5	394	269	1	359	240	0.5	841	612	2	611	534	1	925	849	1.5	5494	2552	35.5
1024	245	201	1	248	155	1	407	280	1	365	248	1	860	624	3	651	551	1	936	855	3	5545	2569	36
2048	259	216	2	265	163	2	418	293	2	367	258	2	909	638	6	669	566	2	942	861	6	5668	2586	38
4096	291	229	4	274	171	4	434	304	4	384	266	4	976	682	15.5	689	581	4	954	867	12	5841	2602	44

7 Conclusion

In this paper, we presented the optimized hardware implementation of sorting algorithms to improve the performance in terms of execution time using a different number of elements (8-4096) encoded in 4 and 8 bytes. We used a High-Level Synthesis tool to generate the RTL design from behavioral description. However, we used a muli-criteria sorting algorithms which contain several actions in line and different criteria in columns.

A Comparative analysis of the various results obtained by these sorting algorithms was done based on two different cases (software and optimized hardware) and two parameters (Execution time and resource utilization) on a Zynq-7000 platform. From the results, it is concluded that BubbleSort, InsertionSort and QuickSort have a high execution time. In addition, TimSort ranges from 1.12x-1.21x faster than MergeSort, 1.03x-1.22x faster than HeapSort and 1.15x-1.61x faster than ShellSort when N ≥ 64 and using optimized hardware with many permutations/vectors. In contrast, when N < 64, Selection sort is 1.01x-1.23x faster than the other sorting algorithms.

As future work, we plan to use the hardware TimSort algorithm in the avionics field and in the hardware implementation of a decisional system. The next step is to include the best solution in hardware path planning algorithms.

References

1. Abirami, R. (2014). Vhdl implementation of merge sort algorithm. International Journal of Computer Science and Communication Engineering, pp. 1-6. [ Links ]

2. Afonso, G. (2013). Vers une nouvelle génération de systèmes de test et de simulation avionique dynamiquement reconfigurables. Ph.D. thesis, Université des Sciences et Technologie de Lille-Lille I. [ Links ]

3. Al-Jaloud, E., Al-Aqel, H., & Badr, G. (2014). Comparative performance evaluation of heap-sort and quick-sort algorithms. International Journal of Computing Academic Research, pp. 39-57. [ Links ]

4. Alquaied, F., Almudaifer, A. I., & AlShaya, M. (2011). A novel high-speed parallel sorting algorithm based on fpga. IEEE International Conference on Electronics, Communications and Photonics Conference (SIECPC), pp. 1-4. [ Links ]

5. Baklouti, M., Marquet, P., Aydi, Y., & Dekeyser, J. (2010). Scalable mpnoc for massively parallel systems-design and implementation on fpga. Journal of Systems Architecture, pp. 278-292. [ Links ]

6. Bárcenas, E., Benítez-Guerrero, E., Benitez, A., Calleja, D. l., J., Medina, M., et al. (2017). Branching path planning with modal logics. Computación y Sistemas, Vol. 21, No. 3, pp. 407-418. [ Links ]

7. Bique, S., Anderson, W., Lanzagorta, M., & Rosenberg, R. (2008). Sorting using the xilinx virtex-4 field programmable gate arrays on the cray xd1. Proc. Craig User Group, pp. 1-12. [ Links ]

8. Chen, R., Siriyal, S., & Prasanna, V. (2015). Energy and memory efficient mapping of bitonic sorting on fpga. Proceedings of the 2015 ACM/SIGDA ACM International Symposium on Field-Programmable Gate Arrays, pp. 240-249. [ Links ]

9. Chhugani, J., Nguyen, A., Lee, V., Macy, W., Hagog, M., Chen, Y., Baransi, A., Kumar, S., & Dubey, P., . Efficient implementation of sorting on multi-core simd cpu architecture. [ Links ]

10. Cong, J., Liu, B., Neuendorffer, S., Noguera, J., Vissers, K., & Zhang, Z. (2011). High-level synthesis for fpgas: From prototyping to deployment. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, pp. 473-491. [ Links ]

11. Coussy, P., Gajski, D., Meredith, M., & Takach, A. (2009). An introduction to high-level synthesis. IEEE Design & Test of Computers, pp. 8-17. [ Links ]

12. Danelutto, M., De Matteis, T., Mencagli, G., & Torquati, M. (2016). A divide-and-conquer parallel pattern implementation for multicores. ACM International Workshop on Software Engineering for Parallel Systems, pp. 10-19. [ Links ]

13. Dong, S., Wang, X., & Wang, X. (2009). A novel high-speed parallel scheme for data sorting algorithm based on fpga. IEEE International Congress on Image and Signal Processing, pp. 1-4. [ Links ]

14. Fernaíndez de Bulnes, D. R. & Maldonado, Y. (2018). Comparación de algoritmos evolutivos multi-objetivo para síntesis de alto nivel en dispositivos fpga. Computación y Sistemas, Vol. 22, No. 2. [ Links ]

15. Figueiredo, L., Jesus, I., Machado, J., Ferreira, J., & Martins de Carvalho, J. (2001). Towards the development of intelligent transportation systems. IEEE Intelligent Transportation Systems, pp. 1206-1211. [ Links ]

16. Georgopoulos, K., Chrysos, G., Malakonakis, P., Nikitakis, A., Tampouratzis, N., Dollas, A., Pnevmatikatos, D., & Papaefstathiou, Y. (2016). An evaluation of vivado hls for efficient system design. IEEE International Symposium ELMAR, pp. 195-199. [ Links ]

17. Harkins, J., El-Ghazawi, T., El-Araby, E., & Huang, M. (2005). Performance of sorting algorithms on the src 6 reconfigurable computer. IEEE International Conference on Field-Programmable Technology, pp. 295-296. [ Links ]

18. Jadoon, S., Solehria, S. F., Rehman, S., & Jan, H. (2011). Design and analysis of optimized selection sort algorithm. International Journal of Electric & Computer Sciences (IJECS-IJENS), pp. 16-22. [ Links ]

19. Jan, B., Montrucchio, B., Ragusa, C., Khan, F., & Khan, O. (2012). Fast parallel sorting algorithms on gpus. AIRCC International Journal of Distributed and Parallel Systems, pp. 107-118. [ Links ]

20. Jmaa, Y. B., Ali, K., Duvivier, D., Jemaa, M. B., & Atitallah, R. B. (2017). An efficient hardware implementation of timsort and mergesort algorithms using high level synthesis. IEEE International Conference on High Performance Computing & Simulation (HPCS), pp. 580-587. [ Links ]

21. Koch, D. & Torresen, J. (2011). Fpgasort: A high performance sorting architecture exploiting run-time reconfiguration on fpgas for large problem sorting. Proceedings of the 19th ACM/SIGDA International symposium on Field programmable gate arrays, pp. 45-54. [ Links ]

22. Krueger, T., Gehl, C., Rieck, K., & Laskov, P. (2008). An architecture for inline anomaly detection. IEEE International Conference on Computer Network Defense, pp. 11-18. [ Links ]

23. Manca, E., Manconi, A., Orro, A., Armano, G., & Milanesi, L. (2016). Cuda-quicksort: an improved gpu-based implementation of quicksort. Concurrency and computation: practice and experience, pp. 21-43. [ Links ]

24. Martinez, J., Cumplido, R., & Feregrino, C. (2005). An fpga-based parallel sorting architecture for the burrows wheeler transform. IEEE International Conference on Reconfigurable Computing and FPGAs, pp. 7-13. [ Links ]

25. Nikolajevic, K. (2016). Système décisionnel dynamique et autonome pour le pilotage d'un hélicoptère dans une situation d'urgence. Ph.D. thesis, Valenciennes. [ Links ]

26. Pasetto, D. & Akhriev, A. (2011). A comparative study of parallel sort algorithms. ACM International conference companion on Object oriented programming systems languages and applications companion, pp. 203-204. [ Links ]

27. Purnomo, D. J., Arinaldi, A., Priyantini, D. T., Wi-bisono, A., & Febrian, A. (2016). Implementation of serial and parallel bubble sort on fpga. Journal Ilmu Komputer dan Informasi, pp. 113-120. [ Links ]

28. Rieck, K. & Laskov, P. (2007). Language models for detection of unknown attacks in network traffic. Journal in Computer Virology, pp. 243-256. [ Links ]

29. Satish, N., Harris, M., & Garland, M. (2009). Designing efficient sorting algorithms for manycore gpus. IEEE International Symposium on Parallel & Distributed Processing, pp. 1-10. [ Links ]

30. Shahzad, B. & Afzal, M. (2007). Enhanced shell-sorting algorithm. computer journal of Enformatika, pp. 66-70. [ Links ]

31. Sodhi, T., Kaur, S., & Kaur, S. (2013). Enhanced insertion sort algorithm. International journal of Computer applications. [ Links ]

32. Sogabe, Y. & Maruyama, T. (2014). Fpga acceleration of short read mapping based on sort and parallel comparison. 24th IEEE International Conference on Field Programmable Logic and Applications (FPL), pp. 1-4. [ Links ]

33. Souissi, O. (2015). Planification de chemin d'hélicoptères sur une architecture hétérogène CPU FPGA haute performance. Ph.D. thesis, Valenciennes. [ Links ]

34. Srivastava, A., Chen, R. , Prasanna, V. K., & Chelmis, C. (2015). A hybrid design for high performance large-scale sorting on fpga. IEEE International Conference on ReConFigurable Computing and FPGAs (ReConFig), pp. 1-6. [ Links ]

35. Żurek, D., Pietroń, M., Wielgosz, M., & Wiatr, K. (2013). The comparison of parallel sorting algorithms implemented on different hardware platforms. Computer Science, pp. 679-691. [ Links ]

Received: August 15, 2018; Accepted: August 27, 2018

^* Corresponding author is Yomna Ben Jmaa. Yomna.BenJmaa@etu.univ-valenciennes.fr

This is an open-access article distributed under the terms of the Creative Commons Attribution License