A Parallel Strategy for Solving Sparse Linear Systems over Finite Fields

Rivera-Zamarripa, Luis; Adj, Gora; Aguilar-Ibáñez, Carlos; Cruz-Cortés, Nareli; Rodríguez-Henríquez, Francisco; Rivera-Zamarripa, Luis; Adj, Gora; Aguilar-Ibáñez, Carlos; Cruz-Cortés, Nareli; Rodríguez-Henríquez, Francisco

doi:10.13053/cys-26-1-3494

Servicios Personalizados

Revista

Articulo

Indicadores

Citado por SciELO
Accesos

Links relacionados

Similares en SciELO

Otros
Otros

Permalink

Computación y Sistemas

versión On-line ISSN 2007-9737versión impresa ISSN 1405-5546

Comp. y Sist. vol.26 no.1 Ciudad de México ene./mar. 2022 Epub 08-Ago-2022

https://doi.org/10.13053/cys-26-1-3494

Articles

A Parallel Strategy for Solving Sparse Linear Systems over Finite Fields

Luis Rivera-Zamarripa¹

Gora Adj²

Carlos Aguilar-Ibáñez³

Nareli Cruz-Cortés³^*

Francisco Rodríguez-Henríquez¹⁴

¹ Technology Innovation Institute, Cryptography Research Centre, United Arab Emirates. luis.riverazamarripa@tuni.fi.

² Universitat de Lleida, Department de Matemàtica, Spain. gora.adj@udl.cat.

³ Instituto Politécnico Nacional, Centro de Investigación en Computación, Mexico. caguilari@cic.ipn.mx.

⁴ CINVESTAV, Computer Science Department, Mexico. francisco@cs.cinvestav.mx.

Abstract:

In this paper we describe a number of parallel techniques that were applied to the problem of finding the null-spaces of thousands of large sparse matrices. This collection of matrices were derived from the discrete logarithm problem attack over the finite field F36⋅509 recently carried out by Adj et al. in [2]. Our software library was mainly executed in the supercomputer ABACUS [7], where in total 21,870 large sparse linear algebra systems were processed. Solving those linear algebra problems involved a computational effort of over 138 core-years, requiring a memory space of over 645 gigabytes to store the corresponding vector solutions.

Keywords: Linear algebra; finite field; parallel computing

1 Introduction

The computational problem of solving large sparse linear systems of equations shows up in several computer science subdisciplines. An emblematic application of sparse linear systems occurs in the process of finding the PageRank vector of the so-called Google matrix, which is a sparse Markov matrix with dimension of about ten billions (≈233) of rows and columns. Besides its famous Google application, the PageRank algorithm has found its way for solving problems in the areas of biology and social network analysis, among others [²⁴].

Cryptography is another important application where the problem of solving large sparse linear systems shows up.

Indeed, when one wants to factorize extremely large integers, or to solve the Discrete Logarithm Problem (DLP) using state-of-the-art index-calculus methods, one needs to solve high dimension linear algebra problems [¹, ⁴].

For example, solving the DLP problem over small characteristic finite fields, can lead to thousands of linear algebra problems some of them containing up to 266,086 equations and variables and up to 289 million (≈228.1) nonzero entries in the corresponding sparse matrix.

Further, the sought solution for this system of equations must be exact because the matrices arising from these cryptanalysis applications are defined over the integers modulo a prime number [², ¹⁰, ⁹, ¹³, ¹⁶].

In the context of cryptographic applications the aforementioned linear algebra problem can be formally stated as follows. Let ℬ be an N×N square matrix defined over a finite field Fp, where p is a large odd prime. The linear algebra problem addressed in this work consists of finding a non-trivial vector w∈Fp (with w≠0) such that:

ℬ⋅w=0, (1)

were both, w and 0 in Equation (1) are considered to be N×1 vectors with integer entries in the range [0,p−1]. We stress that Eq.( 1) has non-zero solutions if and only if ℬ is a singular matrix, i.e., the determinant of ℬ is zero. Equivalently, Eq.( 1) has non-zero solutions if and only if ℬ has not full rank.

The space of solutions of Eq.( 1) is sometimes referred as the kernel or null-space of the matrix ℬ. The families of matrices studied in this paper will always have a kernel of dimension one, which basically means that for a given matrix ℬ, there is only one non-trivial vector solution w that annihilates ℬ. This means that there is a unique vector w such that the matrix-vector product, ℬ⋅w, is equal to the zero vector.

Moreover, the case of interest in this paper are matrices with a dimension N of tens, and even hundreds of thousands columns and rows.

At the same time, the average per-row density of non-zero elements in these matrices will be very low. In the examples analyzed in this work this non-zero row density, which in the reminder of this paper will be denoted as λ, will always be less than one thousand non-zero entries, where an overwhelming majority of these entries take a small integer value.

Eq. (1) can be solved using standard Gaussian elimination at a computational complexity cost of O(N3).^¹

The main drawback of following this approach is that as the Gaussian elimination algorithm proceeds, the matrix being processed becomes more and more dense, which implies that its sparseness will get completely destroyed after some few iterations. This situation negatively affects both, the computational and the memory cost of solving this problem. For instance, a dense version of the matrices computed in this paper may need up to 8 Tera Bytes of memory per matrix.

Fortunately, several algorithms have been reported in the literature that can find the kernel of a matrix without manipulating it, but rather dealing with it using a so-called black box model [¹⁹]. Some of the most relevant algorithms that solve Eq. (1) without perturbing the sparseness of the matrix ℬ include: Structured Gaussian elimination (also known as intelligent Gaussian elimination) [¹⁵], Conjugate gradient and Lanczos algorithm [²¹, ²⁹], and Wiedemann algorithm [¹⁷, ¹⁹, ²⁰, ²⁸].

In this paper we will study the latter approach, which can find the solution vector w of Eq.( 1) at a computational cost of about 3N matrix-vector products, where the cost of each such product is of about λ⋅N integer additions and multiplications. This yields an approximately overall cost of some 3λ⋅N2 integer additions and multiplications.

Furthermore, Wiedemann algorithm enjoys the extra plus of being amenable for parallelization, and is said to be Las Vegas randomized, in the sense that it never outputs an incorrect solution after its execution, although it might sometimes fail to provide a non-trivial solution [¹⁹].

Without loss of generality, in this work we will generally consider that the whole arithmetic is performed modulo an 804-bit prime number p. Considering that contemporary processors architectures have a 64-bit word-size, it becomes necessary to represent the operands as an array of n=⌈80464⌉=13 sixty-four–bit words. Moreover, every arithmetic operation (such as addition, multiplication, etc), must give an integer result within the range of [0,p−1].

Hence, modular reduction should be periodically applied after one or more integer arithmetic operations have been performed [¹²].

As it has been mentioned, the core operation of Wiedemann algorithm, is the multiplication of a randomly looking column vector of dimension N, by each row of the matrix being handled.

Since a matrix-vector product is the single most demanding operation in Wiedemann algorithm, we carefully optimize the computation of this operation both, for CPU and GPU platforms.

Our contributions can be described as following. We present the efficient computation of the kernels associated to 21,870 linear algebra systems that happens to be large and sparse and that are defined over an 804-bit finite field. To solve this challenging task, we describe the specifics of our Wiedemann algorithm implementation in a multi-core environment hosted by the supercomputer ABACUS, and in a many-core GPU architecture. The results achieved in this work were a crucial building block for achieving the record computation of the discrete logarithm problem over the field F36⋅509 reported by Adj et al. in [²].

The remainder of this paper is organized as follows. In Sec. 2, the operand representation and field arithmetic employed in this work are briefly explained. In Sec. 3 a basic version of Wiedemann algorithm and several parallel variants are discussed. Then, in Sec. 4, we report the main implementation aspects related to the computation of the kernels of thousands of large sparse matrices in the supercomputer ABACUS and other platforms. we draw our concluding remarks in Sec. 5.

2 Finite Field Arithmetic

2.1 Field Element Representation

As it was mentioned in the Introduction, all the arithmetic operations are performed over a finite field Fp. Following the setting specified in [²], we use the 804-bit prime p=3509−3255+17. Since our

CPU servers have a 64-bit wordsize, we are forced to represent the field elements using an integer array of ⌈80464⌉=13 words as depicted in Figure 1. Notice that the most significant word of such array only utilizes 36 bits.

Fig. 1 Representation of an 804-bit field element

2.2 Field Multiplication

Let a, b∈Fp. A classical method for obtaining the modular multiplication defined as c=(a⋅b)mod⁡p, consists of performing the integer multiplication t=a⋅b first, followed by a reduction step, c=tmod⁡p. It may appear that this reduction step involves a division by p, which is a relatively expensive operation.

One can do better by using the Barrett reduction algorithm [⁵]. The Barrett reduction algorithm is based in the following observation: Given t=Qp+R, where 0≤R<p, the quotient Q=⌊tp⌋ can be written as:

Q=⌊⌊t/bk−1⌋⋅(b2k/p)⋅(1/bk+1)⌋=⌊⌊t/bk−1⌋⋅μ⋅(1/bk+1)⌋.

For efficiency reasons, the method requires to precompute a per-modulus constant parameter:

μ=⌊b2kp⌋,

where b is usually selected as a power of two close to the word size processor, and k=⌊logb p⌋+1.

Notice that upon finding Q, the remainder R can be obtained as, R=t−Qp≡tmod⁡p.

Algorithm 1 presents the Barrett reduction, where the remainder R is computed at a cost of no more than two 804-bit integer multiplications.

Algorithm 1 Barrett Reduction as presented in [11]

2.3 Main Matrix-Vector Product Arithmetic Operations

The core arithmetic operation needed to perform a matrix-vector product is vi=(vi+α⋅wi)mod⁡p. This operation involves the computation of additions, multiplications, and a modular reduction over the integers.

For the set of matrices considered in this work, the coefficients α happen to be small values. Indeed, in some matrices, for any given row, all the non-zero entries but the first one are equal to one. Besides, the first entry is a 804-bit integer.

We took advantage of this fact (as shown in Fig. 2) by using the function row_multqu that starts a row-vector multiplication computing a field multiplication using the function Fp_mulC. The result of this function is stored in the accumulator C. Then, all non-zero values are added and stored in the accumulator C using the in_addN function.

Fig. 2 OpenMP code of the Krylov Sequence computation distributed into four cores

Taking advantage of the so-called lazy reduction technique, this addition function does not perform any modular reduction at all. Therefore, at the end of the function row_multqu’s loop, our Barret_Reduction function is invoked to guarantee that the final result (labelled as out) has an integer value into [0,p−1].

For the families of matrices where α has a small value different than one, we compute the multiplication α⋅wi by performing a shift-and-add strategy. This way, we avoid performing costly full field multiplications as much as possible.

3 Wiedemann Algorithm

In this section, we describe our strategy for computing the kernel of thousands of sparse matrices.

We begin by describing the mathematics behind the Wiedemann algorithm and continue giving the main parallel strategies followed to compute the kernel of thousands of large sparse matrices.

We are targeting multi-core CPU servers, the super-computer ABACUS [⁷], and GPU Kepler architecture TITAN cards.

3.1 Basic Version of Wiedemann Algorithm

Let w be a randomly chosen column N-vector defined over Fp, and let u be the unit row N-vector, whose first entry is equal to one and all other entries are zero. Then, given an N×N matrix ℬ defined over Fp, a basic version of Wiedemann algorithm [²⁸], computes the 2N−term Krylov sequence S defined as:

S={u⋅ℬi⋅w}1≤i≤2N.

Notice that each one of the coefficients in the sequence S is an integer in the field Fp. In a second phase, the Berlekamp/Massey algorithm [⁶, ²²] can be employed to find the minimal polynomial fℬ,w from the sequence S with high probability, at a computational cost of O(N2) field operations. The monic polynomial:

fℬ,w(λ)=λl+cl−1λl−1+…+c0,

is known as the minimum l-degree polynomial of w, with l<N [²⁰]. In particular, the polynomial fℬ,w, satisfies the following relations:

fB,w(ℬ)w=0, ℬδ(ℬl−δw+cl−1ℬl−δ−1w+…+cδw)=ℬδw^=0. (2)

The exponent δ is strictly greater than zero for singular matrices ℬ. Then, for some integer t with 1≤t≤l, and ℬt⋅w^=0, which implies that:

w=ℬt−1w^≠0,

hence, ℬ⋅w=0.

Therefore, in a third phase of Wiedemann algorithm, the minimum polynomial fℬ,w as shown in Eq. (2) is evaluated, and this evaluation yields the desired solution w. Summarizing, Wiedemann algorithm requires the computation of three main phases [⁸, ⁹]:

1. Krylov sequence: To obtain the Krylov sequence S as defined above, we must compute 2N matrix-vector products of column N-vectors by a matrix ℬ. When we compute the first of such products, we get as output a vector w′, whose first entry corresponds to the first coefficient of S. The column N-vector w′ becomes then the input vector for the next matrix-vector product. After computing the second product, the first entry is obtained and stored in the second position of the sequence S, and so on until the 2N coefficients of the Krylov sequence S are obtained. The Krylov phase just described is illustrated in Figure 3. We stress that S is computed without manipulating the matrix ℬ. Furthermore, Appendix A depicts the C coded function krylov, used to computed this phase of the algorithm, where the product ℬ2N⋅w is computed by iteratively exploiting the identity,

ℬ2N⋅w=ℬ⋅…(ℬ⋅(ℬ⋅w))︸2N times.

Once again, the first coefficient of each one of the 2N matrix-vector products shown above, must be sequentially appended to the Krylov sequence S.
2. Obtaining the minimum Polynomial fℬ,w: From the Krylov secuence S, the Berlekamp-Massey algorithm [²²] can be used to find the corresponding minimum polynomial fB,w. The Berlekamp-Massey algorithm is a classical procedure that has been extensively studied in the literature (see for example [⁶, ⁸, ²⁶]). The complexity of the Berlekamp-Massey algorithm is O(N2). However, a fast version based on a variant of the extended greatest common divisor algorithm, known as the half-gcd algorithm, can compute the minimum polynomial with a complexity of only O(Nlog⁡(N)2) [¹⁵, §3.5.2].
3. Minimum Polynomial Evaluation: It finds the solution w^ (see Eq. (2)). If w^≠0, then for some integer t, with 1≤t<N, ℬtw^=0, but w=ℬt−1w^≠0. This implies that, ℬ⋅w=0. Let us recall that the minimum polynomial is evaluated by iteratively performing matrix-vector multiplications until the output vector is equal to zero. This process executes at most N−1 matrix-vector multiplications.

Fig. 3 First step of Wiedemann algorithm: The computation of the Krylov Sequence S

Hence, the computational cost of Wiedemann algorithm is dominated by phases 1 and 3, with an approximately overall cost of about 3N matrix-vector multiplications. This basic version described here, can be accelerated by applying parallel strategies as explained next.

3.2 Exploiting Parallelism Opportunities in Wiedemann Algorithm

In 1994, Coppersmith famously presented a parallel version of Wiedemann algorithm [⁸]. The next few years, Kaltofen and Villard published in [¹⁸, ²⁷], a comprehensive analysis of Coppersmith’s block version of this procedure. Arguably the first full implementation of this algorithm was reported by Kaltofen and Lobo in [¹⁹].

In a nutshell, Coppersmith idea was to simultaneously process n random vectors w and to store m coefficients of the corresponding matrix-vector product, with m≥n. To this end, the vector w becomes a matrix of n vectors of dimension N×n, whereas the vector u becomes a canonical (or highly sparse), matrix of dimension m×N. Coppersmith then ingeniously generalized the Berlekamp-Massey algorithm, observing that the linear recurrence can be determined by the first ⌈Nm⌉+⌈Nn⌉ coefficients of the sequence S.

This setting has two advantages. Firstly, the total number of iterations is reduced from 3N matrix-vector products (for the sequential version), to ⌈Nn⌉+⌈Nm⌉ matrix-block-of-n-vectors products [¹⁵, §3.5.3]. Secondly, the computation of each one of the matrix-block-of-n-vectors products can be readily parallelized using n different cores with or without shared memory capabilities.^² If however, one only has one core available to process the kernel of a matrix ℬ, then the advantage of these approach significantly diminishes.

Let us recall that for the main case study of this work, we are interested in computing the kernel of thousands of matrices that were obtained from the discrete logarithm problem attack over the finite field F36⋅509 reported in [²]. Furthermore, we had access to several multi-core servers as well as the super-computer ABACUS [⁷] (see Table 1 for more details). Under this scenario, we found that the following parallelization setting was simpler and that gave us a competitive performance as compared with the parallel strategy followed in other works such as [¹⁵].

Table 1 Hardware used for solving the linear algebra problems

Platform	Nodes	Cores	Frequency (@ GHz)	Compiler
Pakal	1	20	2.4	gcc 4.8.2
Abacus I	318	8904	2.6	gcc 4.8.3
Titan	1	2688	0.876	CUDA assembly

The matrix-vector product ℬ⋅w can be trivially parallelized on k cores by partitioning the rows of A into k submatrices A1,A2,…,Ak and computing Aiv on the ith core.

Figure 4 depicts such partitioning for the case when four processing cores can be assigned to compute the kernel of a matrix ℬ. In § 4.2 we discuss in detail how this strategy was coded in C using the OpenMP environment.

Fig. 4 Parallel computation of the matrix-vector product

4 Implementation Aspects

Wiedemann algorithm was successfully implemented for computing the kernel of 21,870 large and sparse linear algebra problems.

To this end, we use the Pakal CPU server,^³ the supercomputer ABACUS and a GPU Kepler architecture TITAN card, whose salient features are summarized in Table 1.

We begin this section by presenting the families of matrices considered in this work. Then, we give our theoretical cost estimates of computing the kernel associated to those matrices along with the real computational time obtained from our experiments.

4.1 Matrix Representation

In this work we only stored the non-zero values of a sparse matrix ℬ. This was accomplished by adopting a row representation of ordered pairs of the form, (mat_indi,mat_vali), which indicate the column of a non-zero entry in the i-th row of ℬ and its integer value, respectively. Using this representation, a matrix ℬ is represented using about λ⋅N pairs, where λ is matrix ℬ′ per-row average number of non-zero entries.

For a comprehensive analysis of different techniques to store a sparse matrix, the interested reader is referred to [¹⁵, § 5.2].

4.2 Parallel Programming API

There exist several software tools for performing parallel computing, such as, Message Passing Interface (MPI), CUDA for GPU platforms, OpenMP, among others. In this work, we chose to use OpenMP [²³] for our CPU server Pakal and the ABACUS supercomputer, and CUDA for our GPU Kepler architecture Titan card (refer to Table 1).

4.2.1 OpenMP for CPU Platforms

OpenMP defines a set of instructions for the C/C++ and Fortran programming languages, which makes easier to execute shared-memory parallelism.

A typical program using OpenMP instructions, starts with a single active master thread. Then, when the master thread execution reaches an OpenMP directive, such as the pragma directive omp parallel, it creates a set of threads specified by the programmer, where each of these threads will process a data set. Once that all the threads have performed their job, the master thread destroys the threads so created, and then it continues the sequential processing.

The segment that is processed in parallel depends on the given directive. For example, the directive sections divides the code into blocks and distributes each one of them among the participants threads. This directive is not sequential.

The parallel matrix-vector product strategy depicted in Figure 4, was implemented in C using OpenMP as reported in Appendix A. Indeed, the C coded function vector_matrix_mult_thread, performs a matrix-vector multiplication, but only in a sub-region of the matrix ℬ that starts at row ini and ends at row final. Then, the C coded function vector_matrix_mult_multicore_main4c, issues the openMP pragma directives that evenly distribute the matrix-vector computation into four different cores.

4.2.2 CUDA for GPU Platforms

Graphic Processing Units (GPU), are massively parallel processors consisting of hundreds of cores. By taking advantage of the highly parallel architecture of the GPUs, one can speed up several computations where high computing power is required.

A typical GPU is composed by several processors. Each processor has a large number of cores, and each core execute threads. A GPU architecture groups the threads into blocks, and the blocks into a grid.

In the case of the implementation reported in this work, for each block we assign 32 threads to process 32 adjacent rows of the sparse matrix being handled. This process was repeated until all the columns of the matrix have been assigned. We implemented all the field arithmetic (addition, subtraction, multiplication and reduction) using CUDA assembly.^⁴

4.3 Families of Matrices

We considered three different families of matrices. These matrices arise from the index-calculus method used in [²] to attack the discrete logarithm problem over the field F36⋅509, were the authors classified the matrices generated by their attack into three classes, namely, quadratics, cubics and quartics families of matrices.

— Family of Quadratics: A single N-dimension sparse matrix with N=266,086. Its non-zero elements can take the integer values ±1, 2, 3 or 4.

— Family of Cubics: 728 N-dimension sparse matrices with N=177,391. For each one of these matrices, all the non-zero entries of any given row, have a value equal to one, except for the first entry that has a randomly looking 804-bit integer value.

— Family of Quartics: 21,141 N-dimension sparse matrices with N≈55,000. For each one of these matrices, all the non-zero entries of any given row, have a value equal to one, except for the first entry that has a randomly looking 804-bit integer value.

Following the convention used in [², ³], the first column of each one of the Cubics and Quartics matrices contains a large 804-bit non-zero entry. The salient features of these three systems are summarized in Table 2.

Table 2 Main features of the three families of linear algebra systems under study

	Quadratics	Cubics	Quartics
Number of matrices	1	728	21,141
Dimension N	266,086	177,391	≈55 K
Per-row density of non-zero entries(λ)	1,086 –1,096	226 – 262	112
number of nonzero entries	≈290M	≈43M	≈6M

From the characteristic of each matrix family, one can estimate the number of operations that a sequential execution would require for solving each linear algebra system.

4.3.1 Finding the Kernel of the Quadratics Matrix

Since the non-zero values of the Quadratics matrix are smaller than four, each scalar multiplication can be performed using a shift-and-add approach. As a first order approximation, let us assume that the cost of each one of these scalar multiplications is equivalent to one addition. Moreover, at the end of a row-vector multiplication, a Barrett reduction using Algorithm 1 is performed with an associated cost of at most two multiplications.

Then, the cost of a matrix-vector multiplication is of, (λ−1)⋅N≈228.11 and 2N≈219.02 additions and multiplications, respectively. Since the cost of the sequential version of Wiedemann algorithm is dominated by the computation of about 3N matrix-vector multiplications, the estimated cost of finding the kernel of the single Quadratics matrix is of about 247.72 and 237.63, additions and multiplications, respectively.

The Quadratics matrix was solved using our C implementation of Wiedemann’s algorithm; the computation took 4,320 CPU hours in the server Pakal. Using the twenty cores available in this server, the elapsed time for this computation was of 216 hours. This matrix was also implemented in a GPU GeForce GTX Kepler architecture TITAN card running at 876MHz. Using 3,584 GPU threads, the linear algebra computation took 223.2 hours.

4.3.2 Finding the Kernels of the Cubics Matrices

Since except for the first entry, the per-row non-zero values of the Cubics matrices are all equal to one, the cost of a row-vector multiplication is of λ−1 additions plus five field multiplications. Hence, the cost of a matrix-vector multiplication can be estimated in about (λ−1)⋅N≈225.36 and 5N≈219.75 additions and multiplications, respectively. Since the cost of the sequential version of the Wiedemann algorithm is dominated by the computation of about 3N matrix-vector multiplications, the estimated cost of finding the kernel of all the 728 Cubics matrices is of about, 253.89 and 248.28, additions and multiplications, respectively.

The 728 Cubics matrices were solved using our C implementation of Wiedemann’s algorithm. Each linear system was solved in parallel on 7 ABACUS cores. The 728 linear systems were solved simultaneously using 5096 ABACUS cores. The total execution time was 379,142 CPU hours.

This time, and also the time for the linear algebra for the quartics (see §4.3.3), was more than expected in part because ABACUS was still running in an experimental phase and the machine was under-clocked to prevent over-heating. The increased CPU time did not have a significant impact on the total calendar time because of the large number of cores that we were at our disposal. The 728 solution vectors found were stored in files whose total size is of 26.4 gigabytes.

Using seven cores per matrix, the elapsed time for the linear algebra computation of each one of the Cubics matrices was of about 74.4 hours. Significantly, the Titan implementation of this problem took only 24.3 hours by using 3,584 GPU threads.

4.3.3 Finding the Kernels of the Quartics Matrices

Since except for the first entry, the per-row non-zero values of the Quartics matrices are all equal to one, the cost of a row-vector multiplication is of λ−1 additions plus five field multiplications.

Hence, the cost of a matrix-vector multiplication can be estimated in about (λ−1)⋅N≈222.54 and 5N≈218.06 additions and multiplications, respectively. Since the cost of the sequential version of Wiedemann algorithm is dominated by the computation of about 3N matrix-vector multiplications, the estimated cost of finding the kernel of all the 21,141 Quartics matrices is of about, 254.24 and 249.75, additions and multiplications, respectively.

The 21,141 Quartics matrices were solved using our C implementation of Wiedemann’s algorithm in 829,573 CPU hours on ABACUS. Each linear system was solved in parallel on 2 cores. We used approximately 5000 cores to solve all 21,141 linear systems. The solution vectors were stored in files whose total size is 618 gigabytes.

Using two cores per matrix, the elapsed time for the linear algebra computation of each one of the Quartics matrices was of about 19.62 hours. Significantly, the Titan implementation of this problem took only 0.96 hours by using 3,584 GPU threads.

4.4 Comparison

Table 3 reports a comparison of the elapsed time required by our linear algebra solver CPU and GPU implementations. It is interesting to note that a GPU solution is remarkably faster than a CPU implementation for the case of the Cubics and Quartics families. However, the CPU implementation outperform its GPU counter part for the computation of the quadratics matrix.

Table 3 A comparison of the GPU and CPU elapsed time for computing the kernels of the three linear algebra system families

	Quadratics	Cubics	Quartics
Matrix size(N)	≈266K	≈177K	≈55K
# Matrices	1	728	21141
Cores (CPU)	20	7	2
Threads (GPU)	3584	3584	3584
Linear Algebra CPU (hours)	216	≈ 74.4	≈ 19.62
Linear Algebra GPU (hours)	223.2	≈ 24.3	≈ 0.9605
Speedup	-	3.06	20.42
MinPoly (hours)	64.8	≈ 39.6	≈ 2.64

We believe that this behavior is due to the large size of this matrix that causes a costly divergence among the thread computations.

Due to the unique characteristics of the matrices derived in [²], it is in general difficult to compare our implementation with other related works. However, for the sake of completeness, we report in Table 4 a comparison of our GPU implementation against the one reported in [¹⁴]. The state-of-the-art software for computing Wiedemann algorithm is the CADO-NFS C implementation of the Number Field Sieve (NFS) algorithm for integer factorization and for computing discrete logarithms over finite fields [²⁵]. However, we stress that the software in [²⁵] specializes in the computation of usually one huge sparse matrix, but not in the simultaneous computation of thousands of large sparse matrices as is the case addressed in this work. Moreover, the GPU-based implementation of the CADO-NFS software basically corresponds to the work reported in [¹⁴], which is included in Table 4.

Table 4 A comparison of linear algebra solvers in GPU platforms

Author	GPU	Matrix size	Modulus size	Timing (hours)
[14]	GeForce GTX 580 @1544MHz, 512 cores	650K	217 bits	16
This work	Titan, 2688 cores	531K	155 bits	18.8

5 Conclusion

In this work, we reported the successful implementation of the null-spaces associated with 21,870 large sparse linear algebra systems.

The CPU years associated with this task is summarized in Table 5, where the CPU frequency column lists the average clock speed of the cores used. In order to accomplish the solution of this task in a reasonable calendar time, we use several parallel and supercomputing techniques both, in CPU and GPU platforms.

Table 5 CPU times of the kernel computation of all the sparse matrices derived in [2]

Computation stage	cores per matrix	CPU time (years)	CPU frequency (GHz)
1 Quadratics matrix	20	0.49	2.40
728 Cubics matrices	7	43.28	2.60
21,141 Quartics matrices	2	94.70	2.60
Total CPU time (years)	138.47

Acknowledgments

We would like to thank the ABACUS-CINVESTAV staff especially Isidoro Gitler, Daniel Ortiz-Gutiérrez and Omar Nieto-Ayalo, and the system manager Santiago Domínguez-Domínguez (CINVESTAV), for granting us computer access and for their technical support.

This project has received funding from Instituto Politécnico Nacional through projects SIP20211168 and SIP20211676, and the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 804476).

References

1. Adj, G. (2016). Discrete logarithms in small characteristic finite fields: attacking Type 1 pairing-based cryptography. Ph.D. thesis, Departamento de Computación, CINVESTAV, Mexico. [ Links ]

2. Adj, G., Canales-Martínez, I., Cruz-Cortés, N., Menezes, A., Oliveira, T., Rivera-Zamarripa, L., Rodríguez-Henríquez, F. (2016). Computing discrete logarithms in cryptographically-interesting characteristic-three finite fields. Cryptology ePrint Archive, Report 2016/914. [ Links ]

3. Adj, G., Menezes, A., Oliveira, T., Rodríguez-Henríquez, F. (2014). Computing discrete logarithms in F36⋅137 and F36⋅163 using magma. Koç, Ç. K., Mesnager, S., Savas, E., editors, Arithmetic of Finite Fields - 5th International Workshop, WAIFI 2014, volume 9061 of Lecture Notes in Computer Science, Springer, pp. 3–22. [ Links ]

4. Adj, G., Menezes, A., Oliveira, T., Rodríguez-Henríquez, F. (2014). Weakness of F36⋅509 for discrete logarithm cryptography. Cao, Z., Zhang, F., editors, Pairing-Based Cryptography - Pairing 2013 - 6th International Conference, volume 8365 of Lecture Notes in Computer Science, Springer, pp. 20–44. [ Links ]

5. Barrett, P. (1987). Implementing the Rivest Shamir and Adleman public key encryption algorithm on a standard digital signal processor. Proceedings on Advances in cryptology—CRYPTO ’86, pp. 311–323. [ Links ]

6. Berlekamp, E. R. (1968). Algebraic Coding Theory. McGraw-Hill, New York. [ Links ]

7. Cinvestav (2017). ABACUS supercomputer – Cinvestav. [ Links ]

8. Coppersmith, D. (1994). Solving homogeneous linear equations over GF[2] via block Wiedemann algorithm. Mathematics of Computation, Vol. 62, No. 205, pp. 337–350. [ Links ]

9. Geiselmann, W., Shamir, A., Steinwandt, R., Tromer, E. (2005). Scalable hardware for sparse systems of linear equations, with applications to integer factorization. Rao, J. R., Sunar, B., editors, Cryptographic Hardware and Embedded Systems - CHES 2005, 7th International Workshop, volume 3659 of Lecture Notes in Computer Science, Springer, pp. 131–146. [ Links ]

10. Granger, R., Kleinjung, T., Zumbrägel, J. (2014). Breaking ’128-bit secure’ supersingular binary curves - (or how to solve discrete logarithms in F24⋅1223 and F212⋅367. Garay, J. A., Gennaro, R., editors, Advances in Cryptology - CRYPTO 2014 - 34th Annual Cryptology Conference, Part II, volume 8617 of Lecture Notes in Computer Science, Springer, pp. 126–145. [ Links ]

11. Hankerson, D., Menezes, A., Vanstone, S. (2003). Guide to Elliptic Curve Cryptography. Springer-Verlag New York, Inc., Secaucus, NJ, USA. [ Links ]

12. Hoffstein, J., Pipher, J., Silverman, J. (2008). An Introduction to Mathematical Cryptography. Springer Publishing Company, Incorporated, 1 edition. [ Links ]

13. Jeljeli, H. (2014). Accelerating iterative SpMV for the discrete logarithm problem using GPUs. Koç, Ç. K., Mesnager, S., Savas, E., editors, Arithmetic of Finite Fields - 5th International Workshop, WAIFI 2014, volume 9061 of Lecture Notes in Computer Science, Springer, pp. 25–44. [ Links ]

14. Jeljeli, H. (2014). Resolution of linear algebra for the discrete logarithm problem using GPU and multi-core architectures. Silva, F. M. A., de Castro Dutra, I., Costa, V. S., editors, Euro-Par 2014 Parallel Processing - 20th International Conference, volume 8632 of Lecture Notes in Computer Science, Springer, pp. 764–775. [ Links ]

15. Jeljeli, H. (2015). Accélérateurs logiciels et matériels pour l’algèbre linéaire creuse sur les corps finis. Ph.D. thesis, Doctorat de l’Université de Lorraine. [ Links ]

16. Joux, A., Pierrot, C. (2014). Improving the polynomial time precomputation of frobenius representation discrete logarithm algorithms - simplified setting for small characteristic finite fields. Sarkar, P., Iwata, T., editors, Advances in Cryptology -ASIACRYPT 2014 - 20th International Conference on the Theory and Application of Cryptology and Information Security, Part I, volume 8873 of Lecture Notes in Computer Science, Springer, pp. 378–397. [ Links ]

17. Joux, A., Pierrot, C. (2015). Nearly sparse linear algebra and application to discrete logarithms computations. Cryptology ePrint Archive, Report 2015/930. [ Links ]

18. Kaltofen, E. (1995). Analysis of Coppersmith’s block Wiedemann algorithm for the parallel solution of sparse linear systems. Mathematics of Computation, Vol. 64, No. 210, pp. 777–806. [ Links ]

19. Kaltofen, E., Lobo, A. (1999). Distributed matrix-free solution of large sparse linear systems over finite fields. Algorithmica, Vol. 24, No. 3–4, pp. 331–348. [ Links ]

20. Kaltofen, E., Saunders, B. D. (1991). On Wiedemann’s method of solving sparse linear systems. Mattson, H. F., Mora, T., Rao, T. R. N., editors, Applied Algebra, Algebraic Algorithms and Error-Correcting Codes, volume 539 of Lecture Notes in Computer Science, Springer, pp. 29–38. [ Links ]

21. Lanczos, C. (1950). An iteration method for the solution of the eigenvalue problem of linear differential and integral operators. J. Res. Nat’l Bur. Std., Vol. 45, pp. 255–282. [ Links ]

22. Massey, J. (1969). Shift-register synthesis and BCH decoding. IEEE Trans. Inf. Theor., Vol. 15, No. 1, pp. 122–127. [ Links ]

23. OpenMP (2015). The OpenMP API specification for parallel programming. [ Links ]

24. Tan, X. (2017). A new extrapolation method for PageRank computations. Journal of Computational and Applied Mathematics, Vol. 313, No. Supplement C, pp. 383–392. DOI: https://doi.org/10.1016/j.cam.2016.08.034. [ Links ]

25. The CADO-NFS Development Team (2017). CADO-NFS, an implementation of the number field sieve algorithm. Release 2.3.0. [ Links ]

26. Tromer, E. (2007). Hardware-Based Cryptanalysis. Ph.D. thesis, Weizmann Institute of Science. [ Links ]

27. Villard, G. (1997). Further analysis of coppersmith’s block Wiedemann algorithm for the solution of sparse linear systems (extended abstract). Char, B. W., Wang, P. S., Küchlin, W., editors, Proceedings of the 1997 International Symposium on Symbolic and Algebraic Computation, ISSAC ’97, ACM, pp. 32–39. [ Links ]

28. Wiedemann, D. H. (1986). Solving sparse linear equations over finite fields. IEEE Trans. Inf. Theor., Vol. 32, No. 1, pp. 54–62. [ Links ]

29. Yang, L. T., Huang, Y., Feng, J., Pan, Q., Zhu, C. (2017). An improved parallel block Lanczos algorithm over GF(2) for integer factorization. Inf. Sci., Vol. 379, pp. 257–273. [ Links ]

Notice that for a matrix [mmlmath]

091;/mmlmath] of size [mmlmath]

row

/mi

/mo

086

row

[/mmlmath], this would imply a humongous computational effort of some [mmlmath]

row

2248;

/mn

row

/math

lmath] operations.

In the latter case at the price of providing a copy of the matrix [mmlmath]

091;/mmlmath] to all the participant processing cores.

available to us in the Computer Science Department of CINVESTAV

CUDA is the NVIDIA GPU hardware and software infrastructure that enables the execution of C programs.

Received: September 24, 2020; Accepted: September 16, 2021

^* Corresponding author: Nareli Cruz-Cortés, e-mail: nareli@cic.ipn.mx

This is an open-access article distributed under the terms of the Creative Commons Attribution License