Towards BIMAX: Binary Inclusion-MAXimal Parallel Implementation for Gene Expression Analysis

Serrano Rubio, Alejandra; Meneses Viveros, Amilcar; Morales Luna, Guillermo B.; Paredes Lopez, Mireya; Serrano Rubio, Alejandra; Meneses Viveros, Amilcar; Morales Luna, Guillermo B.; Paredes Lopez, Mireya

doi:10.13053/cys-24-1-2979

Servicios Personalizados

Revista

Articulo

Indicadores

Citado por SciELO
Accesos

Links relacionados

Similares en SciELO

Otros
Otros

Permalink

Computación y Sistemas

versión On-line ISSN 2007-9737versión impresa ISSN 1405-5546

Comp. y Sist. vol.24 no.1 Ciudad de México ene./mar. 2020 Epub 27-Sep-2021

https://doi.org/10.13053/cys-24-1-2979

Articles

Towards BIMAX: Binary Inclusion-MAXimal Parallel Implementation for Gene Expression Analysis

Alejandra Serrano Rubio¹^*

Amilcar Meneses Viveros¹

Guillermo B. Morales Luna¹

Mireya Paredes Lopez²

^¹ CINVESTAV-IPN, Computer Science Department, México. aserrano@computacion.cs.cinvestav.mx

^² Universidad de las Américas Puebla, Departamento de Computación, Electrónica y Mecatronica, México

Abstract

Differential gene expression analysis and clustering techniques have been current tools to study the relation between a gene and biological processes. Since a group of genes may show co-expression under certain conditions, biclustering techniques have been used to find sets of genes sharing similar expression patterns. We present an analysis of the performance of the BIMAX: Binary Inclusion-MAXimal sequential biclustering algorithm. Its performance is evaluated using synthetic datasets. Finally, we propose a strategy of parallelization for optimizing the performance of the BIMAX using parallel programming techniques.

Keywords: Biclustering; clustering; gene expression; high-performance computing; parallelism

1 Introduction

Data Mining is useful in the analysis of data structures provided in massive quantities by sophisticated new processing technologies. Bioinformatics focuses on the research and development of new computational methodologies based on computer techniques for organization and analysis of information associated with the "omics" sciences ^[¹⁴^]. The organization and analysis of biological data at the level of DNA sequence and RNA generate information related to cellular mechanisms and processes ^[¹⁵^]. One of the main aims is the analysis of gene expression ^[¹⁰^].

Hybridization-based techniques or microarrays in gene expression analysis has achieved high performance in quantifying the level of gene expression. However, such analysis is performed using hypothesis tests, w hich require a relatively small number of conditions and only genes that have been selected can be parsed.

Clustering describes patterns classifying the information by unsupervised methods. Biclustering techniques allow the clustering of genes with a similar genetic profile in experimental conditions, thus overcoming the traditional clustering techniques by the simultaneous clustering of genes, diseases and overlap.

In this paper we focus on BIMAX, described by Prelic et al. ^[²¹^], based on Divide-and-Conquer strategies to determine optimal biclusters in reasonable time.

In general, biclustering techniques face the same problem as the clustering techniques proposed in the literature, due to the type of information being analyzed, which is characterized by high-dimensional databases. Consequently, robust noise algorithms must be proposed minimizing runtime and showing high-quality solutions. A high-performance computing system is essential to improve the performance of any computational algorithm. We describe the implementation of the parallel BIMAX algorithm.

Section 2 describes the background, and Section 3 presents BIMAX algorithm. Finally, Section 4 describes the analysis of the sequential algorithm and Section 5 presents strategies for parallelizing BIMAX.

2 Background

The main application of Bioinformatics within the biological context is the use of Data Mining techniques for the analysis of information obtained in the study of molecules relevant for life ^[¹⁸^]. However, before the application of computational algorithms, it is necessary to adapt and elaborate new models and methodologies that fit the demand of the problem under study ^[²⁰^,²²^].

Techniques of clustering and biclustering allow to perform transcriptome analysis for the detection of genes differentially expressed in a set of experimental conditions. The patterns can be identified from datasets through microarray experiments, aiming to infer the biological mechanisms modeling the genotype-phenotype relationship and support decision-making. The information from microarray analysis is organized in a Gene Expression Matrix:

Wm,n=w11w12w21w22⋯⋯w1nw2n⋮⋮⋱⋮wm1wm2⋯wmn. (1)

The i-th row contains data of a specific gene g_i while the j-th column represents a experimental condition c_j. Let G={g₁,g₂,…,g_n} and C={c₁,c₂,…,c_m} be the collections of genes and conditions, respectively. For each i, j, the value w_ij represents the expression level of gene i at condition j.

The gene expression matrix may contain noise, null values and systemic variations produced during the execution of the experiments. Usually, there are much more genes than conditions. Pre-processing of data is essential before applying any bioinformatic analysis technique, to form hypotheses about the potential pathways of information flow between the involved genes.

2.1 Clustering Techniques

Clustering techniques aim to group data containing common characteristics, they identify densely populated regions called clusters, namely, a partition U={U₁,U₂ ,…,U_k} of an universe U is built.

Gene-based-clustering obtains functional relations between genes based on their expression levels in comparison with experimental conditions. Such relations consider genes as the data to be grouped and the states as attributes.

Instead, sample-based-clustering ^[⁸^] matches each group of experiments with a phenotype.

A good clustering solution must consider the maximization of homogeneity and separation metrics, which act oppositely. There are many proposals to solve grouping problems, and most of them are NP-hard. Thus heuristics and approximations are used. Five clustering algorithms and their characteristics are summarized in Table 1.

Table 1 Clustering algorithms

Algorithm	Description	Ref.
K-means	Method of vector quantization whose objective is to classify n observations into k clusters. The time complexity is O(lkn) where l is the number of iterations, and k is the number of clusters.	^[¹²^]
Self Organizing Maps	Algorithm based on competitive learning without supervision to classify a set of nearby observations through a graph. In some cases, it may fail because interesting patterns can be classified in several ways [²⁶]	^[²⁸^]
CAST	Algorithm based on the notion of corrupted clique graph data model. The input data set is assumed to come from the underlying cluster structure with “contamination” due to random errors caused by the complex process of gene expression measurement.	^[²^]
CLICK	Algorithm to identify highly connected components in the proximity graph as clusters using probabilistic assumptions so that two criteria are satisfied: homogeneity, due to mates: highly similar to each other; and separation due to non-mates: little similarity to each other.	^[⁴^]
Hierarchical Clustering	This algorithm generates a hierarchical series of nested clusters which can be graphically represented by a tree named dendrogram.	^[⁷^]

Advantages and lacks appear at clustering algorithms when identifying highly correlated gene sets in gene expression. K-means, SOM, and hierarchical clustering have shown high-performance ^[³¹^] but their purpose is general and they may fail to address particular challenges of gene expression analysis. On the other hand, CLICK and CAST may solve this problem ^[²⁴^].

2.2 Biclustering Techniques

A gene expression matrix W_mn as in eq. (1) can be seen as indexed by the set G × C where G={g₁,g₂,…,g_n} and C={c₁,c₂,…,c_m} are the sets of considered genes and conditions, respectively. The ij-th value w_ij represents the expression level of gene i at condition j. Accordingly, we will write from now on W_GC instead of the previously introduced notation W_m,n. For any subsets F⊆G and B⊆C, the submatrix WFB=wiji∈F,j∈B consists of the expression levels corresponding to genes in F under conditions in B, and can be seen as the gene expression matrix of the bicluster F × B (in the cases in which F = G or B = C it is conventional to refer to clusters instead of biclusters).

A bicluster may satisfy a homogeneity property, e.g., it may have constant entries, or constant entries by rows or columns, or have additive or multiplicative coherent values or coherent evolutions, etc.

The Biclustering Problem (BP) receives as input a gene expression matrix W_GC and a collection H of homogeneity properties and the goal is to find a covering of G × C consisting of maximal biclusters Fι×Bιι∈I such that WGC=⋃ι∈IWFιBι and each bicluster WFιBι satisfies a homogeneity property.

The BP can be hardened by requiring that the cover solution is indeed a partition. Namely, its sets are pairwise disjoint.

BP, in general, is NP-complete ^[¹⁹^], hence the vast majority of algorithms prioritize the reduction of computing costs, opting for the use of heuristic, stochastic search, divide and conquer and pre-processing techniques to make the patterns of interest more evident.

Most biclustering techniques are used for the analysis of transcriptomic data. The choice of an algorithm will implicitly favor getting a particular type of grouping. Five biclustering algorithms have been selected. Table 2 gives a brief description of them.

Table 2 Biclustering algorithms

Algorithm	Description	Ref.
Cheng & Church	This algorithm is the first application of biclustering for analysis of expression profile, and the aim is to find more delicate signals than the clustering algorithm using the Mean Residue Score.	^[³^]
SAMBA	SAMBA emerged as a method of biclustering that produced statistically significant results, and that will also involve a normalization of the expression matrix that represents the essential characteristics of the data.	^[⁵^]
CTWC	The idea of this algorithm is to identify subsets of genes and conditions so that some of these subsets are used to define new groups that define stable and significant partitions. It should note that the number of submatrices grows exponentially with the size of the problem.	^[³⁰^]
Plaid Models	This algorithm considers a matrix of genes and conditions as a layer superposition, each of them being a subset of rows and columns, which are rearranged to obtain a matrix formed by blocks, where each block is a bicluster	^[¹⁶^]
BIMAX	This algorithm is based on the technique of “divide and conquer”. Specifically, the algorithm begins with a division of the matrix Wm×n into sets of columns based on one of the rows as the reference. Next, the rows or genes rearranged, as they correspond to the different sets of conditions previously obtained.	^[²³^]

2.3 Related Work

Improving the computational efficiency of Data Mining algorithms by introducing some method of Parallel Computation proves to be significant as the dimension of the datasets to be analyzed increases. In this sense, several robust parallel methodologies have been proposed that model a biological model under study and also support decision making efficiently.

Metaheuristics are techniques that help in the solution of combinatorial optimization problems. In ^[⁹^] Gomez-Pulido et al. propose the parallelization based on a fine granularity scheme of a biclustering algorithm based on EAs. The methodology indicates the selection of the section that takes the longest computation time, then copies of the implementation will be processed in different processing units. The results showed high efficiency in computing time and power consumption when using reconfigurable hardware instead of multiprocessor architectures.

Lin et al. proposed the parallelization of the biclustering algorithm: Large Average Submatrices (LAS) based on the MapReduce technique. Intuitively, the algorithm is organized in two phases: 1) search k rows with the largest sum over the columns, this phase consists of a function map and two functions reduce; and 2) based on adaptive row search to sum over the columns indicated for each row, then sort all the row sums in sequential order, this phase contains only a map function ^[¹³^]. This algorithm proved to have a better performance in the quantitative characteristics of each cluster compared to other algorithms such as BIMAX.

On the other hand, Ardaneswari et al. propose a method of grouping in two phases to be able to determine a bicluster ^[¹^]. During the first step, the parallel algorithm k-means is used to classify a matrix W_m×n. In the second phase, the algorithm proposed by Cheng & Church is used. Also, the benefits and limitations related to the design of a parallel biclustering algorithm in GPUs are presented in ^[¹⁷^]. This paper proposes the minimization of latency using a coarse grain to maximize energy efficiency and performance through the use of parallel patterns.

Sarazin et al. in ^[²⁵^] propose an implementation of the algorithm self-organizing maps using MapReduce with the Spark platform. This application is focused on fault correction, information management and distribution in a distributed architecture. The main idea is based on the initiation of two functions of map-reduce type, which manage the iterations between the rows and the columns.

A parallel algorithm of biclustering must be robust, that is, it must show relevant results in a reasonable amount of time, and that must also be scalable to the target architecture. In the following section, we describe the BIMAX: Binary Inclusion-MAXimal algorithm sequential, which assumes two possible levels of expression: level change, and no difference, concerning a control experiment.

3 BIMAX: Binary Inclusion-MAXimal

Heuristics are used to solve problems that have proved to be NP-complete. The advantages of one algorithm over another may be due to a more suitable optimization method for a specific dataset. In this sense, for the choice of BIMAX, the number of references within the scientific community and the facility to reconstruct the code based on the original publication were considered.

BIMAX algorithm uses a search strategy based on "divide and conquer" to determine all optimum maximal biclusters within a reasonable time of a matrix of binary gene expression. Each gene in a condition assumes two possible values: 1 if the gene responds differentially to a condition and 0 if it does not concern a control condition ^[²¹^].

Let G={g₁,g₂,…,g_n} and C={c₁,c₂,…,c_m} be the sets of genes and conditions.

Find a gene g_i ∈ G, such that the following two condition subsets B_i0, B_i1 are non-empty, where for each k ∈ {0,1} B_ik consists of the conditions c_j ∈ C such that w_ij = k.

Let WGBi0 and WGBi1 be the submatrices whose columns are indexed by these sets.

Let us partition the gene (row) index set into three subsets:

G_i0 = {g_i’∈ G| the restriction of the the i'-th row of WGBik is constant with value k, for some k ∈ {0,1} },

G_i1 = {g_i’ ∈ G| the restriction of the the i'-th row of WGBik is constant with value 1-k, for some k ∈ {0,1} },

G_i2 = G - (G_i0 ∪ G_i1),

and let F_i1 = G\G_i1 be its complement in the gene index set. Form the submatrices:

W0=WGBi1,W1=WFi1C.

Repeat the procedure to each of these biclus-terings, W₀ and W₁, till the above activation non-emptyness condition fails.

3.1 Algorithm (Reference Method)

Algorithm The following algorithm realizes the divide-and-conquer strategy. Note that individual operations are required for processing the W_FB submatrices. The algorithm needs to guarantee that only optimal, i.e., inclusion-maximal biclusters are generated ^[²¹^].

4 Analysis of Sequential BIMAX

During the parallelization process of an algorithm it is necessary to make an analysis of the performance of the algorithm. This analysis is helpful to find information about the segments of code that can be potentially parallelized. In this section, we present an analysis of the sequential BIMAX algorithm.

The implementation of the sequential BIMAX algorithm was designed and codified based in the algorithm presented in ^[²¹^] in C language. For the experiments, we use an Intel Xeon E3 v5 processor. For the performance analysis of the sequential BIMAX, we used the three in silico^¹ datasets, Group A, Group B and Group C. Group A was used to evaluate the effectivines of the algorithm, which is measure by calculating the number of correct biclusters presented in the output of the program. Group B was built to measure the effectiveness of the algorihtm in the presence of noise, which was simulated by applying a random order of the rows in the matrix. Finally, Group C was used to determine the maximum amount of sections in the algorithm potentially parallelized.

— Group A is characterized by five binary matrices constructed in the absence of noise, to which a definite number of biclusters has been implanted according to the size of the matrix. These datasets represent the best case scenario due to the elements belonging to a bicluster are contiguous, and there are no overlaps between them. The table 3 describes the characteristics of each of these matrices, the processing time expressed in milliseconds and the percentage of effectiveness represented by the comparison between the features of the biclusters implanted against of the obtained biclusters.

We have represented the percentage of effectiveness through the convention: α₀ - α₁(%), where α₀ and α₁ represent the number of biclusters implanted and detected respectively, while (%) represents the percentage of effectiveness for each case study. The percentage of effectiveness was obtained by comparing the characteristics of the implanted and obtained biclusters. The above has shown that the algorithm has a good correctness and efficiency, being able to determine the 100% of the biclusters implanted in datasets proposed.
— Group B is characterized by five binary matrices belonging to group A, which have been added noise simulated by random ordering of the rows. Table 4 describes the processing time expressed in milliseconds and the percentage of effectiveness of each matrix. One more column has been added to represent the percentage of the growth rate of the execution time needed to find biclusters in the presence and absence of noise. The percentage of the growth has been measured using the equation Growth rate=t0-t1t1×100, where t₀ and t₁ represent the execution time in presence and absence of noise, respectively.

The results show that although the biclusters are not well defined, the algorithm does not lose precision or quality. One more experiment was added by randomly swapping the order of the columns of each test matrix in this dataset.

These experiments (see table 4, 5) shown that the growth rate concerning the execution time varies between 20 % and 30 % in the presence of noise. A matrix M_50×50 was constructed to compare the impact that the presence of noise has on the performance of the algorithm. To each element of M was added a noise value simulated by a Gaussian distribution as well as the order of the rows and columns was randomly exchanged.

The execution time obtained by processing the matrix 5 of the experiments shown in the tables 3, 4 and 5 was compared. The results show that the more noise contains the gene expression matrix and the less obvious the biclusters are, the algorithm will take more time to present a result to the user (see table 6).
— Group C characterized by eleven binary matrices that have been constructed by the implantation of a defined number of biclusters with superposition. Next, a simulated noise value has been added to each element of the matrix using a Gaussian distribution. Finally, the order of the rows followed by the order of the columns has been randomly exchanged. This dataset represents the worst case scenario.

The table 7 shows the dimension of each of the arrays built in silico, as well as the execution time expressed in milliseconds. It is necessary to highlight that the algorithm depends on two parameters that the user needs to define at the beginning of the computation: the minimum number of genes for a bicluster that contains more than one condition, or the minimum number of conditions for a bicluster that contains more than one gene. We have defined these parameters with a value of 1, so that our analysis shows the greatest possible number of results.

An analysis of the time it takes for the algorithm to read the gene expression matrix, to process it and to write the results has been made. Table 8 shows the percentage of time for each case study. It is appreciated that in the first four cases the percentage corresponding to the writing and reading time is higher than the percentage of the processing time. However, as the size of the matrix increases, the rate of processing time begins to converge to 68 %, which represents the processing time limit of the algorithm that can be potentially parallelizable.

Table 3 Effectiveness of the BIMAX sequential biclustering algorithm without noise

Matrix	Size	Time (ms)	Effectiveness
1	10 × 10	0.4802	3-3(100%)
2	20 × 20	0.6608	4-4(100%)
3	30 × 30	0.7990	5-5(100%)
4	40 × 40	0.9938	6-6(100%)
5	50 × 50	1.1459	7-7(100%)

Table 4 Effectiveness of the BIMAX sequential biclustering algorithm in presence of noise

Matrix	Time (ms)	Effectiveness	Rate
1	0.5952	3-3(100%)	23.94%
2	0.8210	4-4(100%)	24.24%
3	0.9819	5-5(100%)	22.89%
4	1.2376	6-6(100%)	24.53%
5	1.4688	7-7(100%)	28.18%

Table 5 Effectiveness of the BIMAX sequential biclustering algorithm in presence of noise

Matrix	Time (ms)	Effectiveness	Rate
1	0.6198	3-3(100%)	29.07%
2	0.8659	4-4(100%)	31.03%
3	1.0578	5-5(100%)	32.39%
4	1.2919	6-6(100%)	29.99%
5	1.4770	7-7(100%)	28.89%

Table 6 Analysis of the impact that the presence of noise has on the performance of the BIMAX algorithm

Matrix	Size	Time(ms)	Effectiviness
1	50 × 50	1.1459	7-7(100%)
2	50 × 50	1.4688	7-7(100%)
3	50 × 50	1.4770	7-7(100%)
4	50 × 50	47.6542	7-4(57.1%)

Table 7 Total execution time in minutes (mins) to evaluate the performance of the BIMAX sequential biclustering algorithm

Matrix	Size	Time (mins)
1	10 × 10	0.0048
2	50 × 50	0.0535
3	100 × 100	1.9749
4	150 × 150	23.2889
5	200 × 200	185.9743
6	250 × 250	762.6570
7	300 × 300	3191.0777
8	350 × 350	10906.2004
9	400 × 400	39018.2120
10	450 × 450	84378.7566
11	500 × 500	131627.8000

Table 8 Percentages of the execution time of the three phases of the BIMAX sequential biclustering algorithm: reading, writing and processing

Matrix	Reading (%)	Writing (%)	Processing (%)
1	2.8014	94.0423	3.1563
2	1.8938	67.1074	22.2813
3	0.5095	59.1921	40.2983
4	0.0679	57.3385	42.5936
5	0.0103	44.3639	55.6258
6	0.0030	38.0696	61.9274
7	0.0008	35.0811	64.9181
8	0.0003	33.0724	66.9273
9	0.0001	32.0714	67.9285
10	0.0001	32.0153	67.9846
11	0.0000	32.0076	67.9924

We analyze the maximum performance that the parallel system of the BIMAX algorithm can provide. Given the sequential program of the BIMAX algorithm whose 68 % of its code is perfectly parallelizable, the performance and efficiency of the sequential algorithm is calculated on {1,2,4,8,16} processors, assuming that it has a run time of 100 units of time (seconds). The system's performance improvement factor (speed-up) is defined as Sp=T1Tp, while the efficiency of the system with p processors is define as Ep=Spp, such that T(p) represents the runtime with p processing units.

The estimation of the performance of the algorithm while increasing the number of processors is shown in table 9. A system is scalable for a specific range of processors, if E(p) of the system remains constant and above a factor of 0.5 ^[⁶^], although it is appreciated that moving to 8 processors reduces the efficiency considerable.

Table 9 Estimation of the performance and efficiency of the BIMAX algorithm with different number of processors

Processors	1	2	4	8	16
S(p)	1	1.5151	2.0408	2.4691	2.7586
E(p)	1	0.7575	0.5102	0.3086	0.0946

Consequently, the performance of the BIMAX algorithm is defined when p → ∞ as:

limp⟶∞⁡T11-ϕ+ϕp=1001-0.68=3.125,

where ϕ is defined as 68 % of the potentially parallelizable code. The above represents an approximation to the maximum acceleration that can be obtained from the parallelization of the sequential code of the algorithm.

5 Strategies for Parallelizing BIMAX

In the figure 1 it is observed that the processing time characterized by ϕ begins to decrease as p increases. In contrast, the sequential time represented by (1-ϕ)=r(p)+w(p) where r(p) and w(p) represent the time of reading and writing partial results by each processing unit remains constant in all the study cases and particularly when p≥4, (1-ϕ)> ϕ.

Fig. 1 Analysis of the performance of the BIMAX sequential algorithm

Based on the results obtained in the table 8, it can be said that:

limp⟶∞⁡rp=0∴wp≈σp.

Therefore, it is proposed to use a parallel file system in a message passing environment, in order to optimize the access time to the data and consequently improve the overall performance of the algorithm.

Due to the complexity of BIMAX, which is defined as O(mnβ min{m,n}), where β is the number of inclusion-maximum biclusters of W_m,n ^[²¹^], the divide and conquer scheme that is used, and the hard disk space required to process and store, the algorithm turns out to be a suitable candidate for the application of some parallelization technique.

The parallelization of biclustering algorithms has been difficult due to its inherent characteristics, which requires to repetitively read the same data or to distribute it between different devices. These data intensive characteristics can limit current parallel architectures.

Nevertheless, some biclustering algorithms have been parallelized including novel algorithms using parallel genetic algorithms, parallel evolutionary learning and the parallel large average submatrices based on MapReduce ^[²⁷^], ^[¹¹^{] [}¹³^], running on multicore systems or clusters. Others algorithms have been parallelized for using the popular graphics processing units (GPUs), requiring more specialized parallel programming as ^[¹⁷^].

Despite the BIMAX algorithm has been taken as a baseline for comparison with other biclustering algorithms, the only parallel version, to the best of our knowledge, is the one presented by Voggenreiter et al. ^[²⁹^]. This parallelisation consists of a straightforward strategy using a job pool of threads. ^[²⁹^] states that using a single pool, leads to contention between threads and it increases as the number of threads gets higher. Thus, the more number of threads running the BIMAX, the slower performance the program have. To alleviate this contention, it is proposed a parallelization of BIMAX without a job pool. However, this implementation was found not to be effective for larger datasets.

In this work, we aim to go further in the BIMAX parallelization by partitioning the input matrix up to a certain level. This level is limited by the number of processors in the architecture system. After reaching the last level, then the BIMAX is executed independently by each process with a submatrix. The program ends when all the processes have been finished.

Figure 2 illustrates the proposed parallelization of the BIMAX algorithm. It basically consists of the division of the input matrix into the total number of available processors in the system. This division is made at the beginning by the parallel BIMAX program. Since the BIMAX is implemented in divide and conquer approach, it generates a tree of processes as it can be seen in Figure 2. The number of levels in that tree depends on the total number of processors. For instance, having available six processors, the tree can only have two levels of the BIMAX recursion. Once the last level of tree is achieved, in this case the second level, each processor start the execution of the BIMAX algorithm with its respective input matrix.

Fig. 2 Example of the partitioning of the input matrix created by BIMAX algorithm

6 Conclusion

Biclustering is a powerful unsupervised technique to uncover patterns in gene expression data. Three main phases of the BIMAX algorithm are described: data reading, processing and writing results. Our results suggest that the algorithm is potentially parallelizable in the processing and the writing phases, due to the reading phase tends to zero when the number of processors increases.

The parallelization of the BIMAX algorithm is proposed in a message passing environment, as well as a parallel file system. The above is associated with a lower computation cost when obtaining partial results for each processing unit during the analysis of a section of the gene expression matrix.

Certainly, there are differences in the specific criteria used to parallelize the algorithm, having as a consequence differences in speed between the present study and those proposed in the literature. In this sense, a previous analysis of the performance of the BIMAX algorithm has been carried out to identify the potentially parallelizable sections, and thus, be able to propose a good design of the parallel algorithm in distributed memory platforms.

In this study, we used only gene expression matrices designed in silico, of which the characteristics of implanted biclusters were known to minimize the confounding variables. Future research should include gene expression matrices that result from biological experiments, and that also allow verifying the results in vitro.

It must be taken into account that the parallelization of the algorithm will show an improvement in its performance that will depend on the algorithm itself, its sequential component, the overhead of communication and synchronization between the processes. However, the objective will always be to increase the speed of the processing without altering the effectiveness of the algorithm, independently of the error coming from the different noise sources of the experiment.

Acknowledgement

The authors would like to thank the financial support given by the Mexican National Council of Science and Technology (CONACyT), as well as ABACUS: Laboratory of Applied Mathematics and High-Performance Computing of the Mathematics Department of CINVESTAV-IPN. They also thank Advanced Studies and Research Center of National Polytechnic Institute (CINVESTAV-IPN), for encouragement and facilities provided to accomplish this publication.

References

1. Ardaneswari, G., Bustamam, A., & Siswantining, T. (2017). Implementation of parallel k-means algorithm for two-phase method biclustering in Carcinoma tumor gene expression data. AIP Conference Proceedings, volume 1825, AIP Publishing, pp. 020004. [ Links ]

2. Bellaachia, A., Portnoy, D., Chen, Y., & Elkahloun, A. G. (2002). E-cast: a data mining algorithm for gene expression data. Proceedings of the 2nd International Conference on Data Mining in Bioinformatics, Springer-Verlag, pp. 49-54. [ Links ]

3. Cheng, Y. & Church, G. M. (2000). Biclustering of expression data. Ismb, volume 8, pp. 93-103. [ Links ]

4. Conesa, A., Madrigal, P., Tarazona, S., Gomez-Cabrero, D., Cervera, A., McPherson, A., others, & Mortazavi, A. (2016). A survey of best practices for rna-seq data analysis. Genome biology, Vol. 17, No. 1, pp. 13. [ Links ]

5. Fiannaca, A., La Rosa, M., La Paglia, L., Rizzo, R., & Urso, A. (2015). Analysis of mirna expression profiles in breast cancer using biclustering. BMC bioinformatics, Vol. 16, No. 4, pp. S7. [ Links ]

6. Foster, I. (1995). Designing and building parallel programs, volume 78. Addison Wesley Publishing Company Boston. [ Links ]

7. Galili, T. (2015). dendextend: an R package for visualizing, adjusting and comparing trees of hierarchical clustering. Bioinformatics, Vol. 31, No. 22, pp. 3718-3720. [ Links ]

8. Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., others, & Lander, E. S. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, Vol. 286, No. 5439, pp. 531-537. [ Links ]

9. Gomez-Pulido, J. A., Cerrada-Barrios, J. L., Trinidad-Amado, S., Lanza-Gutierrez, J. M., Fernandez-Diaz, R. A., Crawford, B., & Soto, R. (2016). Fine-grained parallelization of fitness functions in bioinformatics optimization problems: gene selection for cancer classification and biclustering of gene expression data. BMC bioinformatics , Vol. 17, No. 1, pp. 330. [ Links ]

10. Griffiths, A. J. F., Gelbart, W. M., Miller, J. H., & Lewontin, R. C. (2003). Genética moderna. [ Links ]

11. Huang, Q., Tao, D., Li, X., & Liew, A. (2012). Parallelized evolutionary learning for detection of biclusters in gene expression data. IEEE/ACM Trans. Comput. Biol. Bioinformatics, Vol. 9, No. 2, pp. 560-570. [ Links ]

12. Levine, J. H., Simonds, E. F., Bendall, S. C., Davis, K. L., El-ad, D. A., Tadmor, M. D., others, & Finck, R. (2015). Data-driven phenotypic dissection of aml reveals progenitor-like cells that correlate with prognosis. Cell, Vol. 162, No. 1, pp. 184-197. [ Links ]

13. Lin, Q., Xue, Y., Chen, W., Ye, S., Li, W., & Liu, J. (2015). Parallel large average submatrices biclustering based on MapReduce. Computational Intelligence and Security (CIS), 2015 11th International Conference on, IEEE, pp. 134-137. [ Links ]

14. Luscombe, N. M., Greenbaum, D., & Gerstein, M. (2001). What is bioinformatics? an introduction and overview. Yearbook of Medical Informatics, Vol. 1, No. 1, pp. 83-99. [ Links ]

15. Mount, D. W. (2004). Sequence and genome analysis. Bioinformatics: Cold Spring Harbour Laboratory Press: Cold Spring Harbour, Vol. 2. [ Links ]

16. Oghabian, A., Kilpinen, S., Hautaniemi, S., & Czeizler, E. (2014). Biclustering methods: biological relevance and application in gene expression analysis. PloS one, Vol. 9, No. 3, pp. e90801. [ Links ]

17. Orzechowski, P. & Boryczko, K. (2015). Effective biclustering on GPU-capabilities and constraints. Prz Elektrotechniczn, Vol. 1, pp. 133-6. [ Links ]

18. Perezleo Solózano, L., Arencibia Jorge, R., Conill González, C., Achón Veloz, G., & Araujo Ruiz, J. A. (2003). Impacto de la bioinformatica en las ciencias biomedicas. Acimed, Vol. 11, No. 4, pp. 0-0. [ Links ]

19. Pontes, B., Giraldez, R., & Aguilar-Ruiz, J. S. (2015) . Biclustering on expression data: A review. Journal of biomedical informatics, Vol. 57, pp. 163-180. [ Links ]

20. Pontes, B., Giraldez, R., Divina, F., & Martinez-Alvarez, F. (2007). Evaluacion de biclusters en un entorno evolutivo. IV Taller nacional de minería de datos y aprendizaje (TAMIDA), pp. 1-10. [ Links ]

21. Prelic, A., Bleuler, S., Zimmermann, P., Wille, A., Bühlmann, P., Gruissem, W., & Zitzler, E. (2006). A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics , Vol. 22, No. 9, pp. 1122-1129. [ Links ]

22. Rossi, F., Van Beek, P., & Walsh, T. (2006). Handbook of Constraint Programming. Foundations of Artificial Intelligence. Elsevier Science. [ Links ]

23. Roy, S., Bhattacharyya, D. K., & Kalita, J. K. (2016) . Analysis of gene expression patterns using biclustering. Microarray Data Analysis: Methods and Applications, pp. 91-103. [ Links ]

24. Salazar, G., Bellocchi, C., Todoerti, K., Saporiti, L., F. and Piacentini, Scorza, R., & Colombo, G. I. (2016). Gene expression profiling reveals novel protective effects of aminaphtone on ECV304 endothelial cells. European journal of pharmacology, Vol. 782, pp. 59-69. [ Links ]

25. Sarazin, T., Lebbah, M., & Azzag, H. (2014). Biclustering using Spark-MapReduce. Big Data, IEEE International Conference on, IEEE, pp. 58-60. [ Links ]

26. Sathishkumar, K., Thiagarasu, V., & Balamurugan, E. (2015). An analysis on clustering based gene selection and classification for gene expression data. International Journal of Innovative Trends in Engineering (IJITE), Vol. 11, No. 01, pp. 55-60. [ Links ]

27. Shen, W., Xie, C. J., Liu, G. X., Xing, C., Wang, M. Q., & Zhou, Y. (2011). A novel biclustering with parallel genetic algorithm. International Conference on Human Health and Biomedical Engineering. [ Links ]

28. Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., others, & Golub, T. R. (1999). Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proceedings of the National Academy of Sciences, Vol. 96, No. 6, pp. 2907-2912. [ Links ]

29. Voggenreiter, O. (2014). Biclustering for the Analysis of Global Regulatory Patterns in Large-Scale Gene Expression Data. ETH-Zurich. [ Links ]

30. Wani, M. A. & Riyaz, R. (2017). A novel point density based validity index for clustering gene expression datasets. International Journal of Data Mining and Bioinformatics , Vol. 17, No. 1, pp. 66-84. [ Links ]

31. Weber, L. M. & Robinson, M. D. (2016). Comparison of clustering methods for high-dimensional single-cell flow and mass cytometry data. Cytometry Part A, Vol. 89, No. 12, pp. 1084-1096. [ Links ]

¹ generated through a computational simulation

Received: July 03, 2018; Accepted: November 22, 2019

^* Corresponding author is Alejandra Serrano Rubio. aserrano@computacion.cs.cinvestav.mx

This is an open-access article distributed under the terms of the Creative Commons Attribution License