<?xml version="1.0" encoding="ISO-8859-1"?><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<front>
<journal-meta>
<journal-id>0016-7169</journal-id>
<journal-title><![CDATA[Geofísica internacional]]></journal-title>
<abbrev-journal-title><![CDATA[Geofís. Intl]]></abbrev-journal-title>
<issn>0016-7169</issn>
<publisher>
<publisher-name><![CDATA[Universidad Nacional Autónoma de México, Instituto de Geofísica]]></publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id>S0016-71692015000100003</article-id>
<title-group>
<article-title xml:lang="en"><![CDATA[Forward modeling of gravitational fields on hybrid multi-threaded cluster]]></article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname><![CDATA[Couder-Castañeda]]></surname>
<given-names><![CDATA[Carlos]]></given-names>
</name>
<xref ref-type="aff" rid="A01"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname><![CDATA[Ortiz-Alemán]]></surname>
<given-names><![CDATA[José Carlos]]></given-names>
</name>
<xref ref-type="aff" rid="A01"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname><![CDATA[Orozco-del-Castillo]]></surname>
<given-names><![CDATA[Mauricio Gabriel]]></given-names>
</name>
<xref ref-type="aff" rid="A01"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname><![CDATA[Nava-Flores]]></surname>
<given-names><![CDATA[Mauricio]]></given-names>
</name>
<xref ref-type="aff" rid="A01"/>
<xref ref-type="aff" rid="A02"/>
</contrib>
</contrib-group>
<aff id="A01">
<institution><![CDATA[,Mexican Petroleum Institute  ]]></institution>
<addr-line><![CDATA[Ciudad de México ]]></addr-line>
</aff>
<aff id="A02">
<institution><![CDATA[,Universidad Nacional Autónoma de México Facultad de Ingeniería ** División de Ingeniería en Ciencias de la Tierra]]></institution>
<addr-line><![CDATA[México Distrito Federal]]></addr-line>
<country>México</country>
</aff>
<pub-date pub-type="pub">
<day>00</day>
<month>03</month>
<year>2015</year>
</pub-date>
<pub-date pub-type="epub">
<day>00</day>
<month>03</month>
<year>2015</year>
</pub-date>
<volume>54</volume>
<numero>1</numero>
<fpage>31</fpage>
<lpage>48</lpage>
<copyright-statement/>
<copyright-year/>
<self-uri xlink:href="http://www.scielo.org.mx/scielo.php?script=sci_arttext&amp;pid=S0016-71692015000100003&amp;lng=en&amp;nrm=iso"></self-uri><self-uri xlink:href="http://www.scielo.org.mx/scielo.php?script=sci_abstract&amp;pid=S0016-71692015000100003&amp;lng=en&amp;nrm=iso"></self-uri><self-uri xlink:href="http://www.scielo.org.mx/scielo.php?script=sci_pdf&amp;pid=S0016-71692015000100003&amp;lng=en&amp;nrm=iso"></self-uri><abstract abstract-type="short" xml:lang="es"><p><![CDATA[La solución analítica de las componentes del tensor gravimétrico, utilizando la ecuación del potencial gravitacional para un ensamble volumétrico compuesto de prismas de densidad constante, requiere un alto costo computacional. Esto se debe a que el potencial gravitacional de cada uno de estos prismas tiene que ser calculado para todos los puntos de una malla de observación previamente definida, lo cual resulta en una carga computacional de gran escala. En este trabajo introducimos un diseño híbrido y su implementación paralela basada en OpenMP y MPI, para el cálculo de las componentes vectoriales del campo gravimétrico (Gx, Gy, Gz) y las componentes del tensor gravimétrico (Gxx, Gxy, Gzz, Gyy, Gyz, Gzz). El rendimiento obtenido conlleva a óptimas relaciones del speed-up, ya que el tiempo de cómputo es drásticamente reducido. La técnica de paralelización aplicada consiste en descomponer el problema en grupos de prismas y utilizar diferentes espacios de memoria por núcleo de procesamiento, con el fin de evitar los problemas de cuello de botella cuando se accesa a la memoria compartida de un nodo del cluster, que se producen generalmente cuando varios hilos de ejecución acceden a la misma región en OpenMP. Debido a que OpenMP solo puede utilizarse en sistemas de memoria compartida es necesario utilizar MPI para la distribución del cálculo entre los nodos del cluster, dando como resultado un código híbrido OpenMP+MPI altamente eficiente con un speed-up prácticamente perfecto. Adicionalmente los resultados numéricos fueron validados con respecto a su contraparte secuencial.]]></p></abstract>
<abstract abstract-type="short" xml:lang="en"><p><![CDATA[The analytic solution of the gravimetric tensor components, making use of the gravitational potential equation for a three-dimensional volumetric assembly composed of unit prisms of constant density, demands a high computational cost. This is due to the gravitational potential of each one of these prisms must be calculated for all of the points of a previously defined observation grid, which turns out in a large scale computational cost. In this work we introduce a hybrid design and its parallel implementation, based on OpenMP and MPI, for the calculation of the vectorial components of the gravimetric field and the components of the gravimetric tensor. Since the computing time is drastically reduced, the obtained performance leads close to optimal speed-up ratios. The applied parallelization technique consists of decomposing the problem into groups of prisms and using different memory allocations per processing core to avoid bottleneck issues when accessing the main memory in one cluster node, which are generally produced when using too many execution threads over the same region in OpenMP. Due OpenMP can be only used on shared memory systems is necessary to use MPI for the calculation distribution among cluster nodes, giving as a result a hybrid code (OpenMP+MPI) highly efficient and with a nearly perfect speed-up. Additionally the numerical results were validated with respect to its sequential counterpart.]]></p></abstract>
<kwd-group>
<kwd lng="es"><![CDATA[gravedad]]></kwd>
<kwd lng="es"><![CDATA[gradiometría]]></kwd>
<kwd lng="es"><![CDATA[OpenMP]]></kwd>
<kwd lng="es"><![CDATA[MPI]]></kwd>
<kwd lng="es"><![CDATA[hyper-threading]]></kwd>
<kwd lng="es"><![CDATA[clusters]]></kwd>
<kwd lng="en"><![CDATA[gravity]]></kwd>
<kwd lng="en"><![CDATA[gradiometry]]></kwd>
<kwd lng="en"><![CDATA[OpenMP]]></kwd>
<kwd lng="en"><![CDATA[MPI]]></kwd>
<kwd lng="en"><![CDATA[hyper-threading]]></kwd>
<kwd lng="en"><![CDATA[clusters]]></kwd>
</kwd-group>
</article-meta>
</front><body><![CDATA[  	    <p align="justify"><font face="verdana" size="4">Original paper</font></p> 	    <p align="justify">&nbsp;</p> 	    <p align="center"><b><font face="verdana" size="4">Forward modeling of gravitational fields on hybrid multi&#45;threaded cluster</font></b></p> 	    <p align="justify">&nbsp;</p> 	    <p align="center"><font face="verdana" size="2"><b>Carlos Couder&#45;Casta&ntilde;eda*, Jos&eacute; Carlos Ortiz&#45;Alem&aacute;n*, Mauricio Gabriel Orozco&#45;del&#45;Castillo*and Mauricio Nava&#45;Flores**</b></font></p> 	    <p align="justify">&nbsp;</p> 	    <p align="justify"><font face="verdana" size="2"><i>* Mexican Petroleum Institute, Eje Central L&aacute;zaro C&aacute;rdenas, 152, San Bartolo Atepehuacan, Gustavo A. Madero, 07730, Ciudad de M&eacute;xico.</i> * Corresponding author: <a href="mailto:ccouder@esfm.ipn.mx">ccouder@esfm.ipn.mx</a></font></p>         <p align="justify"><font face="verdana" size="2"><i>** Divisi&oacute;n de Ingenier&iacute;a en Ciencias de la Tierra, Facultad de Ingenier&iacute;a, Universidad Nacional Aut&oacute;noma de M&eacute;xico, Ciudad Universitaria, Delegaci&oacute;n Coyoac&aacute;n, 04510, M&eacute;xico D.F., M&eacute;xico.</i></font></p>         <p align="justify">&nbsp;</p>         ]]></body>
<body><![CDATA[<p align="justify"><font size="2" face="verdana">Received: October 18, 2013;    <br>     Accepted: March 11, 2014;    <br>     Published on line: December 12, 2014</font></p>     <p align="justify">&nbsp;</p> 	    <p align="justify"><font face="verdana" size="2"><b>Resumen</b></font></p>     <p align="justify"><font face="verdana" size="2">La soluci&oacute;n anal&iacute;tica de las componentes del tensor gravim&eacute;trico, utilizando la ecuaci&oacute;n del potencial gravitacional para un ensamble volum&eacute;trico compuesto de prismas de densidad constante, requiere un alto costo computacional. Esto se debe a que el potencial gravitacional de cada uno de estos prismas tiene que ser calculado para todos los puntos de una malla de observaci&oacute;n previamente definida, lo cual resulta en una carga computacional de gran escala. En este trabajo introducimos un dise&ntilde;o h&iacute;brido y su implementaci&oacute;n paralela basada en OpenMP y MPI, para el c&aacute;lculo de las componentes vectoriales del campo gravim&eacute;trico (<i>G<sub>x</sub></i>, <i>G<sub>y</sub></i>, <i>G<sub>z</sub></i>) y las componentes del tensor gravim&eacute;trico (<i>G<sub>xx</sub></i>, <i>G<sub>xy</sub></i>, <i>G<sub>zz</sub></i>, <i>G<sub>yy</sub></i>, <i>G<sub>yz</sub></i>, <i>G<sub>zz</sub></i>). El rendimiento obtenido conlleva a &oacute;ptimas relaciones del speed&#45;up, ya que el tiempo de c&oacute;mputo es dr&aacute;sticamente reducido. La t&eacute;cnica de paralelizaci&oacute;n aplicada consiste en descomponer el problema en grupos de prismas y utilizar diferentes espacios de memoria por n&uacute;cleo de procesamiento, con el fin de evitar los problemas de cuello de botella cuando se accesa a la memoria compartida de un nodo del cluster, que se producen generalmente cuando varios hilos de ejecuci&oacute;n acceden a la misma regi&oacute;n en OpenMP. Debido a que OpenMP solo puede utilizarse en sistemas de memoria compartida es necesario utilizar MPI para la distribuci&oacute;n del c&aacute;lculo entre los nodos del cluster, dando como resultado un c&oacute;digo h&iacute;brido OpenMP+MPI altamente eficiente con un speed&#45;up pr&aacute;cticamente perfecto. Adicionalmente los resultados num&eacute;ricos fueron validados con respecto a su contraparte secuencial.</font></p>  	    <p align="justify"><font face="verdana" size="2"><b>Palabras clave:</b> gravedad, gradiometr&iacute;a, OpenMP, MPI, hyper&#45;threading, clusters.</font></p> 	    <p align="justify">&nbsp;</p> 	    <p align="justify"><font size="2" face="verdana"><b>Abstract</b></font></p>     <p align="justify"><font face="verdana" size="2">The analytic solution of the gravimetric tensor components, making use of the gravitational potential equation for a three&#45;dimensional volumetric assembly composed of unit prisms of constant density, demands a high computational cost. This is due to the gravitational potential of each one of these prisms must be calculated for all of the points of a previously defined observation grid, which turns out in a large scale computational cost. In this work we introduce a hybrid design and its parallel implementation, based on OpenMP and MPI, for the calculation of the vectorial components of the gravimetric field and the components of the gravimetric tensor. Since the computing time is drastically reduced, the obtained performance leads close to optimal speed&#45;up ratios. The applied parallelization technique consists of decomposing the problem into groups of prisms and using different memory allocations per processing core to avoid bottleneck issues when accessing the main memory in one cluster node, which are generally produced when using too many execution threads over the same region in OpenMP. Due OpenMP can be only used on shared memory systems is necessary to use MPI for the calculation distribution among cluster nodes, giving as a result a hybrid code (OpenMP+MPI) highly efficient and with a nearly perfect speed&#45;up. Additionally the numerical results were validated with respect to its sequential counterpart.</font></p>  	    ]]></body>
<body><![CDATA[<p align="justify"><font face="verdana" size="2"><b>Keywords:</b> gravity, gradiometry, OpenMP, MPI, hyper&#45;threading, clusters.</font></p> 	    <p align="justify">&nbsp;</p> 	    <p align="justify"><font size="2" face="verdana"><b>Introduction</b></font></p>     <p align="justify"><font face="verdana" size="2">The shared memory architecture is becoming more common every day in the high&#45;performance computing market. With the hardware technology advances allowing us to have a great number of cores with access to the same memory locations, nowadays it is not that expensive to have systems with forty or sixty cores using shared memory. OpenMP is now a standard for symmetric multiprocessing systems (SMP) (even can be used transparently in the Xeon Phi architecture (Calvin <i>et al.</i>, 2013)) sustained by a combination of function and compiler directives, a standard for the symmetric multiprocessing (SMP) systems (Dagum and Menon, 1998; Curtis&#45;Maury <i>et al.</i>, 2008). OpenMP has proven to be a powerful tool for SMP due to several reasons: it is highly portable; it allows fine and medium granularity, each thread can access to the same global memory; and has their own private memory, and it also has a greater level of abstraction than MPI model (Brunst and Mohr, 2008).</font></p>  	    <p align="justify"><font face="verdana" size="2">MPI is a library supported on the Same Program Multiple Data (SPMD) model and on the message passing model, with an explicit control of the parallelism. The processes can only read and write in their respective local memories and the data in these memories is transferred through calls to functions or procedures which implement the message passing model. Among the principal characteristics of MPI are that it can run in architectures of shared and distributed memory, is convenient for medium to coarse granularity and that employment is widely extended, making it extremely portable among platforms (Krpic <i>et al.</i>, 2012).</font></p>  	    <p align="justify"><font face="verdana" size="2">Using a hybrid programming model we can take advantage of the benefits of two programming models OpenMP and MPI. MPI is normally used to control the parallelism among cluster nodes, while OpenMP is applied in the creation of threads of fine granularity tasks within each node. Most applications developed in hybrid model involves a hierarchical model: MPI is for the higher level and OpenMP for the lower one (Smith, 2000).</font></p>  	    <p align="justify"><font face="verdana" size="2">One of the potential benefits of using hybrid model programming consists of getting rid of the barrier of scaling that each model has. Generally, in MPI the scaling is limited by the communications cost, because an application is affected by the overload of communication when the number of processes is increased. In OpenMP the performance of an application is affected by cache coherence problems and access to shared memory which may lead to bottleneck issues between the execution threads when trying to access memory. By mixing these methodologies of parallel programming (OpenMP and MPI), we can obtain a more diverse granularity of the application and therefore a better performance than by using each one on its own.</font></p>  	    <p align="justify"><font face="verdana" size="2">There are different applications which use this programming paradigm: OpenMP with MPI. For example, in the solution of sparse linear systems (Mitin <i>et al.</i>, 2012), in graph&#45;coloring algorithms (Sariyuce <i>et al.</i>, 2012), in some models of fluid dynamics (Amritkar <i>et al.</i>, 2012; Couder&#45;Casta&ntilde;eda, 2009) and finite element methods (Boehmer <i>et al.</i>, 2012), in the simulation of turbulent fluids (Jagannathan and Donzis, 2012), even in the simulation of combustion chambers (K&ouml;rnyei, 2012) and the implementation of neural networks (Gonzalez <i>et al.</i>, 2012). As can be observed, there are numerous computational implementations using OpenMP with MPI, nevertheless, this type of design is supported on a natural decomposition of the domain (Carrillo&#45;Ledesma <i>et al.</i>, 2013), based on data. For our particular problem, each one of the processing units accesses all of the computational domain points.</font></p>  	    <p align="justify"><font face="verdana" size="2">In <a href="/img/revistas/geoint/v54n1/a3f1.jpg" target="_blank">Figure 1</a> is depicted a domain decomposition, where each task (process or thread) is given some data subset on which to work. This domain decomposition is commonly used for example in finite differences problems where computational domains divided disjointly among the different tasks.</font></p>  	    <p align="justify"><font face="verdana" size="2">On the other hand, in the direct conformation of gravimetric data, an initial model for the source body is constructed from geological&#45;geophysical information. The anomaly of such model is calculated and compared to the observed anomaly, after which the parameters are adapted to improve the adjustment between them. These three steps that arrange the model properties &#151; <i>anomalies calculation, comparison and adjustment</i> &#151; are repeated up to the observed and calculated anomalies are similar enough.</font></p>  	    ]]></body>
<body><![CDATA[<p align="justify"><font face="verdana" size="2">A mass volume can be approximated by a set of rectangular prisms; if chosen sufficiently small, each prism can be considered to have a constant density. Because of the superposition principle, the gravitational anomaly of a body can be approximated at anypoint by summing the effects of all the prisms over that point. Even though this methodology appears simple (by reducing the size of the prisms to better adjust the source body), computing time is considerably increased. There are other approaching methods of the gravitational anomaly that can simplify the required computation (mass points or tesseroids approximations), however, they may complicate the construction of the geological model (Heck and Seitz, 2007).</font></p> 	    <p align="justify">&nbsp;</p> 	    <p align="justify"><b><font face="verdana" size="2">Application design</font></b></p>     <p align="justify"><font face="verdana" size="2">The application consists of calculating the gravimetric anomaly produced by a rectangular prismatic body with constant density with respect to a group of observation points (see <a href="/img/revistas/geoint/v54n1/a3f2.jpg" target="_blank">Figure 2</a>). The set of prisms is known as an ensemble of prisms, which is not necessarily regular. A set of irregular prisms can be configured as long as the prisms are not superimposed. Because the gravitational field complies with the superposition principle with respect to the observation points, if &#402; is the calculated response at a point (<i>x</i>, <i>y</i>), then the observed response at the point &#402; (<i>x</i>, <i>y</i>) is given by:</font></p>      <p align="center"><img src="/img/revistas/geoint/v54n1/a3e1.jpg"></p>  	    <p align="justify"><font face="verdana" size="2">where <i>M</i> is the number of total prisms and &#961; is the density of the prism.</font></p>  	    <p align="justify"><font face="verdana" size="2">It is well known that the function that calculates the anomaly for a given prism from an observation point is written as follows (Nagy <i>et al.</i>, 2000):</font></p>  	    <p align="center"><img src="/img/revistas/geoint/v54n1/a3e2.jpg"></p>      <p align="justify"><font face="verdana" size="2">where (<i>x<sub>l</sub></i>, <i>y<sub>l</sub></i>, <i>z<sub>l</sub></i>) is the top left vertex of the prism, (<i>x<sub>r</sub></i>, <i>y<sub>r</sub></i>, <i>z<sub>r</sub></i>) is the bottom right prism and (<i>x<sub>p</sub></i>, <i>y<sub>p</sub></i>, <i>z<sub>p</sub></i>) is the observation point and &#961; the density, as shown in <a href="#f3">Figure 3</a>.</font></p>     <p align="center"><a name="f3"></a></p>     ]]></body>
<body><![CDATA[<p align="center"><img src="/img/revistas/geoint/v54n1/a3f3.jpg"></p>      <p align="justify"><font face="verdana" size="2">The aforementioned is a large scale problem since, for example, a synthetic problem conformed by a set of prisms of 300 &times; 300 &times; 150 = 13,500,000 elements, against an observation grid of 100 &times; 100 = 10,000 points, results in the calculation of 135,000,000,000 integrals or differentials to solve the entire problem. The formulations we used are included in <a href="/img/revistas/geoint/v54n1/html/a3appe.html" target="_blank">appendix A</a>.</font></p>  	    <p align="justify"><font face="verdana" size="2">Computing time reduction in a numerical simulation is of great importance to diminish research costs. A simulation which lasts a week is likely to be costly, not only because the machine time is expensive, but also because it prohibits the quick acquisition of results to make modifications and predictions.</font></p>  	    <p align="justify"><font face="verdana" size="2">In many projects to be parallelized, several times the serial algorithm does not show a natural decomposition which allows easily porting it to a parallel environment, or the trivial decomposition does not yield good performance results. For such reasons it is convenient to use a hybrid programming methodology, as the one developed and presented in this paper. This methodology provides an adequate programming design to obtain a superior performance.</font></p>  	    <p align="justify"><font face="verdana" size="2">To develop a parallel program it is fundamental to search for the finest granularity, as in the methodology proposed by Foster (Foster, 1995). In this case it is possible to parallelize by prisms or by observation points. One of the requirements of the design is that it must be scalable, therefore the use of hybrid systems is quite appropriate; these systems are the most commonly used nowadays. Following Foster's methodology, it is necessary to begin with the finest granularity, in this case corresponds to OpenMP because it is in the lowest level. Subsequently the implementation follows with MPI, due to its coarse granularity.</font></p>  	    <p align="justify"><font face="verdana" size="2"><i>Implementation in OpenMP</i></font></p>  	    <p align="justify"><font face="verdana" size="2">We started our design with OpenMP because it handles shared memory and it is also the finest granularity. First we partitioned the domain into prisms, and for each prism we parallelized the calculation by observation points, as shown in <a href="#f4">Figure 4</a>.</font></p> 	    <p align="center"><a name="f4"></a></p> 	    <p align="center"><img src="/img/revistas/geoint/v54n1/a3f4.jpg"></p>     <p align="justify"><font face="verdana" size="2">This parallelization by observation points is trivial and does not offer a great design challenge, since we simply partition the calculation with respect to the observation grid for each prism (see the pseudo&#45;code 1). However, this scheme has several drawbacks. One of them is that the performance is not optimal since the number of prisms is much greater than the number of observation points. In other words, this partitioning is efficient as long as there are not too many threads working upon the observation grid, thus avoiding a bottleneck issue as a consequence of the threads works in the same memory allocation. Maybe the worst drawback lies in the fact that the parallel environment is created and closed, i.e. for each prism, a function which parallely calculates the anomalies is executed, but such environment is closed once the execution is over, and reopened for the following prism, which results in an unnecessary overload and therefore decreases the performance.</font></p>     ]]></body>
<body><![CDATA[<p align="center"><img src="/img/revistas/geoint/v54n1/a3l1.jpg"></p>      <p align="justify"><font face="verdana" size="2">The other parallelization option is to use prisms i.e., making the threads divide the work per number of prisms (see pseudo&#45;code 2). To avoid the coherence problems of the cache it is necessary to create a different memory space for each execution thread, because it is not feasible to create a single memory space for an unique observation grid, shared by all the threads.</font></p>     <p align="center"><img src="/img/revistas/geoint/v54n1/a3f5.jpg"></p>  	    <p align="justify"><font face="verdana" size="2">As observed in <a href="#f6">Figure 6</a>, it is required to create an observation grid for each execution thread to avoid memory consistency problems. Bottleneck memory access issues are avoided since every thread writes in a different direction of the memory space. If only one grid were to be used, there would be access problems to the shared grid, which would create numerical inconsistencies.</font></p> 	    <p align="center"><a name="f6"></a></p> 	    <p align="center"><img src="/img/revistas/geoint/v54n1/a3f6.jpg"></p>  	    <p align="justify"><font face="verdana" size="2">One of the characteristics of OpenMP is that the computing is distributed in an implicit manner, therefore the partitioning of the M prisms, which composes the problem, is done automatically using a balancing algorithm included in OpenMP. In this case the decision is left to the compiler, which is optimum 99% of the cases (Zhang <i>et al.</i>, 2004).</font></p>  	    <p align="justify"><font face="verdana" size="2"><i>OpenMP+MPI Implementation</i></font></p>  	    <p align="justify"><font face="verdana" size="2">One of the advantages of the prism parallelization is that it is easier to implement in MPI, producing tasks of coarse granularity using the same design previously applied in OpenMP. Having the observation grid partitioned would result in a more complicated and less efficient design using MPI. Since the parallelization in MPI is explicit, we need to manually distribute the number of prisms through a modular expression. If <i>M</i> is the number of prisms to calculate and p is the MPI process number (numbered from 0 to <i>p</i>-1), then for each process <i>p</i> we define the beginning and end of the prisms to be processed by <i>p</i> as p<sub>start</sub> and p<sub>end</sub>, respectively. We define theinteger <i>s</i> as the quotient of the number of prisms <i>M</i> between the total number of processes p<sub><i>n</i></sub>, and r as the remainder, the procedure to determine pstart and pend proceed as follows:</font></p>      <p align="center"><img src="/img/revistas/geoint/v54n1/a3e3.jpg"></p>     ]]></body>
<body><![CDATA[<p align="justify"><font face="verdana" size="2">This way we can distribute the number of prisms <i>M</i> over p<sub><i>n</i></sub> processes in a balanced manner; once this distribution is made, we can use the OpenMP implementation in each node. In other words, we occupy MPI to distribute the number of prisms in each node, and at the same time in each node we employ OpenMP to reduce the number of MPI processes, reducing communication time.</font></p>      <p align="justify"><font face="verdana" size="2">In consequence, the application is partitioned by the number of prisms M, both in OpenMP as in MPI. Another option is to parallelize by prisms in MPI and by observation points in OpenMP. Even though this is a viable option, it is not very scalable due the drawback discussed in the previous subsection.</font></p>  	    <p align="justify"><font face="verdana" size="2">Basically the design consists of allocating an observation grid per execution thread and a global observation grid in the master thread per computing node, subsequently the reduction of the sum of the grids per thread is done and stored in the global grid contained in the master thread, and finally at the end of the parallel calculation, every master thread will add their grid values to update the master thread of the master node using a MPI reduction method (see <a href="#f7">Figure 7</a>).</font></p> 	    <p align="center"><a name="f7"></a></p> 	    <p align="center"><img src="/img/revistas/geoint/v54n1/a3f7.jpg"></p>      <p align="justify"><font face="verdana" size="2">It is necessary to mention that the implementation of the code was made with the FORTRAN 2003 specification, using as development tool the Intel Cluster Toolkit version 2013 of Intel Corporation.</font></p>     <p align="justify">&nbsp;</p>     <p align="justify"><font face="verdana" size="2"><b>Performance experiments</b></font></p>     <p align="justify"><font face="verdana" size="2">For the synthetic experiment we used a case composed by a cube of 700 &times; 700 &times; 50 prisms, with 7 contrasting spheres of variable density (see <a href="/img/revistas/geoint/v54n1/a3f8.jpg" target="_blank">Figure 8</a>). The spheres were conformed by 251,946 prisms and an observation grid of 150 &times; 100 = 15,000 points, to an elevation of 100 m. Therefore, the number of calls to a procedure required, to calculate the vector/tensor component of the gravity are 3,779,190,000; this classifies the experiment into a high&#45;performance computing problem.</font></p>  	    <p align="justify"><font face="verdana" size="2">We tested the parallelized code by observation points versus the version by prisms using OpenMP. The first parallel scheme is technically easier to implement because for each one of the prisms the calculation of the cycles corresponding to the tracking of the observation grid is parallelized. The second scheme has a more complex implementation because it requires different space memory allocations. The performance experiments that calculate the components of the gravimetric tensor <i>G<sub>xx</sub></i>, <i>G<sub>yy</sub></i>, <i>G<sub>zz</sub></i>, <i>G<sub>xy</sub></i>, <i>G<sub>xz</sub></i>, <i>G<sub>yz</sub></i>, using both versions were carried out in the server described below. We did not include the performance analysis for the vectorial components <i>G<sub>x</sub></i>, <i>G<sub>y</sub></i> and <i>G<sub>z</sub></i>, since its behavior is very similar.</font></p>      ]]></body>
<body><![CDATA[<p align="justify"><font face="verdana" size="2">The characteristics of the server where the tests took place with OpenMP are as follows:</font></p>  	    <blockquote> 	      <p align="justify"><font face="verdana" size="2">&bull; 4 Xeon Intel (R) Xeon (R) E7&#45;4850 Processors</font></p> 	      <p align="justify"><font face="verdana" size="2">&bull; 10 processing cores per processor</font></p> 	      <p align="justify"><font face="verdana" size="2">&bull; Hyperthreading Technology deactivated</font></p> 	      <p align="justify"><font face="verdana" size="2">&bull; 512 GB of RAM memory</font></p> 	      <p align="justify"><font face="verdana" size="2">&bull; Red Hat 6.3 as operating system</font></p> </blockquote>      <p align="justify"><font face="verdana" size="2">To interfere as least as possible with the processes of the operating system, we used 35 of the 40 cores available in the server. Initially we can say that the prisms implementation and with independent memory per core was 3.22X faster than its counterpart of observation points. Therefore, while the observation points version uses 757 s, the version partitioned by prisms only consumes 235 s.</font></p>  	    <p align="justify"><font face="verdana" size="2">The comparison of the computing times per thread in the partition by prisms against the partition by observation points is shown in <a href="/img/revistas/geoint/v54n1/a3f9.jpg" target="_blank">Figure 9</a>.</font></p>  	    <p align="justify"><font face="verdana" size="2">In <a href="/img/revistas/geoint/v54n1/a3f9.jpg" target="_blank">Figure 9</a> it can be seen that the performance behavior is kept stable in both types of partitioning; however, by prisms the best reduction in time is obtained. To prove that the partitioning by prisms keeps reduction time practically linear, we graphed the <i>speed&#45;up</i> of the performance by prisms.</font></p>  	    ]]></body>
<body><![CDATA[<p align="justify"><font face="verdana" size="2">For the <i>speed&#45;up</i> shown in <a href="/img/revistas/geoint/v54n1/a3f10.jpg" target="_blank">Figure 10</a>, we considered a serial fraction of 5% (&#402; = 0.05). In this fraction the necessary reductions to sum the grid points for each core are contemplated, the total result of the anomaly is calculated as:</font></p>  	    <p align="center"><img src="/img/revistas/geoint/v54n1/a3e4.jpg"></p> 	    <p align="justify"><font face="verdana" size="2">where, for each (<i>i</i>,<i> j</i>) <i>O<sub>&#402;</sub></i> is the final observation, <i>O<sub>t</sub></i> is the calculated grid by core t and I is the total number of cores. Therefore, we considered that 95% of the code is parallel, and according to Gustafson's law, the maximum textitspeed&#45;up that can be obtained with 35 processing units, in this case cores, is 35 + (1&#45;35) &times; (0.05) = 33.30. The experimentally obtained <i>speed&#45;up</i> result was 31.31, which represents an absolute difference of 1.99 and a relative difference of 0.06, which shows the efficiency of the implementation.</font></p>     <p align="justify"><font face="verdana" size="2">Another indicator which must be contemplated is the efficiency E, defined as:</font></p>  	    <p align="center"><img src="/img/revistas/geoint/v54n1/a3e5.jpg"></p>      <p align="justify"><font face="verdana" size="2">where <i>S</i>(<i>n</i>) is the obtained speed&#45;up with n tasks, and indicates how busy the processors or cores are during execution. <a href="/img/revistas/geoint/v54n1/a3f11.jpg" target="_blank">Figure 11</a> shows that the efficiency by prisms is high since on average every processing core is kept busy 94% of the time. The efficiency <i>E</i> also indicates that the partitioning by prisms is scalable, which means that we can increase the number of processors to improve time reduction while not losing efficiency in the use of many cores. The scalability must be contemplated as a good design of the parallel program since it allows scaling the algorithm, so we could expect when the number of processing units is increasing the performance is not affected.</font></p>  	    <p align="justify"><font face="verdana" size="2">The design using OpenMP is limited to architectures of machines of shared memory, therefore we are now making experiments using a hybrid machine commonly known as <i>cluster</i>, mixing OpenMP+MPI with the methodology described in subsection 2.2.</font></p>  	    <p align="justify"><font face="verdana" size="2">The characteristics of the cluster where the numerical experiments were carried out are as follows:</font></p>  	    <blockquote> 	      <p align="justify"><font face="verdana" size="2">&bull; Node: Intel(R) Xeon(R) model X5550 processors with four physical cores processor.</font></p> 	      ]]></body>
<body><![CDATA[<p align="justify"><font face="verdana" size="2">&bull; 44 processing nodes</font></p> 	      <p align="justify"><font face="verdana" size="2">&bull; Hyperthreading Technology enabled</font></p> 	      <p align="justify"><font face="verdana" size="2">&bull; 40 GB of RAM memory per node</font></p> 	      <p align="justify"><font face="verdana" size="2">&bull; Red Hat 6.3 as operating system</font></p> 	      <p align="justify"><font face="verdana" size="2">&bull; InfiniBand 300Gbps</font></p> </blockquote>      <p align="justify"><font face="verdana" size="2">We started by evaluating the performance of each cluster node, as opposed to the experiments done with the 40 cores server, where hyperthreading technology (HT) was disabled. In this case HT is enabled, so each node reports the handling of 8 execution threads instead of 4, but we only have 4 physical floating point units (FPUs). Since our program is computationally intensive, we have to find out if we benefit from the use of HT; some studies have reported the use of HT in numerical applications can modify the performance by 30% (Curtis&#45;Maury <i>et al.</i>, 2008).</font></p>  	    <p align="justify"><font face="verdana" size="2">The behavior obtained using one node containing 1 processor with four real cores with HT enabled/disabled can be exposed by an analysis of the computing time graph, shown in <a href="/img/revistas/geoint/v54n1/a3f12.jpg" target="_blank">Figure 12</a>, the problems analyzed is setup with 13,997 prism conforming a sphere with a mesh of 150 &times; 100 observations points.</font></p>  	    <p align="justify"><font face="verdana" size="2">As can be observed, the best run&#45;time performance that we can obtain from the processor in HT mode is not produced with 4 execution threads, the best performance is obtained with 8 threads, but the time is not doubly improved. This occurs since two threads share the same FPU and the HT technology is designed to quickly switch between threads, and therefore there is not a double improvement in time but the performance gain is approximately 30%, which means that the two threads make better use of the FPU,therefore is necessary to create two threads per core to obtain the maximum performance when the HT is enabled. When the HT is disabled we have an asymptotic behavior after 4 threads but did not reach the performance obtained using the HT mode.</font></p>  	    <p align="justify"><font face="verdana" size="2">In <a href="/img/revistas/geoint/v54n1/a3f13.jpg" target="_blank">Figure 13</a> it can be observed that when HT technology is enabled we obtain a linear <i>speed&#45;up</i> up to 4 execution threads; this is obvious since there are only 4 physical FPUs. Nevertheless, with the HT we can have a better use of the FPUs improving the <i>speed&#45;up</i> up to 5.60, this is, 1.6 more processing units. With the HT disabled, a similar performance is observed up to 4 threads, although this performance is below the one with the HT enabled. For more than 4 threads, the performance with the HT disabled begins to decrease.</font></p>  	    <p align="justify"><font face="verdana" size="2">The efficiency corresponding to the speed&#45;up shown in <a href="/img/revistas/geoint/v54n1/a3f13.jpg" target="_blank">Figure 13</a> is graphed in <a href="/img/revistas/geoint/v54n1/a3f14.jpg" target="_blank">Figure 14</a>; notice how HT is able to increase the efficiency of some intensive floating point applications up to 30% when the number of threads equals the number of physical cores. Of course, the best efficiency is obtained with 4 threads because we have 4 FPUs, nevertheless we can get a better performance creating 4 threads more using the additional virtual processors created by the HT.</font></p>  	    ]]></body>
<body><![CDATA[<p align="justify"><font face="verdana" size="2">To analyze the performance in a node with the original problem (shown in <a href="/img/revistas/geoint/v54n1/a3f8.jpg" target="_blank">Figure 8</a>), we added a processor in the second socket to one of the nodes. In other words, we created a node with eight real cores to compare it against a node with four real cores with HT enabled. The results of execution time are shown in <a href="/img/revistas/geoint/v54n1/a3f15.jpg" target="_blank">Figure 15</a>.</font></p>  	    <p align="justify"><font face="verdana" size="2">It must be taken in consideration our cluster nodes are composed of a single processor with HT enabled, we only added another processor in the second socket to a node for experimental purposes. To have a better perspective of the performance, we determined the speed&#45;up through both node configurations we showed in <a href="/img/revistas/geoint/v54n1/a3f16.jpg" target="_blank">Figure 16</a>. A nearly perfect speed&#45;up can be observed for the node with 8 real cores, but a increase of 1.8 processing units for the node with 4 real cores with HT enabled. Evidently, if we enable HT in the machine with 8 real cores we would have 16 reported processors, and to get its maximum performance we would have to create 16 threads. However, the experimentation with 8 real cores was only for comparisonpurposes, since the cluster configuration is made of one node with 4 real cores with HT enabled. It can also be observed that each node of the cluster reduces the time by a factor of 5,8X against the serial version.</font></p>  	    <p align="justify"><font face="verdana" size="2">Once it is known that the best node performance is achieved with 8 execution threads for a node with 4 real cores with HT enabled and with the partition by prisms, we can consider each node as a processing unit and distribute the computing with MPI, obtaining a code with a hybrid programming model.</font></p>  	    <p align="justify"><font face="verdana" size="2">The <i>speed&#45;up</i> results using 25 cluster nodes are displayed in <a href="/img/revistas/geoint/v54n1/a3f17.jpg" target="_blank">Figure 17</a>; a serial fraction of 5% (&#402; = 0.05) is considered since in MPI there needs to be reductions in the sum for each node. The results show that a nearly perfect <i>speed&#45;up</i> is obtained up to 22 nodes. From this point on, the speed&#45;up starts declining because the application performance is affected by the communication time between nodes. In other words, the granularity of the tasks begins to decrease for this problem of 249,946 prisms for 30 nodes. This implies that by increasing the granularity of the problem (increasing the number of prisms), the <i>speed&#45;up</i> is also increased until it becomes stable, to decrease again later on.</font></p>  	    <p align="justify"><font face="verdana" size="2">The efficiency graph related with the <i>speed&#45;up</i> of <a href="/img/revistas/geoint/v54n1/a3f16.jpg" target="_blank">Figure 16</a> is shown in <a href="/img/revistas/geoint/v54n1/a3f18.jpg" target="_blank">Figure 18</a>. Notice how the efficiency is below 90% after node 23. If we consider that we have an increase in speed 5.8 times per node (from <a href="/img/revistas/geoint/v54n1/a3f15.jpg" target="_blank">Figure 15</a>) with respect to the serial version, then the optimum speed factor for this cluster (for a problem of 251,946 prisms) is approximately 5.8 &times; 22 = 127.6X, i.e. 127 times faster than the serial version. Obviously, as previously stated, if we increase the granularity (number of prisms), the efficiency increases as well. In fact, we reduce the computation time of the spheres problem from 1 h 34 m 56 s to 34 s.</font></p>  	    <p align="justify"><font face="verdana" size="2"><i>Comparison with similar programs</i></font></p>  	    <p align="justify"><font face="verdana" size="2">To provide a better perspective of the obtained performance with the parallel implementation of our code, we compared against an open source code called tesseroids (Uieda <i>et al.</i>, 2011), which can be downloaded from http://dx.doi.org/10.6084/m9.figshare.786514. We chose the problem of 13,997 prisms which form an sphere against 10,000 observation points, since tesseroids is not distributed (can not be executed on a cluster) and can only accelerate the computation in shared memory machines. The execution times are shown using the bar chart in <a href="/img/revistas/geoint/v54n1/a3f19.jpg" target="_blank">Figure 19</a>, where it can be observed that with HT disabled we have a speed improvement of 2.14X and with HT enabled of 2.51X with respect to tesseroids. This performance improvement is due to our program design takes a better advantage of the processor technology and keeps the cores occupied to the maximum by using a prisms parallelization scheme based on different memory allocations. This can be observed in the CPU history graph shown in the <a href="/img/revistas/geoint/v54n1/a3f20.jpg" target="_blank">Figure 20</a>.</font></p> 	    <p align="justify">&nbsp;</p> 	    <p align="justify"><b><font face="verdana" size="2">Numerical code validation</font></b></p>     <p align="justify"><font face="verdana" size="2">The main challenge of the parallel programming is to decompose the program into components which can be simultaneously executed to reduce computing time. The decomposition level is highly influenced by the type of architecture of the parallel machine. In this case the design was made with a hybrid programming strategy to get the maximum out of the architecture. Although the reduction of the execution time is the main objective of the parallel programming, the validation of the code is a topic that should be covered since inherent parallelism programming errors can occur.</font></p>  	    ]]></body>
<body><![CDATA[<p align="justify"><font face="verdana" size="2">To measure the error, we compared the previously validated sequential counterpart in the synthetic experiment with the analytical solution. We used the L2 norm error or RMS (Mickus and Hinojosa, 2001; Menke, 1989), defined as:</font></p>  	    <p align="center"><img src="/img/revistas/geoint/v54n1/a3e6.jpg"></p>      <p align="justify"><font face="verdana" size="2">where <img src="/img/revistas/geoint/v54n1/a3e6a.jpg" align="middle"> is the tensor component, parallely computed, and <img src="/img/revistas/geoint/v54n1/a3e6b.jpg" align="middle">gsi,j is the serially calculated component.</font></p>      <p align="justify"><font face="verdana" size="2">In <a href="#t1">Table 1</a> the errors of the gravimetric tensor components are shown, parallely calculated with respect to the serial form.</font></p>     <p align="center"><a name="t1" id="t1"></a></p>     <p align="center"><img src="/img/revistas/geoint/v54n1/a3t1.jpg"></p>      <p align="justify"><font face="verdana" size="2">From the errors obtained it can be noticed that there is no numerical difference, therefore the parallel version is correctly implemented.</font></p>  	    <p align="justify"><font face="verdana" size="2">The surface graphs of the gravitational fields are shown in <a href="/img/revistas/geoint/v54n1/a3f21.jpg" target="_blank">Figure 21</a>. These graphs correspond to the components of the gravimetric tensor, calculated for the synthetic case studied in <a href="/img/revistas/geoint/v54n1/a3f8.jpg" target="_blank">Figure 8</a>.</font></p> 	    <p align="justify">&nbsp;</p> 	    <p align="justify"><b><font face="verdana" size="2">Conclusions</font></b></p>     ]]></body>
<body><![CDATA[<p align="justify"><font face="verdana" size="2">A parallel design for the calculation of the vectorial and tensorial components of the gravity anomaly was implemented and validated using a hybrid methodology with OpenMP and MPI. The numerical experiments and the obtained indicators validate that the implementation is very efficient and that it also yields good results with respect to the numerical solution.</font></p>  	    <p align="justify"><font face="verdana" size="2">We show that using the simplest or most trivial parallelization form does not contribute to the attainment of the best performance or the greatest exploitation of the platform. For our case, even though the partitioning by prisms requires a greater investment in the design and implementation, it was the most advantageous with respect to performance.</font></p>  	    <p align="justify"><font face="verdana" size="2">The HT technology could improve some numerical intensive applications up to 30%, nevertheless, to get the best performance it is necessary to create two threads per core when the HT is enabled.</font></p>  	    <p align="justify"><font face="verdana" size="2">We also conclude that this design can serve as a benchmark for solving problems which require the parallelization of schemes where the decomposition of the domain is not trivial or is shared by the processing units, as is the case of the observation grid. Finally the correct exploitation of OpenMP and MPI, jointly, can become a fundamental tool for parallel programming in clusters.</font></p> 	    <p align="justify">&nbsp;</p> 	    <p align="justify"><b><font face="verdana" size="2">Future work</font></b></p>     <p align="justify"><font face="verdana" size="2">As future work we pretend to implement the code in CUDA NVIDIA with TESLA technology and compare these results with the cluster performance results presented in this paper, as the measurement of the error introduced by CUDA in single and double precision. The implementation in CUDA is a work of interest since the reduction of the variable values in CUDA technology is very complicated when used in shared form, as is the case with the observation grid.</font></p> 	    <p align="justify">&nbsp;</p> 	    <p align="justify"><b><font face="verdana" size="2">Acknowledgment</font></b></p>     <p align="justify"><font face="verdana" size="2">The authors thank the support provided by the Mexican Institute of Petroleum (IMP, www.imp.mx) in allowing access to its computing equipment, as well as the financial support through project Y.00107, jointly created by IMP&#45;SENER&#45;CONACYT number 128376. Also, we would like to express our gratitude to the two anonymous reviewers for their helpful comments.</font></p>     ]]></body>
<body><![CDATA[<p align="justify">&nbsp;</p>     <p align="justify"><b><font face="verdana" size="2">References</font></b></p>     <!-- ref --><p align="justify"><font face="verdana" size="2">Amritkar A., Tafti D., Liu R., Kufrin R., Chapman B., 2012, OpenMP parallelism for fluid and fluid&#45;particulate systems. <i>Parallel Computing</i>, 38, 9, 501&#45;517.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=3936801&pid=S0016-7169201500010000300001&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --></font></p>  	    <!-- ref --><p align="justify"><font face="verdana" size="2">Boehmer S., Cramer T., Hafner M., Lange E., Bischof C., Hameyer K., 2012, Numerical simulation of electrical machines by means of a hybrid parallelisation using MPI and OpenMP for finite&#45;element method. <i>Science, Measurement &amp; Technology</i>, IET, 6, 5, 339&#45;343.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=3936803&pid=S0016-7169201500010000300002&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --></font></p>  	    <!-- ref --><p align="justify"><font face="verdana" size="2">Brunst H., Mohr B., 2008, Performance analysis of large&#45;scale OpenMP and hybrid MPI/OpenMP applications with Vampir NG. In OpenMP Shared Memory Parallel Programming (pp. 5&#45;14). Springer Berlin Heidelberg.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=3936805&pid=S0016-7169201500010000300003&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --></font></p>  	    <!-- ref --><p align="justify"><font face="verdana" size="2">Calvin C., Ye F., Petiton S., 2013, October, The Exploration of Pervasive and Fine&#45;Grained Parallel Model Applied on Intel Xeon Phi Coprocessor. In P2P, Parallel, Grid, Cloud and Internet Computing (3PGCIC), 2013 Eighth International Conference on (pp. 166&#45;173). IEEE.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=3936807&pid=S0016-7169201500010000300004&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --></font></p>  	    ]]></body>
<body><![CDATA[<!-- ref --><p align="justify"><font face="verdana" size="2">Carrillo&#45;Ledesma A., Herrera I., de la Cruz L.M., 2013, Parallel algorithms for computational models of geophysical systems. <i>Geof&iacute;sica Internacional</i>, 52, 3, 293&#45;309.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=3936809&pid=S0016-7169201500010000300005&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --></font></p>  	    <!-- ref --><p align="justify"><font face="verdana" size="2">Couder&#45;Casta&ntilde;eda C., 2010. Simulation of supersonic flow in an ejector diffuser using the jpvm. <i>Journal of Applied Mathematics</i>, 2009.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=3936811&pid=S0016-7169201500010000300006&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --></font></p>  	    <!-- ref --><p align="justify"><font face="verdana" size="2">Curtis&#45;Maury M., Ding X., Antonopoulos C.D., Nikolopoulos D.S., 2008, An evaluation of OpenMP on current and emerging multithreaded/multicore processors. In OpenMP Shared Memory Parallel Programming (pp. 133&#45;144). Springer Berlin Heidelberg.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=3936813&pid=S0016-7169201500010000300007&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --></font></p>  	    <!-- ref --><p align="justify"><font face="verdana" size="2">Dagum L., Menon R., 1998, OpenMP: an industry standard API for shared&#45;memory programming. <i>Computational Science &amp; Engineering</i>, IEEE, 5(1), 46&#45;55.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=3936815&pid=S0016-7169201500010000300008&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --></font></p>  	    <!-- ref --><p align="justify"><font face="verdana" size="2">Foster I., 1995, Designing and building parallel programs (pp. 83&#45;135). Addison Wesley Publishing Company.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=3936817&pid=S0016-7169201500010000300009&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --></font></p>  	    ]]></body>
<body><![CDATA[<!-- ref --><p align="justify"><font face="verdana" size="2">Gonzalez B., Donate J.P., Cortez P., S&aacute;nchez G., De Miguel A., 2012, May, Parallelization of an evolving Artificial Neural Networks system to Forecast Time Series using OPENMP and MPI. In Evolving and Adaptive Intelligent Systems (EAIS), 2012 IEEE Conference on (pp. 186&#45;191). IEEE.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=3936819&pid=S0016-7169201500010000300010&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --></font></p>  	    <!-- ref --><p align="justify"><font face="verdana" size="2">Heck B., Seitz K., 2007, A comparison of the tesseroid, prism and point&#45;mass approaches for mass reductions in gravity field modelling. <i>Journal of Geodesy</i>, 81, 2, 121&#45;136.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=3936821&pid=S0016-7169201500010000300011&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --></font></p>  	    <!-- ref --><p align="justify"><font face="verdana" size="2">Jagannathan S., Donzis D.A., 2012, July, Massively parallel direct numerical simulations of forced compressible turbulence: a hybrid MPI/OpenMP approach. In Proceedings of the 1st Conference of the Extreme Science and Engineering Discovery Environment: Bridging from the eXtreme to the campus and beyond (p. 23). ACM.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=3936823&pid=S0016-7169201500010000300012&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --></font></p>  	    <!-- ref --><p align="justify"><font face="verdana" size="2">Krpic Z., Martinovic G., Crnkovic I., 2012, May). Green HPC: MPI vs. OpenMP on a shared memory system. In MIPRO, 2012 Proceedings of the 35th International Convention (pp. 246&#45;250). IEEE.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=3936825&pid=S0016-7169201500010000300013&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --></font></p>  	    <!-- ref --><p align="justify"><font face="verdana" size="2">Kornyei L., 2012, May, Parallel implementation of a combustion chamber simulation with MPI&#45;OpenMP hybrid techniques. In MIPRO, 2012 Proceedings of the 35th International Convention (pp. 356&#45;361). IEEE.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=3936827&pid=S0016-7169201500010000300014&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --></font></p>  	    ]]></body>
<body><![CDATA[<!-- ref --><p align="justify"><font face="verdana" size="2">Menke W., 2012, Geophysical data analysis: discrete inverse theory. Academic press.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=3936829&pid=S0016-7169201500010000300015&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --></font></p>  	    <!-- ref --><p align="justify"><font face="verdana" size="2">Mickus K.L., Hinojosa J.H., 2001, The complete gravity gradient tensor derived from the vertical component of gravity: a Fourier transform technique. <i>Journal of Applied Geophysics</i>, 46, 3, 159&#45;174.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=3936831&pid=S0016-7169201500010000300016&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --></font></p>  	    <!-- ref --><p align="justify"><font face="verdana" size="2">Mitin I., Kalinkin A., Laevsky Y., 2012, A parallel iterative solver for positive&#45;definite systems with hybrid MPI&#150;OpenMP parallelization for multi&#45;core clusters. <i>Journal of Computational Science</i>, 3, 6, 463&#45;468.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=3936833&pid=S0016-7169201500010000300017&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --></font></p>  	    <!-- ref --><p align="justify"><font face="verdana" size="2">Nagy D., Papp G., Benedek J., 2000, The gravitational potential and its derivatives for the prism. <i>Journal of Geodesy</i>, 74(7&#45;8), 552&#45;560.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=3936835&pid=S0016-7169201500010000300018&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --></font></p>  	    <!-- ref --><p align="justify"><font face="verdana" size="2">Sariyuce A.E., Saule E., Catalyurek U.V., 2012, May, Scalable hybrid implementation of graph coloring using mpi and openmp. In Parallel and Distributed Processing Symposium Workshops &amp; PhD Forum (IPDPSW), 2012 IEEE 26th International (pp. 1744&#45;1753). IEEE.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=3936837&pid=S0016-7169201500010000300019&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --></font></p>  	    ]]></body>
<body><![CDATA[<!-- ref --><p align="justify"><font face="verdana" size="2">Smith L.A., 2000, Mixed mode MPI/OpenMP programming. <i>UK High&#45;End Computing Technology Report</i>, 1&#45;25.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=3936839&pid=S0016-7169201500010000300020&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --></font></p>  	    <!-- ref --><p align="justify"><font face="verdana" size="2">Uieda L., Bomfim E., Braitenberg C., Molina E., 2011, July, Optimal forward calculation method of the Marussi tensor due to a geologic structure at GOCE height. In Proceedings of GOCE User Workshop 2011.    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=3936841&pid=S0016-7169201500010000300021&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --></font></p>  	    <!-- ref --><p align="justify"><font face="verdana" size="2">Zhang Y., Burcea M., Cheng V., Ho R., Voss M., 2004, September, An Adaptive OpenMP Loop Scheduler for Hyperthreaded SMPs. In ISCA PDCS (pp. 256&#45;263).    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=3936843&pid=S0016-7169201500010000300022&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --></font></p>      ]]></body><back>
<ref-list>
<ref id="B1">
<nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Amritkar]]></surname>
<given-names><![CDATA[A.]]></given-names>
</name>
<name>
<surname><![CDATA[Tafti]]></surname>
<given-names><![CDATA[D.]]></given-names>
</name>
<name>
<surname><![CDATA[Liu]]></surname>
<given-names><![CDATA[R.]]></given-names>
</name>
<name>
<surname><![CDATA[Kufrin]]></surname>
<given-names><![CDATA[R.]]></given-names>
</name>
<name>
<surname><![CDATA[Chapman]]></surname>
<given-names><![CDATA[B.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[OpenMP parallelism for fluid and fluid-particulate systems]]></article-title>
<source><![CDATA[Parallel Computing]]></source>
<year>2012</year>
<volume>38</volume>
<numero>9</numero>
<issue>9</issue>
<page-range>501-517</page-range></nlm-citation>
</ref>
<ref id="B2">
<nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Boehmer]]></surname>
<given-names><![CDATA[S.]]></given-names>
</name>
<name>
<surname><![CDATA[Cramer]]></surname>
<given-names><![CDATA[T.]]></given-names>
</name>
<name>
<surname><![CDATA[Hafner]]></surname>
<given-names><![CDATA[M.]]></given-names>
</name>
<name>
<surname><![CDATA[Lange]]></surname>
<given-names><![CDATA[E.]]></given-names>
</name>
<name>
<surname><![CDATA[Bischof]]></surname>
<given-names><![CDATA[C.]]></given-names>
</name>
<name>
<surname><![CDATA[Hameyer]]></surname>
<given-names><![CDATA[K.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Numerical simulation of electrical machines by means of a hybrid parallelisation using MPI and OpenMP for finite-element method]]></article-title>
<source><![CDATA[Science, Measurement & Technology]]></source>
<year>2012</year>
<volume>6</volume>
<numero>5</numero>
<issue>5</issue>
<page-range>339-343</page-range><publisher-name><![CDATA[IET]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B3">
<nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Brunst]]></surname>
<given-names><![CDATA[H.]]></given-names>
</name>
<name>
<surname><![CDATA[Mohr]]></surname>
<given-names><![CDATA[B.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Performance analysis of large-scale OpenMP and hybrid MPI/OpenMP applications with Vampir NG.]]></article-title>
<collab>OpenMP Shared Memory Parallel Programming</collab>
<source><![CDATA[]]></source>
<year>2008</year>
<page-range>5-14</page-range><publisher-loc><![CDATA[BerlinHeidelberg ]]></publisher-loc>
<publisher-name><![CDATA[Springer]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B4">
<nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Calvin]]></surname>
<given-names><![CDATA[C.]]></given-names>
</name>
<name>
<surname><![CDATA[Ye]]></surname>
<given-names><![CDATA[F.]]></given-names>
</name>
<name>
<surname><![CDATA[Petiton]]></surname>
<given-names><![CDATA[S.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[The Exploration of Pervasive and Fine-Grained Parallel Model Applied on Intel Xeon Phi Coprocessor]]></article-title>
<collab>P2P, Parallel, Grid, Cloud and Internet Computing</collab>
<source><![CDATA[]]></source>
<year>2013</year>
<month>, </month>
<day>Oc</day>
<page-range>166-173</page-range><publisher-name><![CDATA[IEEE.]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B5">
<nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Carrillo-Ledesma]]></surname>
<given-names><![CDATA[A.]]></given-names>
</name>
<name>
<surname><![CDATA[Herrera]]></surname>
<given-names><![CDATA[I.]]></given-names>
</name>
<name>
<surname><![CDATA[de la Cruz]]></surname>
<given-names><![CDATA[L.M.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Parallel algorithms for computational models of geophysical systems]]></article-title>
<source><![CDATA[Geofísica Internacional]]></source>
<year>2013</year>
<volume>52</volume>
<numero>3</numero>
<issue>3</issue>
<page-range>293-309</page-range></nlm-citation>
</ref>
<ref id="B6">
<nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Couder-Castañeda]]></surname>
<given-names><![CDATA[C.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Simulation of supersonic flow in an ejector diffuser using the jpvm]]></article-title>
<source><![CDATA[Journal of Applied Mathematics]]></source>
<year>2010</year>
<month>20</month>
<day>09</day>
</nlm-citation>
</ref>
<ref id="B7">
<nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Curtis-Maury]]></surname>
<given-names><![CDATA[M.]]></given-names>
</name>
<name>
<surname><![CDATA[Ding]]></surname>
<given-names><![CDATA[X.]]></given-names>
</name>
<name>
<surname><![CDATA[Antonopoulos]]></surname>
<given-names><![CDATA[C.D.]]></given-names>
</name>
<name>
<surname><![CDATA[Nikolopoulos]]></surname>
<given-names><![CDATA[D.S.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[An evaluation of OpenMP on current and emerging multithreaded/multicore processors]]></article-title>
<collab>OpenMP Shared Memory Parallel Programming</collab>
<source><![CDATA[]]></source>
<year>2008</year>
<page-range>133-144</page-range><publisher-loc><![CDATA[BerlinHeidelberg ]]></publisher-loc>
<publisher-name><![CDATA[Springer]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B8">
<nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Dagum]]></surname>
<given-names><![CDATA[L.]]></given-names>
</name>
<name>
<surname><![CDATA[Menon]]></surname>
<given-names><![CDATA[R.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[OpenMP: an industry standard API for shared-memory programming]]></article-title>
<source><![CDATA[Computational Science & Engineering]]></source>
<year>1998</year>
<volume>5</volume>
<numero>1</numero>
<issue>1</issue>
<page-range>46-55</page-range><publisher-name><![CDATA[IEEE]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B9">
<nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Foster]]></surname>
<given-names><![CDATA[I.]]></given-names>
</name>
</person-group>
<source><![CDATA[Designing and building parallel programs]]></source>
<year>1995</year>
<page-range>83-135</page-range><publisher-name><![CDATA[Addison Wesley Publishing Company]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B10">
<nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Gonzalez]]></surname>
<given-names><![CDATA[B.]]></given-names>
</name>
<name>
<surname><![CDATA[Donate]]></surname>
<given-names><![CDATA[J.P.]]></given-names>
</name>
<name>
<surname><![CDATA[Cortez]]></surname>
<given-names><![CDATA[P.]]></given-names>
</name>
<name>
<surname><![CDATA[Sánchez]]></surname>
<given-names><![CDATA[G.]]></given-names>
</name>
<name>
<surname><![CDATA[De Miguel]]></surname>
<given-names><![CDATA[A.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Parallelization of an evolving Artificial Neural Networks system to Forecast Time Series using OPENMP and MPI]]></article-title>
<collab>Evolving and Adaptive Intelligent Systems</collab>
<source><![CDATA[]]></source>
<year>2012</year>
<month>, </month>
<day>Ma</day>
<page-range>186-191</page-range><publisher-name><![CDATA[IEEE ConferenceIEEE.]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B11">
<nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Heck]]></surname>
<given-names><![CDATA[B.]]></given-names>
</name>
<name>
<surname><![CDATA[Seitz]]></surname>
<given-names><![CDATA[K.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[A comparison of the tesseroid, prism and point-mass approaches for mass reductions in gravity field modelling]]></article-title>
<source><![CDATA[Journal of Geodesy]]></source>
<year>2007</year>
<volume>81</volume>
<numero>2</numero>
<issue>2</issue>
<page-range>121-136</page-range></nlm-citation>
</ref>
<ref id="B12">
<nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Jagannathan]]></surname>
<given-names><![CDATA[S.]]></given-names>
</name>
<name>
<surname><![CDATA[Donzis]]></surname>
<given-names><![CDATA[D.A.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Massively parallel direct numerical simulations of forced compressible turbulence: a hybrid MPI/OpenMP approach]]></article-title>
<source><![CDATA[Proceedings of the 1st Conference of the Extreme Science and Engineering Discovery Environment: Bridging from the eXtreme to the campus and beyond]]></source>
<year>2012</year>
<month>, </month>
<day>Ju</day>
<page-range>23</page-range><publisher-name><![CDATA[ACM.]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B13">
<nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Krpic]]></surname>
<given-names><![CDATA[Z.]]></given-names>
</name>
<name>
<surname><![CDATA[Martinovic]]></surname>
<given-names><![CDATA[G.]]></given-names>
</name>
<name>
<surname><![CDATA[Crnkovic]]></surname>
<given-names><![CDATA[I.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Green HPC: MPI vs. OpenMP on a shared memory system]]></article-title>
<collab>MIPRO</collab>
<source><![CDATA[Proceedings of the 35th International Convention]]></source>
<year>2012</year>
<month>, </month>
<day>Ma</day>
<page-range>246-250</page-range><publisher-name><![CDATA[IEEE.]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B14">
<nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Kornyei]]></surname>
<given-names><![CDATA[L.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Parallel implementation of a combustion chamber simulation with MPI-OpenMP hybrid techniques]]></article-title>
<collab>MIPRO</collab>
<source><![CDATA[Proceedings of the 35th International Convention]]></source>
<year>2012</year>
<month>, </month>
<day>Ma</day>
<page-range>356-361</page-range><publisher-name><![CDATA[IEEE.]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B15">
<nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Menke]]></surname>
<given-names><![CDATA[W.]]></given-names>
</name>
</person-group>
<source><![CDATA[Geophysical data analysis: discrete inverse theory]]></source>
<year>2012</year>
<publisher-name><![CDATA[Academic press]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B16">
<nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Mickus]]></surname>
<given-names><![CDATA[K.L.]]></given-names>
</name>
<name>
<surname><![CDATA[Hinojosa]]></surname>
<given-names><![CDATA[J.H.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[The complete gravity gradient tensor derived from the vertical component of gravity: a Fourier transform technique]]></article-title>
<source><![CDATA[Journal of Applied Geophysics]]></source>
<year>2001</year>
<volume>46</volume>
<numero>3</numero>
<issue>3</issue>
<page-range>159-174</page-range></nlm-citation>
</ref>
<ref id="B17">
<nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Mitin]]></surname>
<given-names><![CDATA[I.]]></given-names>
</name>
<name>
<surname><![CDATA[Kalinkin]]></surname>
<given-names><![CDATA[A.]]></given-names>
</name>
<name>
<surname><![CDATA[Laevsky]]></surname>
<given-names><![CDATA[Y.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[A parallel iterative solver for positive-definite systems with hybrid MPI-OpenMP parallelization for multi-core clusters]]></article-title>
<source><![CDATA[Journal of Computational Science]]></source>
<year>2012</year>
<volume>3</volume>
<numero>6</numero>
<issue>6</issue>
<page-range>463-468</page-range></nlm-citation>
</ref>
<ref id="B18">
<nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Nagy]]></surname>
<given-names><![CDATA[D.]]></given-names>
</name>
<name>
<surname><![CDATA[Papp]]></surname>
<given-names><![CDATA[G.]]></given-names>
</name>
<name>
<surname><![CDATA[Benedek]]></surname>
<given-names><![CDATA[J.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[The gravitational potential and its derivatives for the prism]]></article-title>
<source><![CDATA[Journal of Geodesy]]></source>
<year>2000</year>
<volume>74</volume>
<numero>7</numero><numero>8</numero>
<issue>7</issue><issue>8</issue>
<page-range>552-560</page-range></nlm-citation>
</ref>
<ref id="B19">
<nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Sariyuce]]></surname>
<given-names><![CDATA[A.E.]]></given-names>
</name>
<name>
<surname><![CDATA[Saule]]></surname>
<given-names><![CDATA[E.]]></given-names>
</name>
<name>
<surname><![CDATA[Catalyurek]]></surname>
<given-names><![CDATA[U.V.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Scalable hybrid implementation of graph coloring using mpi and openmp]]></article-title>
<collab>Parallel and Distributed Processing Symposium Workshops & PhD Forum</collab>
<source><![CDATA[]]></source>
<year>2012</year>
<month>, </month>
<day>Ma</day>
<edition>26th International</edition>
<page-range>1744-1753</page-range><publisher-name><![CDATA[IEEEIEEE.]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B20">
<nlm-citation citation-type="">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Smith]]></surname>
<given-names><![CDATA[L.A.]]></given-names>
</name>
</person-group>
<source><![CDATA[Mixed mode MPI/OpenMP programming]]></source>
<year>2000</year>
<page-range>1-25</page-range></nlm-citation>
</ref>
<ref id="B21">
<nlm-citation citation-type="">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Uieda]]></surname>
<given-names><![CDATA[L.]]></given-names>
</name>
<name>
<surname><![CDATA[Bomfim]]></surname>
<given-names><![CDATA[E.]]></given-names>
</name>
<name>
<surname><![CDATA[Braitenberg]]></surname>
<given-names><![CDATA[C.]]></given-names>
</name>
<name>
<surname><![CDATA[Molina]]></surname>
<given-names><![CDATA[E.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Optimal forward calculation method of the Marussi tensor due to a geologic structure at GOCE height]]></article-title>
<source><![CDATA[Proceedings of GOCE User Workshop 2011]]></source>
<year>2011</year>
<month>, </month>
<day>Ju</day>
</nlm-citation>
</ref>
<ref id="B22">
<nlm-citation citation-type="">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Zhang]]></surname>
<given-names><![CDATA[Y.]]></given-names>
</name>
<name>
<surname><![CDATA[Burcea]]></surname>
<given-names><![CDATA[M.]]></given-names>
</name>
<name>
<surname><![CDATA[Cheng]]></surname>
<given-names><![CDATA[V.]]></given-names>
</name>
<name>
<surname><![CDATA[Ho]]></surname>
<given-names><![CDATA[R.]]></given-names>
</name>
<name>
<surname><![CDATA[Voss]]></surname>
<given-names><![CDATA[M.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[An Adaptive OpenMP Loop Scheduler for Hyperthreaded SMPs.]]></article-title>
<collab>ISCA PDCS</collab>
<source><![CDATA[]]></source>
<year>2004</year>
<month>, </month>
<day>Se</day>
<page-range>256-263</page-range></nlm-citation>
</ref>
</ref-list>
</back>
</article>
