Application of the LDA Model for Obtaining Topics from the WIKICORPUS

Martínez Guzmán, Gerardo; Bernábe Loranca, María Beatriz; Cerón Garnica, Carmen; Serrano Pérez, Jonathan; Archundia Sierra, Etelvina; Martínez Guzmán, Gerardo; Bernábe Loranca, María Beatriz; Cerón Garnica, Carmen; Serrano Pérez, Jonathan; Archundia Sierra, Etelvina

doi:10.13053/cys-26-1-4171

Servicios Personalizados

Revista

Articulo

Indicadores

Citado por SciELO
Accesos

Links relacionados

Similares en SciELO

Otros
Otros

Permalink

Computación y Sistemas

versión On-line ISSN 2007-9737versión impresa ISSN 1405-5546

Comp. y Sist. vol.26 no.1 Ciudad de México ene./mar. 2022 Epub 08-Ago-2022

https://doi.org/10.13053/cys-26-1-4171

Articles of the Thematic Issue

Application of the LDA Model for Obtaining Topics from the WIKICORPUS

Gerardo Martínez Guzmán¹

María Beatriz Bernábe Loranca¹^*

Carmen Cerón Garnica¹

Jonathan Serrano Pérez¹

Etelvina Archundia Sierra¹

¹ Benemérita Universidad Autónoma de Puebla (BUAP), Facultad de Ciencias de la Computación, Mexico. gmartinez54@hotmail.com, academicaceron2016@gmail.com, js.perez@gmail.com, etelvina.archundia@correo.buap.mx.

Abstract:

A fundamental problem in text analysis of great amount of information is to discover the topics described in the documents. One of the most useful application involves the extraction of topics from documents corpus. Such is the case of Wikicorpus that consists of approximately 250,000 documents totaling in 250 millions of words. In this work, a system based on the Latent Dirichlet Allocation (LDA) model has been developed to carry out the task of automatically selecting the words of the corpus and, based on their frequency in the documents, it would indicate that they may or not belong to certain topic, classifying words without human intervention. Due to the large amount of information of the corpus, a Serial-Parallel Algorithm (SPA) in C/C++ and OpenMP have been used to perform parallel programming, since in parallel stages all threads must share certain variables, so the design architecture was shared memory.

Keywords: Corpus; generative model; Dirichlet distribution; latent topics; parallelization; algorithm; C/C++ programming

1 Introduction

The quantity of information that society produces is affected since human capacity to categorize it is limited. Thus, Machine Learning has developed algorithms capable of processing great quantities of documents and associate them to certain topics. This type of analysis receives the name of Probabilistic Topic Modeling. In this type of models statistic methods are used to analyze the words of the texts and to identify the groups of words that constitute the topics of which the documents are composed. One of the methods that precedes the statistics, is the Latent Semantic Analysis (LSA) model that applies the algebraic principle of decomposition in singular values, that consists in the factorization of a matrix [¹⁹].

Hofmann [¹⁵, ¹⁶], starting from the LSA model, developed a new algorithm that turned away of linear algebra and centered in the statistic analysis called Probabilistic Latent Semantic Analysis (pLSA). The modification that was introduced to this new study was the creation of topics that is a new relation between words and documents. Words were distributed around the topics where words can be shared and everyone has a different relevance within each document.

As it can be sensed, the topics must be deduced from the text. This method of unsupervised learning uses the iterative algorithm Expectation Maximization (EM) to determine the distribution of the latent variables.

Blei [⁸] get backs the idea of latent structure of topics and designs a generative model of documents called Latent Dirichlet Allocation (LDA) and is based in the basic concept that a collection of documents presents a logic structure of words that belong, in different proportion, to each of the topics that are responsible for the generation of documents. Instead of directly estimating the distributions words-topics and topics-documents, it uses another approach that is to directly estimate the posterior distribution given by the observed words.

Currently, this algorithm is one of the most simple and mostly used in Topic Modeling.

The main objective of this work is to develop a system based in the Latent Dirichlet Allocation (LDA) model to carry out the task of automatically selecting the corpus words and, based on their frequency in the documents, it would indicate that they may or do not belong to certain topic.

This will be implemented using Serial-Parallel Algorithm (SPA) in C/C++ programming language with parallel programming using OpenMP and in a shared memory architecture.

This document is organized as follows. In section 2, the state of art in the field of topic modeling and LDA. In section 3, the design and development of the LDA model are realized, section 4, results and findings of the application of the LDA model are presented. Finally in section 5, final comments and future work of the research are presented.

2 State of the Art

In this section, the main contributions of works related to the application of topics models and LDA are revised as described below.

The important contributions to the statistical analysis identify certain topic models known as Probabilistic Topic Modeling, that refer to statistical algorithm to discover latent semantic structures of extensive body text. Some interesting language properties, as hierarchical semantic relations between words [⁸] and the interaction between syntax and semantics [¹³]. In the technical note of [²⁰], the objective is to review the foundations of Bayesian parameter estimation in the discrete domain, that is necessary to comprehend the internal functioning of text analysis approaches based on topics as Probabilistic Latent Semantic Analysis (PLSA), Latent Dirichlet Allocation (LDA) [⁶], and other methods of count data mixing

2.1 Topic Modeling

The Latent Dirichlet Allocation (LDA) model is a generative probabilistic model. The basic idea is that the documents to be classified are represented as random mixtures of hidden topics, where each topics is characterized by a distribution of words. The distribution of categories has a priori distribution of Dirichlet, this model is the most commonly used in the following works.

Authors Ruiz and Campos in [²⁵] present a study where they measure the performance of an algorithm that combines non-linear kernels, concurrences of skull shape features, a method of selection of variables and a standard algorithm of dimensionality reduction to characterize and classify malformations caused by primary craniosynostosis. The sorting technique used in this work is that a particular class of skull shapes can be represented through characteristic patterns. The implemented dimensionality reduction model is based on the LDA algorithm. LDA is used as a generative model that samples skull shape features from a mix of geometric topics. The results of that study suggest that the combination of shape descriptor calculations (of the skulls) and LDA technique result in low rates of classification errors.

In the work of Bisgin, Liu, Fang, Xu and Tong in [⁷], the relationship between medicines and drugs approved by FDA is determined. In this way, each drug is associated to the most probable topic. The model used in the extraction of topic is LDA.

The author Hu in [¹⁷] employs unsupervised learning in the study of 2 bible books (proverbs and psalms) due to the shared association of information. The chapters of each book are grouped by content. The model that determines the association between books is LDA algorithm, the extraction of topics of each document identified by the model is used to define a correlation and measure the similarity between both books.

The author Rodríguez in [²⁴], unveils tools that analyze sport videos using computer vision and pattern recognition techniques. Currently, to improve sport results, clubs lay out people in charge of analyzing the matches on their staffs, both their own and those of rival teams, to find patterns in their gameplay and, thus, study the best form to obtain competitive advantages. Once evaluated their right functioning of the algorithm, the next step is to test the model on sport videos and see if it is capable of breaking them down in topics. In the discovery phase, the LDA unsupervised learning method is used to discover activities and interaction in places very crowded.

Authors Dueñas and Velásquez in [¹⁰] present an alternative methodology to detect Web trends through the use of information retrieval techniques, topic modeling and opinion mining. Given a set of Websites, topics that are mentioned in the retrieved texts are extracted and later, social networks are used to obtain the opinion of their users in relation to these. LDA is used for the extraction of important topics from the opinions consigned in blogs and news sites.

Seiter, Amft, Rossi, and Tröster in [²⁶] compare the results of three unsupervised models (LDA, n-gram TM, Correlated TM (CTM)), in three data sets of public activities to obtain guidelines for the selection of TM parameters depending on the properties of the set data. In this work, it is determined that the main limitation of the unigram model is that it supposes that all documents are just collections of homogenous words, that is to say, all documents present one unique topic. The experimental results about the selected data set have demonstrated that the LDA proposal in the discovery of information results to be less sensible to noise. The LDA method initially was developed to model discrete data sets, though overall text documents. In the task of modeling a document, LDA handles better than LSI and a mix of unigram models. LSI over adjusts the probabilities of document modeling to determine the topics in a new document. Predictably, the great advance of LDA with LSI was that probabilities are easily allocated to a new document. Its application in documents classification determines that the LDA model can be useful as a screening algorithm to the topic selection function.

The LDA is a robust and generic model that is easily extendable beyond empiric data of a small discussed set. Diverse articles have been published about the application of LDA to a wide range of areas. It has been applied to tasks ranging from fraud screening in telecommunications to error detection in the source code. Despite the wide range of applications, LDA has not been applied to automatic summary of documents, although the possibility is quite feasible. The LDA model has demonstrated to be a a method that allows to assign a class or category to an object to create learning, generating a good separability between classes and a good cohesion within the same class.

Words of documents are the observable variables, meanwhile that the topics, the relation between each document and the topics, the correspondence between each word and topic, are non-explicit distributions. Considering the observed words in a set of documents, the class of topics more likely to have generated the data must be determined. This implies to infer the probability distribution on words associated with each topic and the distribution on the topics of each document.

3 Design and Development of LDA Model

3.1 LDA Generative Model

In LDA (Latent Dirichlet Allocation), if observations are words in documents, each document can be seen as a mix of various topics. In LDA, it is assumed that the distribution of topics has a distribution a priori of Dirichlet [²⁰]:

Dir(p;α)=1B(α)∏i=1npiαi−1, (1)

where

B(α)=∏i=1nΓ(αi)Γ(∑i=1nαi), (2)

since Dir(p;a) is a function of density, then it is fulfilled:

∫1B(α)∏i=1npiαi−1dp=1. (3)

Which implies that:

∫∏i=1npiαi−1dp=B(α). (4)

3.2 Development of LDA Model

The list of symbols in LDA model to begin the development is described in Table 1.

Table 1 List of symbols used in LDA model

Variables	Meaning
N	Number of words in the corpus.
W = {w_i}	The corpus, w_i denotes a word.
Z = {z_i}	Latent topics, zi topic assigned to the wordith.
W_-i = W -{w_i}	The corpus excluding W_i.
Z_-i = Z-{z_i}	The corpus excluding Z_i.
K	Number of topics.
V	Number of words (terms) in the vocabulary.
D	Number of documents.
N_d	Number of of words in the document d.
v	Denotes a word (term) in the vocabulary.
α	Vector K-dimensional of real positives.
β	Vector V-dimensional of real positives.
θ	Vector K-dimensional of probabilities.
φ	Vector V-dimensional of probabilities.

The model specifies the following distribution on words within a document:

p(wi)=∑j=1Kp(wi,zj)=∑j=1Kp(wi/zi=j)p(zi=j), (5)

where p(z_i= j) is the probability that the jth topic was chosen to the ith word and p(w_i/ z_i= j) is the probability to choose the word w_icome down to topic j.

Table 2 is an example of the distribution of words in a corpus.

Table 2 Distribution of words in a corpus

	Term_1	Term_2	Term_3	Term_4	Term_5
Doc_1	0.2	0.1	0.1	0.4	0.2
Doc_2	0.5	0.2	0.1	0.1	0.1
Doc_3	0.0	0.2	0.4	0.0	0.4
Doc_4	0.2	0.0	0.3	0.2	0.3

The probability of p(w/z) is a multinomial on words for the topics, that is to say:

p(W/Z)=p(W/Z,ϕ)=∏k=1K∏v=1Vϕk,vn∘,k,v, (6)

where ϕk,v is the probability that the term v be assigned to topic k to any document n◦,k,v is the number of times that the term v is assigned to topic k for all the corpus. The operator ∘, depending on its location in nd,k,v means to any document, topic or word respectively. Also it is denoted by B=(n∘,k,v) the matrix of order K×V, and by Bk the kth row of the matrix B.

Table 3 is an example of the distribution of terms in the topics.

Table 3 Distribution of terms in a corpus

	Term_1	Term_2	Term_3	Term_4	Term_5
Top_1	0.3	0.1	0.0	0.4	0.2
Top_2	0.1	0.4	0.1	0.2	0.2
Top_3	0.0	0.2	0.4	0.3	0.1

The LDA model introduces a distribution of Dirichlet prior on [²⁰]. The prior to is the following:

p(ϕ/β)=∏k=1KΓ(∑v=1Vβv)∏v=1VΓ(βv)∏v=1Vϕk,vn∘,k,v. (7)

The hyperparameters β smooth the distribution of words in each topic. Now the probability p(z) is also a distribution multinomial on the topics for the document d, that is to say:

p(Z)=p(Z,θ)=∏d=1D∏k=1Kθd,vnd,k,∘, (8)

where θd,k is the probability that the topic k be assigned to the document d to any word. Here nd,k,∘ is the number of times that the topic k is assigned to document d. Also it is denoted by A=(nd,k,∘) the matrix of order D×K, and by Ad the dth row of matrix A. Table 4 is an example of the distribution of the topics in documents:

A Dirichlet is introduced prior θ:

p(θ/α)=∏d=1DΓ(∑k=1Kαk)∏k=1KΓ(αk)∏k=1Kθd,kαk−1. (9)

Table 4 Distribution of topics in documents

	Top_1	Top_2	Top_3	Top_4
Doc_1	0.4	0.2	0.1	0.3
Doc_2	0.8	0.0	0.2	0.0
Doc_3	0.3	0.2	0.1	0.4
Doc_4	0.0	0.3	0.4	0.3

The hyperparameters α smooth the distribution of topics in each document. It is convenient to utilize a distribution of symmetric Dirichlet with a hyperparameter α such that α1=α2=…=αkα.

Good options for the hyperparameters α and β will depend on the number of topics and the size of vocabulary. From previous investigations, it has been found that α=50/K and β=0,01 work well with many collections of different texts [¹³, ¹²].

The main variables of interest in the model are the distributions of words-topics ϕ and the distributions of topics-documents θ. Hofmann [¹⁵] uses the algorithm of maximization of expectation (EM) to obtain direct estimations of φ and θ. This approach implies problems of local maximums from the function of plausibility. Another approach is to estimate directly the posterior distribution on z given the observed words W, instead of directly estimating the distributions words-topics and topics-documents [¹⁵, ¹³].

In this work, it will be applied an algorithm that uses the Gibbs sampling, a form of chain of Markov Monte Carlo, that is easy to implement and provides a method relatively efficient to extract a set of topics of a great corpus [⁹, ¹¹].

Gibbs sampling (also known as como la conditional sampling alternation), is a specific form of MCMC, simulates a multidimensional distribution through the sampling of subsets of variables of inferior dimensions, where each subset is conditioned to the value of all the others. The sampling is realized continuously until all sampled values approximate to the destination distribution. The process of Gibbs does not provide direct estimations of ϕ and θ, but it can be approximated using posterior estimations of z [¹⁴, ²⁸].

3.3 Gibbs Sampling

To obtain a sample of p=(Z/W), it is used the method of Gibbs sampling, for this it is needed p=(zi/Z−i,W). This probability is obtained following the development presented principally by [¹⁴, ²⁸]:

p(zi/Z−i,W,α,β)=p(Z,W/α,β)p(Z−i,W/α,β). (10)

On the other hand, it is known that:

p(Z,W/α,β)=p(W/Z,β)p(Z/α), (11)

and

p(W/Z,β)=∫p(W/Z,ϕ)p(ϕ/β)dϕ. (12)

Using (4), (5) and (3), it is obtained:

p(W/Z,β)=∫∏k=1KΓ(∑v=1Vβv)∏v=1VΓ(βv)∏v=1Vϕk,vn∘,k,v+βv−1dϕ=∏k=1K(Γ(∑v=1Vβv)∏v=1VΓ(βv)∫∏v=1Vϕk,vn∘,k,v+βv−1dϕk)=∏k=1KB(Bk+β)B(β). (13)

Now, p(Z/α)=∫p(Z/θ)p(θ/α)dθ and using (6), (7) and (3), it is obtained:

p(Z/α)=∏d=1D(∫Γ(∑k=1Kαk)∏k=1KΓ(αk)∏v=1Kθd,knd,k,∘+αk−1dθd)=∏d=1DB(Ad+α)B(α). (14)

Substituting (11, 12) in (9), it is obtained:

p(Z,W/α,β)=∏k=1KB(Bk+β)B(β)∏d=1DB(Ad+α)B(α). (15)

As zi depends only of wi, the expression (8) can be written as:

p(zi/Z−i,W,α,β)∝p(Z,W/α,β)p(Z−1,W−i/α,β). (16)

But it is known that:

p(Z−i,W−i/α,β)=p(W−i/Z−i,β)p(Z−i/α). (17)

Doing the same as for obtaining (13), we have:

p(Z−i,W−i/α,β)=∏k=1KB(Bk−i+β)B(β)∏d=1DB(Ad−i+α)B(α), (18)

where n∘,k,v−i is the number of times that the term v is assigned to topic k, but with the ith word excluded, and nd,k,∘−i is the number of times that the topic k is assigned to document d but with the ith topic excluded. Bk−i and Ad−i are the rows Bk and Ad but with the values n∘,k,v−i and nd,k,∘−i respectively. The following expressions are fulfilled:

n∘,k,v={n∘,k,v−i+1siv=wi∧k=ki,n∘,k,v−i in all other cases. (19)

nd,k,∘={nd,k,∘−i+1sid=di∧k=ki,nd,k,∘−i in all other cases. (20)

Finally using (13, 14, 15, 16), the expression (8) can be written as:

p(zi/Z−i,W,α,β)=B(Bk+β)B(Bk−i+β)B(Ad+α)B(Ad−i+α). (21)

Then:

p(zi/Z−i,W,α,β)=∏v=1VΓ(n∘,k,v+βv)Γ(∑v=1Vn∘,k,v+βv)∏v=1VΓ(n∘,k,v−i+βv)Γ(∑v=1Vn∘,k,v−i+βv)∏k=1KΓ(nd,k,∘+αk)Γ(∑k=1Knd,k,∘+αk)∏k=1KΓ(nd,k,∘−i+αk)Γ(∑k=1Knd,k,∘−i+αk). (22)

Using the equalities Γ(y)=(y−1)! and Γ(y+1)=yΓ(y) to y, a positive integer, and taking into account the following equalities:

∑v=1Vn∘,k,v=1+∑v=1Vn∘,k,v−i, (23)

∑k=1Knd,k,∘=1+∑k=1Knd,k,∘−i, (24)

it is obtained:

p(zi=K/Z−I,w=v,W−i,α,β)=n∘,k,v+βv−1[∑v=1Vn∘,k,v+βv]−1nd,k,∘+αk−1[∑k=1Knd,k,∘+αk]−1. (25)

Finally, the estimations of Bayes of the parameters of the function of posterior distribution are:

θ^d,k=αk+nd,k,∘∑k=1Kαk+nd,k,∘, (26)

θ^k,v=βk+n∘,k,v∑v=1Vβv+n∘,k,v. (27)

Table 5 shows how each word in a document is assigned to a topic.

Table 5 Assignment of words to topics

	Pal_1	Pal_2	Pal_3	Pal_4	Pal_5	Pal_6
Doc_1	Top_2		Top_3	Top_2	Top_3	Top_1
Doc_2	Top_1	Top_3	Top_1	Top_4	Top_1	Top_2
Doc_3	Top_2	Top_4	Top_4	Top_3		Top_1

4 Experimental Part

The development of the system LDA will be executed in parallel form, in various processors and cores, so that the time of execution is greatly reduced. In the best parallelization, must exist an independence of data between the tasks, that is to say, a task A does not needs information of the task B to realize its work, and vice verse. However, there exist cases in which the tasks despite of being able to execute in parallel form, need to communicate between them, either message passing or because they share variables, thus, there must exist a mechanism of synchronization [², ¹¹, ²²].

4.1 Programming Language

Nowadays it exists a wide variety of programming languages that serve to make parallel or concurrent computing, between them C/C++, Java, Python, Julia, among others stand out. However, what is been looking for the implementation of system LDA is a language of high perfromance. And it was found in the page of Julia [⁶] the graphic shown in Fig. 14, where different “benchmarks” have been made with different programming languages, and results are compared against C, where it can be clearly appreciated that C is overcome few times regarding its performance. For these reasons it was decided to realize the programming of system LDA in language C/C++.

4.2 Programming of System LDA

For the programming of system LDA, an Algorithm of type Serial-Parallel (SPA) has been used, which, can be graphically seen in Fig. 2.

Fig. 1 Times in comparative tests in relation with C, performance of C = 1.0 (less is better)

Fig. 2 Graphic representation of SPA algorithm for the system LDA, which consists of 8 stages

To be able to carry out the parallel programming, it has decided to use OpenMP, this due to that in the parallel stages all created threads must share certain variables, and the fact that in MPI this is not possible, makes it to be discarded immediately. Thus, the architecture of memory occupied was the shared memory [⁴, ⁵, ³, ²¹, ²⁷].

4.2.1 Initialization

It is in this stage the initialization of the system, that is to say, here it is defined the quantity of topics for the corpus (k), it is defined the quantity of iterations (iter) for the system sistema LDA, and are calculated the values alfa-beta of the equation.

Once this is done, they must be read and stored, the names of the files that belong to corpus for later load all the vocabulary along with its frequency in a dictionary. When the dictionary is created, all words which frequency in the corpus is less than three are deleted. This is, the words with a frequency very reduced cannot define topics, besides, they would only cause noise in the final results.

Knowing the quantity of files that belong to corpus, and the quantity of words in the dictionary, sonare created two matrices of real numbers; the first represents topics by vocabulary (that will be called ntk), and the second represents the documents by topics (that will be called nkm).

It must have special care in this part, since the created matrices can become very big, for example, if 1,000 topics are chosen, and the vocabulary with major frequency or equal to 3 has a size of 500,000 words, and there is a total of 100,000 files, and as the matrices are real, the double is occupied as type of data.

It can be used the following formula to have an idea of the space in RAM to be able to execute the system (the formula gives the result in Gigabytes):

mem=(size_vocabulary + num_files)×k×810243.

So that for this situation are necessary 4.47 Gigabytes. This formula is only as reference for the creation of these matrices, since in the parallel stages each thread occupies certain variables to perform its work, whereby it is recommended for the RAM to be at least the double of the result of mem.

4.2.2 First Iteration

As the name of this stage indicates it, it is based on the first iteration on the corpus. To carry out the parallelization, the files spread between threads, and the matrixes ntk and nkm are shared variables between these. Thus, the matrixes are initialized with the Montecarlo Method through all threads.

4.2.3 Iterations

This stage is perhaps the most important, the cause that the system develops in parallel, since according to the number of iterations, it is the number of times that each file has to open, that is to say, if there is a corpus with 100,000 files and it is wanted to do1000 iterations, then a total of 100,000,000 files would have to open, which is not a small quantity. Which, similarly to stage 1, in each iteration, the files are divided between all threads, so that the matrixes ntk and nkm are shared again; furthermore of k, beta, and alfa. And the actualization of both matrixes is done using Gibbs Sampling.

4.2.4 Creation of Theta

Once it is finished the part where the heaviest weight of the processes of the system is recharged, theta is calculated. At the end of the calculations, theta contains the probabilities, of each topic, of belonging to each document. In this stage, parallelization is possible spreading the rows of the matrix theta between threads.

4.2.5 Creation of Phi

Stage 5 is very similar to stage 4, with the difference that now the probabilities that each word belongs to each topic will be stored in phi. Similarly, parallelization is possible dividing the rows of phi between threads.

4.2.6 Most Representative Topics of a Document

In this point of the program, there is only interest for the words that best describe each document/file, so that in this parte the 10 topics with higher odds for each file are sought in theta. Once found they are stored in a text file.

4.2.7 Representative Words of a Topic

Similar to stage 6, in this stage the interest is to search for words that best represent the topics, so that in phi are sough for each topic the 20 words with the highest odds. Once found, they are stores in a text file.

4.2.8 Entropy Calculation

It has been added this last stage, which helps to measure the entropy of the results. This part serves to identify an optimal amount of topics, that is to say, in the beginning we may not know the exact number of topics that the corpus, nevertheless, after executing the system with different quantities of topics, a graphic can be constructed with the result of this stage, and based on the graphic, the optimal amount of topics for the corpus could be observed.

4.3 Execution of the LDA System

The corpus that was used as a base is the Wikicorpus in Spanish [²³], which contains about 120 millions of words.

4.3.1 Pre-Processing of Data

This part is fundamental in the execution of this system, since it is where the input data are “cleared” so that good results are obtained. Abella and Medina [¹] talk a little of the pre-processing de datos, from which stand out:

— Elimination of punctuation marks.
— Reduction of capital letters.
— Elimination of labels if it comes from a document type XML.
— Elimination of empty words; those that does not contribute with any meaning to the text. An example of these are articles, prepositions, conjunctions, etc.
— Extraction of etymons and slogans.

Preserving this pre-processing, it is found that in the state of art that the Polytechnic University of Catalonia and the Pompeu Fabra University have processed the Wikicorpus, this group has lemmatized and labeled the corpus. This corpus already lemmatized will be applied to LDA system [⁹].

4.3.2 Execution

The system LDA has been executed in the National Laboratory of Supercomputer of Southeast of Mexico in normal nodes [¹⁸] , where each node has 12 cores with a frequency of 2.5 GHz, besides they count with 126 Gigabytes of RAM memory.

Remembering that the system has been programmed with Lenguaje C++ and the API of OpenMP; in each parallel stage, 24 threads are created with the purpose that for each thread be executed by a different core.

5 Results and Discussion

The system LDA was executed with 300 iterations and with a variable amount of topics, ranging from 100 to 4000 topics. Results can be seen in the Table 6.

Table 6 Results of 300 iterations

Topics	Entropy	Topics	Entropy
100	9.11148	1200	8.63675
200	9.01459	1300	8.61248
300	8.94128	1400	8.59993
400	8.88203	1500	8.58606
500	8.83977	1600	8.57487
600	8.79597	1700	8.55774
700	8.76163	1800	8.54389
800	8.72798	1900	8.52989
900	8.70400	2000	8.51713
1000	8.67401	3000	8.43023
1100	8.66157	4000	8.36932

With these data, the graphic of Fig. 8 was obtained, where it can be seen a decrease and, that after 1300 topics, the entropy starts to gradually decrease. In this setting, it was chosen as amount of topics 1300.

Fig. 3 Creation of frequency matrices

Fig. 4 Parallel Monte Carlo execution

Fig. 5 Gibbs parallel execution

Fig. 6 Parallel theta calculation

Fig. 7 Parallel phi calculation

Fig. 8 Graphic of entropy of each execution

Once chosen the value of 1,300, the system LDA has been executed again with 1,000 iterations.

Then, some topics with their words and their respective probabilities in the topic generated by system LDA for 1,300 topics with 1,000 iterations are presented in Table 7.

Table 7 1,300 topics and their probabilities

Topic_334		Topic_413		Topic_418
roberto	0.01567	game	0.06414	play	0.02811
year	0.00948	videogame	0.00929	club	0.02808
no	0.00741	character	0.00728	football	0.02726
gómez	0.00625	power	0.00706	team	0.02686
rodríguez	0.00487	saga	0.00669	selection	0.01437
all	0.00478	no	0.00659	first	0.01362
chespirito	0.00475	cisneros	0.00594	goal	0.01352
first	0.00443	developers	0.00588	division	0.01312
program	0.00429	also	0.00585	match	0.01212
city	0.00422	player	0.00542	tournament	0.01073
other	0.00420	all	0.00481	season	0.00950
iso	0.00415	console	0.00478	footballer	0.00927
jesús	0.00399	other	0.00472	player	0.00879
aramón	0.00356	version	0.00472	year	0.00862
where	0.00343	mode	0.00446	champion	0.00799
screenwriter	0.00341	gamecube	0.00439	national	0.00785
character	0.00332	wii	0.00413	dispute	0.00743
fernández	0.00324	protocol	0.00382	win	0.00694
power	0.00315	software	0.00380	debute	0.00662
mexican	0.00309	year	0.00378	championship	0.00627
club	0.04666	orchestra	0.03887	cv	0.03641
team	0.04375	piano	0.03185	motor	0.02912
play	0.03234	music	0.02398	cylinder	0.01561
season	0.03169	concert	0.01781	four	0.01438
match	0.02597	oeuvre	0.01712	automobile	0.01204
football	0.02417	violin	0.01525	line	0.01057
link	0.02372	composer	0.01511	generation	0.00983
cup	0.01963	op	0.01448	year	0.00857
selection	0.01558	guitar	0.012823	external	0.00748
division	0.01502	musical	0.01169	game	0.00706
goal	0.014972	symphonic	0.01065	five	0.00694
player	0.01456	composition	0.00993	first	0.00692
dispute	0.01235	director	0.00928	bodywork	0.00666
champion	0.011898	opus	0.00818	traction	0.00653
first	0.01004	pianist	0.00805	version	0.00642
football	0.00900	musician	0.00740	gasoline	0.00641
spanish	0.00881	major	0.00639	power	0.00598
athletic	0.00850	year	0.00563	diesel	0.00581
final	0.00824	piece	0.00535	injection	0.00540
trainer	0.00761	sonata	0.00514	maximum	0.00512
patient	0.03165	candidate	0.03174	island	0.04056
disease	0.02630	election	0.02674	tropical	0.03674
symptom	0.01986	vote	0.02366	hurricane	0.02882
disorder	0.01595	party	0.01459	storm	0.02735
doctor	0.01256	political	0.00764	atlantic	0.01935
power	0.01105	coalition	0.00734	wind	0.01643
pain	0.00952	candidacy	0.00671	ice	0.01511
no	0.00927	electoral	0.00668	coast	0.01461
other	0.00706	obtain	0.00666	ocean	0.01368
medicament	0.00632	seat	0.00619	soil	0.01147
affect	0.00556	no	0.00592	km	0.01125
mental	0.00513	method	0.00527	north	0.01033
effect	0.00443	choose	0.00520	season	0.00928
medicine	0.00418	jaime	0.00467	rain	0.00862
studio	0.00405	deputy	0.00438	damage	0.00772
duty	0.00396	all	0.00424	west	0.00762
nervous	0.00388	year	0.00422	cause	0.00740
thumb	0.00383	result	0.00417	sea	0.00735
produce	0.00381	other	0.00403	south	0.00730
also	0.00381	power	0.00371	cyclone	0.00729

It must be highlighted that the words that best describe a topic are those with the highest probabilities, which are ordered in descending form.

6 Conclusions and Future Work

As it can be seen in topics, good results have been found, of course there are words that do not belong to the topic, various topics are right and are easily linked to specific topics.

Getting these topics is important, for example, when it is desirable to know what does a file talk about. In this point, it is not only necessary to see the topics which it is linked, and in this way it is possible to abstract the idea of the content of the file without completely reading it.

On the other hand, not all 1,300 topics represent different themes, also to find similar topics is viable, as well as topics that contain words of several topics, which hinders to associate these topics to some specific theme. LDA is one of the algorithms more used on the field of Topic Modeling that is based, as it has been described in this work, in the concept of latent structure of topics within a great collection of documents. The method of this Gibbs sampling is the algorithm more used within the family of sampling methods. However, there are other type of algorithms that could be applied in future works, and that do not approximate a posteriori distribution with samples, but that are of deterministic type such as Variational Bayesian Inference (VB), Expectation Propagation (EP), Variational EM, inter alia, that could be implemented to compare results.

Acknowledgments

We would like to thank the National Supercomputing Laboratory of Southeastern Mexico for the support of computational resources and technical assistance through the project number O-2017/037, as well as the Vice-rectory of Research and Postgraduate Studies (VIEP, by its Spanish acronym) of the Benemérita Universidad Autónoma de Puebla for the economical support.

References

1. Abella, R., Medina, J. E. (2014). Segmentación lineal de texto por tópicos. Serie Gris. CENATAV. [ Links ]

2. Arbenz, P., Petersen, W. (2004). Introduction to Parallel Computing (Oxford Texts in Applied and Engineering Mathematics). Oxford University Press, Inc., New York, NY, USA. [ Links ]

3. Barney, B. (2015). Openmp. [ Links ]

4. Barney, B. (2017). Introduction to parallel computing tutorial. [ Links ]

5. Barney, B. (2017). Message passing interface (mpi). [ Links ]

6. Bezanson, J. (2017). Julia micro-benchmarks. [ Links ]

7. Bisgin, H., Liu, Z., Kelly, R., Fang, H., Xu, X., Tong, W. (2012). Investigating drug repositioning opportunities in FDA drug labels through topic modeling. BMC Bioinformatics, Vol. 13, No. S6. DOI: 10.1186/1471-2105-13-s15-s6. [ Links ]

8. Blei, D. M., Jordan, M. I. (2003). Modeling annotated data. Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, SIGIR ’03, Association for Computing Machinery, New York, NY, USA, pp. 127–134. DOI: 10.1145/860435.860460. [ Links ]

9. Boleda, G. (2012). GrAF version of Spanish portions of wikipedia corpus. [ Links ]

10. Dueñas, R., Velásquez, J. (2013). Una aplicación de web opinion mining para la extracción de tendencias y tópicos de relevancia a partir de las opiniones consignadas en blogs y sitios de noticias. Revista Ingenieria de Sistemas„ Vol. 27, pp. 33–54. [ Links ]

11. Gebali, F. (2011). Algorithms and parallel computing. Hoboken, N.J.: Wiley. Wiley. [ Links ]

12. Griffiths, T. L., Steyvers, M. (2002). A probabilistic approach to semantic representation. Proceedings of the Annual Meeting of the Cognitive Science Society, volume 24. [ Links ]

13. Griffiths, T. L., Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences, Vol. 101, No. 1, pp. 5228–5235. DOI: 10.1073/pnas.0307752101. [ Links ]

14. Heinrich, G. (2008). Parameter estimation for text analysis. Technical report fraunhofer, University of Leipzig, Germany. [ Links ]

15. Hofmann, T. (1999). Probabilistic latent semantic analysis. Laskey, K. B., Prade, H., editors, Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, volume 15 of UAI’99, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp. 289–296. [ Links ]

16. Hofmann, T. (1999). Probabilistic latent semantic indexing. Laskey, K. B., Prade, H., editors, Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, volume 15 of SIGIR ’99, Association for Computing Machinery, New York, NY, USA, pp. 50–57. [ Links ]

17. Hu, W. (2012). Unsupervised learning of two bible books: Proverbs and psalms. Sociology Mind, Vol. 02, No. 03, pp. 325–334. DOI: 10.4236/sm.2012.23043. [ Links ]

18. Laboratorio Nacional de Supercómputo (LNS) (2021). http://www.lns.buap.mx. [ Links ]

19. Landauer, T. K., Foltz, P. W., Laham, D. (1998). An introduction to latent semantic analysis. Discourse Processes, Vol. 25, No. 2-3, pp. 259–284. DOI: 10.1080/01638539809545028. [ Links ]

20. Minka, T. (2000). Estimating a Dirichlet distribution. Technical report, MIT. [ Links ]

21. MPI Forum (2017). MPI documents. [ Links ]

22. Pacheco, P. S. (2011). An Introduction to Parallel Programming. Morgan Kaufmann, Burlington, MA, USA, 1 edition. DOI: 10.1016/c2009-0-18471-4. [ Links ]

23. Reese, S., Boleda, G., Cuadros, M., Padró, L., Rigau, G. (2010). Wikicorpus: A word-sense disambiguated multilingual wikipedia corpus. Proceedings of the Seventh International Conference on Language Resources and Evaluation, LREC’10, European Language Resources Association (ELRA), Valletta, Malta. [ Links ]

24. Rodríguez, S. (2012). Estudio de técnicas no supervisadas para descubrir tópicos en videos deportivos. Masther thesis, Universidad Jaume I. [ Links ]

25. Ruiz, S., Campos, Y. (2010). Clasificación de malformaciones craneales causadas por craneosinostosis primaria utilizando kernels no lineales. Revista Mexicana de Ingeniería Biomédica, Vol. 31, No. 1, pp. 15–29. [ Links ]

26. Seiter, J., Amft, O., Rossi, M., Tröster, G. (2014). Discovery of activity composites using topic models: An analysis of unsupervised methods. Pervasive and Mobile Computing, Vol. 15, pp. 215–227. DOI: https://doi.org/10.1016/j.pmcj.2014.05.007. Special Issue on Information Management in Mobile Applications Special Issue on Data Mining in Pervasive Environments. [ Links ]

27. Software Intel (2017). OpenMP. pragmas and clauses summary. [ Links ]

28. Wang, Y. (2008). Distributed gibbs sampling of latent topic models: The gritty details. Technical report. [ Links ]

Received: July 29, 2021; Accepted: September 30, 2021

^* Corresponding author: María Beatriz Bernábe Loranca, e-mail: beatriz.bernabe@gmail.com

This is an open-access article distributed under the terms of the Creative Commons Attribution License