A Study on Stochastic Variational Inference for Topic Modeling with Word Embeddings

Ozaki, Kana; Kobayashie, Ichiro; Ozaki, Kana; Kobayashie, Ichiro

doi:10.13053/cys-26-3-4343

Servicios Personalizados

Revista

Articulo

Indicadores

Citado por SciELO
Accesos

Links relacionados

Similares en SciELO

Otros
Otros

Permalink

Computación y Sistemas

versión On-line ISSN 2007-9737versión impresa ISSN 1405-5546

Comp. y Sist. vol.26 no.3 Ciudad de México jul./sep. 2022 Epub 02-Dic-2022

https://doi.org/10.13053/cys-26-3-4343

Articles

A Study on Stochastic Variational Inference for Topic Modeling with Word Embeddings

Kana Ozaki¹^*

Ichiro Kobayashie¹

¹1 Ochanomizu University, Japan. koba@is.ocha.ac.jp.

Abstract:

Probabilistic topic models based on Latent Dirichlet Allocation (LDA) is widely used to extract latent topics from document collections. In recent years, a number of extended topic models have been proposed, especially Gaussian LDA (G-LDA) has attracted a lot of attention. G-LDA integrates topic modeling with word embeddings by replacing discrete topic distributions over words with multivariate Gaussian distributions on the word embedding space. This can reflect semantic information into topics. In this paper, we use G-LDA for our base topic model and apply Stochastic Variational Inference (SVI), an efficient inference algorithm, to estimate topics. Through experiments, we could extract the topics with high coherence in practical time.

Keywords: Topic model; latent Dirichlet allocation; word embeddings; stochastic variational inference

1 Introduction

Probabilistic topic models such as Latent Dirichlet Allocation (LDA) [²], are widely used to uncover hidden topics within text corpus. In LDA, each document may be viewed as a mixture of latent topics where each topic is a distribution over words. With statistical inference algorithms, LDA reveals latent topics using document-level word co-occurrence.

In recent years, a number of extended topic models have been proposed, especially Gaussain LDA(G-LDA) [³] that integrates LDA with word embeddings has gained much attention. G-LDA uses Gaussian distribution as the topic distribution over words.

Furthermore, Batmanghelich et al. [¹] proposed spherical Hierarchical Dirichlet Process (sHDP) which use the von Mises-Fisher distribution as the topic distribution to model the density of words over unit sphere. They used the Hierarchical Dirichlet Process (HDP) for their base topic model and apply Stochastic Variational Inference (SVI) [⁵] for efficient inference.

They showed that sHDP is able to exploit the semantic structures of word embeddings and flexibly discovers the number of topics. Hu et al. [⁷] proposed Latent Concept Topic Model (LCTM) which introduces latent concepts to G-LDA. LCTM models each topic as a distribution over latent concepts, where each concept is a localized Gaussian distribution in word emgedding space.

They reported that LCTM is well suited for extracting topics from short texts with diverse vocaburaly such as tweets. Xun et al. [¹⁵] proposed a correlated topic model using word embeddings. Their model enables us to exploit the additional word-level correlation information in word embeddings and directly models topic correlation in the continuous word embedding space.

Nguyen et al. [¹²] proposed Latent Feature LDA (LF-LDA) which integrates word embeddings into LDA by replacing the topic-word Dirichlet multinomial component with a mixture of a Dirichlet multinomial component and a word embedding component. They compared the performance of LF-LDA to vanilla LDA on topic coherence, document clustering and document classification evaluations and showed that LF-LDA improves both topic-to-word mapping and document-topic assignments compared to vanilla LDA, especially on datasets with few or short documents. Kumar et al. [¹⁴] presented an unsupervised topic model for short texts that performs soft clustering over word embedding space.

They modeled the low-dimensional semantic vector space represented by word embeddings using Gaussian mixture models (GMMs) whose components capture the notion of latent topics. Their proposed framework outperforms vanilla LDA on short texts through both subjective and objective evaluation, and showed its usefulness in learning topics and classifying short texts on Twitter data for several foreign languages.

Zhao et al. [¹⁷] proposed a focused topic model where how a topic focuses on words is informed by word embeddings. Their models are able to discover more informed and focused topics with more representative words, leading to better modelling accuracy and topic quality. Moody [¹⁰] proposed a model, called lda2vec, which learns dense word vectors jointly with Dirichlet distributed latent document-level mixtures of topic vectors.

His method is simple to incorporate into existing automatic differentiation frameworks and allows for unsupervised document representations geared for use by scientists while simultaneously learning word vectors and the linear relationships between them. Yao et al. [¹⁶] proposed Knowledge Graph Embedding LDA (KGE-LDA), which combines topic model and knowledge graph embeddings.

KGE-LDA models document level word co-occurrence with knowledge encoded by entity vectors learned from external knowledge graphs and can extract more coherent topics and better topic representation. In this paper, we use G-LDA as our base topic model. Compared with vanilla LDA, G-LDA produces higher Pointwise Mutual Information (PMI) in each topic because it has semantic information of words as prior knowledge.

In addition, because G-LDA operates on the continuous vector space, it can handle out of vocabulary (OOV) words in held-out documents whereas the conventional LDA cannot. On the other hand, the cost for estimating the posterior probability distribution for latent topics in word embedding space is costly because of dealing with the high dimensional information of words. So, it is unpractical to use the methods which take much time to estimate the posterior probability distribution such as Gibbs sampling. To reduce the cost for estimating the posterior probability distribution, G-LDA utilizes Cholesky decomposition of covariance matrix and applies Alias Sampling [⁸] for that.

In a similar case of dealing with high dimensional data, it is also difficult to estimate latent topics in massive documents using sampling methods. To deal with this problem, Hoffman et al. [⁴] developed online Variational Bayes (VB) for LDA. Their model is handily applied to massive and streaming document collections.

Their proposed method, online variational Bayes, becomes well known as “Stochastic Variational Inference” [¹³, ⁵]. Referring to their approach, in this paper, we propose a method to efficiently estimate latent topics in the high dimensional space of word embeddings by adopting SVI (Stochastic Variational Inference).

2 LDA and Gaussian LDA

2.1 Latent Dirichlet Allocation (LDA)

LDA [²] is a probabilistic generative model of document collections. In LDA, each topic has a multinomial distribution β over a fixed vocabulary and each document has a multinomial distribution θ over K topics.

Distributions β and θ are designed to be sampled from the conjugate Dirichlet priors parameterized by η and α, respectively. Suppose that D and Nd denote the number of documents and words in dth document, respectively. The generative process is as follows:

for k=1 to K
- a) Choose topic βk~Dir(η)
for each document d in corpus D
- a) Draw topic distribution θd~Dir(α)θ_d ∼ Dir(α)
- b) For each word index n from 1 to Nd
  - a) Draw a topic zn~Categorical(θd)
  - b) Draw a word wn~Categorical(βzn).

The graphical model for LDA is shown in the left side of Figure 1.

Fig. 1 Grphical representations of LDA (left) and Gaussian LDA (right).

2.2 Gaussian LDA (G-LDA)

Hu et al. [⁶] proposed a new method to model the latent topic in the task of audio retrieval, in which each topic is directly characterized by Gaussian distribution over audio features.

Das et al. [³] presented an approach for accounting for semantic regularities in language, which integrates the model proposed by Hu et al. [⁶] with word embeddings. They use word2vec [⁹], to generate skip-gram word embeddings from unlabeled corpus.

In this model, they characterize each topic k as a multivariate Gaussian distribution with mean μk and covariance Σk in an M-dimentional embedding space, and concurrently replaces the Dirichlet priors with the conjugate Normal Inverse Wishart (NIW) priors on Gaussian topics.

Because the observations are no longer discrete values but continuous vectors, word vectors are sampled from continuous topic distributions. They reported that G-LDA produced higher PMI score than conventional LDA as the result of the experiment, which means topical coherence was improved.

Because G-LDA uses continual distributions as the topic distributions over words, it can assign latent topics to OOV words without training the model again, whereas the original LDA cannot deal with those words. The generative process is as follows:

for k=1 to K
- a) Draw topic covariance Σk~W−1(Ψ,μ)
- b) Draw topic mean μk~N(μ,1βΣk)
for each document d in corpus D
- a) Draw topic distribution θd~Dir(α)
- b) for each word index n from 1 to Nd
  - a) Draw a topic zn~Categorical(θd)
  - b) Draw vd,n~N(μzn,Σzn).

Although θd represents topic distributions of dth document as the traditional LDA does, μk and Σk represent the mean and the covariance of the multivariate Gaussian distribution, respectively. Besides, vd,n represents word vector. The graphical model for G-LDA is shown in the right side of Figure 1.

3 Posterior Inference with SVI

Sampling method such as Gibbs sampler is widely used to perform approximate inference in topic modeling. Although Gibbs sampler has an advantage for easy implementation, it takes much time to estimate a posterior distribution.

Hence, we employ an efficient inference algorithm based on VB, i.e., Stochastic Variational Inference (SVI) [⁵], to estimate the posterior probability distributions of the latent variables. SVI is an efficient algorithm for large datasets because it can sequentially process batches of documents.

With VB inference, the true posterior probability distribution is approximated by a simpler distribution q(z,θ,μ,Σ), which is indexed by a set of free parameters θ, μ and Σ.

These parameters are optimized to maximize the Evidence Lower BOund (ELBO), a lower bound on the logarithm of the marginal probability of the observations log⁡ p(v):

log⁡ p(v|α,ζ)≥L(v,ϕ,γ,ζ)≜Eq[log⁡ p(v,z,θ,μ,Σ|α,ζ)]−Eq[log⁡ q(z,θ,μ,Σ)]. (1)

Based on the assumption that variables are independent in the mean-field family, approximate distribution q is fully factorized as follows:

q(z,θ,μ,Σ)=q(z)q(θ)q(μ,Σ). (2)

Let ϕ be the parameter for the latent variables z, γ be the parameter for the distribution θ over topics and ζ=(m,β,Ψ,v) be the parameter of the mean μ and covariance Σ of the topic distribution over word types. Factorized distributions of q are:

q(zdi=k)=θdwdik, (3)

q(θd)=Dir(θd|γd), (4)

q(μk,Σk)=NIW(μk,Σk|mk,βk,Ψk,vk), (5)

γdk=α+∑wndwϕdwk, (6)

ϕdwk∝exp⁡{Eq[log⁡ θdk]+Eq[log⁡ N(vdw|μk,Σk)]}. (7)

SVI needs not analyze the whole data set before improving the global variational parameters and can apply new data which is constantly arriving, while VB requires a full pass through the entire corpus at each iteration. q(μk,Σk) is the object for sequential learning, while q(zd) and q(θd) are optimized at each iteration.

Thus, we apply the stochastic natural gradient descent to update the parameters ζ=(m,β,Ψ,v). At dth document containing nd words, we optimize ϕd and γd, holding ζ fixed. Next, we calculate intermediate global parameters ζ*=(m*,β*,Ψ,v*) as follows:

βk*=β+D∑wndwϕdwk, (8)

vk∗=v+D∑wndwϕdwk, (9)

mk∗=βm+D∑wndwϕdwkv¯kβk∗, (10)

Ψk∗=Ψ+Ck+βD∑wndwϕdwkβk∗(v¯k−m)(v¯k+m)T. (11)

Here:

v¯k=∑wndwϕdwkvdw∑wndwϕdwk, (12)

Ck=D∑wndwϕdwk(vdw−v¯k)(vdw−v¯k)T. (13)

D denotes the number of corpus, which means that ζ is optimized if the entire corpus consisted of the singe document nd repeated D times. By this operation, it becomes possible to update the parameters ϕ, γ and ζ at each iteration without whole documents, so that it can analyze massive document collections, including those arriving in a stream. We then update ζ using a weighted average of its previous value and the estimated ζ*. The update is:

ζ=(1−ρd)ζ+ρdζ*. (14)

The weight for ζ* is given by ρd≜(τ0+d)−κ, where κ∈(0.5,1] controls the rate at which old values of ζ are forgotten and τ0≥0 slows down in the early iterations of the algorithm. The expectations under q of log⁡ θdk and log⁡ N(vdw|μk,Σk) are:

Eq[log⁡ θdk]=Ψ(γdk)−Ψ(∑i=1Kγdi), (15)

Eq[log⁡ N(vdw|muk,Σk)]=−12vdwT〈Σk−1〉vdw+vdwT〈Σk−1μk〉−12〈log⁡|Σk|〉. (16)

where, Ψ and 〈 〉 denote the digamma function and the expectation, respectively. Algorithm 1 presents the full algorithm of SVI for Gaussian LDA.

Algorithm 1 SVI for Gaussian LDA

4 Experiments

We construct a model which integrates SVI to word vector topic model following Algorithm 1 and conduct the experiment of topic extraction. In this paper, we evaluate whether our model is able to find coherent and meaningful topics compared with the conventional LDA.

4.1 Experimental Setting

We perform experiments on two different text corpora: 18846 documents from 20Newsgroups^{^fn} and 1740 documents from the NIPS^{^fn}. We utilize 50-dimensional word embeddings trained on text from Wikipedia using word2vec and run out the model with various number of topics (K = 20 ∼ 60). The document distribution over topics θ is designed to be sampled from the conjugate Dirichlet prior parameterized by α=1/K. In equation (7), we set parameters τ0∈ {1, 4, 16, 64, 256, 1024} and κ∈ {0.5, 0.6, 0.7, 0.8, 0.9, 1.0} then we set the batch size according to the number of documents: S∈ {4, 16, 64, 256, 1024} for 20Newsgroups-dataset, S∈ {4, 10, 16} for NIPS -dataset. Our model implementation is in Python^{^fn}. In the experiments, we used the conventional LDA as a baseline model. The hyper parameters α and η in Dirichlet distribution are 1/K and 0.01, respectively.

4.2 Evaluation

We use PMI score to evaluate the quality of topics learnt by our models as well as it is used to evaluate the ability of G-LDA [³]. Newman et al. [¹¹] showed that PMI has relatively good agreement with human scoring.

We use a reference corpus of documents from Wikipedia and use co-occurence statistics over pairs of words (wi,wj) in the same document. The PMI score of topic k is computed by:

PMI(k)=1(N2)∑j=2N∑i=1j−1log⁡p(wi,wj)p(wi)p(wj). (17)

We use the average of the score of top 10 words of each topic. A higher PMI score implies a more coherent topic as it means the topic words usually co-occur in the same document.

4.3 Result

The experimental results of PMI on 20Newsgroups and NIPS-datasets are shown in Figure 2. We plot the average of the PMI scores for the top 10 words in each topic, the result of 20Newsgroups with the parameters of S=16, κ=1.0, and τ0=1024, and NIPS with those of S=4, κ=1.0, and τ0=1024.

Fig. 2 PMI performance of the top 10 words on 20Newsgroups (left) and NIPS (right)

It is clearly seen that our model outperforms the conventional LDA in terms of PMI score. Some examples of top topic words are listed in Table 1 and Table 2. The parameter settings is the same as above and we present the top 10 topics in descending order.

Table 1 Top 10 words of some topics from our model and multinomial LDA on 20Newsgroups for K = 40 and PMI score

Gaussian LDA topics
cie	geophysics	manning	authenticicity	beasts	ton	acts	disasters	provoke	normals
informatik	astrophysics	neely	veracity	creatures	tons	exercising	disaster	provocation	histograms
nos	physics	carney	credence	demons	gallon	coercion	hazards	futile	gaussian
gn	meteorology	brady	assertions	monsters	mv	act	catastrophic	suppress	linear
nr	astronomy	wilkins	inaccuracies	eleves	cargo	enforcing	devastation	resorting	symmetric
sta	geophysical	brett	particulars	spirits	cruiser	collective	dangers	threatening	histogram
vy	geology	seaver	texttual	unicorns	pound	proscribed	pollution	aggression	vectors
gl	astrophysical	reggie	merits	denizens	pounds	regulating	destruction	urge	inverse
cs	chemistry	ryan	substantiate	magical	corvettes	initiating	impacts	inﬂict	graphs
ger	microbiology	wade	refute	gods	guns	involving	destructive	expose	variables
6.6429	6.2844	5.3070	5.0646	4.3270	3.6760	3.1671	2.8486	2.7723	2.7408
Multinomial LDA topics
drive	ax	subject	data	south	la	supreme	writes	key	goverment
disease	max	lines	doctors	book	goal	bell	article	code	law
ard	a86	server	teams	lds	game	at&t	organization	package	gun
scsi	0d	organization	block	published	cal	zoology	senate	window	clinton
drives	1t	spacecraft	system	adl	period	subject	subject	data	congress
disk	giz	spencer	spave	armenian	bd	covenant	dod	information	clipper
subject	3t	program	output	books	roy	suggesting	lines	anonymous	key
daughter	cx	space	pool	documents	55.0	lines	income	ftp	clayton
unit	bh	software	resources	isubject	its	off	deﬁcit	program	federal
organization	kt	graphic	bits	information	season	origins	year	source	constitution
2.3514	2.2500	1.3700	1.1216	1.0528	0.8338	0.7092	0.4531	0.4501	0.4355

Table 2 Top 10 words of some topics from our model and multinomial LDA on NIPS for K = 40 and PMI score

Gaussian LDA topics
topological	ginzburg	mitsubishi	negation	m.s	generalize	vx	gcs	behaviors	describes
projective	goldmann	vw	predicate	ms	analytically	xf	dcs	behaviours	describing
subspaces	jelinek	gm	disjunction	m/s	generalizations	vf	tss	behavior	interprets
symplectic	kolmogorov	motors	predicates	bd	intuitively	vz	rbp	behaviour	discusses
homotopy	markov	ﬂat	propositional	tat	generalizing	r4	sdh	biases	illustrates
topology	pinks	dyna	priori	dd	computable	xr	modulators	arousal	relates
euclidian	christof	integra	reﬂexive	stm	theretic	rx	signalling	behavioural	identiﬁes
integrable	koenig	combi	duality	bs	solvable	tlx	mds	behaviorally	characterizes
subspace	engel	gt	categorical	lond	generalization	t5	analysers	attentional	demonstrates
afﬁne	lippmann	suzuki	imperfect	bm	observable	spec	bss	predisposition	observes
11.3463	9.4716	6.8211	6.7072	4.8832	4.4388	4.2489	3.6088	3.4125	3.3666
Multinomial LDA topics
model	network	learning	network	neural	network	network	learning	function	network
ﬁgure	model	network	algorithm	networks	networks	neural	neural	network	model
neural	input	ﬁgure	neural	model	neural	funtion	ﬁgure	model	input
learning	learning	data	learning	input	input	input	network	neural	learning
input	neural	units	training	data	learning	learning	data	training	data
network	networks	model	input	learning	data	model	input	learning	system
output	output	input	output	function	training	networks	training	set	training
number	function	set	networks	ﬁgure	output	ﬁgure	function	algorithm	neural
function	data	neural	set	units	number	output	model	data	function
data	ﬁgure	output	function	output	set	training	output	ﬁgure	output
0.4945	0.4302	0.3506	0.3232	0.2280	0.1759	-0.0473	-0.1412	-0.1784	-0.2415

In the last line of the tables, we present the PMI score for 10 topics for both our model and the traditional LDA. We see that the topics of our model seems more coherent than the baseline model.

In addition, our model is able to capture several intuitive topics in the corpus such as natural science, mythology and cargo in Table 1, mathematics and car in Table 2. In particular, our model discovered the collection of human names, which was not captured by traditional LDA.

5 Conclusions and Future Work

Traditional topic models do not account for semantic regularities in language such as contextual relation of words as expressed in word embedding space. Therefore, G-LDA integrates the conventional topic model with word embeddings.

However, dealing with high dimensional data such as word vectors in embedding space requires costly computation. So, G-LDA employs faster sampling using Cholesky decomposition of covariance matrix and Alias Sampling.

On the other hand, Stochastic Variational Inference is much faster inference method than Markov chain Monte Carlo (MCMC) sampler such as Gibbs sampling and can deal with enormous dataset. Hence, we draw attention to SVI with expectation that SVI is also effective to handle high dimensional data.

In this paper, we have proposed to apply efficient inference algorithm based on SVI to the topic model with word embeddings. As a qualitative analysis, we have verified the coherence in the extracted latent topics through the experiments and confirmed that our model is able to extract meaningful topics as G-LDA is.

In the future work, we will observe perplexity convergence to evaluate the inference speed and the soundness of our model.

References

1. Batmanghelich, K., Saeedi, A., Narasimhan, K., Gershman, S. (2016). Nonparametric spherical topic modeling with word embeddings. [ Links ]

2. Blei, D. M., Ng, A. Y., Jordan, M. I. (2003). Latent Dirichlet allocation. J. Mach. Learn. Res., Vol. 3, pp. 993–1022. [ Links ]

3. Das, R., Zaheer, M., Dyer, C. (2015). Gaussian LDA for topic models with word embeddings. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Association for Computational Linguistics, pp. 795–804. [ Links ]

4. Hoffman, M., Bach, F., Blei, D. (2010). Online learning for latent Dirichlet allocation. Advances in Neural Information Processing Systems, volume 23, Curran Associates, Inc. [ Links ]

5. Hoffman, M., Blei, D. M., Wang, C., Paisley, J. (2012). Stochastic variational inference. [ Links ]

6. Hu, P., Liu, W. J., Jiang, W., Yang, Z. (2012). Latent topic model based on Gaussian-LDA for audio retrieval. pp. 556–563. [ Links ]

7. Hu, W., Tsujii, J. (2016). A latent concept topic model for robust topic inference using word embeddings. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Association for Computational Linguistics, pp. 380–386. [ Links ]

8. Li, A. Q., Ahmed, A., Ravi, S., Smola, A. J. (2014). Reducing the sampling complexity of topic models. Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Association for Computing Machinery, pp. 891–900. [ Links ]

9. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J. (2013). Distributed representations of words and phrases and their compositionality. [ Links ]

10. Moody, C. E. (2016). Mixing Dirichlet topic models and word embeddings to make lda2vec. [ Links ]

11. Newman, D., Karimi, S., Cavedon, L. (2011). External evaluation of topic models. ADCS 2009 - Proceedings of the Fourteenth Australasian Document Computing Symposium. [ Links ]

12. Nguyen, D. Q., Billingsley, R., Du, L., Johnson, M. (2018). Improving topic models with latent feature word representations. [ Links ]

13. Paisley, J., Blei, D., Jordan, M. (2012). Variational Bayesian inference with stochastic search. [ Links ]

14. Rangarajan Sridhar, V. K. (2015). Unsupervised topic modeling for short texts using distributed representations of words. Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, Association for Computational Linguistics, pp. 192–200. [ Links ]

15. Xun, G., Li, Y., Zhao, W. X., Gao, J., Zhang, A. (2017). A correlated topic model using word embeddings. Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17, pp. 4207–4213. [ Links ]

16. Yao, L., Zhang, Y., Wei, B., Jin, Z., Zhang, R., Zhang, Y., Chen, Q. (2017). Incorporating knowledge graph embeddings into topic modeling. Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 31, No. 1. [ Links ]

17. Zhao, H., Du, L., Buntine, W. (2017). A word embeddings informed focused topic model. Proceedings of the Ninth Asian Conference on Machine Learning, volume 77, PMLR, pp. 423–438. [ Links ]

http://qwone.com/jason/20Newsgroups/

https://cs.nyu.edu/~roweis/data.html

https://github.com/KanaOzaki/SVI_GLDA

Received: February 15, 2018; Accepted: January 11, 2020

^* Corresponding author: Kana Ozaki, e-mail: ozaki.kana@is.ocha.ac.jp

This is an open-access article distributed under the terms of the Creative Commons Attribution License