1 Introduction
In the modern Web, with the advance of social interactions between users and full-scale data mining of all information related to users, user profiling has become a very important problem. In this context, user profiling means converting the recorded user behavior into a set of labels or probability distributions that capture the most important aspects of the user that can be further used for making new recommendations, providing targeted advertisement, and so on.
One could expect that user profiling can be significantly augmented with natural language processing. Much of what goes on in social networks has the form of a text, and one can use texts generated by the user him/herself such as wall posts or statuses to mine his or her interests and demographic information. The recent development of deep learning techniques for natural language processing has led to state of the art models that operate in a basically unsupervised fashion and do not require much linguistic insight; one such direction of study deals with word embeddings, vector representations of words that capture semantic relations between the words and can serve as an intermediate step for other models.
The contributions of this work are twofold. First, we concentrate on a novel application of word embeddings to user profiling, with the specific example of improving user age prediction with full-text items. Second, user profiles can incorporate knowledge such as demographic information (age, gender, location etc.) or attempt to infer it from user behavior.
However, the holy grail of user profiling is to concisely represent a user’s topical interests, preferably as narrowly as possible, with obvious applications to targeted advertising and new recommendations. The methods of summarizing information about users lie at the core of many personalized search and advertisement engines and various recommender systems. Being able to make predictions based on appropriately summarized prior user-system interaction allows, among other things, to alleviate the so-called cold start problem, which is one of the main problems of recommender systems: how do you recommend a new item that has not been rated before or has had very few ratings? Given user profiles and a way to match the new item to these profiles, one can make recommendations when collaborative filtering is inapplicable.
We believe that huge corpora of user-generated texts stored in forums and social networks can be used to produce interpretable, semantic user profiles and improve interests-based recommendations for full-text items. We develop new age prediction methods and algorithms for users interacting with full-text items based on distributed word representations and a novel approach to user’s interests profile construction. We show improvements in demographic user profiling for these algorithms, show improved results in item recommendation over collaborative recommenders and, most interestingly, show a way to extract an interpretable user interest profile.
The paper is organized as follows. In Section 2, we survey related work on user profiling with textual information, especially inferred from social media, and word embeddings.
Section 4 is devoted to the background of our experimental study, detailing the dataset collected from the Russian social network Odnoklassniki and word2vec models we have trained for this study.
In Section 5 we show our proposed algorithms for age prediction and present comprehensive experimental results that show improved age prediction.
Section 6 defines models for semantic user profiling that are also based on word embeddings; we show sample user interest profiles and present an experimental evaluation in terms of recommending new items to the users based on their textual preferences.
We conclude with Section 7. This work is a significantly extended joint journal version of two conference papers [3,58].
2 Related Work
2.1 User Profiling with NLP
User profiling by user behavior has had a long history in many different contexts. Previous attempts at big data user profiling without deep neural networks have leaned upon external knowledge in the form of ontologies [50] and presented a general framework for using NLP in profiling [14]. There is a large classical field of authorship analysis, attribution and author verification studies [38,91]; we refer to surveys [16, 78,79] for details and references.
Some works use natural language processing to perform or augment user profiling. In particular, there have been several works closer to social sciences based on available anonymized datasets that do things similar to user profiling, usually mining demographic information from texts generated by a user, and attempts to mine text to establish new information regarding a user or relations between users, a field known as social media personal analytics. Next, we highlight some of this research.
In [43], anonymized text messaging datasets are used to investigate the demographics of texting, while in [26], author profiling for English emails uncovers basic demographic traits (gender, age, geographic origin, level of education, and native language) and five psychometric traits based on email texts.
Several Twitter-based studies have focused on mining demographic features based on tweets [24, 32]; the work [42], for instance, does it in a weakly supervised fashion, using Facebook or Google+ profiles as distant supervision. The work [59] detects personality traits from weblog texts, while the work [5] explicitly studies lexical predictors of personality type, [10] determines demographic information by social media texts, and [67] mines user relations from online discussions; an interesting extension is [28] which attempts personality profiling of fictional characters based on the texts about them.
In [70], author profiles in social media are mined to get hidden user profile information, while in [60] metadata is used to mine author profiles; the work [85] attempts automatic collection and summarization of personal profiles from various social networks and other sources, while [20] proposes linguistic features that help determine the natural language of a person writing in English (on a dataset of the First NLI Shared Task) and [66] determines a user’s occupation by his or her tweets.
In [21, 82], the user’s political preferences are determined by his or her tweets, and [40] drives it further to get the user’s actual voting intentions. This kind of profiling even extends to medical issues: the work [64] attempts to screen Twitter users for depression based on their tweets. Numerous works on the topic have been published based on the results of the shared Author Profiling Tasks at digital text forensics events by PAN initiative [8, 29, 71–74, 83]; we specifically note the work [8] that uses word2vec clustering to get features for author profiling. Finally, there are quite a few works for determining the geographical location of a user from his or her textual activity in social networks [9,33,44,68,69,87].
As for neural NLP models, one recent work that actually uses modern neural network-based NLP to automatically construct user profiles is [80]. There, convolutional neural networks are used to construct a joint representation of users, products, and their reviews, in particular user profiles. This results in semantic user profiles that are then used to improve sentiment classification but can probably be used for other purposes as well. A recent work [58] has used word embeddings to construct user profiles from the texts they liked in a social network; the profiles were constructed as logistic regression weights of word clusters (clustered in the semantic space of word embeddings), with a special mechanism to reduce the weights of clusters with common words and bring topical clusters to the top. In [30], a deep semantic similarity model (DSSM) is trained to model the “interestingness” of documents. The purpose of the model is to recommend target documents that might interest a user based on a source document which she is reading at the moment. This is mostly an information retrieval model, trained on click transitions between source and target documents; this work is similar to [81] and also uses convolutional architectures. The hierarchical neural language model from [25] with a document level and a token level can also be extended to learning user-specific vectors to represent individual preferences, which can be used to give personalized recommendations.
User profiling is a special case of user modeling. For general reviews of the field and key papers, we refer to [12,18,27,36,86]. Specific techniques that have been applied to represent user interests in content-based and hybrid recommender systems include, for example, relevance feedback and Rocchio’s algorithm [47, 63], where a user profile is represented as a set of words and their weights, penalized if the retrieved textual item is uninteresting, as in [62]. Ontologies and encyclopaedic knowledge sources have been used, e.g., in Quickstep and Foxtrot systems [51] that recommend papers based on browsing history, automatically classify the topics of a paper and make use of relations between the topics in ontology to obtain their similarity; rank is computed based on the correlation of user profile topics and estimated paper topics. Nearest neighbours are often used in such systems; e.g., DailyLearner [11] stores tf-idf representations of recently liked stories in a short-term memory component, using it to recommend new stories [47, 63]. Decision rules have been used, e.g., in the RIPPER system [7, 22], where rules are a conjunction of several tests against items features. Interpretable predicted user characteristics are also often utilized in practice; cf., e.g., Yandex.Crypta.
3 Word Embeddings
Recent advances in distributed word representations have made it into a method of choice for modern natural language processing [31]. Distributed word representations are models that map each word occurring in the dictionary to a Euclidean space, attempting to capture semantic relationships between the words as geometric relationships in the Euclidean space. In a classical word embedding model, one first constructs a vocabulary with one-hot representations of individual words, where each word corresponds to its own dimension, and then trains representations for individual words starting from there, basically as a dimensionality reduction problem. For this purpose, researchers have usually employed a model with one hidden layer that attempts to predict the next word based on a window of several preceding words. Then representations learned at the hidden layer are taken to be the word’s features.
The word2vec embeddings come in two flavors, both introduced in [52]: Continuous Bag-of-Words (CBOW) and skip-gram. During its learning, a CBOW model is trying to reconstruct the words from their contexts, while the skip-gram model operates inversely, predicting the context from the word. The idea of word embeddings has been applied back to language modeling in [53, 54, 56], and starting from the works of Mikolov et al. [52, 55], word representations have been used for numerous NLP problems, including text classification, extraction of sentiment lexicons, part-of-speech tagging, syntactic parsing etc.
Another important model for word embeddings is Glove (GLObal VEctors for word representations) [65].
Efficient and/or more stable algorithms for training word embeddings have been developed in [48,49,52,57].
4 Background
4.1 Datasets
For this project, we have obtained a large dataset from the Odnoklassniki social network. The dataset has been created as follows:
the dataset began with 486 seed users;
for these users, their sets of friends have been extracted;
then the friends of these friends; as a result, the dataset contains a neighborhood of depth 2 in the social graph for the original seed users.
As a result, the dataset contains information on 868,126 users of the Odnoklassniki social network. In particular, it contains the following data:
demographic information on 868,126 users of the network: gender, age, and region (region info may be imprecise since there is no such explicit field in the user’s profile, the region is determined by the IP addresses from which the user has logged in most often);
the social graph that defines the “friendship” relation and contains (and indicates) several different type of links: “friend”, “love”, “spouse”, “parent”, and so on; all users with known demographic data are also present in the social graph;
history of logins for individual users;
data on the “likes” (“class!” marks) a user has given to other users’ statuses and posts in various groups;
texts of user posts and group statuses that have been liked by these selected users.
The mean age of all users was 31.39 years; the age distribution is shown on Fig.1. Note that there are quite a lot of users with implausible ages (ages 2 and 3, age higher than 100 years); since the user specifies the age by himself/herself, this probably represents missing, incorrect, or purposefully distorted data. Note that this is an important point for the relevance of our research: when a user has not specified his/her age, or when a user has specified an obviously incorrect age, we still need to predict his or her age in order to give age-related recommendations and enroll the user into age cohorts. For the experiments, however, we have removed from the dataset all ages below 10 and above 80 since they are likely to correspond to faulty/missing information.
Fig.2 shows the distribution of the number of friends in the Odnoklassniki dataset; interestingly, while the usual Pareto distribution (straight line on a log-log plot) picks up after about 100 friends, it actually increases before that point. This is probably an artifact of the data collection: naturally, the social circle (neighborhood of depth 2) of a predefined set of seed users will contain few isolated or nearly isolated users.
We began evaluation with the entire dataset as outlined above, what is called below the “extended” dataset. However, in order to perform more experiments, be more flexible, and not get bogged down in the technicalities of fitting huge datasets into available hardware, we have also prepared a smaller “basic” dataset that we performed some experiments on.
The basic dataset preserves most properties of the extended dataset; the only difference is that we have filtered the users to have at least 5 and at most 300 statuses. This has let us cut off a relatively small number of highly prolific writers (or, to be more precise, prolific reposters), significantly reducing the total number of statuses, and cut off the long tail of users with very few statuses, while still preserving important properties of the data.
The basic statistics for the two datasets are shown on Table 1, and Fig. 3 indicates that all basic distributions such as age and number of friends are very similar for the two datasets, except, naturally, the distribution of the number of statuses. Both datasets were split into training and test sets randomly in the 80:20 proportion.
4.2 Word2vec Models
As a dataset for word embeddings, we have used a large Russian-language corpus (the largest we know) with about 14G tokens in 2.5M documents [4,61]. This corpus includes the Russian Wikipedia (1.15M documents, 238M tokens), automated Web crawl data (890K documents, 568M tokens), (main part) a huge lib.rus.ec library corpus (234K documents, 12.9G tokens), and, finally, user statuses and group posts from the Odnoklassniki social network, as described above (excluding test data used later). All of this has let us obtain what we believe to be an unprecedented quality of the resulting representations. We refer to [4, 61] for more details on the training data.
We have used continuous bag-of-words (CBOW) and skip n-gram word2vec models trained on a single NVidia Titan X GPU with the currently fastest word2vec implementation ported to CUDA (https://github.com/ChenglongChen/ word2vec_cbow). Our previous experiments have suggested that vector sizes in the low hundreds and window size of 11 words are the best parameters on this dataset. In total, so far in the experiments we have used eight different word2vec models whose parameters are shown in Table 2; the models differ in the type (CBOW or skip-gram), dimension of word vectors d, window size w (later omitted since w = 11 in all models), number of negative samples n in the training, and vocabulary threshold v that controls the size of the vocabulary (a lower threshold means more words get vectors, but words with few occurrences will not have enough training data and might have a random-like, meaningless vector). Note also that every model can come in a “raw” form, as trained, and a normalized form where all vectors are normalized to Euclidean length 1.
5 Demographic User Profiling with Word Embeddings
5.1 User Age Prediction Algorithms
In this section, we propose several relatively simple algorithms that operate on word embeddings of the words in social network statuses of the users, aiming to predict a user’s age from his or her writing.
First, folklore among social network researchers says that to predict a user’s age it is usually sufficient to take the mean age of his or her friends: it will predict the age with outstanding accuracy. We have tested this hypothesis on the Odnoklassniki dataset. To investigate, we have trained the following models:
(1) MEANAGE: predict age with the mean of friends’ ages and the global mean if no friends ages are known;
(2) LINEARREGR: linear regression with a single feature (mean friends age);
(3) ELASTICNET: elastic net regressor with a single feature (mean friends age);
(4) GRADBOOST: gradient boosting with a single feature (mean friends age).
Results of these simple models are shown in Table 3 in two variations: basic, where we substitute zeros instead of missing features (when there are no friends’ ages) and “nonzero”, where we train and test only on a subset of data with nonzero features (at least one friend with known age). It appears that LINEARREGR performs worse than MEANAGE in its first variation because linear regression cannot implement the condition “if the feature is zero (default value in the absence of neighbors) do something completely different”, and GRADBOOST is noticeably better because it is powerful enough to handle such case-by-case conditions.
However, we should note that the errors here are quite significant: in terms of MAE, we are more than nine years off on average even if we restrict ourselves to cases with friends with known ages. Hence, we expect that subsequent work is not meaningless and can bring substantial improvements.
Note that while the idea to use the sum and/or mean of word embeddings to represent a sentence/paragraph is, indeed, the simplest idea for the representation of a larger chunk of text, due to the geometric properties of the word2vec and GloVe models this idea is not as naive as it sounds. This approach has been used as a baseline in [41] but was proposed as a reasonable method for short phrases in [55] and has been shown to be effective for document summarization in [37].
Thus, we propose three basic algorithms:
(1) MEANVEC: train on mean vectors of all statuses for a user;
(2) LARGESTCLUSTER: train on the centroid of the largest cluster of statuses;
(3) ALLMEANV: train on every status independently, with the mean vector of a specific status and the mean age of friends as features and the user’s demography as target; at the testing stage, we compute predictions for every status and average the predictions.
The MEANVEC algorithm simply computes the mean vector of all statuses and adds it as features to the classification/regression model. Formally speaking, we introduce the following notation:
— W is the vocabulary, with words w ∈ W ;
— U is the set of users, a user will usually be denoted as u ∈ U;
— Su is the set of texts “belonging to” user u (either written by u or liked by him/her), with a single text usually denoted as s ∈ Su; the s stands for either “string” or, more specifically, “status”;
— is the vector (word embedding) of word w in model m (we will omit the superscript when it is not important or clear from context);
— MAFu is the mean age of the friends of a user u ∈ U; in the algorithms, this is the only feature we use from the social graph.
In this notation, the MEANVEC algorithm operates as follows: for a machine learning (regression for age) algorithm ML,
(1) for every user u ∈ U:
(2) train ML with features for every u ∈ U.P
The LARGESTCLUSTER algorithm operates as follows: for a machine learning algorithm ML,
(1) for every user u ∈ U:
(2) train ML with features for every u ∈ U.
Prior experiments (e.g. see 6.3) showed that clustering word2vec representations may yield semantically related groups of words/n-grams, and it appears natural to try a similar approach to statuses representations. Hence, the largest cluster could be expected to be the most descriptive.
The ALLMEANV algorithm operates as follows: for a machine learning algorithm ML,
5.2 Evaluating User Age Prediction
In the first experiment, we took the simplest MEANVEC algorithm and compared how various word2vec models perform. The results are shown in Table 4. We can draw the following conclusions:
— naturally, the MEANAGE algorithm does not care about word2vec at all, it is only included as a sanity check;
— word2vec models do help all models, both linear and GRADBOOST – compare these results with Table 3;
— it appears that CBOW models outperform skip-gram models in this task (quite significantly);
— by increasing the dimension d, we also get some improvements, but these improvements are rather small;
— a decrease in v, although it makes the word2vec model significantly larger and longer to train, has absolutely no effect on the end result.
Generally speaking, these conclusions mean that for the purposes of demographic analysis and similar problems we can concentrate on relatively small word2vec models, with dimensions 100 or 200, and perhaps further increase v, which would lead to much smaller models and faster training.
In the second experiment here, we have compared raw and normalized word2vec models in the same setting; some of the results are shown in Table 5; for convenience, raw and normalized versions are shown immediately next to each other. The results are rather interesting: the ’farther’ the classifier is from linear models, the better normalized versions are. For LINEARREGR, raw vectors slightly outperform normalized ones, for ELASTICNET there is almost no difference, and GRADBOOST makes (sometimes significantly) better use of the normalized versions. This result can probably be attributed to the fact that while normalized vectors are indeed usually recommended for use, raw vectors can have larger absolute values, including rather large outliers, and simple linear models are better at picking on larger absolute values. Still, the conclusion is to use mostly normalized models in the future since we are after the best model rather than the best linear regression.
The next step was to compare baseline algorithms with each other. Table 6 shows the comparison results between MEANVEC and LARGESTCLUSTER algorithms (marked MV and LC) on the original (extended) dataset, shown for a selection of normalized word2vec models.
Interestingly, the LARGESTCLUSTER algorithm invariably loses to MEANVEC in all experiments. One possible reason for this might be that the largest cluster of all statuses turns out in many cases to be the least meaningful (e.g., consisting of similar reposts from an online game or of extremely brief statuses, e.g., consisting of a single smiley); we have verified this idea with a direct examination of the data but believe that in the future, variations on the idea of clustering statuses might yet prove to be useful.
This comparison has been performed on the smaller “basic” dataset that we have presented above. Results of this comparison are shown in Table 7, which marks the MEANVEC, LARGEST-CLUSTER, and ALLMEANV algorithms as MV, LC, and AV respectively.
As for the results, the LARGESTCLUSTER algorithm, again, loses in almost all cases to both MEANVEC and ALLMEANV. What is much more interesting, however, is that ALLMEANV, while performing roughly on par with MEANVEC in LINEARREGR and ELASTICNET, begins to lose significantly to MEANVEC and even LARGEST-CLUSTER when we use GRADBOOST as the classifier. This result was quite surprising since we expected that more data and more detailed status vectors (individual for each status rather than averaged over all statuses of a user) will actually bring an improvement. One possible reason for this behavior is that in passing from MEANVEC to ALLMEANV we have, in essence, “moved” the averaging from the semantic space of word embeddings to averaging prediction results. Hence, this result can be interpreted as showing that simple averaging works very well in the semantic space (this is not surprising given that many semantic relations become linear in the space of embeddings), even better than building an ensemble of predictions from individual statuses afterwards.
5.3 Word2vec Trained on Different Data
Another interesting question is whether to use generic word2vec models trained on large text corpora externally or train word embeddings specifically for this problem. To answer this question, we have trained word2vec models on the user statuses and group posts themselves with the gensim library. Table 8 shows a comparison for our three basic classifiers, LINEARREGR, ELASTICNET, and GRADBOOST, for the MEANVEC algorithm with these “local” word embeddings and “global” word embeddings trained externally (they were used in all previous experiments). We see that while the difference for ELASTICNET is nonexistent, both LINEARREGR and GRADBOOST consistently make better use of “local” word2vec models. Hence, in future studies we recommend to train word embeddings locally or fine-tune global embeddings with the local dataset.
6 Mining User Interests
6.1 Problem Setting
Apart from demographic predictions, a harder and arguably even more commercially attractive task of user profiling is to concisely represent a user’s topical interests, preferably as narrowly as possible. The methods of summarizing information about users lie at the core of many personalized search and advertisement engines and various recommender systems. Being able to make predictions based on appropriately summarized prior user-system interaction allows, among other things, to alleviate the so-called cold start problem, which is one of the main problems of many recommender systems: how do you recommend a new item that has not been rated before or has had very few ratings? Given user profiles and a way to match the new item to these profiles, one can make recommendations when collaborative filtering is inapplicable.
This motivation ties in well with full-text recommendations. When users interact with items that have actual text associated with them, this allows for a possibility to infer topical user profiles based on automated mining of the texts they interact with. This problem has become especially relevant in recent years due to the growth of the social Web, where users interact with various texts all the time, not only reading but actively rating them.
As for possible solutions, recent advances in natural language understanding, especially in distributional semantics, provide many promising new methods for this problem. This is precisely the path that we take in the second part of the work, using topical clusters based on distributed word representations to construct user profiles.
In this work, we propose a novel method for user profiling in full-text recommender systems, constructing a user profile as an interpretable summary of the user’s interests that can also be utilized for recommending new items solely based on the prior state of the system.
6.2 Brief Outline of the Approach
First, we cluster all word representations trained on an external corpus. We have obtained high-quality clusters that are easy to interpret as possible indicators of user’s interests, so they were chosen to serve as a basis for user profiling; a user would be characterized by his or her affinity to these clusters.
For the recommender system, we used a large dataset from the “Odnoklassniki” online social network; we used group posts (texts in online communities written by their members) and individual user posts (texts published by a user on his/her profile page) as full-text items and user likes for these posts as ratings.
There are two important obstacles along this way.
First, the dataset contains only positive signals from the users (likes), which is common in real life recommender systems but which makes it hard to train. While recommender systems based on such implicit information do exist, e.g., recommender systems based on max-margin non-negative matrix factorization [39], it is unclear how to adapt them to full-text recommendation and user profiles in the semantic space.
Whatever technique one tries for the problem, user profiles always tend to be dominated by clusters/topics consisting of common words that occur often in the texts of various topics, but are useless for recommendations.
The second problem was especially hard to solve; we solved it with a novel approach to user profiling based on logistic regression trained multiple times on random subsets of the dataset; this approach is described in detail below.
6.3 Clustering Word Vectors
In our experiments, we use a skip-gram word2vec model of dimension 500 trained on a large Russian language corpus [4,61].
To get a finite set of possible user interests or document topics, we clustered the word vectors directly. Note that while for some other applications topic modeling [13, 84] might prove to be more useful, but in our case the basic underlying texts were too short and of too poor quality to hope for a good topic model, decisions regarding the topics would often have to be made on the basis of one or two keywords. Besides, we wanted to develop a top-down general approach that would be applicable directly even without a large and all-encompassing collection of texts available directly in the recommender system.
The embeddings of terms that occurred in our social network posts dataset resulted in 111281 vectors to be clustered in the R500 space. We have tried several methods for large-scale data clusterization, including Birch [90], DBSCAN [76], and mean shift clustering [23], but despite being generally able to process 100K+ items, these methods have proven to be not fast enough for high-dimensional data (for dimension 500 in our case), coupled with 2000 clusters.
Hence, the best option turned out to be classical k-means clustering. We applied mini-batch k-means that samples subsets of data (mini-batches) and then applies standard k-means to then: they are assigned to centroids, and centroids are “moved” to actual centers; the updates are done stochastically, after every mini-batch [77]. For initialization, we used the k-means++ approach that initializes cluster centers as far from each other as possible and then applies standard k-means to a random data subset to refine initialization [6].
Table 9 shows sample clusters together with their idf (inverse document frequency) values. It is clear that the most frequent clusters largely consist of common words that do not represent any specific topic that could be used for recommendations; they will be our major problem in the next section.
We begin with the following notation: D is the set of documents, C, set of clusters, T, set of all words, Tc, set of words in a cluster c ∈ C, word2vec : T → ℝd, function assigning each word its embedding, df(t), number of documents t ∈ T occurs in, clust : T → C, function returning the cluster of a word, , set of all items user u liked.
To produce user profiles, we first constructed fixed-dimensional vector representations of documents vdoc ∈ ℝd for each document doc ∈ D, representations of clusters of documents vc ∈ ℝd for each cluster c ∈ C based on the representations of documents, and finally representations of users vu ∈ ℝd for each user u ∈ U based on representations of the documents they liked and the corresponding clusters; in our experiments, d = 500. To build vector representations, we used a straightforward approach based on averaging and idf-like weighting. Suppose that we know word2vec word embeddings for a large proportion of words in our data (not all due to typos, proper names and the like), word2vec(t) for each c ∈ C. Then we define
and . Finally, the user representation is
where Z is a corresponding normalization value.
Then we constructed a new representation of a document, designed as a vector of cluster likelihoods p(c | d); namely, for every document doc ∈ D and every cluster c ∈ C we computed
Then, to construct the profile of a user u by his or her set of liked items , we repeated the following procedure N times independently (in the experiments below, we used N = 100):
(i) on step k, draw a random sample from the documents the user u didn’t like, taking the size of the sample equal to the number of documents the user actually liked; we denote it by ;
(ii) train logistic regression with the following data: are the positive examples, are the negative examples, and the features are document affinities to clusters p(c | d), ;
(iii) as a result of this logistic regression, we get a set of weights wu,c,k for each cluster c.
Then, for every user u his or her profile is defined as the parameters of the normal distribution for every weight {(c, µu,c, σu,c) | c ∈ C}, each (µu,c, σu,c) trained on the set {wu,c,k | c ∈ C}.
In other words, logistic regression here is used to approximate the probability of a like; it trains a hyperplane separating liked items from items that have not been liked in the semantic feature space. This simple approach would be sure to fail if we simply trained liked documents against non-liked documents since the dataset is vastly imbalanced (a single user can be expected to view but a tiny fraction of all items); hence the random sampling of non-liked documents.
However, there is one more purpose to the random sampling apart from balancing the problem. We would like to solve the problem of common-word clusters, clusters that contain common words, are ubiquitous in the dataset, and tend to dominate all user profiles simply because by random chance, a user will like more than their fair share of some of common word clusters. In randomly sampling the negative examples, we get a certain distribution of “concentrations” of different clusters in the negative example. Note that:
— “topical” clusters that contain rare words will seldom occur in negative examples and thus the variance of the resulting weight distribution σu,c with respect to these clusters will be low;
— “common word” clusters that contain words that are widely distributed across the entire dataset will sometimes appear more often and sometimes less often in the negative examples, and thus the variance of the resulting weight distribution σu,c with respect to these clusters will be high.
Hence, this approach lets us distinguish between common word clusters and topical clusters by the value of σu,c. The higher the standard deviation, the more likely it is that the cluster consists of common words. As the final scoring metric for the user profile, we propose to use the mean weight penalized by its variance; we used µ − 2σ as the final score in the examples below. This scoring metric can also be thought of as the lower bound of a confidence interval for the cluster affinity. Figure 4 shows sample results of two user profiles. Note how common-words clusters have high average affinity but also high standard deviation that drags them down in the final scoring and lets topical clusters come out on top.
6.4 Recommender Algorithm and Evaluation
Here, we present an actual recommender algorithm based on the user profiles mined as shown above. This serves as both a sample application for our user profiling system and as a way to evaluate our results numerically, by comparing it to baseline recommender algorithms.
We propose the following item-based algorithm to make recommendations based on a user profile in the form {(c, µu,c, σu,c) | c ∈ C}:
Note that this is a cold start algorithm for the items: it does not use an item’s likes at all, only the likes of a user to construct his or her profile.
We have conducted experimental evaluation with a large dataset provided by the “Odnoklassniki” social network. For the experiment, we have chosen to use posts in groups (online communities) and likes provided by the users for these posts since a post in a group, as opposed to a post in a user’s profile, is likely to be evaluated by many users with different backgrounds, and the users are more likely to like it based on its topic and content rather than the person who wrote it.
Thus, the dataset consists of texts of posts in the communities (documents) and lists of users who liked the posts. The dataset contains 286K words in the vocabulary (after stemming and stop words removal), 14.3M documents (group posts), and 284.6M total tokens in these documents.
As the user set U, we chose top 2000 users with most likes from a randomly sampled subset of users (so that we get users with a lot of likes but not outliers with huge number of likes that are most probably bots or very uncharacteristic users). We divided their likes into disjoint training and test sets; there were 16000 likes by these users in the training set and 4797 likes in the test set.
We carried out our evaluation procedures on three algorithms: two baseline collaborative algorithms and the new algorithm described above.
User-based collaborative algorithm finds k nearest neighbors for a user and recommends documents according to the users’ likes. Specifically, for each user u we build a list of k nearest neighbours N(u) by cosine distance (via LSHForest) in the space of their vector representations in ℝd. Then we set the affinity between users as cosine distance between their vector representations:
Documents are ordered by the following ranking function:
where Liked(doc) is the set of users who liked document doc. Thus, we rank documents according to the weighted sum of representations of users who liked it.
Item-based collaborative algorithm finds k nearest neighbors for a document and recommends documents similar to the ones a user liked. Specifically, for each doc ∈ D we build the set of k nearest neighbors N(doc) by cosine distance in the space of vector representations for the documents, compute similarities between the documents, , and rank documents as
where Like(u) is the set of documents user u liked.
In all algorithms that use k nearest neighbors approach we used an empirically chosen k = 5.
Finally, in our regression-based algorithm we recommend according to the negative-biased posterior: given a user profile {(µu,c, σu,c | c ∈ C}, we rank documents according to
All users and documents vector representations are normalized before applying each of the algorithms above. Each of the evaluated recommender algorithms provides the ranking of documents for a given user. We build a set of all likes from test set and the same number of unliked documents for each user. It is expected that the liked documents will be ranked higher than others on average, which is a common ranking task. Hence, we used standard ranking evaluation metrics to evaluate the algorithms:
— NDCG (Normalized Discounted Cumulative Gain) is a unified metric of ranking quality [35]; the discounted cumulative gain is defined as
where likedi = 1 iff item i in the ranked list is recommended correctly, and NDCG normalizes this value to the maximal possible: NDCGp = DCGp/IDCGp, where IDCGp is the perfect DCG of a ranked list with all correct items on top;
— Top1, Top5, and Top10 metrics show the share of liked documents at the first place, among the top five, and among the top ten recommendations respectively; these metrics are important for real-life recommender systems since an average user commonly views only a very small number of recommendations.
Results of our experimental evaluation are shown in Table 10. We see that the simple cold start recommender algorithm based on our user profiles performs virtually on par with collaborative algorithms that actually take into account the likes already assigned to this item. These are very good results for a cold start algorithm; note, however, that actual recommendations of full-text items in the same system are not the only or even the main purpose of our approach: the ultimate goal would be to employ user profiles to make outside recommendations for other items with textual content or tags that could be related to the interest profile, such as targeted advertising.
Another way to demonstrate that the regression method learns new things about the users and items being recommended is to show its contribution into the performance of ensembles of rankers. We used the following blending method: first, we normalized the scores obtained by the methods in the blend (Scorem for each ranking method m):
for every document doc and ranking method m, and then constructed the final scoring function as
where αm are blending weights to be found. We use hill climbing to tune parameters αm ∈ [−1, 1], maximizing average NDCG for a separate validation set and finally testing performance on a production set that constituted 20% of the values. Rows 4 and 5 of Table 10 show that the blends noticeably improved upon the performance of both our regression-based approach and collaborative algorithms.
7 Conclusion and Future Work
In this work, we have prepared and preprocessed a huge Russian language free text dataset with a number of different sources ranging from literature to user statuses in social networks, trained a number of word2vec models, obtained and preprocessed a large user profiling dataset from the social network Odnoklassniki, suggested a number of user profiling algorithms based on word2vec embeddings, and performed a large-scale comparison of these algorithms and different word2vec models, drawing conclusions important for subsequent work on user-generated texts. We have also presented a new approach to user profiling based on logistic regression on randomly resampled subsets of items, which leads to readily interpretable user profiles; our experiments have shown that a simple cold start recommender algorithm based on user profiles produces results comparable to collaborative approaches and can be blended with them for further improvement.
While the proposed age prediction algorithms did bring certain improvements as compared to the “zero baseline” of training with the mean age of a user’s friends, these improvements were not huge in absolute terms: we have been able to shave off about 0.2 years in terms of mean absolute error. Therefore, we remain optimistic that these results can be much improved in the future. In further work, we plan to
(1) develop new features for user profiling algorithms based on text embeddings (embedding larger portions of text than a word); here we hope to train a deep text understanding model for the Russian language and apply it to user profiling,
(2) develop and train a character-level word embedding model for the Russian language; we expect this model to be very important for studies of user-generated texts replete with typos, intentional misspellings, and so on.
Also, apart from developing new user profiling algorithms, we plan to investigate other variations of word embeddings. For example, one such is given by the Polyglot system [2], and a completely different direction with a graph-based model is proposed in [1].
We also note recent efforts in word sense disambiguation for word embeddings: the same word can have several very different meanings, and it would be natural to try to model it with several vectors in the semantic space [15, 17, 19, 34, 45, 46, 75, 88, 89]. In further work, we plan to perform an even more extensive comparison between various word embedding variations; a comparison across these models might provide valuable insight into the use of word2vec models for subsequent applications such as user profiling, sentiment analysis, or full-text recommendations.
The work of Anton Alekseev was supported by the Russian Federation grant 14.Z50.31.0030. The work of Sergey Nikolenko was supported by the Basic Research Program at the National Research University Higher School of Economics (HSE) in 2017.