Enhanced Microblog Network Representation with User-Generated Content

: The real-world networks are so sparse that only utilizing the limited edges cannot capture valuable user feature fully. However, the existing network embedding methods usually took the network structure into consideration merely. Therefore, inspired by the assumption that there exist latent relationships between those who generate similar text information, we proposed to enhance the original network structures by incorporating the latent relationships extracted from user-generated content into the raw network topology and furthermore to learn the representation for the revised network. Eventually, we evaluate our proposed method in two inference tasks and the experiment results demonstrate that the network representation generated by our enhanced network embedding method has a better performance than the baseline method on dataset provided by Sina Microblog.


Introduction
With the widespread popularity of social network services such as Sina Microblog, networks are ubiquitous in our daily life and play an increasingly important role in the study for our society.However, the diversity and complexity of network information is not propitious for capturing more valuable user features.Moreover, it is difficult to perform these subsequent applications such as interest recommendation, user profile and so on.A straightforward idea is to learn user representation with these available network information.As is known to all, a large-scale network constitutes a high-dimension space so that network representation seeks to embed the networks into a low-dimensional space.In recent year, network representation learning has drawn extensive attention.

Related Work
Network representation is essentially network embedding in low-dimension space.Many outstanding research effort has been proposed in network embedding.Earlier manifold learning [1] can be considered the originator, such as Isomap, Laplacian Eigenmaps (LE) and Locally Linear Embedding.Inspired by the word embedding [2], DeepWalk [3] exploited Random Walk [4] to generate vertex sequence as sentence in corpus and then introduced Skip-Gram [5] into network representation learning firstly in binary network.Because DeepWalk did not establish a unified objective function, what's worse, the vertex sequences generated randomly are effected by a lot of noise.Hence, a recent work from Tang et al. [6] proposed to model the first-order and second-order proximity respectively and then presented a simple method to incorporate these two representation vector.Furthermore, GraRep [7] proposed by Cao et al. took the N-order network structure into consideration, where N > 2 .In order to address the randomness of vertex sequence in DeepWalk, nore2vec [8] added flexibility in exploring neighborhoods to learn richer network representations.As described above, these methods learned network representations only based on the structure information.Certainly, the available network information is not only the network topology but also other information.Thus, Yang et al. presented TADW [9] to model the text feature and network structure simultaneously with inductive matrix completion algorithm to learn better vertex representation.GENE [10] proposed by Chen et al. incorporated the group information into network embedding.Besides, Li et al. [11] proposed multi-faceted representation jointly with diverse information like user-generated contents, user attributes and user-user network graphs.
However, in real world, networks are usually sparse, that is, there are a small quantity of edges between vertex pair.Therefore, only utilizing the limited structure information is difficult to learn the network representation and to have a good performance in the following tasks.As we know, social networks produce not only the user-user network topology, but also other diverse information from each user like text information, image information, video information and emoji information.Assumed that those with similar user-generated text tend to have common interests and concerns, which indicates there exists latent relationship between them.
Hence, motivated by the sparseness problem and inspired by the aforementioned assumption, a feasible method is to enhance the original network structure with rich user-generated content.Our work proposes how to incorporate the latent relationship extracted from text information into the original network structure and then learn the network representation with the revised network topology based on the classic network embedding model LINE [6].Finally, we conduct several experiments to evaluate the performance of representation vector with our proposed method in gender and age inference tasks compared with the baseline method.
Generally speaking, the major contribution is that we extract the latent relationship from user-generated text to integrate into the raw network structure information and then learn the network representation with LINE [6], which addresses the sparseness problem of real-world networks.

Method
In this section, firstly, we give the definition of the problem.Then we describe the latent relationships extracted from user-generated content and the relationships which integrate the first-order proximity and the second-order proximity for the given network structure.Finally, we introduce how the initial network was revised on the basis of these relationships as above.

Definition
The vertex corresponding to a user in microblog network usually has rich text information which is published until now.
A network with rich text information is denoted as G = (V, E, T) , where V = ‫ݒ{‬ } is user vertex set and E = ‫ݒ({‬ , ‫ݒ‬ )} is the edges associated with a weight w ∈ {0,1}.T = ‫ݐ{‬ } is user-generated blogs set where each entity ‫ݐ‬ corresponds to a text paragraph of user ‫ݒ‬ .
Therefore, our study aims to capture the latent information from user-generated content and to learn the low-dimensional representation ‫ݕ‬ ∈ ℝ from the revised network G ᇱᇱ for each vertex, where k is expected to be much smaller than |V|.

Latent Relationships from User-generated Content
Considering that users who publish similar blogs usually have several common interests, that is to say, they are likely to be friends or they have the potential to become friends.Thus, the relationship extracted from user-generated contents is named latent relationship.
The extraction of latent relationships can be essentially attributed to a matter of text similarity problem.However, each blog text is not more than 140 words, so it is necessary to integrate all the history blogs generated by each user into a text paragraph.Because the colloquial expression style of blog text makes text analysis vulnerable to noise data, text preprocessing is crucial for the accuracy of subsequent tasks and some targeted measures must be taken such as the stop words filtering, abnormal words replacement, and words segmentation.Here some specific preprocessing measures for user-published blogs are shown as follows, Some texts separated by symbol "#" in microblog which are named the topic of this blog represents some interests of users, as a result, these topic phrases should be extracted directly used as the keywords for the corresponding usergenerated texts.
Contents after symbol "@" usually represent user name and do not need to be cut further.
Filter the useless noise data in raw texts such as punctuation and special symbols.
Replace all abnormal words in contrast to the abnormal vocabulary.Abnormal words are some popular networksparlances which have been widely recognized.For example, if you want say "thanks", you can use "3Q" or "3q".Also, in order to avoid some special expression, you can separate one Chinese character into two or more parts.
Replace the original complex form of simplified Chinese character in contrast to the simple word list.
Remove the stop words.
Calculate TF-IDF for corpus and filter out the lowfrequency words according to the pre-set threshold.
HanLP is used to cut the remaining texts which have been preprocessed by above-mentioned measures.
Then we adopt the Latent Dirichlet Allocation (LDA) [2] model which identifies the latent topic information in a large corpus to extract the feature vector, at last, the cosine similarity between any two vectors indicates the weight of latent edge between the corresponding two users.
LDA is a generative probabilistic model covering document, topic and word layers, which is based on the idea that documents are represented as random mixtures over K latent topics where each topic is characterized by a multinomial distribution over words and each document subjects to a multinomial distribution over k latent topics.Hence, the generation process for each document w in a corpus ॰ consisting of M documents is described as follows, For each document ‫ܯ‬ , choose θ~Dir(α) , where Dir(α) is the Dirichlet distribution for parameter α and parameter θ is a topic vector, each column of which represents the probability that each topic appears in this document.
Given the parameters α and β, the joint distribution of the model is defined as follows, Here ware regarded as the observed variable while θ and z are as the hidden variables, then we use the Expectation Maximization Algorithm (EM) to learn parameters α and β.
Supposed that the top T topics are retained, then each text paragraph is embedded into a feature vector ‫ݐ‬ ௩ = ‫ݓ{‬ ଵ , ‫ݓ‬ ଶ , ⋯ , ‫ݓ‬ ் } , where ‫ݓ‬ is the weight value  Afterwards, each feature vector describes the topics associated with each user-generated contents, in other words, the concerns or interests can be obtained from userpublished blogs.Thus we extract the latent relationships from these representation vectors with cosine similarity.Certainly, other similarity measures such as Pearson Correlation Coefficient, Jaccard Coefficient and so on can also be used to calculate the similarity between individuals.Given two representation vector ‫ݐ‬ ௩ and ‫ݐ‬ ௩ ೕ , then ‫ݓ‬ ᇱ indicates the latent relationship between user ‫ݒ‬ and user ‫ݒ‬ , and it is defined as follows, Accordingly, the latent adjacency matrix extracted from user-generated blogs can be described as matrix ܹ ᇱ ∈ ℝ ||×|| , where each entry ‫ݓ‬ ᇱ ∈ [0,1].

Integrated Relationships from Network Structure
As we all know, real-world social network is usually sparse because there exists a small amount of edges which denotes the friendship between users.Moreover, the direct friendship is voluntarily established by users according to their own preferences so that it plays a significant role in network embedding problem only considered the network structures.However, the direct friendship is not enough to describe the network structure, what's more, it is possible that those who are not friends have something in common.
As a matter of fact, there exists a fact that users who share similar friends tend to have common interests or similar characteristics in the social media.Therefore, LINE [6] took the above facts into consideration and firstly introduced first-order and secondorder proximity to characterize the local and global structure information as fully as possible.
First-order Proximity: Given the edge setE, for each pair of vertices in E , the weight on the corresponding edge indicates the first-order proximity.Let ‫ݓ‬ ଵ be an entry of the first-order proximity matrix ܹ ଵ , and then ‫ݓ‬ ଵ is shown as follows, Second-order Proximity: The number of common neighbors for any pair of vertices is used to define the second-order proximity, which describes the similarity of these two users' neighborhood structure in social network.Given the neighbor vertices set of user ‫ݒ‬ and user ‫ݒ‬ as ࣨ ௩ and ࣨ ௩ ೕ respectively, then the number of common neighbor vertex is calculated and the second-order proximity between user ‫ݒ‬ and ‫ݒ‬ is defined as follows, . Now, we exploit the first-order and second-order proximity jointly into the adjacency matrix extracted from the network structure.Accordingly, we introduce W to donate the integrated adjacency matrix where each entry ‫ݓ‬ is composed of the two corresponding proximities and is shown as follows, Where ߣ and ߤ are the normalized coefficient and their value is adjusted iteratively until the result reaches optimal.

Revised Network with Latent Relationships
In this section, we firstly describe how to revise the network structure with the latent relationships extracted from usergenerated contents and then use LINE [6] to learn the embedding representation for the extended network topology.Besides, the original network G have two changes as a result of the revise for the original adjacency matrix: firstly, there exists latent friendship between users who are not friends originally.From the perspective of graph theory, that is, the weight ‫ݓ‬ on the edge of pair of vertex ‫ݒ(‬ , ‫ݒ‬ ) changes from 0 to w(w ∈ [0,1]).Secondly, the relationship between users becomes stronger, and thus the weights of some edges increase.
ITA 2017 Let ܹ ᇱᇱ be the adjacency matrix of the revised network where each entry ‫ݓ‬ ᇱᇱ is defined as follows, However, some entries in the revised adjacency matrix are too small to make much sense and it is more reasonable to remove them.Therefore, we take the final revised adjacency matrix as the input data of LINE [6] to calculate the low-dimension representation.LINE [6] firstly introduced the first-order and second-order proximity and learned the corresponding representation vector for each vertex on the basis of first-order and second-order proximity separately, and then described how to incorporate these two representation vectors into the final vertex representation.
Essentially, the first-order proximity is the weight of edge between the vertices in the network.In order to model the first-order proximity, LINE [6] utilized the direct weight to build the empirical probability and then used the joint probability built by the representation vector to establish the object function.Suppose that ‫ݒ‬ ⃗ ଵ and ‫ݒ‬ ⃗ ଶ are the vector representation of vertex ‫ݒ‬ and ‫ݒ‬ separately, the joint probability between ‫ݒ‬ and ‫ݒ‬ was defined as follows, Meanwhile, the empirical probability was defined as follows, Consequently, LINE [6] adopted Kullback-Leible (K-L) divergence to establish the object function, and then LINE learned network representation by minimizing the object function which was shown as follows, Where measured the difference between the two probability distributions ‫̂‬ and ‫‬ was defined as follows, In order to establish the second-order proximity model, LINE [6] assumed that each vertex play two roles which are the object vertex and the context of other vertices.Hence, ‫ݒ‬ ⃗ ଶ denotes the representation when ‫ݒ‬ is the object vertex while ‫ݒ‬ ⃗ ଶ ᇱ denotes the representation vector when ‫ݒ‬ is as the content of other vertices.Similarly, LINE [6] defined the conditional probability of "context" vertex ‫ݒ‬ generated by "object" vertex ‫ݒ‬ as follows, .
Also, the empirical probability was defined as follows, where ݀ is the out-degree of vertex ‫ݒ‬ .Therefore, LINE [6] established the object function for the second-order proximity by minimizing the K-L divergence to obtain the corresponding network representation.Afterwards, these two representation vectors optimized by Negative Sampling Algorithm(NEG) [13] are put together to denote the final representation ‫ݒ‬ ⃗ for each vertex ‫ݒ‬ in LINE model, where ‫ݒ‬ ⃗ = ‫ݒ‬ ⃗ ଵ + ‫ݒ‬ ⃗ ଶ .

Experiments
In this section, we conduct several experiments for the proposed methods and introduce the experimental results on a real dataset.Moreover, we use some inference tasks to evaluate the quality of the vector representation and compare our method with the baseline method.

Datasets
We conduct the experiments on dataset extracted from Sina Microblog provided by SMP CUP 2016.The dataset involves about 2.5 million users and covers 550 million friendships, and there are 4, 4000 users with microblogs texts and 3000 users with label among them.For the inference tasks, the dataset are divided into two subsets.One is used for training and the other is for testing.Besides, there are four kinds of data including user information, labels, relationships and user-published contents.

Baseline Methods
Since our proposed methods exploits the rich text information of each vertex to extend the original network structure, it is necessary to demonstrate the improvements over network embedding only considering the network topology information.However, the real-world social network structure is a binary network, that is, the weight value of each edge is either 0 or 1.Thus, we compare our methods with DeepWalk [3] and LINE only based on the

Experiment Settings
For our proposed methods, the number of vertex |ܸ| in the dataset is 1560.The number of topic in LDA topic generation model is set as 60, in other words, the dimension of user-generated text vector is 60.For DeepWalk [3], the number of available vertex is 1155.We used the default parameters, for instance, windows size is 5, walk length is 40, number of random walks to start at each node is set as 10 and the number of latent dimensions to learn for each node is set as 64.For the negative sampling in LINE model, the number of negative samples is set as 5.

Experiment Results and Analysis
The representation vector extracted from user-generated contents according to LDA indicates the latent relationships.
Here, the text representation corresponding user ‫ݒ‬ is described as follows, In order to visualize the performance of the representation vector, we choose the top three topics for each user-generated blogs and then calculate the corresponding coordinate values in the three axes by adding the weight value to the topic number value.Figure 3 shows the distribution of user-generated texts representation.
At the same time, we visualize the network structure to compare the changes of the network topology before and after revised as shown in Figure 4 and Figure 5.
For the dataset, only five percentage of users are statistically friends each other.Obviously, it can be found from Fig 6 that the second-order proximity is more than the first-order proximity, which is the same as our abovementioned views for the real-world network.

Gender Inference
In the real-world network, the task of gender inference is a supervised binary classification problem based on the representation vector generated by our proposed methods.The quantity of common friends between user is different, especially, several of which is up to more than 200.Although the half of vertex pairs have no common friends, the second-order proximity plays an significant role in network topology.
Thus, we used linear kernel for SVM model and took the final representation vector as the extracted feature to train the gender classifier.Similarly, we used accuracy, precision, recall and F1-measure to analyses the classification results and evaluate the performance.Accuracy is a metric to measure the number of samples that are classified correctly in the test samples.For classification tasks, precision is referred to as positive predictive value while recall is referred to as the true positive rate or sensitivity.And F1measure is the harmonic mean of precision and recall.

DOI: 10
the ݅ ௧ topic and the weight denotes the likelihood that the texts generated by user ‫ݒ‬ belongs to the ݆ ௧ topic.

Figure 1 .
Figure 1.Graphical model representation of LDA which is composed of three layers.α and β are sampled only once during the generation process.The outer plate represents documents and parameter θ is sampled for each document, and therefore the probability that each document generates topic z is different.The inner plate represents the repeated choice of topics and words within a document and the word w is generated by z and β.

Figure 2 .
Figure 2. Schematic of enhanced network topology.The subgraph of gray nodes is the original topology and other color nodes isolated at first.After revised, these edges indicated by dotted line are newly generated, notably, the thickness of line represents the different weight value.
validate the performance.Inspired by Word2Vec[2], DeepWalk[3] treated the vertex sequence as the sentence in the corpus.Random walk was used to generate the standard input sequence and then Skip-gram model trained the above sequences to learn the latent representation for each vertex in the network.

Figure 3 .Figure 4 .Figure 5 .
Figure 3.The distribution of user-generated texts representation.The clustering phenomenon of the representation vector indicates the similarity of the user-generated text.z

Figure 6 .
Figure 6.The number of common friends for each vertex pair.The quantity of common friends between user is different, especially, several of which is up to more than 200.Although the half of vertex pairs have no common friends, the second-order proximity plays an significant role in network topology.