Learning Word Subsumption Projections for the Russian Language

The semantic relations of hypernymy and hyponymy are widely used in various natural language processing tasks for modelling the subsumptions in common sense reasoning. Since the popularisation of the distributional semantics, a significant attention is paid to applying word embeddings for inducing the relations between words. In this paper, we show our preliminary results on adopting the projection learning technique for computing hypernyms from hyponyms using word embeddings. We also conduct a series of experiments on the Russian language and release the open source software for learning hyponym-hypernym projections using both CPUs and GPUs, implemented with the TensorFlow machine learning framework.


Introduction
In Linguistics, hyponymy denotes the asymmetric relationship between a generic term (hypernym) and a specific instance of this term (hyponym).These relations are similar to the relations between genus and species in Biology and called "subsumptions".For instance, the word "cat" is a hyponym of the word "feline".Traditionally, dictionaries of hypernyms and hyponyms are created manually by expert lexicographers or extracted automatically using lexico-syntactic patterns from a large collection of documents [1].
Since the inception of efficient methods for computing low dimensional word embeddings by Mikolov et al. [2], a significant attention has been paid to how distributional semantics can model relations of specific types, such as hypernyms or synonyms.One way to specify the type of relations between word vectors, investigated in this paper, is to induce a matrix such that multiplying on which a hyponym vector provides a hypernym vector.In particular, we investigate such an approach in the context of the Russian language.
In this paper, we will briefly review the related studies in Section 2 and describe the approach learning word subsumptions in Section 3, providing the open source implementation.We also conduct the performance study along with the quality evaluation in Section 4.Then, we discuss the obtained results in Section 5. Finally, we conclude with final remarks in Section 6.

Related Work
Currently, the most widely used method for detecting hypernyms and hyponyms is the Hearst patterns [1].These lexical-syntactic patterns, e.g., "Y such as X1 and X2", have successfully found a substantial number of applications including ontology learning [3].However, these patterns offer an inconvenient to work with the sparse representation of words which is being nowadays addressed using the word embeddings [2].
Fu et al. [4] proposed the projection learning approach to learning hypernyms for the Chinese language.This approach assumes learning the projection matrix such that multiplying on which a hyponym vector provides a hypernym vector.The learning problem has been posed as the linear regression problem that has been then numerically approximated using stochastic gradient descent.Also, the k-means clustering algorithm has been used to split the embeddings space to several subspaces to provide more flexibility to the model.
Levy et al. [5] observed the lexical memorization effect when using hyponym and hypernym embeddings for subsumption classification task.However, they conclude that it is still possible to learn "prototypical hypernyms", i.e., the word categories, due to the reported effect.
Kutuzov et al. [6] showed that the word embeddings can serve as informative languageindependent semantic fingerprints when exploited in the problem of multilingual text clustering.In particular, the projection learning method similar to the one presented in our paper was used to translate words from Russian to Ukrainian, trained on a bilingual dictionary.
The method of Kutuzov et al. mentioned above stems from the original publication of Mikolov et al. [7], where projection learning was used to translate words from English to Spanish.More recently, Vulic et al. [8] presented a systematic study of four classes of methods for learning bilingual embeddings.The authors find approach based on linear projection, similar to the one we use in our method, to be most practical and efficient.
Vylomova et al. [9] evaluated several popular approaches for computing semantic relations and found that in word embeddings, vector subtraction generalises well to a broad range of relations, including over unseen lexical items.
Shwartz et al. developed an integrated method that combines the syntactic parsing features with word embeddings based on a long short-term memory network [10].The resulting method called HypeNET has been implemented using the recurrent neural network that encodes the patterns with the embeddings.

Method
In the baseline setting proposed by Fu et al. [4], the projection matrix is obtained similarly to the linear regression problem, i.e., for the given row vectors x and y representing the hyponym and hypernym embeddings correspondingly, the |x| × |y| matrix Φ * is numerically approximated: where N is the number of training examples and dist(xФ, y) is the distance between a pair of row vectors xФ and y.In the original method, the Euclidean distance (L 2 distance) is used.However, in distributional semantics, the cosine distance and similarity are the more widely used measures [2], so it is reasonable to study their performance.The distributed word representations tend to promote synonyms and other related words among the hypernyms [11], which are of the primary interest.Thus, it seems also reasonable to provide the examples of undesired relations to refine the matrix being approximated.

Variations
Here, we propose three variations to the above-mentioned method: hyponymy penalization, synonymy penalization, and hypernymy promotion.Each variation consists of modifying the loss function by introducing the additional term weighted by the constant α or β that control the balance between two components of the loss function (in our experiments we used α = 0.01 and β = 0.3.For preventing the difference from being negative, we use the absolute value.

Hyponymy Penalization
Our first variation is designed for enforcing the asymmetry of the projection matrix given the fact the subsumption is an asymmetric relation.Thus, applying the same transformation to the hypernym vector xФ as to the hyponym vector should not provide the initial hyponym vector x.

Synonymy Penalization
Our second variation introduces the approach of negative sampling, i.e., explicitly providing the examples of synonyms z that penalizes the matrix to produce the vectors similar to them.
The main obstacle to realizing this loss function is the introduction of the z term representing a synonym of the given word x, because certain words might have no synonyms.In such cases, we substitute z with x, gracefully reducing to the previous variation.Otherwise, on each batch, we sample a random synonym of the given word.

Hypernymy Promotion
Our third variation is designed for promoting the projection matrix to produce hypernyms not just for the initial hyponym, but also for its randomly sampled synonym z.This is motivated by the fact that in lexical ontologies the words are grouped into synsets (sets of synonyms) and the subsumptions are established between such synsets.So, both hyponym and its synonym are supposed to have the same hypernym.
In the case of no synonyms available, i.e., x = z, this variation gracefully reduces to the baseline setting.

Implementation
Instead of the linear regression used to approach this problem [4], our implementation is based on the single-layer perceptron developed using the TensorFlow open source framework for machine learning [12] that supports both CPU and GPU computation of the numerical optimization procedures.Particularly, each input hyponym embedding x has been provided with an additional bias dimension.Thus, a vector x' = (1, x1, …, x|x|) has been used instead of the original vector.Similarly, the projection matrix is now |x'| × |y|.
For minimizing the loss functions, we use the Adam stochastic optimization method [13].
We provide both implementations for L 2 and cosine distances for the loss functions, but our evaluation is focused only on the former due to the poor preliminary performance results of the latter.

Experiments
In our experiments, we use the following openly available language resources for Russian: • pre-trained word embeddings in the form of 500-dimensional vectors computed using the skip-gram architecture [2] having the context window parameter as 10 words with the minimum word frequency of 5 (this model has been among the best ones in the RUSSE evaluation campaign [14]); • a set of subsumption pairs obtained automatically using Hearst patterns from a large text corpus [14,15]; • a set of subsumption pairs and synonyms derived from the Russian Wiktionary [16].Particularly, as it has been suggested in [5], we split the train and test sets such that each contains a distinct vocabulary to avoid the lexical overfitting of the models.As the result, the training set contains 21 997 examples, the test set contains 10 811 examples.The test set contains only the examples from Wiktionary, while the training set is composed of other sources as well.We ran 14 000 training epochs; each passes a batch of 512 examples to the optimizer.The dimensions of the projection matrix are 501 × 500.At the initialization stage, we initialize the elements the projection matrix with N(0, 0.1).In the experiments, we study the performance of the loss functions operating with the L 2 distance along with the benefit of the clustering.
Since that the specificity of the relations differs in various regions of the embedding space, we employed the same clustering algorithm as described in [4,Section 3.3.2].Initially, we estimated the number of clusters by maximizing the Silhouette score [17], but this approach led us to the suboptimal number of clusters k = 2. Instead, we evaluated all the values of 1 ≤ k ≤ 10 to find the optimum on the test set.Each experiment has been run for five times to make it possible to assess the statistical significance of the results using the one-tailed t-test with the significance level of 0.025.

Quality Evaluation
In order to assess the quality of the model, we employed the following technique.For each subsumption pair (x, y) of hyponym x and the related hypernym y in the test set, we selected the projection matrix Φk * assigned to the same cluster k as the given pair.Then, we compute ten nearest neighbours for the projected hypernym.The pair is considered matched if the word representing the gold hypernym y appears in the computed list of the nearest neighbours NN10(xΦk * ).In order to obtain the integrated quality score, we average the matches across the test set: where N is the number of test examples and ( ) is the indicator function.Intuitively, the A@10 measure is the probability of providing the correct hypernym among the ten nearest neighbours by projecting its related hyponym, which is previously unknown to the model.
ICBDA 2016 Since the list of the nearest neighbours of the non-transformed hyponym vector may also contain hypernyms [11], it yields A@10 = 0.0877 on our test set.

Performance Study
Since that TensorFlow has been used for defining and executing the computation graph, we paid attention to the comparison of the CPU and GPU performance in our task.Therefore, for our experiments, we used the following computational resources available on a single machine: • Intel Xeon E5-2620 v2 @ 2.10GHz (32 GB of RAM), denoted as CPU; • NVIDIA Tesla K20Xm, 2866 cores (6 GB of VRAM), denoted as GPU.

Results and Discussion
According to the evaluation results in Table 1, both our variations implying penalizing the hyponymy and synonymy statistically significantly outperform the baseline in most settings.However, hypernymy promotion, inspired by lexical ontologies, showed the results worse than the baseline.Thus, we conclude that such a penalization can provide the system with the useful lexical information.Interestingly, no variation performed better than the baseline on k = 6 due to the inconsistent clustering.Table 1.Quality evaluation according to A@10, the best statistically significant result compared to the baseline in each setting is highlighted.Fig. 1.Evaluation results according to the A@10 measure, the best result on all the variations is achieved on k = 9.

Model
Increasing the number of clusters seems to be an efficient mean for increasing the capacity of the machine learning model.However, we found that the results stopped improving after k = 9, suggesting extending the training and test set sizes (Fig. 1).Since that the clustering reduces the number of the train items available per cluster, we had to use a relatively low batch size.To study the performance of the training procedure w.r.batch size, we run 1000 training epochs for the batch sizes of 512, 1024, 2048, 4096 and 8192.Table 2 shows the results of the performance study, confirming that under the present settings using a GPU makes the training process slower (Fig. 2) due also to the matrix size.Fig. 2. Performance study of the baseline approach and the synonymy penalization approach involving negative sampling.
We also conducted a series of experiments with the cosine distance instead of the L 2 distance, but virtually in all the settings the A@10 measure was one and half times worse than using the L 2 distance, while the training process took ten times longer time.During these experiments, we removed the absolute value bars and replaced the negative distance terms in the equations ( 2) and (3) with the positive values of cosine similarity, making the loss function still non-negative.

Conclusion
In this study, we developed three models for learning word subsumptions and evaluated them on several resources for the Russian language.We also presented the open source software implementing the described approach, which is available under the terms of a libré license: https://github.com/dustalov/projlearn.Our datasets are available for other studies: http://ustalov.imm.uran.ru/pub/projlearn-ruwikt.tar.gz.To the best of our knowledge, this is the first study dedicated to learning subsumptions using word embeddings for the Russian language.
In the further studies, we are interested in applying convolution layers for capturing high-level features of word embeddings, increasing the number of neural network layers, and using the learned matrices to construct semantic hierarchies.We also plan to conduct a crowdsourcing experiment to compare our results with the human judgements in order to weaken the dependency from the gold standard.
t. the

Table 2 .
Performance study: the total number of seconds spent per 1K training epochs on various batch sizes.