The Study of Graininess for Tibetan Named Entity Recognition

Tibetan named entity recognition (NER), which is a fundamental part in Tibetan natural language processing, is the important subtask of Information extraction. In this paper, we surveyed the methods, effect and problems of Tibetan NER. And we discussed which kind of tokens that should be taken as the graininess for Tibetan NER task. The paper used two kinds of different graininess in a comparative experiment for Tibetan person names, location names and organization names, based on syllables, or based on words. From the result, we know that the person names based on syllable have better result than that based on words. Location names have small difference while species differ. But the organization names are more suitable based on words.


Introduction
Named entity recognition (NER) (also known as entity identification, entity chunking and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities in text into pre-defined categories such as the names of persons, locations, organizations, expressions of times, quantities, monetary values, percentages, etc. Tibetan NER is the fundamental and key subtask for Tibetan information extraction and text mining in Tibetan natural language processing.Nowadays, named entity recognition had achieved good results in various languages, such as English.State-of-the-art NER systems for English produce nearhuman performance.For example, the best system entering MUC-7 scored 93.39% of F-measure while human annotators scored 97.60% and 96.95%.However, Tibetan NER started late.It has yielded a great number of positive results but is still a new study field in which there are series of problems.
This paper will summarize the present situation of Tibetan NER, introduce the existing research methods, the results obtained and the existing problems.By summarizing the results of studies, we find out that there are two kinds of graininess for Tibetan NER, including syllables, and words.Syllables are the basic units in Tibetan, and words created by syllables based on word segmentation.
Then the paper discusses the granularity problem in Tibetan NER by a comparative experiment.In addition to the graininess, syllables or words, all attributes are the same for Tibetan person names, location names and organization names.By comparing the accuracy, recall, F-score, we will get the conclusion that which graininess is more suitable.
The rest of the paper is organized as follows: In section 2, we briefly describe the background of Tibetan NER, including current situation, difficulties, methods, problems and improvement direction.In section 3, we introduce the algorithm of Tibetan named entity recognition based on conditional random field(CRF), this method is used in our comparative experiments about graininess.We report in section 4 our experimental results and give our conclusions on this work in section 5.

Tibetan Named Entity Recognition
The basic work of Tibetan natural language processing (NLP) includes Tibetan word segmentation, Tibetan POS tagging and named entity recognition.Nowadays, there are many practical methods in Tibetan word segmentation, such as automatic Tibetan word segmentation scheme based on lattice auxiliary word and continuation feature [3], Tibetan language word segmentation system [4], SegT [5], Yangjin [6], TIP-LAS [7], but for Tibetan NER, the research conclusions are immature.As the fundamental part in Tibetan NLP, there are many aspects of the need for further study and improve in Tibetan NER.

Introduction to Tibetan
Tibetan (བོ ད་ཡི ག) refers to the use of Tibetan language Tibetan.The glyph structure is a letter as the core, the rest of the letters are based on this before and after the ITA 2017 additional and overlapping from top to bottom, combined into a complete word table structure.Writing habits from left to right.The font is divided into "head" and "headless" two categories.Tibetan is a phonetic alphabet, with 30 consonants and 4 vowels.One Tibetan syllable can have 1 to 7 basic characters, if you consider Sanskrit, characters may be more.The seven basic characters have a base character and a vowel, the other characters were added to the base word, the up, down, front, back, and then back [1] [2].
There are fewer types of punctuation in Tibetan .Tibetan various syllables separate with a small point, this point named the syllable node (་).In addition to the syllable node, the most common punctuation is a single vertical line (།), as a full stop, colon and other situations.And the paragraph ends with a double vertical line (།།).

Methods of Tibetan NER
The methods of Tibetan NER can be divided into rulebased methods and based on supervised machine learning methods.

Rule-based methods
In the early days, the study of Tibetan NER was based on a rule-based approach.Yu

Difficulties in Tibetan NER
Tibetan belongs to the Sino-Tibetan language family.In theory, the natural language processing methods used in Chinese can be used in Tibetan information processing.
But in practice, it must be considered in the specific problems.The main difficulties in Tibetan NER are as follows: Tibetan is a complex system of phonetic logic.The basic unit of the sentence is syllable.Syllables are separated by syllable node.One syllable or more syllables constitute words.There is no obvious mark between the word and next word.The boundaries of named entities are difficult to determine.And too few punctuation types, just single vertical line (།) and double vertical line (།།), will make the too long analysis object length, increasing the difficulty of recognition algorithm.
There is no morphological difference between named entities and unnamed entities in Tibetan.Unlike English, the person names, location names and organization names in English with the capitalized first letter, are easy to extract.And compared to Chinese person name, most of the Tibetans do not have the family name and the length of the name which can be from single syllable to twenty-six syllables.
The name dictionary, the labeled corpus and other related resources is insufficient.Nowadays, the main method of Tibetan Named Entity Recognition is supervised learning algorithms which require large-scale of labeled corpus.But Tibetan resource is not easy to obtain.

Summary of Tibetan NER
All kinds of methods are proven the feasibility and accuracy in the Tibetan NER.However, due to the different corpus of different methods, we cannot judge which method is better based on the experimental results alone.Today, the problems in Tibetan NER: the conflicts between Tibetan names and ordinary words, the misinterpretation of translations, and the difficulties in identifying Tibetan NE boundaries.
In the future, building a large amount of manually annotated training data is urgent for improving effect.It is necessary to improve the recognition accuracy of Tibetan named entities by using syntactic information, extending boundary information, making full use of boundary information, expanding the translation of the thesaurus, and testing other possible recognition models such as Support Vector Machine (SVM).

3.
Tibetan NER based on CRF Let G = (V , E) be a graph such that Y=(Yv)v V, so that Y is indexed by the vertices of G. Then (X,Y) is a conditional random field when the random variables Yv , conditioned on X, obey the Markov property with respect to the graph: p(Yv |X, Yw ,w v)=p(Yv |X, Yw ,w~v), where w~v means that w and v are neighbors in G.
What this means is that a CRF is an undirected graphical model whose nodes can be divided into exactly two disjoint sets X and Y, the observed and output variables, respectively; the conditional distribution p(Y|X) is then modeled.
By now, CRF has become a widely used technique which is applied in named entity recognition on low resource language [17], such as Hindi, Bengali, Tamil, and Telugu [18].

Tibetan NER based on CRF
Tibetan NER can be defined as a sequence labeling problem for determining whether a observations belongs to a labeled set of markers.Suppose that a given marker sequence y= (y1, y2…, yn) is labeled, n is the length of the sequence.The sequence of Tibetan NE is represented as w= (w1, w2, ,wm),m is the length of the NE.The model of CRF is defined as follows: Z (w) is normalization factor, determined by the observation sequence.λk is the weight of the k-th function, fk (yi,y(i-1),w) is a characteristic function.

Design of Feature Template
The use of the CRF model for a feature set is defined by a fixed pattern of feature templates, by a feature template to get features in a context (or window).The larger the window is, the better the relationship between the current word and the context can be observed, and the long distance dependency can be found in the text.But when the window is too large, the model will take a long time.This will lead to a decline in overall performance.
Considering the relationship between the Tibetan NE and its context correlation, the length of the window is set to 5, which can achieve a balance between training time and recognition effect.
The characteristics of the template used in the experiment shown in Figure 1.

Corpus Pretreatment
In the experiment, the corpora which built from text of the Tibetan version of People's Daily online website published.The train data is 5.97M and the test data is about 700K, which were included 2546 person names, 6469 location names, 4049 organization names.In order to examine the effect of different graininess on Tibetan NER We designed a comparative experiment used CRF on Tibetan person names, location names and organization names , based on syllables, or based on words.The syllable-based markers : B-PER(person names' first syllable),I-PER(syllable in person names' but not the first syllable),B-LOC(location names' first syllable),I-LOC(syllable in location names' but not the first syllable),B-ORG(organization names' first syllable),I-ORG(syllable in organization names' but not the first syllable),O(remaining syllables).The wordbased markers: PER (person names), LOC (location names), ORG (organization names), O (remaining words).
We use Precision (P), Recall (R), F1 to evaluate the performance of each graininess, which are very common in NLP evaluation.

Result and Analysis
The result of the experiment is in table 1.We labeled "Words as the graininess" as group A, and" Syllables as the graininess" as group B. As can be seen from the experimental results in Table 1, we get the following several conclusions.
The precision of group A is higher than that of group B. But its recall is short of group B, except in ORG.Compared F1, which considered as comprehensive effect, differs greatly in PER and ORG.For PER, the score of group B is 28.08% higher than that of group A. But for ORG, B is 9.99% less than A. It can be considered that the granularity of the syllable is small, more data can be obtained, which has a great influence on the recall rate.This also shows that the syllablesbased approach because of the smaller particle size, can partially solve the problem of data sparseness.
For PER, although the Precision of group A is significantly higher than group B, but the recall is serious losses, resulting in poor F, while the results of group B were stable.This means that the identification of Tibetan person names should be based on syllables to achieve good results without Tibetan word segmentation.
For LOC, the Precision of group A is better than B, but the Recall is less.However, the F1 is slight difference between A and B. So, we can get conclusions that the identification of Tibetan location names can use both methods.But based on syllables, Tibetan word segmentation can be omitted.
For ORG, all the evaluation results in a group A are higher than those in group B. It indicates that the better results can be achieved by Tibetan word segmentation for organization names.Based on the analysis, we think organization names are complex, nesting and more syllables.These lead to difficulties in boundary identification based on syllables.But the boundary is confirmed on the method based on words.

Conclusions
In this paper, to get which kind of tokens that should be taken as the graininess for Tibetan NER, we did a comparison experiment for graininess.The paper used two different graininess, syllables or words, for identification of three kinds of named entities, including Tibetan person names, location names and organization names.From the result, we know that the method which syllable as graininess for person names is better.But the organization names are more suitable based on words.And both are fit into location names.In other words, identification of person names and location names can achieve very good results without Tibetan segmentation.But it is inconformity for organization names.
et al. used a rulebased model based on case-auxiliary word and lexicon, and also adapt boundary information list static from large corpus to improve recognition.And experiments shows that recall rate and precise rate are respectively 90.13% and 94.02% in the newspaper corpus, 85.67% and 88.20% in the website text.Sun et al. used the internal features of names, contextual features and boundary features of names, and establishes the dictionary and feature base of Tibetan names.The results prove the algorithm is effective with 0.8391 Fscore.Dou et al. used the Statistical Method of Mutual Information to, combining the rules of lattice auxiliary and the dictionary of person names, F value in the test can be up to 93.55%.Supervised machine learning methods After 2014, supervised machine learning methods are increasingly applied to Tibetan NER.Jia et al. came up with Maximum entropy (ME) and conditional random field(CRF), and the F-score of the recognition of names can be 92.08%.Hua et al. proposed a syllable features with Perceptron training model to identify Tibetan name entity with detail analysis NE structure rule and word segmentation ambiguity.The F-score of NE identification is 86.03% for the testset.Kang et al. defined a feature tag set to fit in with the characters of Tibetan names, used CRF as tagging model to train and test corpus data.The highest F-score obtained in the experiment can reach 94.31%.Zhu et al. studied Tibetan name recognition technology using conditional random fields (CRF) principle focuses on analysis of the internal structure of the Tibetan names, contextual features, feature selection and data preprocessing, etc. and evaluated the effectiveness of different features through experiments.The recognition rate of Tibetan names can reach 80% of F-score.