An Accurate Topic Mining Algorithm Based on Business Dictionary

: The text mining is an important branch of data mining. Many scientific research institutions and teams are actively exploring and putting forward algorithms. Because of industry and scene difference, it is difficult to use the common analysis algorithm of log to mine the potential information accurately. For example, a topic is given in one scene, how to find the main related words is not easy. To deal with the problem, this paper provides the accurate topic mining algorithm based on business dictionary. In the algorithm, segmenting with business dictionary is achieved in the document set after screening the valid documents. In this step, the document set is split into professional terms and then the invalid words are removed. Finally, the qualitative analysis is transformed to quantitative analysis. With the relevance index, the relevance degree of every word is computed. The relevance matrix is returned to the user to analyze the relevance of the words and topic. The algorithm has been applied to PMS and the validation result shows the main related factors can be analyzed accurately.


Introduction
With the expansion of breadth and depth of data, the potential value of data, which can create wealth, is cognized by more and more enterprises, governments and other research communities [1]. As the key step of data knowledge discovery [2], data mining [3, 4, 5, and 6] is one of the research focuses. Text mining [7,8] is an important branch of data mining. Nowadays, many algorithms have been proposed to solve data analysis in the text scenario. However, in electric power, there are too much related equipment, inconsistent statement, unfixed topics and many related fields in the descriptions of power operation and maintenance log. It is difficult for the existing text mining algorithms to find out its maximum value. How to achieve accurate data mining is becoming one of research hotspots. In text mining, the text data is stored in the database or file in the form of semi-structured or unstructured data. It is difficult to dig out its value because of semantic information hidden in it. At present, there are many scientific research institutions and teams to provide algorithms. There are mainly two types. The first is text cluster [9][10][11][12][13][14]. The cluster analysis [9] is one of important method to realize text mining. The literatures [10,11] provide two fuzzy cluster methods based on weighted characteristic. In them, the feature weight vector, which reflects the internal structure of data set, is obtained by using the supervised or unsupervised learning process. Then the distance function of weighted characteristic comes into being. Another representation method is automatic weighted characteristic technique [12,13]. In the K-Means or FCM, the feature weight vector indicates importance of each feature on the whole data set. Besides, the literature [14]proposes the fuzzy clustering algorithm by integrating the feature weighting metric into the framework of soft subspace learning In these algorithms, because that there are different key words in different topic and the descriptive ability of each key word for topic is different, it is difficult to find out the best feature vector and weight.
Another is topic mining [15][16][17][18][19][20][21][22][23][24]. In the PLSA (Probabilistic Latent Semantic Analysis) model [17,18], the probability is related with special documents. So, there are some defects for dealing with new documents. In addition, over-fitting comes into being easily. The most famous topic model is LDA (Latent Dirichlet Allocation) [19, 20, and 21]. In the model, the document set is input. By setting the appropriate parameters, the final multiple topics and word-distribution in each topic could be obtained. On the basis of LDA, The extended LDA models (Twitter-LDA [22], Labeled-LDA [23], MB-LDA [24], etc.) are used for some scenarios. In these algorithms, the topic mining can be achieved and topic-related words are enumerated. However, business experts need to take time to analyze what is the topic and whether these words are related with the topic. Furthermore, the flexibility of algorithm is low and the topic-related words cannot be got for random topic.
The above algorithms can solve many text mining problems, but don't satisfy the text mining demand of electric power. In the scene of text mining, the log data is stored in the semi-structured database and in every description of log, operation and maintenance information of power network equipment and systems (for example, equipment failure, maintenance procedure, etc.) are stored. The demand is that for given topic, the semantic-related words can be found out. At the same time, the semantic correlations of words are got. The potential value of data can be used for design and planning of power network. But, the above algorithms could not dig out the accurate result.
To address the aforementioned problems, we provide the topic mining algorithm with business dictionary to precisely find out the log value. In the algorithm, on the basis of the theory of topic mining, according to the process of document-word, by using the natural language semantic segmentation technology, the semantic impact factor of words are computed. Then, with the help of business dictionary, the accurate semantic-related words are returned to business expert. It is convenient for them to find out the potential knowledge.
The advantage of this algorithm lies in: Search topic-related words with semantic technology. For the given topic, by use of semantic theory, conditional probability and business dictionary, the semantic-related words are found out accurately.
Precisely dig out the words with the business dictionary. In this case, the interference of large number of unrelated words is avoided. The time and energy to search the business words from all the words can be economized.
Reduce the number of documents with topic. Using the topic, the irrelevant documents can be excluded. So, the efficiency is improved obviously.
This paper is organized as follows. Section 2 states the topic mining theory and evaluation criterion. The topic mining algorithm with business dictionary is proposed and analyzed in section 3. Section 4 verifies this system solution in the PMS. The last section draws to a conclusion.

Topic Mining Theory
In the scene of given topic, the ultimate objective of topic mining is that the optimal semantic-related word set can be found out. Because that actually the semantic-related words often exist in the sentence with topic word, the conditional probability is used for computing the semanticrelated degree of word-topic.
In the document set, it is rare that there are different words in different topic. Usually, the words belong to two or more topics shown as figure 1. Because that it is simple that different words are in different topics, in this paper, it is discussed that the words belong to more topics. When In one sentence word A and word B appear together, they are described as co-occurrence. The mathematical formula of semantic-related degree set is: In which, the i O indicates the semantic-related degree of every word. Its mathematical formula is:

Evaluation Criterion
For the given topic, we can obtain many words from the above algorithm. There is different contribution of every word for the topic. The contribution of all words is 100%. For user's threshold, the fewer words, the better the effect. In this case, proportionality index is used to describe it. Combined with the actual business situation, user evaluates it.

Topic Mining Algorithm with Business Dictionary
The general principle of the algorithm is that the original data cannot be changed and disturbing by man-made factors should be avoided. Based on it, the topic mining algorithm is designed.
By using the topic set, the total amount of documents is reduced and topic-related documents are kept. Then, by means of business dictionary, the accurate words can be searched and saved. Finally, on the basis of relevance index, the semantic-related of word-topic can be obtained. The sorted result can be returned to user. There are three steps in the algorithm. The first step: Screening of document set There are many topics in the document set. When one or more topics are analyzed, the related documents need to be filtered out. In this way, the analysis scope can be reduced to improve efficiency. The filtering method is that on the base of topic (or related topic) set, the fuzzy search can be used to traversal of all documents. When the topic is contained in the document, the document is related, otherwise it is independent. According to the topic set, the irrelevant documents are taken away.  Figure 3. The process of screen algorithm The documents of subset are related to topic. When there are none in the subset, it indicates there is no document related to topic.
The second step: Segment with business dictionary There are different technical terms in different industries. In the same industry, there are some specific terms in different scenarios. If these terms, which are not included in the general dictionary, cannot be identified, the valid information hidden in the documents will be lost and some data value cannot be found. So, in order to avoid the loss of useful information, the defect can be remedied in the stage. The business dictionary depends on specialized person. There are industry term, special scene term and some common term. The segment process with business dictionary is that segmenting word to split document set, then, filtering out the form word, figure, and so on. The reserved word set is effective which the base for precise theme mining is.
The detailed procedure is as follows:  The third step: relevance index computing Because the words and topic appear in the same document, the words describe the topic from different aspect or angle. The correlation degree of different word and topic is different. The higher the correlation, more intimate the word and topic. In this case, the accurate characterization can be got from the word set. The related words reflect the customer's focus to a certain degree. In the paper, the relevance index is used to describe the relationship. The relevance index is illustrated with the ratio of word frequency and topic frequency.
In the formula, the R indicates the relevance. The WF indicates the word frequency. The TF indicates the topic frequency.
After computing R of all words, a data matrix is formed which describes the relevance of each word and topic.
The data matrix as result is returned to the user. On the base of the data matrix, the data can be displayed with network diagram etc.
The algorithm analysis: By appoint the topic and analyze the topic in the document set, the user can obtain the relevant words. In this case, the relationship between the topic and related words is shown accurately. At the same time, it is avoided that the minor topic is ignored.
In the algorithm, there are no all the topics but appointed topic. So, the workload can be saved to improve efficiency.
By use of business dictionary, the specific business words can be retained. It is very good to grasp the topic accurately. At the same time, the high frequency and not related words are removed. In this case, the related words are focused.
Using the relevance index formula, the relevance of topic and word as qualitative analysis index is used for the user to support making the policy decision.

Experiment and Analysis
The above solution has been applied to the power production management system (PMS) of State Grid of China.
The PMS is the most advanced power production management system in the world. It is one of SG186 engineering applications, which is one of the most massive and complex applications. The investment of PMS1.0 is over billion. The investment of PMS2.0 is much larger than PMS1.0. Including in the low pressure data, the total amount of data of PMS2.0 is over 150 billion. In these data, the log is very large and contains much information such as grid operation and fault information. In the log, there is a lot of implicit information which are not found. In this paper, the fault data is selected to verify the above algorithm.
In the PMS of some province, there is a lot of fault and defect data of main transformer, line, tower and other equipment in the log. The log describes the related equipment, phenomenon, process, consequence and failure analysis. From the data, we can analyze the main factor and secondary factor. The direct or indirect causes may be obtained.
In the R language [25] environment, based on the real data, the verification can be achieved. The main analysis process is as follows:

Preprocess Stage
The fault log is stored in the structured database. So, three fields are selected to analyze the algorithm. The three fields are as follow: Table1: the description of three fields   No  Field  name   Description   1  DSMCE city name  2  JSYC  reason of failure  3 GZQFK The description information of failure In the selected log data, because that the null value is no influence to analysis result. So, all the null values are deleted.
In order to determine the topic set, the categorical data (JSYC) is analyzed. From the data, the "40499" value is selected as analysis object. Then, the data about it is analyzed with word cloud algorithm. The biggest words are selected as topics. The topic set is: { switch trip , phase fault , Lightning }

Verification Process
First step: screening of the document set. Using the topic set, all the documents are filtered with the fuzzy algorithm.
In the fuzzy algorithm, the filter condition is "like". If the document contains the topic, the document is remaindered. The remainder is the valid data which number is 4528. Second step: segment with business dictionary. According to the state grid and the fault scenario, the professionals can formulate the business dictionary.
In the result of segmenting, there are many invalid words. There are many form words, single letters, and figures, and so on in the invalid words which are shown as following.
Because there are a lot of words after segmenting, some words, which frequency is higher, are selected to show word frequency chart. The top 150 words are related in every chart. From the chart, we can see that with the business dictionary, it is easy to see the main factor set which is { switch trip , phase fault , Lightning }.
Third step: with the above formula (3), the topic one by one is used to analyze the related words. The relevance index of every word can be computed. The user's threshold is 90%. With the help of the evaluation criterion formula, we find out the proportionality of top 10 is more than 90%. So, the top 10 is selected to analyze. For example, the result of the topic "switch trip" is shown as following. From the figure, we can see that the word frequency of "Fail to coincide" is closest to topic "switch trip". The result is returned to the user. Based on the result, the algorithm plot in the igraph package is used to show the result. The result is shown as follow.  The business professional can analyze real influence factors. For example, there are several lines relating the topic "switch trip". It indicates these line faults often cause the topic" switch trip". Analysis Based on the algorithm, the most relevant words with topic can be shown with the thick line. In this case, it is convenient to the business workers to analyze the main factors and secondary factors accurately.
Screening the document set with the topic set can reduce the amount of documents. So, the efficiency of topic analysis in small document set is improved evidently.
Segment with business dictionary can retain more potential information about documents. By the help of it, the professional may analyze the relevant words accurately.
Using the relevance index, the qualitative analysis is transformed into the quantitative analysis. On the base of this, the relevant degree can be described accurately.

Conclusion
Because that it is difficult to accurately analyze the related words to document set from the log data, this paper provides the accurate topic mining algorithm based on business dictionary. In the algorithm, firstly, with the topic set, the valid document set can be filtered with fuzzy algorithm. After creating the industry and scene business dictionary, the documents are segmented into a lot of words. In this case, the invalid words are removed. At last, by using the relevance index, the relevance of every word can be computed. The qualitative analysis is transformed into quantitative analysis. The algorithm is applied to PMS. The validation result shows the main related factors of topic can be analyzed accurately.
Although the algorithm can solve the problem of analyzing the log data accurately, when drawing up the business dictionary, the business expert must be need. The quality of the business dictionary is constrained by business expert ability. Furthermore, it is the research direction in the future how to improve the efficiency of the algorithm in the distributed environment.