Mining Telecommunication Circles via the Call Record and Short Messages

Telecommunication circles are groups of similar customers in telecommunication networks. Mining such circles provides with telecommunication operators great value in developing prospective customers while retaining old ones. However, most of the existing community detecting algorithms utilize mainly the structure of the complex network and ignore the strength of relationship. This paper improves the classic CPM (Clique Percolation Method) algorithm by taking into account both the call record and short messages, and proposes a new algorithms called SR_CPM (Strengthened Relationship CPM). The new algorithm is applied to telecommunication networks and demonstrates superior effectiveness


Introduction
Telecommunication users constitute a huge, but relatively sparse social network.Community discovery has become a very hot research topic of data mining for many years.Telecommunication circles are groups of similar customers in telecommunication networks.Finding out these circles is of great value for operators, helping them make more attractive pricing package politics, obtain prospective customers while retaining old ones.However, current community discovery algorithms in the literature mainly focuse on the people's personal behaviors as well as population attributes.This paper will employ information such as call record and short messages, and present an improved CPM algorithm called SR_CPM (Strengthened Relationship CPM).A group of experiments are designed and results show that the new algorithm is effective and efficient.
In this paper, part two introduce the related work, part three introduce an improved CPM algorithm and part four introduce the application of SR_CPM in telecommunication data set.

Related work
The complex networks have the characteristics of "small world", "scale-free", and community structure.Many social networks in real life can be abstracted as complex networks, and telecommunication networks are the typical ones.Farrahi [2] et al. have studied the location driven daily behavior patterns using classification method.Kim [10] et al.
set up a logistic regression model to handle the Korea Telecom Data.Especially, community discovery technology is well suited to find out the community structure in complex networks.Early community discovery algorithms are derived from graph theory like the spectral bisection community discovery [3], or derived from the hierarchical clustering algorithm, ,which introduces a measure Q [4], to reflect the degree of modularity of community division.Based on this concept, Girvan and Newman proposed the GN algorithm [1] by deleting the edge of the network in the largest number of edges to the community Division.In addition to the non-overlapping community detection algorithm mentioned above, research workers also put forward a series of overlapping community discovery algorithms that employee the idea of clique filtering [5] , seed expansion [6] , hybrid probability model [7] and edge detection [8] , etc.

Existing Problems of Community Discovery Algorithm
Although many community detection algorithms has been proposed, it is still a challenging work to find out community structure from the complex network.There are still a lot of problems need to be solved.
1. Most of the current algorithms are based on static networks.
2. The effect and performance of the community discovery algorithms have been a problem.It is a great challenge to design highly effective algorithms with low time complexity.
3. Most of the existing community discovery algorithms only consider the connections between nodes while ignore the strength of the connection and the inherent attributes of the nodes.

Improved Clique Fitering Community Discovery Algorithm
The main contribution of this paper is to improve the CPM algorithm, called SR_CPM, which is more suitable for the community discovery in the telecom user call network.

Algorithm idea
There are mainly two kinds of methods to expand the CPM to weighted network.a) Set a global threshold इ, remove edges with weights less than इ and then use the traditional CPM algorithm to partition the network to communities.b) Farkas [ଽ] et al. proposed a clique intensity function for CPM in the clique algorithm.For a k-clique containing k* (k-1) /2 edge, its clique intensity is defined as: And cliques whose intensity is less than some prescribed value are ignored.
The first method is simple but the choice of threshold इ is difficult and greatly influences the quality of community partition.The second method is generally more effective but suffers from huge computations.Moreover, the intensity function is an absolute index hence and it is not easy to set the threshold either.This paper will introduce a simple and relative measurement.

Definitions
Definition 1:Coefficient of Variation The coefficient of variation is the ratio of the standard deviation to the mean. .
The advantage of coefficient of variation over other statistical indices is that it is relative and easy to calculate.Definition 2:Weighted k-clique based on the coefficient of variation (k-clique-w_c.v).
Let G=(V,E,W) be a weighted network, W represents edge weights.A k-nodes complete subgraph G ' = V ' , E ' , W ' has a total of edges is less than some prescribed threshold C. V * , we call the k nodes a weighted k-clique based on the coefficient of variation.Similarity, weighted k-clique based on deviation is denoted by k-clique-w_σ.

Algorithm Procedure
There are three steps in the process of SR_CPM algorithm: Firstly, find out all the cliques which are not included in other cliques.

a)
Calculate the degree of each node in the network and record the largest value g-1 and then turn to b.

b)
Set C to be the collection of all nodes in the network and then turn to c. c) and then take a node आ ࣻ from C * randomly.For आ ࣻ , we define two collections A and B. A is a collection of all the nodes which connects each other and contains node आ ࣻ during the exection of the algorithm, and B is a collection of nodes that are connected to each node in A. Then, turn to d.

d)
Iterate recursively to find all the cliques that include आ ࣻ and whose size is g.Let ‫|܆|‬ stands for the number of elements in the collection X. 1) Initialize the collection A = आ ࣻ , B={the neighbors of आ ࣻ }.
2) Each time we move a node from the set B to the set A and adjust the set B by deleting the node that is no longer connected to all the nodes in set A.

3.3.1Experimental data
The data set is the classical complex network data set Les Miserables.This data set is a character relationship network constructed according to the relationships among the characters in the Miserable world.The node in the network stands for a character in the novel.If two characters appeared in the same chapter, there will be an edge between two nodes.The weight of the edge stand for the number of times when the two characters appeared in the same chapter.There are 77 nodes and 253 edges in the data set.

Evaluation function
Evaluation function use community partition quality evaluation function-the extension of modular Q which is an evaluation function for overlapping community organizations EQ: . .
represent the number of communities that node i and node j belong to respectively.

Result analysis
The SR_CPM_c.vstands for the SR_CPM algorithm using k-clique-w_c.vand the SR_CPM_σ stands for the SR_CPM algorithm using k-clique-w_σ.For k = 3 and k=4, the results are shown in figure 1.
Under different cv*, the EQ value of the SR_CPM_c.valgorithm for k = 3 is generally larger than the corresponding EQ value for k = 4.This shows the nature that the network is roughly a 3-clique community to some extent.The EQ value increases first and then decreases along with the increase of the variation coefficient threshold cv*, which has a small fluctuation in the middle.The difference of the coefficient of variation cv* shows that the algorithm requires different levels of the degree of dispersion of the edge weights in the clique, that is to say, the degree of familiarity between the users is required.The smaller the cv* is, the smaller the degree of dispersion of Figure 1.SR_CPM_c.vChange of EQ value under different parameters the weights will be in the clique and the more severe the conditions for the formation of cliques will be.The larger the value of cv* is, the discrete degree of edge weights requirement in the clique is more broadly.When cv* is too small, some real community structures may not be considered cliques.Although the connections inside the community are very close, the edges among communities is not sparse and then the EQ value is still low.With the Figure 2.The comparision between SR_CPM_c.vand CPM increase of the cv* value, the formation condition of the cliques is gradually broad and the EQ value increases gradually as more reasonable nodes join the community.When the cv* is too large, the restrictive condition of constructing a clique in weight is becoming weaker and weaker.Thus, some complete subgraphs with different edge weights are also considered cliques and the EQ value decreases gradually.The algorithm is close to CPM algorithm to some degree without considering the intensity information of the node.
The coefficient of variation of C.V and standard deviation to measure the degree of dispersion are introduced into the definition of cliques.The experimental results show that the SR_CPM is more reasonable in the division of community structure compared to CPM.The data set used in the experiment is coming from a telecommunications company data record.There are 3 files including the user information, the call log records and the SMS records.The dataset has been transformed to protect the privacy of the users.The original data contains 382779 users, 76907842 call records and 20947956 SMS records.After preprocessing, the two data set Call_mess_table and User_info are shown below.The SR_CPM and CPM algorithm are used to partition the community in the two data sets respectively.

Experimental Content
Two data sets are extracted in this experiment: data set 1 is user call network 1 which has 623 nodes and 3391 edges consisting of a record of 27913 calls from the 623 users and 5624 SMS records.Data set 2 is user call network 2 which has 2403 nodes and 14094 edges consisting of a record of 98152 calls from the 2403 users and 14094 SMS records.
Community discovery is an unsupervised learning process.The structure of the network is not known in advance, so the final results of the community need to be evaluated.There is no authoritative evaluation index of community partition which can be applied to any kind of network at present.The research on the quality evaluation function of the community partition with the weighted network is very little compared with the non-weighted network.It is not easy to evaluate the quality of weighted complex networks.There are a variety of defects in the weighted network when the community evaluation index of many kinds of unauthorized network is applied to the weighted network.Due to the lack of effective social evaluation index of doubt and the weighted complex network classification information, it makes the objective evaluation of the weighted network community discovery algorithm results become extremely difficult.The commonly used evaluation indexes include clustering coefficient, strong and weak associations and modularity.This paper is still using EQ evaluation function.

Result Analysis
The SR_CPM algorithm divides the data set into 33 societies (overlapping community structure), and the CPM algorithm divides the data set into 61 societies (with overlapping community structure) with the parameters k = 4 and m = 2.The relationship between community size ranking and community size is shown in figure 3 and 4. The SR_CPM algorithm proposed in this paper is better than the CPM algorithm in the case of reasonable parameter settings, and it can find the reasonable community structure in the weighted network.However, if the parameters are set so that the formation conditions are too harsh, the quality of community partition may be inferior to the corresponding parameters of the CPM algorithm.In addition, the division of the SR_CPM algorithm to get the community is different according to the input of different parameters.Telecom operators can adjust them to meet the needs of practical applications.

3 )
If |A|< g and B =Φ, or A∪B is a subset of some clique or |A|+|B|<g, stop calculation and return to the previous step in recurrsion.If |A| = g and B =Φ and the weight of nodes in set A satisfies definition 2, a new clique is obtained.Record the new clique and return to the previous step in recurrsion and continue to find new cliques.If |A| = g and B =Φ and the weight of nodes in set A does not satisfies definition 2, return to the previous step in recurrsion and continue to find new cliques.If|A|< g and B≠ Φ, execte the (2) recursively.All the cliques which have the size of g and start from आ ࣻ are obtained in the end.e) Delete आ ࣻ from C * , delete आ ࣻ and all the edges connected to it in the network.f) If C * ≠ Φ, get the next node from the C * and repeat procedure d ~ e.If C * = Φ , set g = g-1 ,repeat the procedure b ~ e until g=2.Secondly, constructing the clique-clique overlap matrix C according to all the weighted cliques found in the previous step.Thirdly, constructing clique connection matrix and find the k-cliques according to the input k and the matrix C.

Figure 3 .Figure 4 .
Figure 3 .The distribution of Community size ranking of CPM