Social Relationship Discovery Via Call Records

Telecom users constitute a huge, but relatively sparse social network. Community discovery has been a research topic of data mining. Traditional algorithms are greatly influenced by outliers. This paper presents a new algorithm based on social triangle theory. Experiments show that the new algorithm is effective.


Introduction
Short messages and call records are an important part of social media network.One needs to dig out the various social relations among users or the behavior habit of the users from the call and SMS records, to develop more suitable pricing or business package politics for users, and to provide users with better service to attract more users.
To achieve the above objectives, the community partition of telecommunication users is a prospective direction.Before the birth of Internet, people began to study the structure of networked community, such as early biomedical research on the protein structure [1] .The division of community structure is closely related to the segmentation of images in computer science and the hierarchical clustering in sociology [2] .This paper utilizes the idea of social triangle theory and proposes a community discovery algorithm based on the improved triangle theory, and designs and conducts some experiments to demonstrate its effectiveness.

Hierarchical Cluster Algorithm
Here we will introduce the splitting algorithm.The most representative splitting algorithm in the community algorithm is GN algorithm [3] .The basic idea of the GN algorithm is to find the largest edge in the network, and delete it.GN algorithm steps are as following: Step1: find the betweenness of the entire network.
Step2: find the edge which has the highest betweenness, remove it.
Step3: repeat the step 2 until all notes are degenerated into a community.
The GN algorithm solves the problem that the Laplace bisection algorithm must know the number of nodes in the community in advance.In order to get better clustering, Newman et al. proposed a standard for measuring the quality of community divisionmodularity.Modularity is used to represent the ratio of the number of edges connecting two different communities to all edges in the network.The specific formula of the modularity is shown as follow: where m represents the sum of the number of edges in the network.A ij represents the value of the i-th row and the j-th column element in the adjacency matrix.The matrix P is an adjacency matrix used to store the relationship between the original nodes, and P is the node i and j correspond to the value of the element ij.If the two nodes belong to the same community, ie c i = c j , then δ(ci, cj) = 1 , otherwise δ(ci, cj) = 0.
In the pseudo-random network mentioned above, when the nodes i, j point to each other, said node i, j connected.Assuming that the degree of the two nodes in the network (including the out-degree and in-degree) were k i and k j .For the network with m sides, we can define j as the connection probability of i to j.
as the connection probability of j to i. Define any element in P is , then the Q can be rewritten as following: According to the modularity theory, we can see that the closer the Q is to 1, the more obvious the community structure is, and the Q found in the actual network community is often located between 0.3 and 0.7 [4] .

Community Discovery Alogorithm
Based On Improved Triangle Theory

Triangle Theory
In real society, the relationship among people is complicated, for example, A and B are good friends, B and C are also good friends, so are C and A. Thus, the relationship among them constitutes a triangle shown as below.For members of the same social triangle, we can classify them as a unified community.Since there is such a social triangular relationship exists in the network, the network diagram given in the above can be extended to build a complex display society, as shown in Figure 2.

The Initial Community To Determine
We first determine the threshold ε of the good acquaintance, and then find out all the links and the similarity between nodes.Points with similarities larger than ε are grouped into a triangular group as the initial community.

Experiment Result and Analysis
The algorithm designed in this paper is based on the clustering theory.The calculation process of this paper is actually a complex network clustering process, in the process of trying to find the optimal clustering results for the known clustering structure data We often use the accuracy rate, recall rate and F value to evaluate the effect of clustering [6] . 1) The first step is to perform an initial triangular search in all nodes.In the second step, the upper triangular group is merged and the members of the triangle are expanded by searching for the structural community as the initial community.Finally, three societies along with two outliers are detected.Since one of the three societies contains only three nodes, the three nodes are treated as incorrectly assigned nodes when calculating the accuracy rate.In this test, the similarity threshold is chosen to be 0.1.The final result is shown in table 2.Where pi represents the number of algorithm allocations and pj represents the number of original network members, pi∩pj represents the number of members of the algorithm that are overlapped with the original association.
The GN algorithm result is shown in table 3.In the US political book network dataset, The similarity threshold is chosen to be 0.3 in this test.In this experiment, the algorithm proposed in this paper divides the whole network into five communities, and the GN algorithm divides into six communities, where the nodes with lower coverage are regarded as outliers.The specific experimental results are shown in Table 4 and Table 5.The results are depicted as follows Combining the above results with GN algorithm, it can be seen that the algorithm designed in this paper can reach the level of traditional excellent community division algorithm.

The Application Of Triangle Algorithm In Telecommunication Network Division
In the experiment, about 20,000 call records among 524 users are extracted.The triangular community division algorithm and the GN algorithm are compared and analyzed respectively.The Dunn index is picked to evaluate the effect of clustering [5], whose expression is as formula.The maximum value of Dunn index is 0.29, which is larger than the maximum value of Dunn index of traditional GN algorithm.This indicates that the triangular community division algorithm is better than GN algorithm when the number of community is kept at a certain value.As a traditional community network division algorithm, GN algorithm for outlier detection and the capacity of processing telecom networks is relatively weak.The triangulation division algorithm is mainly based on the similarity of triangular relationships among node.If a user can not build a triangular relationship with other users, none of the structural reachable point of the user appears in a triangular group.The algorithm will treat it as an outlier.Due to the exclusion of a large number of outliers, the maximum value of the Dunn index of the triangulation method is higher than that of the GN algorithm, and the clustering efficiency is better than the GN algorithm.

Summary
In the paper, existing community discovery algorithms are reviewed.Then, a triangle community division algorithm based the social triangular theory is proposed and implemented, and validated using standard data set.The new algorithm is then compared to the GN algorithm using the telecommunication call record data set.Experimental results show that the new is algorithm is more effective.

Figure 2 .
Figure 2. Complex social triangle group structure diagram

Figure 1 .
Figure 1.Comparison of Different Parameter of Triangle Sociology and GN Algorithm in Two Data Sets clusters, indicates the maximum distance between members of the cluster.The larger the D is, the larger the cluster is.The results are shown in Figures 7.

Figure
Figure 2. Diffrent method Dunn index changes with the number of community

2 .
Figure 2. Diffrent method Dunn index changes with the number of community

5 Experimental Design and Result
Flow

Table 1 .
Accuracy and recall rate analysis table

Table 2 .
clustering based on triangle improved algorithm result

Table 3 .
GN algorithm clustering result

Table 4 .
Clustering based on triangle improved algorithm result

Table 5 .
GN algorithm clustering result