Bert-GCN: multi-sensors network prediction

. With the application of neural network technologies such as GCN and GRU in sensor networks, the accuracy and robustness of multi-sensor prediction have been greatly improved. GCN effectively uses the spatial characteristics of the sensor network, and GRU effectively uses the temporal characteristics of the sensor network, so the PROPOSED T-GCN model has achieved excellent results. However, there are still shortcomings: i) The prediction is only for a single sensor feature, and multiple features cannot be trained at the same time. ii) Only the connections between sensors are considered, while the connections between multiple features of sensors are ignored. iii) Modeling for multiple features leads to the deepening of the model from 2d to 3D, resulting in slow model training and poor learning effect. To solve the above problems, this paper proposed the Bert-GCN model. Bert pre-training was added on the basis of the original GCN-GRU model to effectively improve the learning effect of multiple features of a single sensor.


Introduction
With the change of the world military strategy and the deepening of the reform of the world military struggle system, information-oriented system operation and system confrontation have become the main form of the future war. As the core of battlefield target perception, multi-sensor prediction also faces more severe challenges. How to realize efficient multi-sensor prediction is the key to solve the problem.
This topic starts with improving the prediction performance of multi-sensor to solve three problems existing in the training process of multi-feature of single sensor. i) It is slow to train multiple features at the same time to predict only a single sensor feature. ii) Only the connections between sensors are considered, while the connections between multiple features of sensors are ignored. iii) Modeling for multiple features deepens the model from twodimensional to three-dimensional, resulting in poor learning effect.
Aiming at the above three problems, this paper proposes a training model based on Bert-GCN. The aim is to improve the resource scheduling capability of future sensor systems to cope with increasingly severe electronic interference environment and increasing threats of various targets. By constructing a networked, collaborative and information-based sensor resource scheduling system, the sensor configuration is optimized to realize the complementation of sensor cooperative detection capabilities and realize all-round, threedimensional and multi-level collaborative resource scheduling. Sensor networks can continuously track and detect small targets, low altitude targets, high speed targets and high maneuvering targets in a wider and more flexible working mode. Bert is used to preprocess sensor features, which can not only compress the model to improve the training speed, but also effectively find the relationship between multiple features to reduce information loss.

Relate works 2.1 Bert
Bert is a language processing model based on neural network. [1,2] Bert model pays more attention to identify the relationship between words in sentences or between sentences. It adopts semi-supervised learning and language to express the model. Bert is a bi-directional Transformer model [3]. It can adjust both left-to-right and right-to-left transformers. In the pre-training stage, Bert performs pre-training with unsupervised predictive tasks, including the Masked Language Model MLM(MLM) below. After pre-training, The Bert Model performs fine-turning for downstream tasks to fine-tune Model parameters. To achieve the most adaptive effect. Bert's depth bi-directional Transformer embodies this philosophy in fig.1.
There can be loops in the use of two-way interpretation, which can lead to a misunderstanding of the word itself. Bert adopted MLM model to solve this misunderstanding. The MLM model randomly masks the words of the input sentence (OpenAIGPT takes a similar approach) [4]. The word encoding of Bert model is not simple word encoding, but the combination of three layers of meaning encoding. The first layer of encoding is the encoding of the word itself. During Bert initialization, there will be an external input thesaurus for encoding, which will contain all the words of this natural language. The second layer of encoding is embedded based on the position information of words. In order to reflect the position information of words in sentences, Bert will position every word in every sentence. The third layer of coding is sentence-level coding. In order to embody the independence of sentences (Bert called it segment embedding), Bert uses the way of two-sentence stitching to construct coding [5]. After the three layers of embedding are completed, Bert combines the three kinds of embedding and finally determines the word vector.

Graph neural network
Graph is a kind of data structure, and graph neural network should be some models, methods and applications of deep learning on graph structure data. A common graph structure is ITM Web of Conferences 47, 01004 (2022) CCCAR2022 https://doi.org/10.1051/itmconf/20224701004 composed of nodes and edges. Nodes contain entity information and edges contain information about relationships between entities. Many learning tasks, such as modeling physical systems, learning molecular fingerprinting, predicting protein interfaces, and classifying diseases, require models to learn from the input of graph structures. GNN is generally divided into the following four categories: I) graph convolutional network [6][7][8] and graph attention network [9,10]. This kind of graph neural network believes that every node in the graph changes its state all the time due to the influence of neighbors and distant points until the final equilibrium. The closer the relationship is, the more influence the neighbor has. Ii) spatial network of graph [11,12] this model can effectively capture complex local spatio-temporal correlations through a well-designed spatio-temporal synchronization modeling mechanism. At the same time, several modules in different time periods are designed in the model to effectively capture the heterogeneity in the local spatio-temporal map. iii) self-coding of graphs [13,14]. In this model, the known graph is encoded to learn the distribution of node vector representation, and the vector representation of nodes is sampled from the distribution, and then the graph is reconstructed by decoding (link prediction). Iv) Graph generation network [15,16]. The model generates new graphs given a set of observed graphs.

Our work
Our work in this paper consists of Bert, GCN and GRU. Firstly, multiple parameters of multiple sensors are taken as Bert input. The purpose of Bert layer is to uniformly encode multiple features of a single sensor, compress data dimensions, and retain the internal relationship between parameters. Bert's output serves as the input of GCN. Multiple GCNS need to be built for multi-head Transformer, and the purpose of GCN layer is to build the spatial relationship between multi-sensor networks. The output of GCN is the input of GRU, and the purpose of GRU layer is to find the correlation on the timing of multi-sensor network. The network structure diagram is in Fig.2. Bert is a Transformer model using bidirectional encoder. It is made by stacking enconder structures of multiple Transformers. The model structure is shown in figure. In Transformer enconder, the data is first given a weighted feature vector Z by the self-attention module, as  The above is the structure of one-layer Transformer, Bert is the stack of multi-layer Transformer, and the model structure is shown in the fig. 3. For graph = ( , ), is the set of nodes, is the set of edges. For each node , it has its characteristic , which can be represented by matrix * . Where represents the number of nodes, and represents the feature dimension of each node, or the dimension of feature vector. Any graph convolution layer can be written as a nonlinear function: ∈ � , � ∈ +1 = ( , ) 0 = is the input of the first layer, ∈ * , and is the adjacency matrix. Different models are selected for different problems, and the difference lies in the realization of the function .
Given A adjacency matrix and feature matrix , the GCN model constructs A filter in the Fourier. The filter acts on the nodes of the graph and captures the spatial features between nodes through its first-order neighborhood. Then, the GCN model is constructed by superimposing multiple convolutional layers, which can be expressed as: We choose the 2-layer GCN model to capture spatial dependence, which can be expressed as: To prevent overfitting, L2 regularization is added.

Train model
In this section, we describe the concrete implementation of the Bert-GCN model. A Transformer encoder unit consists of a multi-head-attention + Layer Normalization + feedforword + Layer Normalization. Each BERT layer consists of one of these Encoder units. To evaluate the prediction performance of the Bert-GRU model, we use five metrics to evaluate the difference between the real traffic information Y t and the prediction Y t � , including: Root Mean Squared Error (RMSE): (2) Mean Absolute Error (MAE): (3) Accuracy: (4) Explained Variance Score (Var): Root mean square error (RMSE) and MAE are used to measure the prediction error. The smaller the value, the better the prediction effect. Prediction accuracy is determined by precision: the higher the number, the better the prediction. Using Var to calculate the correlation coefficient, it shows the ability of the prediction results to match the actual data: the larger the value, the better the prediction effect.

Experimental results
We took GCN-GRU as baseline and compared it with the Bert-GCN model proposed in this paper, and compared the prediction effect of the two models in multi-sensor and multi-feature scenarios. It can be seen from the figure that Bert-GCN is more effective than the original GCN-GRU model in terms of accuracy and robustness. The following are the outcomes:

Conclusion
In multi-sensor network prediction, based on GCN-GRU, this paper optimizes the complex scenario in which a single sensor contains multiple features. On the one hand, the data dimension is reduced and the performance is improved. On the other hand, the relationship between multiple features is effectively preserved and a good effect is achieved.