On the performance of overlaid wireless energy harvesting cognitive industrial sensor networks under jamming attacks

. Two or more wireless sensor networks coexist in the same space while low energy consuming devices mobiles in a secondary network harvest ambient RF energy from transmissions by nearby active transmitters in the primary network. The channels are allocated to the primary network, while the overlaid secondary network can access the idle channel allocated to perform data transmission opportunistically and operate properly. In this paper, with the jammer implanted, we propose a novel solution in which we execute a deception strategy to exhaust the energy of the jammers. As a result, the energy constraint jammers will be challenging to achieve jamming attacks when the secondary transmitters (STs) transmit information. We formulate the problem first to tackle the issue; that is, we regard throughput optimization issues for ST under jamming attacks as a Markov decision process (MDP). Then, since the focus is mainly on the throughput of the secondary network, a learning algorithm is adopted to maximize it. Through the learning process, the STs can adapt to the dynamics of the primary network while executing proper actions to benefit the overall throughput online. Simulations validate the efficiency and the convergence of the algorithm we proposed.


Introduction
Wireless cognitive sensor networks with wireless energy harvesting have drawn much attention recently.Powering low-power mobile devices by harvesting energy from ambient radio frequencies such as solar, wind, and kinetic activities makes wireless networks not only environmentally friendly but also self-sustaining.With the progress on designing efficient circuits and devices for radio-frequency (RF) energy harvesting suitable for lowpower applications, the proposal of a novel network model becomes available.Two or more wireless sensor networks coexist in the same space while low-power mobiles in a secondary network, called secondary transmitters (STs), harvest ambient RF energy from transmissions by nearby active transmitters in the primary network, called primary transmitters (PTs).The channels are allocated to the primary network while the STs are unlicensed users.The STs can be self-sustaining by harvesting energy from ambient RF signals of primary networks.By opportunistically accessing the idle channel allocated, the overlaid secondary network can perform data transmission and operate properly.While wireless power transfer techniques ease the energy limitation caused by constrained hardware of mobile devices on STs, mobile nodes can be self-maintained.Such networks have become more common since the wild spread of wireless charging techniques while focusing on the performance of the secondary users.And malicious transmitters which execute jamming attacks can easily harm the STs, while the energy for jamming attacks is also from harvested wireless RF energy.
In our network model, the PTs own the licensed spectrum; thus, the STs are not immune from intruders.With the wireless power supply, the intruders can also harvest wireless RF energy to execute attacks on the ST's information transmission process to harm the throughput of the overlaid secondary wireless energy harvesting cognitive sensor networks.In this paper, we assume the intruders can use smart sensing processes to distinguish between the signals from the PTs' transmission and that from the STs' transmission.
The jamming attacks issues in classical wireless systems are generally seen as there are such an amount of channels, and jammers can only manage to invalidate a few of them at a time.Therefore, the solution lies in choosing the unlikely attacked channels or switching to another channel when the one is under attack.However, the common situation is that the STs are designed to transmit on a fixed channel opportunistically.Thus, we need to come out with a totally different strategy to tackle this issue.Actually, since the jammers rely on the wireless power supply and their energy is limited, they cannot perform attacks on a single channel continuously.Therefore, we propose a novel solution in this work which we execute a deception strategy.That is, the STs can perform blank transmissions to crack the ability of the intruders while wasting the energy of the jammers.As a result, the jammers may find it challenging to execute jamming attacks when the STs transmit real information.
We formulate the problem first to tackle the issue; that is, we regard throughput optimization issues for ST under jamming attacks as an MDP.Then, since our objective is mainly on the throughput of the secondary network, a learning algorithm is adopted.The novel learning algorithm here is derived from simulation-based and policy gradient methods.Through the learning process, the STs can adapt to the dynamics of the primary network while executing proper actions to benefit the overall throughput online.
Later, we do simulations to validate the efficiency and the convergence of the algorithm we proposed.Through the simulations, we demonstrate that under such multiple jammers' conditions, our learning algorithms are capable of effectively balancing deception actions to data transmission ratio so as to achieve optimized solutions.The convergence proofs are given with simulations while we compare other algorithms.It has to be emphasized here that in our problem formulation, the smart jammers are only aimed at degrading the throughput of the secondary network and are not into the primary network.That is, the secondary users are their targets.
The rest of the paper is organized as follows.Related works are provided in section 2, and our system models are provided in section 3.In section 4, we focus on a single ST and provide the learning algorithm to optimize the performance of the ST.The simulations are present in section 5, while section 6 concludes the paper.

Related work
Many optimization problems for the secondary network with energy harvesting lie in the cognitive radio network and focus on throughput [1], energy consumption [2,3], and channel access [4,5] in the literature.This paper is in the realm of cognitive industrial wireless sensor networks, and the network operation is more energy constraint and computation complexity limited.Here we are concerning the average packet update delay of the secondary network instead of the average packet delay of the secondary network.Our assumption is more on the industrial application facts than the communication side.Targeting jamming with one target channel in industrial is not found so now in the existing literature.These are the main contribution of our work.

System model
We formulate the primary cognitive wireless sensor network with energy harvesting circuits and coexisting with STs and jammers in the same spatial domain.While the STs harvest energy and use it to transmit data, the jammers conduct jamming attacks to destroy the communication.We assume the PTs utilize the channel on a time slot basis and idle ch p denotes the idle probability of the channel.
The STs are equipped with a battery to store the harvested energy together with a buffer to store unsent incoming data.The STs sense the status of the channel at each time slot as defined; it can be either busy or idle.If it is busy, the ST will stand by.Otherwise, if it is not, the ST will execute transmission while it has harvested or stored enough energy.Concerning spectrum sensing of ST, it may come across some errors, and one is that it misses detecting the channel idle state as busy, and the other is that it misses detecting the channel state as idle.Above two types of errors are straightforward, and we describe them as type I error and type II error and denote the probability of them by With the assumption that we utilize the channel on a time slot basis, we also assume that the ST can complete energy harvest and data transmission simultaneously on a time slot basis.We denote the battery capacity of the ST as ST E units while the energy harvest circuits can bring h ST e units of energy in each time slot with probability h ST λ .For the sake of simplicity, here we ignore the energy consumption of other operations on ST.The working scheme of the powered ST is on the condition that it could just stand by, transmit data while its data storing buffer is not empty or send the trap packet to exhaust the jammers' energy(deception).On the contrary, when the ST has no energy, the ST will stand by.Also, we assume the jammer is capable of distinguishing the transmission between the PTs and STs.When the jammer detects the ST's signals, the jammer will perform the attack by jamming into the whole time slot on the channel.In fact, the assumption is realistic since the PT's signals can be recognized by techniques as in [6].
The paper aims to maximize the throughput of the STs under the attacks of the jammers; with our deception strategy, there's a trade-off between transmitting actual data or transmitting deception packets.If the ST transmits fake packets to deceive the jammer into attacking, the action may consume lots of energy of the jammer, which benefits the STs.That is, when the ST has actual data packets to transmit, the jammer may not have enough energy to execute an attack.In practice, the power consumption of transmitting actual and fake data is different; otherwise, the deception method would be in vain since the STs can choose the retransmit organism.Here, we denote the energy consumption of the actual packet transmission as tr ST e while the energy consumption of the fake packet transmission as de ST e , and de tr ST ST e e  time on the slot to deceive the jammer into executing the attack while on an energy-saving basis.

Similar to STs, we assume each of the jammers can harvest
h ja e units of energy with probability h ja λ each time and is equipped with a battery with a capacity of ja E .The working scheme of the powered jammers is on the condition that they will perform jamming attacks once they detect signals sent from the STs.On the contrary, they will stand by when the jammers have no energy.We assume the jammer needs at ja e units of energy to execute the attack.It is straightforward that , since the jammer needs to jam the channel.
In the following sections, we will mimic the actions of the STs to maximize the throughput while dealing with smart jamming attacks.We will only illustrate the case of one ST since the scenarios are likewise.A learning algorithm will be proposed to manage the transmission and deception strategy, together with a highly adaptive MAC access mechanism for an industrial wireless sensor network.

The optimal strategy for a single secondary transmitter
We study the case of an ST under the attack from jammers in this section.First, we formulate the optimization problem as an MDP.Then the aforementioned learning algorithm, which is based on simulation-based and policy gradient methods, is followed by later contents to help optimize the performance metrics of the ST.

Optimization problem formulation
The problem is formulated in four parts, that is, state space, action space, transition probability and reward function, respectively.

State space
where d is data packets exist in the data buffer, and without loss of generality, we only define two values, 0 and 1, for not having or having packets to transmit.e is the battery status of the ST, which stands out in discrete levels.

Action space
The ST will need to choose an action to make after each channel sensing, and the possible actions are stand by, transmit data and execute deception.So for the action space: where where the first denotes the battery level is too low to act, and the ST can only choose to stand by.The second denotes that the ST has enough energy to execute deception but not enough to transmit actual data; thus, the ST can choose to perform deception or stand by.The third denotes that the ST has enough energy to perform any one of the actions.

Transition probability
Typically, we need to come up with a transition probability matrix for the MDP.But under our assumption that the objective of the jammer is to attack the ST's transmissions, the jammer will not disclose its information to the ST.As a matter of fact, network information like the channel idle probability idle ch p , the type I error probability I ST p , and the type II error probability I ST I p are all inaccessible.Thus, we cannot directly get the transition probability matrix.
The main idea of the learning method [7] we propose is to simulate the actions of ST, channel, and jammers through parameters generating; thus, the ST can update its parameters together with optimized decisions as directed.Next, we do formula computing.
For a control policy κ , the transition probability(probability of going from one state to another in the stochastic process in one step) function is: e t stands for the harvested.

Reward function
Here it refers to the throughput of the ST: ( ) For now, we have formulated the optimization problem, and next we will try a new policy-based method to solve this problem as there is no direct traditional way to work it

Parameterization modelling
We utilize a randomized parameterized policy as in literature [8,9] to make the decision for ST, and the parameter vector is ∈  , at state i with action a ).On the policy, the ST will execute an action a with probability ( ) , when currently at state i : And the transition probability and the immediate reward after being parameterized for Also, we have the average throughput function as: With assumption I, the following balance equations can be derived: ( ) where exists a unique solution ( ) being the steady-state probability of state i under a particular vector Φ .As a performance metric we derive the average throughput here ( ) ( ) ( ) Under Assumption I, it's evident that the average reward is wholly defined and doesn't rely on the initial state.We can define the differential reward throughput at state ' i as: ( ) where T is the cyclicality, that is, the first future time state * i is revisited and

The idealized gradient algorithm
It is natural to consider gradient-type methods since our goal is to maximize the average reward.We use the following formula for the gradient throughput with respect to the parameter vector Φ [10] with form: As is seen, we do consider an algorithm with the form ( ) ( ) We then make the following assumptions [8] : Assumption III The step size is nonnegative while satisfy Assumption II presents that both the transition probability and immediate reward are dependent Φ .Assumption III indicates the convergence of the policy gradient method.
Under Assumptions I & III, it is proved in [8] ( ) χ Φ converges and ( ) Thus under Assumption I, the differential throughput ( ) , d i Φ matches the unique solution of the Bellman equation, which is defined as: From (7) we have ( ) From ( 9) we have Combining equations ( 21) , ( 22)and( 15), and then we can rewrite (15) as: where T is the cyclicality, that is, the first time state * i is revisited and . However, the exact value of the diferential reward is replaced by the approximation ( ) while executing the action a is on µ Φ at state i .
Next, we will propose an algorithm to update the parameter vector Φ at the visit to the state * i .
When reaching the ( ) In the algorithm, C stands for a positive constant.
( ) is the gradient of the randomized parameterized policy function defined in (6).And the algorithm helps to update Φ and estimated version average throughput  χ .Although the algorithm above updates the value of the parameter vector Φ at the next visit to the state * i , we still store all the values of ( ) where All in all, proposed algorithm can be expressed as follows.That is, at the time k , the state is k i , and k Φ , k h , k χ  are accessible during the iteration and update as: , where C is a positive constant.And only Φ and χ  are required for the ST to update for each time.
For now, we have derived the theoretic model for the wireless energy harvesting cognitive industrial sensor networks with jamming attacks, which is by means countable.And we have proved that the algorithm is convergent while it operates efficiently.

Parameters
We perform simulations using Python to evaluate network configurations under different parameter settings.Without loss of generality, we mimic the situation with only one ST under attacks from jammers.As mentioned earlier, we set the data queue capacity as 1 and the energy storage capacity as nine units.The packet arrival probability is set to 0.4, and the data buffer will always be updated with the arriving packet.Also, the queued data in the buffer follows the first in first out policy, and once detect a sent packet is not received correctly, it may be discarded or kept.Either action depends on whether there's an arriving ITM Web of Conferences 47, 01006 (2022) CCCAR2022 https://doi.org/10.1051/itmconf/20224701006packet; if there's a newly arriving packet, it will be updated, or it will stay for the next transmission.The reason for the strategy is that in industrial scenarios, we always have sufficient or repetitive data to transmit; the reason we transmit is mostly for updating the immediate status, the newer packet is always better or otherwise, the arriving packet will be deleted before entering the data queue.The energy consumption rates are 1 unit for deception, 3 units to transmit data, and 6 units to perform an attack.The energy harvest probability is set to 0.5, the sensing error probabilities of the ST are both set to 0.01, and the successful transmit probability of the ST (without attack) is set to 0.9.Also, we define there are 3 jammers, and they operate independently.
We compare different performance metrics between our proposed learning algorithm and the standard 1-persistent policy.The ST will transmit data when the channel is idle, and the ST has both data and energy.If a collision happens, the ST will retransmit the packet until it has been successfully received.

Simulation results
Convergence.We first compare the convergence of the learning algorithm and the 1persistent algorithm.We evaluate the scenario with four jammers, and the simulation results are shown in Fig. 1.We can derive from the graph that the learning algorithm convergent around 4 9 10 × iterations; in other words, the throughput converges to 0.09 (average).While compared to the ST adopted 1-persistent algorithm, the average throughput is nearly tripled.Optimality.Since the primary network is time-variant, the effectiveness of energy harvest may change.To mimic the impact, here we change the settings on the energy harvest probabilities of the ST and jammers.To make it fair and comparable, we set the energy harvest probability for the ST and jammer as the same.Then the performance metrics we evaluate are on domains of the average throughput and average update delay of the system.As is shown in the graph Fig. 2, the average throughput of the ST increases as energy harvesting probability rises.Also, we can draw from graph Fig. 3 that the average update delay of the learning algorithm decreases as the energy harvesting probability rises, while the average update delay of the 1-persistent algorithm decreases significantly but increase a little bit in the end; this is owed to the power of deception mechanism that the ITM Web of Conferences 47, 01006 (2022) CCCAR2022 https://doi.org/10.1051/itmconf/20224701006system that adopted the learning algorithm benefits remarkably.Also, the performance of the learning algorithm stands out; as we can see from the graph, the average throughput gap between the learning algorithm and the 1-persistent algorithm increases while the energy harvesting probability rises.When almost every device in the system can harvest wireless RF energy every time, the average update delay of the system increases a bit since, with no deception strategy, the jammers will jam more time slots, and the ST will have less opportunity to transmit, thus increasing the time delay.

Conclusion
In the paper, we have investigated a secondary wireless sensor network operating on a channel allocated to the primary network while harvesting energy from the ambient wireless RF signals.The STs can recharge wirelessly and use the energy adopted to transmit.A learning algorithm is proposed in this work to ease the tension caused by jamming attacks and upgrade the performance metrics.The learning method not only helps the ST to defend against jammer attacks but also helps to execute proper actions to benefit the overall throughput online.Simulation reveals that the proposed learning algorithm is convergent and helps the ST maximize the overall throughput compared to the traditional Since the network may have an ample state space, as just mentioned earlier, We consider a new strategy which can perform estimation of the value of ( ) χ ∇ Φ and update Φ ; since it is impossible and inefficient to calculate it.

Fig. 1 .
Fig. 1.The average throughput of the learning algorithm and the 1-Persistent algorithm under attacks with three jammers.

Fig. 2 .
Fig. 2. The average throughput of the ST under various energy harvesting probability settings.

Fig. 3 .
Fig. 3.The average update delay of time slots of the ST under various energy harvesting probability settings.
represents the probability when taking action a according to control policy κ at time t of ST. ( ) p is the probability function for the simulator which generates the parameters.d and e stand for the packets number and energy level, respectively; thus, ' d t stands for the sent packets while ( ) ' e t stands for the energy consumed while ( ) '' [8]resents the expectation, k i represents the state of ST at the time k .. As with the same assumption in[8], let ς be the closure of ς , we have: Φ Ε ⋅* i recurrent for every such Markov chain.