Minimax Normal Two-Armed Bandit with Indefinite Control Horizon

We consider the two-armed bandit problem as applied to data processing if there are two alternative processing methods available with different a priori unknown efficiencies. On should determine the most effective method and provide its dominating application. The total number of data, which is interpreted as a control horizon, is assumed to have a priori known probability distribution. The problem is considered in minimax (robust) setting. According to the main theorem of the theory of games minimax risk and minimax strategy are sought for as Bayesian ones corresponding to the worst-case prior distribution. We describe the properties of the worst-case prior and present a recursive Bellman-type equation for determination of both minimax strategy and minimax risk. Numerical results illustrating the proposed algorithm are given. The algorithm can be applied to optimization of parallel data processing if the number of processed data is not definitely known in advance.


Introduction
We consider the two-armed bandit problem (see, e.g.[1], [2]) which is also well-known as the problem of expedient behavior in a random environment (see, e.g.[3], [4]) and the problem of adaptive control (see, e.g.[5], [6]) in the following setting.Let ξ n , n = 1, . . ., N be a controlled random process which values are interpreted as incomes, depend only on currently chosen actions y n (y n ∈ {1, 2}) and have Normal probability distribution densities f (x|m ) = (2π) −1/2 exp −(x − m ) 2 /2 , if y n = ( = 1, 2).Normal two-armed bandit can be described by a vector parameter θ = (m 1 , m 2 ).The set of parameters is assumed to be the following Control strategy σ at the point of time n assigns a random choice of the action y n depending on the current history of the process, i.e. replies x n−1 = x 1 , . . ., x n−1 to applied actions y n−1 = y 1 , . . ., y n−1 : σ (y n−1 , x n−1 ) = Pr(y n = |y n−1 , x n−1 ), = 1, 2. The set of strategies is denoted by Σ.
Generally, the goal is to maximize (in some sense) the total expected income.In this article, we consider the minimax approach.If parameter θ is known then the optimal strategy should always apply the action corresponding to the larger value of m 1 , m 2 .The total expected income e-mail: Alexander.Kolnogorov@novsu.ru would thus be equal to N(m 1 ∨ m 2 ).If parameter is unknown then the loss function describes expected losses of total income with respect to its maximal possible value due to incomplete information.
Here E σ,θ denotes the mathematical expectation calculated with respect to measure generated by a strategy σ and a parameter θ.According to the minimax approach the maximal value of the loss function on the set of parameters Θ should be minimized on the set of strategies Σ.The value is called the minimax risk and corresponding optimal strategy σ M is called the minimax strategy.Note that application of the minimax strategy ensures that inequality holds for all θ ∈ Θ and this means the robustness of the control.The minimax approach to the problem was proposed by H. Robbins in [7] for the so-called Bernoulli two-armed bandit.It is described by binary incomes {0, 1} and probability distribution Pr(ξ n = 1|y n = ) = p , Pr(ξ n = 0|y n = ) = q , p + q = 1, = 1, 2. Bernoulli two-armed bandit is described by a parameter θ = (p 1 , p 2 ) with the set of values Θ = {θ : 0 ≤ p ≤ 1; = 1, 2}.The article [7] caused a significant interest to considered problem.It was shown in [8] that explicit determination of the minimax strategy and minimax risk is practically impossible already for N > 4.However, the asymptotic minimax theorem was proved by W. Vogel in [9] which states that minimax risk has the order N 1/2 and provides lower and upper estimates for a factor.This theorem holds true for the Normal two-armed bandit as well.
A very popular approach to the problem is a Bayesian one.Let λ(θ) be a prior probability density.The value is called Bayesian risk and corresponding optimal strategy is called Bayesian strategy.Bayesian approach allows to find Bayesian strategy and risk by solving a recursive Bellman-type equation.
According to the main theorem of the theory of games the minimax risk ( 2) is equal to Bayesian risk (3) calculated with respect to the worst-case prior distribution, i.e.
This approach allows to obtain the following asymptotic estimate for the Normal two-armed bandit (see [10]- [13]): with R 0 ≈ 0.637.Let's explain the choice of Normal distribution of incomes.We consider the problem as applied to group control of processing a large amount of data.Let T = NM data be given that can be processed using either of the two alternative methods.The result of processing of the t-th item of data is ζ t = 1 if processing is successful and ) depend only on selected methods (actions).Let's assume that p 1 , p 2 are close to p (0 < p < 1).We partition the data into N packets of M data in each packet and use the same method for data processing in the same packet.For the control, we use the values of the process According to the central limit theorem, distributions of ξ n , n = 1, . . ., N are close to Normal and their variances are close to unity as in considered setting.
Note that data in the same packet may be processed in parallel.However, there is a question of losses in the control performance as the result of such aggregation.It was shown in [10]- [13] that if N is large enough (e.g.N ≥ 30) then parallel control is close to optimal.Therefore, say 30000 items of data can be processed in 30 steps by packets of 1000 data with almost the same maximal losses as if the data were processed optimally one-by-one.
Remark 1.There are some different approaches to robust control in the two-armed bandit problem, see, e.g.[6,[14][15][16].In these articles stochastic approximation method and mirror descent algorithm are used for the control.Instead of minimax risk, the authors often consider the equivalent attitude called the guaranteed rate of convergency.The order of the minimax risk for these algorithms is N 1/2 or close to N 1/2 .However, more precise estimates were not presented for these algorithms.The versions for parallel processing were not proposed as well.
Remark 2. Parallel control for the two-armed bandit problem was first proposed for the problem of treating a large group of patients by either of the two drugs with different unknown efficiencies.The discussion and bibliography of the problem can be found in [17].
The goal of this paper is to investigate the robust control of the Normal two-armed bandit with indefinite horizon.The structure of the paper is the following.In Section 2 we set the control problem, i.e. define the loss function and corresponding minimax and Bayesian risks and strategies if there is a priori known probability distribution on the set of horizons.We state the main theorem of the theory of games in this case.In Section 3 we analyze the properties of the worst-case prior distribution and present the recursive Bellman-type equation for calculation of the Bayesian risk with respect to this worst-case prior.We also present recursive Bellman-type equation for calculation of the expected losses.In Section 4 the results of numerical experiments are presented.Section 5 contains a conclusion.

Setup of the Control Problem with Indefinite Horizon
The disadvantage of approach considered in Section 1 is that the control horizon N is fixed.However, it is often more likely to consider the problem with indefinite control horizon.Let's consider {N i } = {N i ; i = 1, . . ., I} a finite set of control horizons s.t. 2 ≤ N 1 < • • • < N I = N. Let's define appropriate loss function as follows where L N i (σ, θ), i = 1, . . ., I are defined in (1).Factors β i , i = 1, . . ., I may be arbitrary chosen.However, it is natural to choose them so that all {β i L N i (σ, θ)} have approximately equal maximal values.Taking into account (5) one can choose β i N −1/2 i , i = 1, . . ., I. If in addition the following condition holds then one can say that the prior distribution Pr(N = N i ) = β i , i = 1, . . ., I is assigned on the set of control horizons.
Given n satisfying condition N i−1 < n ≤ N i , let's put Using ( 1) and ( 6) one obtains In the sequel we'll denote LN (σ, θ) = L(σ, θ, {N i }).Corresponding minimax and Bayesian risks are defined as follows RB Taking into account (7) an using reasonings similar to those in [10], one can prove that the main theorem of the theory of games holds in considered setting as well as in the case of definite horizon.It means that minimax risk ( 8) can be determined as Bayesian risk ( 9) calculated with respect to the worst-case prior distribution, i.e.
and minimax strategy is equal to corresponding Bayesian strategy as well.Bayesian risk can be calculated recursively.Denote by the Normal distribution density with mathematical expectation M and variance D. Denote by λ(m 1 , m 2 ) the prior distribution density on the set of parameters Θ.Let history of control up to instant of time n be described by Here n 1 , n 2 are total numbers of applications of both actions (n 1 + n 2 = n) and X 1 , X 2 are corresponding total incomes.Let X = 0 if n = 0.The posterior distribution density is thus equal to If additionally it is assumed that f n (X|nm) = 1 at n = 0 then this expression holds true if n 1 = 0 and/or n 2 = 0 as well.
Denote by RB N−n (λ; X 1 , n 1 , X 2 , n 2 ) Bayesian risk at the latter (N − n) steps calculated with respect to the poste- where R( 1) (13) and Bayesian strategy prescribes currently to choose action corresponding to the smaller value of R( 1) , the choice may be arbitrary if these values are equal.So, if λ is the worst-case prior distribution then αλ+ αλ is the worst-case prior as well.It means that the worst-case prior distribution does not change if the above transformations are implemented.
In the sequel it is convenient to modify parameterization.Let a prior distribution density is equal to ν(u, v) = 2λ(u+v, u−v).Then the following transformations of the prior distribution densities ν(1) (u, v) = ν(u, −v), ν(2) (u, v) = ν(u + m, v) (for any fixed m) do not change the value of Bayesian risk.These properties allow to describe the worst-case prior.Namely, asymptotically the worstcase prior distribution density can be chosen the following one: where κ a (u) is the uniform density on the interval |u| ≤ a, ρ(v) is a symmetric density (i.e.ρ(−v) = ρ(v)) on the interval |v| ≤ C and a → ∞.This prior does not change under the first transformation and asymptotically (as a → ∞) does not change under the second transformation.Now let's write the dynamic programming equation for calculation the Bayesian risk with respect to (14).This equation follows from ( 11)-( 13) if the prior distribution density is formally assumed to be constant with respect to u and this gives true expressions for the posterior densities if n 1 ≥ 1, n 2 ≥ 1.At the former two steps actions should be chosen turn-by-turn.Note that equation is more simple for risks Rn 1 ,n 2 (X 1 , X 2 ) = RB N−n (X 1 , n 1 , X 2 , n 2 )p n 1 ,n 2 (X 1 , X 2 ) with

Conclusion
Minimax approach to the two-armed bandit problem is considered with a priori assigned distribution on the control horizon.The problem can be reduced to the classical two-armed bandit problem with discounted incomes.The algorithm of determination of the minimax strategy and minimax risk as Bayesian ones corresponding to the worst-case prior distribution is obtained and numerical results are presented.

3 Recursive Bellman-type Equation for Calculation of the Bayesian Risk and Expected Losses
, m 2 ) = λ(m 2 , m 1 ) (for all m 1 , m 2 ).This property means that expected losses do not change if one swaps the arms of the bandit.2.λ(2) (m 1 , m 2 ) = λ(m 1 + m, m 2 + m) (for all m 1 , m 2and any fixed m).This property means that expected losses do not change if one equally shifts both mathematical expectations.