# Asymptotically lower bound for stocastic multi-armed bandit problem

## 1. Notations

(See wikipedia for complete settings)

• $k$ arms
• for a fixed strategy, $I_j$ denotes the arm pulled at time $j$
• for arm $j$:
• unknown reward distribution $P_{\theta_j}$ for $\theta_j \in \Theta$
• reward at time $t$: $X_{j, t} \sim P_{\theta_j}$
• mean reward: $\mu_j = \E{X_{j,1}}$
• mean of the best arm: $\mu^{*} = \max_j \mu_j$
• gap: $\Delta_j = \mu^{*} - \mu_j$
• number of plays until time $t$: $T_j(t) = \sum_{s=1}^t \I{I_t=j}$
• regret with time horizen $n$: $R_n = n\mu^{*} - \sum_{j=1}^k \E{T_j(n)} \Delta_j$

## 2. Lower bound

To derive the lower bound , let's consider two sequences of $\theta$ values, i.e.:

• $\theta = (\theta_1, \theta_2, ..., \theta_k)$ with $\mu_1 > \mu_2 \ge \mu_3 \ge ... \ge \mu_k$
• $\theta' = (\theta_1, \theta_2', ..., \theta_k)$ with $\mu_2' > \mu_1 \ge \mu_3 \ge ... \ge \mu_k$

Later, we will carry out the deriviation by making $\theta_2'$ close to $\theta_1$.

Now, fix a strategy and denote $\mathbb{P}$ the joint distribution over $\{X_{j,s}, I_t\}$ under distribution $P_{\theta}$, while $\mathbb{P}'$ denotes the corresponding joint distribution under $P_{\theta'}$.

For an event $A \subset \{T_2(n) = m \}$ where $m$ is an arbitrary constant, we have \begin{aligned} \P'(A) &= \int_A \prod_{s=1}^{m} \frac{\diff P_{\theta_2}'}{\diff P_{\theta_2}'}(X_{2,s}) \diff \P \\ &= \int_A e^{-L(m)} \diff \P \tag{1} \end{aligned} , where $L(m) = -\sum_{s=1}^{m} \log\frac{\diff P_{\theta_2}'}{\diff P_{\theta_2}}(X_{2,s})$.

From (1), we see that if $A \subset \{T_2(n) = m, L(m) \le c_n\}$ for some constant $c_n$, then $\P'(A) \ge e^{-c_n}\P(A)$, i.e.: $$\P(A) \le e^{c_n}\P'(A)$$

With the above, now let's try to bound the probability that arm 2 is pulled less than $f_n$ times under $\P$, where $f_n$ is another constant: \begin{aligned} & \P(T_2(n) < f_n) \\ \le{}& \P(T_2(n) < f_n, L(T_2(n)) \le c_n) + \P(T_2(n) < f_n, L(T_2(n)) > c_n) \\ \le{}& e^{c_n} \underbrace{\P'(T_2(n) < f_n)}_A + \underbrace{\P(T_2(n) < f_n, L(T_2(n)) > c_n)}_B \end{aligned}

We are going to show that with properly chosen $f_n$ and $c_n$, the above goes to $0$ when $n \to \infty$. Here $A$ means the best arm is not chosen often under $\P'$, and $B$ means the samples from $\P$ is not likely from $\P'$. We will bound $A$ and $B$ separately.

We first bound $A$. Note that arm $2$ is the best arm under $\P'$, thus if a policy has a small regret on every instances, i.e.: $$\mathbb{E}'\{n - T_2(n)\} = o(n^{\alpha}), \quad \forall\; \alpha > 0$$ Assuming $f_n=o(n)$, by Markov's inequality, we have $$\P'(T_2(n) < f_n) \le \frac{\mathbb{E}'\{n - T_2(n)\}}{n - f_n} = o(n^{\alpha - 1})$$

It remains to bound $B$: \begin{aligned} B &= \P(T_2(n) < f_n, L(T_2(n)) > c_n) \\ &\le \P(L(f_n) > c_n) \\ &= \P(\frac{L(f_n)}{f_n} > \frac{c_n}{f_n}) \end{aligned} Because $$\E{\frac{L(f_n)}{f_n}} = \E{-\log\frac{\diff P_{\theta_2}'}{\diff P_{\theta_2}}(X_{2,s})} = \int \log\frac{\diff P_{\theta_2}}{\diff P_{\theta_2}'}(X_{2,s}) \diff \P_{\theta_2} = D_{\text{KL}}(\P_{\theta_2}, \P_{\theta_2'})$$ if we choose $\frac{c_n}{f_n}$ to be a bit larger than $\E{\frac{L(f_n)}{f_n}}$: $$\frac{c_n}{f_n} = (1-\delta)D_{\text{KL}}(\P_{\theta_2}, \P_{\theta_2'})$$ then by the strong law of large number, we have $B \to 0$, hence \begin{aligned} \P(T_2(n) < f_n) \le& e^{c_n} A + B \to e^{c_n} o(n^{\alpha-1}) \end{aligned}

Finally, let $c_n = (1-\delta_1)\log n$ and $f_n = \frac{(1-\delta_2)\log n}{D_{\text{KL}}(\P_{\theta_2}, \P_{\theta_2'})}$ where $\delta_2 > \delta_1 > 0$, we obtain $$\liminf_{n\to\infty} \E{T_2(n)} \ge \frac{\log n}{D_{\text{KL}}(\P_{\theta_2}, \P_{\theta_2'})}$$

## References

1. Lai, T. L., & Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1), 4-22.
Posted by Qianfan Zhang on