On the optimality of the Hedge algorithm in the stochastic regime
Introduction
The standard setting of prediction with expert advice (Littlestone and Warmuth, 1994; Freund and Schapire, 1997; Vovk, 1998; Cesa-Bianchi and Lugosi, 2006) aims to provide sound strategies for sequential prediction that combine the forecasts from different sources. More precisely, in the so-called Hedge problem (Freund and Schapire, 1997), at each round the learner has to output a probability distribution on a finite set of experts {1, . . . , M}; the losses of the experts are then revealed, and the learner incurs the expected loss from its chosen probability distribution. The goal is then to control the regret, defined as the difference between the cumulative 173 On the optimality of the Hedge algorithm in the stochastic regime loss of the learner and that of the best expert (with smallest loss). This online prediction problem is typically considered in the individual sequences framework, where the losses may be arbitrary and in fact set by an adversary that seeks to maximize the regret. This leads to regret bounds that hold under virtually no assumption (Cesa-Bianchi and Lugosi, 2006). In this setting, arguably the simplest and most standard strategy is the Hedge algorithm (Freund and Schapire, 1997), also called the exponentially weighted averaged forecaster (CesaBianchi and Lugosi, 2006). This algorithm depends on a time-varying parameter ηt called the learning rate, which quantifies by how much the algorithm departs from its initial probability distribution to put more weight on the currently leading experts. Given a known finite time horizon T, the standard tuning of the learning rate is fixed and given by ηt = η ∝ p log(M)/T, which guarantees an optimal worst-case regret of order O( √ T log M). Alternatively, when T is unknown, one can set ηt ∝ p log(M)/t at round t, which leads to an anytime O( √ T log M) regret bound valid for all T > 1. While worst-case regret bounds are robust and always valid, they turn out to be overly pessimistic in some situations. A recent line of research (Cesa-Bianchi et al., 2007; de Rooij et al., 2014; Gaillard et al., 2014; Koolen et al., 2014; Sani et al., 2014; Koolen and van Erven, 2015; Luo and Schapire, 2015) designs algorithms that combine O( √ T log M) worst-case regret guarantees with an improved regret on easier instances of the problem. An interesting example of such an easier instance is the stochastic problem, where it is assumed that the losses are stochastic and that at each round the expected loss of a “best” expert is smaller than those of the other experts by some gap ∆. Such algorithms rely either on a more careful, datadependent tuning of the learning rate ηt (Cesa-Bianchi et al., 2007; de Rooij et al., 2014; Koolen et al., 2014; Gaillard et al., 2014), or on more sophisticated strategies (Koolen and van Erven, 2015; Luo and Schapire, 2015). As shown by Gaillard et al. (2014) (see also Koolen et al. 2016), one particular type of adaptive regret bounds (so-called second-order bounds) implies at the same time a O( √ T log M) worst-case bound and a better constant O(log(M)/∆) bound in the stochastic problem with gap ∆. Arguably starting with the early work on second-order bounds (Cesa-Bianchi et al., 2007), the design of online learning algorithms that combine robust worst-case guarantees with improved performance on easier instances has been an active research goal in recent years (de Rooij et al., 2014; Gaillard et al., 2014; Koolen et al., 2014; Sani et al., 2014). However, to the best of our knowledge, existing work on the Hedge problem has focused on developing new adaptive algorithms rather than on analyzing the behavior of “conservative” algorithms in favorable scenarios. Owing to the fact that the standard Hedge algorithm is designed for — and analyzed in — the adversarial setting (Littlestone and Warmuth, 1994; Freund and Schapire, 1997; Cesa-Bianchi and Lugosi, 2006), and that its parameters are not tuned adaptively to obtain better bounds in easier instances, it may be considered as overly conservative and not adapted to stochastic environments.
Our contribution
This work fills a gap in the existing literature by providing an analysis of the standard Hedge algorithm in the stochastic setting. We show that the anytime Hedge algorithm with default learning rate ηt ∝ p log(M)/t actually adapts to the stochastic setting, in which it achieves an optimal constant O(log(M)/∆) regret bound without any dedicated tuning for the easier instance, which might be surprising at first sight. This contrasts with previous works, which require the construction of new adaptive (and more involved) algorithms. Remarkably, this property is not shared by the variant of Hedge for a known fixed-horizon T with constant learning rate η ∝ p log(M)/T, since it suffers a Θ(√ T log M) regret even in easier instances. This exhibits a strong difference between the performances of the anytime and the fixed-horizon variants of the Hedge algorithm. Given the aforementioned adaptivity of Decreasing Hedge, one may wonder whether there is in fact any benefit in using more sophisticated algorithms in the stochastic regime. We answer this question affirmatively, by considering a more refined measure of complexity of a stochastic instance than the gap ∆. Specifically, we show that Decreasing Hedge does not admit improved regret under Bernstein conditions, which are standard low-noise conditions from statistical learning (Mammen and Tsybakov, 1999; Tsybakov, 2004; Bartlett and Mendelson, 2006). By contrast, it was shown by Koolen et al. (2016) that algorithms which satisfy some adaptive adversarial regret bound achieve improved regret under Bernstein conditions. Finally, we characterize the behavior of Decreasing Hedge in the stochastic regime, by showing that its eventual regret on any stochastic instance is governed by the gap ∆.
Related work
In the bandit setting, where the feedback only consists of the loss of the selected action, there has also been some interest in “best-of-both-worlds” algorithms that combine optimal O( √ MT) worst-case regret in the adversarial regime with improved O(M log T) regret (up to logarithmic factors) in the stochastic case (Bubeck and Slivkins, 2012; Seldin and Slivkins, 2014; Auer and Chiang, 2016). In particular, Seldin and Slivkins (2014); Seldin and Lugosi (2017) showed that by augmenting the standard EXP3 algorithm for the adversarial regime (an analogue of Hedge with Θ(1/ √ t) learning rate) with a special-purpose gap detection mechanism, one can achieve poly-logarithmic regret in the stochastic case. This result is strengthened in some recent follow-up work (Zimmert and Seldin, 2019; Zimmert et al., 2019), which appeared since the completion of the first version of the present work, that obtains optimal regret in the stochastic and adversarial regimes through a variant of the Follow-The-Regularized-Leader (FTRL) algorithm with Θ(1/ √ t) learning rate and a proper regularizer choice. This result can be seen as an analogue in the bandit case of our upper bound for Decreasing Hedge. On the other hand, in the bandit setting, the hardness of an instance is essentially characterized by the gap ∆ (Bubeck and Cesa-Bianchi, 2012); in particular, the Bernstein condition, which depends on the correlations between the losses of the experts, cannot be exploited under bandit feedback, where one only observes one arm at each round. Hence, it appears that the negative part of our results (on the limitations of Hedge) does not have an analogue in the bandit case. A similar adaptivity result for FTRL with decreasing Θ(1/ √ t) learning rate has been observed in a different context by Huang et al. (2017). Specifically, it is shown that, in the case of online linear optimization on a Euclidean ball, FTRL with squared norm regularizer and learning rate Θ(1/ √ t) achieves O(log T) regret when the loss vectors are i.i.d. This result is an analogue of our upper bound for Hedge, since this algorithm corresponds to FTRL on the simplex with entropic regularizer (Cesa-Bianchi and Lugosi, 2006; Hazan, 2016). On the other hand, the simplex lacks the curvature of the Euclidean ball, which is important to achieve small regret; here, the improved regret is ensured by a condition on the distribution, namely the existence of a gap ∆. Our lower bound for Hedge shows that this condition is necessary, thereby characterizing the long-term regret of FTRL on the simplex with entropic regularizer. In the case of the Euclidean ball with squared norm regularizer, the norm of the expected loss vector appears to play a similar role, as shown by the upper bound from Huang et al. (2017). Outline. We define the setting of prediction with expert advice and the Hedge algorithm in Section 4.2, and we recall herein its standard worst-case regret bound. In Section 4.3, we consider the behavior of the Hedge algorithm on easier instances, namely the stochastic setting with a gap ∆ on the best expert. Under an i.i.d assumption on the sequence of losses, we provide in Theorem 4.1 an upper bound on the regret of order (log M)/∆ for Decreasing Hedge. In Proposition 4.2, we prove that the rate (log M)/∆ cannot be improved in this setting. In Theorem 4.2 and Corollary 4.1, we extend the regret guarantees to the adversarial with a gap setting, where a leading expert linearly outperforms the others. These results stand for any Hedge algorithm which is worst-case optimal and with any learning rate which is larger than the one of Decreasing Hedge, namely O( p log M/t). In Proposition 4.3, we prove the sub-optimality of the fixed-horizon Hedge algorithm, and of another version of Hedge based on the so-called “doubling trick”. In Section 4.4, we discuss the advantages of adaptive Hedge algorithms, and explain what the limitations of Decreasing Hedge are compared to such versions. We include numerical illustrations of our theoretical findings in Section 4.5, conclude in Section 4.6 and provide the proofs in Section 4.7.