Aggregated Mondrian forests for online learning
Introduction
Introduced by Breiman (2001a), Random Forests (RF) is one of the algorithms of choice in many supervised learning applications. The appeal of these methods comes from their remarkable accuracy in a variety of tasks, the small number (or even the absence) of parameters tune, their reasonable computational cost at training and prediction time, and their suitability in high-dimensional settings. Most commonly used RF algorithms, such as the original random forest procedure (Breiman, 2001a), extra-trees (Geurts et al., 2006), or conditional inference forest (Hothorn et al., 2010) are batch algorithms, that require the whole dataset to be available at once. Several online random forests variants have been proposed to overcome this issue and handle data that come sequentially. Utgoff (1989) was the first to extend Quinlan’s ID3 batch decision tree algorithm (see Quinlan, 1986) to an online setting. Later on, Domingos and Hulten (2000) introduce Hoeffding Trees that can be easily updated: since observations are available sequentially, a cell is split when (i) enough observations have fallen into this cell, (ii) the best split in the cell is statistically relevant (a generic Hoeffding inequality being used to assess the quality of the best split). Since random forests are known to exhibit better empirical performances than individual decision trees, online random forests have been proposed (see, e.g., Saffari et al., 2009; Denil et al., 2013). These procedures aggregate several trees by computing the mean of the tree predictions (regression setting) or the majority vote among trees (classification setting). The tree construction differs from one forest to another but share similarities with Hoeffding trees: a cell is to be split if (i) and (ii) (defined above) are verified. One forest of particular interest for this work is the Mondrian Forest (Lakshminarayanan et al., 2014) based on the Mondrian process (Roy and Teh, 2009). Their construction differs from the construction described above since each new observation modifies the tree structure: instead of waiting for enough observations to fall into a cell in order to split it, the properties of the Mondrian process allow to update the Mondrian tree partition each time a sample is collected. Once a Mondrian tree is built, its prediction function uses a hierarchical prior on all subtrees and the average of predictions on all subtrees is computed with respect to this hierarchical prior using an approximation algorithm. The algorithm we propose, called AMF, and illustrated in Figure 3.1 below on a toy binary classification dataset, differs from Mondrian Forest by the smoothing procedure used on each tree. While the hierarchical Bayesian smoothing proposed in Lakshminarayanan et al. (2014) requires approximations, the prior we choose allows for exact computation of the posterior distribution. The choice of this posterior is inspired by Context Tree Weighting (see, e.g., Willems et al., 1995; Willems, 1998; Helmbold and Schapire, 1997; Catoni, 2004), commonly used in lossless compression to aggregate all subtrees of a prespecified tree, which is both computationally efficient and theoretically sound. Since we are able to compute exactly the posterior distribution, our approach is drastically different from Bayesian trees (see, for instance, Chipman et al., 1998; Denison et al., 1998; Taddy et al., 2011), and from BART G (Chipman et al., 2010) which implement MCMC methods to approximate posterior distributions on trees. The Context Tree Weighting algorithm has been applied to regression trees by Blanchard (1999) in the case of a fixed-design tree, in which splits are prespecified. This requires to split the dataset into two parts (using the first part to select the best splits and the second to compute the posterior distribution) and to have access to the whole dataset, since the tree structure needs to be fixed in advance. As noted by Rockova and van der Pas (2017), the theoretical study of Bayesian methods on trees (Chipman et al., 1998; Denison et al., 1998) or sum of trees (Chipman et al., 2010) is less developed. Rockova and van der Pas (2017) analyzes some variant of Bayesian regression trees and sum of trees; they obtain near minimax optimal posterior concentration rates. Likewise, Linero and Yang (2018) analyze Bayesian sums of soft decision trees models, and establish minimax rates of posterior concentration for the resulting SBART procedure. While these frameworks differ from ours (herein results are posterior concentration rates as opposed to regret and excess risk bounds, and the design is fixed), their approach differs from ours primarily in the chosen trade-off between computational complexity and adaptivity of the method: these procedures involve approximate posterior sampling over large functional spaces through MCMC methods, and it is unclear whether the considered priors allow for reasonably efficient posterior computations. In particular, the prior used in Rockova and van der Pas (2017) is taken over all subsets of variables, which is exponentially large in the number of features. The literature focusing on the original RF algorithm or its related variants is more extensive, even if the data-dependent nature of the algorithm and its numerous components (sampling procedure, split selection, aggregation) make the theoretical analysis difficult. The consistency of stylized RF algorithms was first established by Biau et al. (2008), and later obtained for more sophisticated variants in Denil et al. (2013); Scornet et al. (2015). Note that consistency results do not provide rates of convergence, and hence only offer limited guidance on how to properly tune the parameters of the algorithm. Starting with Biau (2012); Genuer (2012), some recent work has thus sought to quantify the speed of convergence of some stylized variants of RF. Minimax optimal nonparametric rates were first obtained by Arlot and Genuer (2014) in dimension 1 for the Purely Uniformly Random Forests (PURF) algorithm, in conjunction with suboptimal rates in arbitrary dimension (the number of features exceeds 1). Several recent works (Wager and Walther, 2015; Duroux and Scornet, 2018) also established rates of convergence for variants of RF that essentially amount to some form of Median Forests, where each node contains at least a fixed fraction of observations of its parent. While valid in arbitrary dimension, the established rates are suboptimal. More recently, adaptive minimax optimal rates were obtained by Mourtada et al. (2018) (Chapter 2) in arbitrary dimension for the batch Mondrian Forests algorithm. Our proposed online algorithm, AMF, also achieves minimax rates in an adaptive fashion, namely without knowing the smoothness of the regression function.
Forests of aggregated Mondrian trees
We define in Section 3.2.1 the setting and notations that will be used throughout the chapter, together with the definition of the Mondrian process, introduced by Roy and Teh (2009), which is a key element of our algorithm. In Section 3.2.2, we explicitly describe the prediction function that we want to compute, and prove in Proposition 3.1 that the AMF algorithm described in Section 3.2.3 computes it exactly. 3.2.1 The setting, trees, forests and the Mondrian process We are interested in an online supervised learning problem in which we assume that the dataset is not fixed in advance. In this scenario, we are given an i.i.d. sequence (x1, y1), (x2, y2), . . . of [0, 1]d × Y-valued random variables that come sequentially, such that each (xt , yt) has the same distribution as a generic pair (x, y)
Definition 3.1 (Tree partition)
Let C ⊆ [0, 1]d be a hyper-rectangular box of the form Qd j=1[aj , bj ], −∞ 6 aj 6 bj 6 +∞ (the interval being open at an infinite extremity). A tree partition (or kd tree, guillotine partition) of C is a pair (T , Σ), where • T is a finite ordered binary tree, which is represented as a finite subset of the set {0, 1} ∗ = S n>0 {0, 1} n of all finite words on the alphabet {0, 1}. The set {0, 1} ∗ is endowed with a tree structure (and called the complete binary tree): the empty word is the root, and for any v ∈ {0, 1} ∗ , the left (resp. right) child of v is v0 (resp. v1), obtained by adding a 0 (resp. 1) at the end of v. We denote by N ◦ (T ) = {v ∈ T : v0, v1 ∈ T } the set of its interior nodes and by L(T ) = {v ∈ T : v0, v1 6∈ T } the set of its leaves, which are disjoint by definition. • Σ = (σv)v∈N ◦(T ) is a family of splits at the interior nodes of T , where each split σv = (jv, sv) is characterized by its split dimension jv ∈ {1, . . . , d} and its threshold sv ∈ [0, 1]. In Section 3.2.3, we will actually store in σv ∈ Σ more information about nodes v ∈ T .