The upcoming text discusses the maths behind the XGBoost. It is expected that you write the equations to have a better understanding.Mathematics behind XGBoost

**The mathematics behind the XGBoost model:**

For a detailed understanding, please attend the live session on Saturday.

Both XGBoost and GBM follow the principle of gradient boosted trees, but XGBoost uses a more** regularised model formulation to control overfitting**, which gives it better performance, which is why it’s also known as ‘**regularised boosting’** technique.

**The mathematics behind the XGBoost model:**

Note: Here unlike the added model represented earlier by hm, here we are representing the model by ht at the tth iteration.

In an ideal machine learning model, the objective function is a sum of Loss function “L” and regularization “Ω”. Loss function controls the predictive power of the algorithm and regularization controls its simplicity

**Objective Function : Training Loss(L) + Regularization(Ω)**

The algorithm we have seen in Gradient Boosting has the objective function as only the Training Loss while the XGBoost objective function constitutes of loss function evaluated over, all predictions and sum of regularization term for all predictors (‘T’ trees).

$$//Obj\;=\;\sum_{i=1}^nL(\gamma_\iota,\;{\mathrm F}_t({\mathrm\chi}_{\mathrm\iota}))\;+\;\sum_{\mathrm t=1}^{\mathrm T}\cap\left({\mathrm h}_{\mathrm t}\right)\\\\\\\\//$$

Where ht means predictions coming from the tth tree.

From Gradient Boosting we have understood that the final model will be represented by:

$$//{\mathrm F}_t\left({\mathrm x}_{\mathrm i}\right)\;=\;{\mathrm F}_{\mathrm o}\left({\mathrm x}_{\mathrm i}\right)\;+\;\sum_{\mathrm t=1}^{\mathrm T}\;{\mathrm h}_{\mathrm t\;}\left({\mathrm x}_{\mathrm i}\right)\;=\;{\mathrm F}_{\mathrm t-1}\left({\mathrm x}_{\mathrm i}\right)\;+\;{\mathrm h}_{\mathrm t}\left({\mathrm x}_{\mathrm i}\right)//$$

Here Ft(xi) is the prediction of the ith instance xi at the tth iteration and to calculate this we need to add ht to our previous prediction. So now we will apply this to our objective function

$$//obj\left(t\right)\;=\;\sum_{i=1}^n\;L\left(Y_{I\;,\;}F_{T\;}\left(X_I\right)\right)\;+\;\sum_{t=1}^T\Omega\left(h_t\right)\\\\\;\;\;\;\;\;\;\;\;\;\;=\sum_{i=1}^n\;L\left(Y_{I\;,\;}F_{t-1\;}\left(x_i\right)\;+\;h_t\left(x_i\right)\right)\;+\sum_{t=1}^T\Omega\left(h_t\right)//$$

Here we greedily add the ht that improves our model the most by minimizing our objective function. It acts as a **small update** to our final model.

In Gradient boosting algorithm we obtained Ft(xi) at each iteration by **fitting a base learner to the negative gradient of the loss function with respect to previous iteration’s value.**

In XGBoost, we explore several base learners/models and pick a model that minimizes the loss.

Our model ht in XGBoost contains the structure of the tree and leaf scores, thus which seems a fairly complex optimization problem to be dealt with “Gradient descent’ technique. This is due to the fact that an ensemble model includes “functions” as parameters and **can’t be optimized using the conventional method – i.e Gradient descent technique.**

To solve this problem, the XGBoost algorithm uses **Taylor series** to approximate the value of the loss function for a base learner

**Taylor series approximation of the loss**

The above described objective function can be approximated using Taylor series expansion and hence can be solved.

To have a better understanding, let’s take a foot back and recall Taylor series from our calculus class.

A Taylor series is a series expansion of a function about a point, Suppose that the function f(x) is infinitely differentiable(smooth) at x = a.

For **f(x)**, **Δx** is the new learner we added in step **t** and **a** is the prediction at step **(t-1) **

**Δx = (x-a) **is the new learner that we need to add in step (t) in order to greedily minimize the objective.

By Taylor’s expansion, any continuous function can be approximated by the linear combination of its first-order gradient and the quadratic function of the second-order gradient and so on. The Taylor series for f(x) centred at x = a is then

$$///f\;\left(x\right)\;=\;f\left(a\right)\;+\;\frac{f’\left(a\right)}{1!}\left(x-a\right)\;+\;\frac{f”\left(a\right)}{2!}\left(x-a\right)^{2\;}\;+\;\frac{f”’\left(a\right)}{3!}\left(x-a\right)^3\;+\dots,\\\\\\\\\\\\\\\\\\\\\\/$$

which can also be rephrased as:

$$//f\left(x\;+\;h\right)\;=\;f\left(x\right)\;+\;\frac{f’\left(x\right)}{1!}\;\left(h\right)\;+\;\frac{f”\left(x\right)}{2!}\;\left(h\right)^2\;+\;\frac{f”’\left(x\right)}{3!}\left(h\right)^{3\;}+\;\dots\;,\\\\\\//$$

by replacing a with x and x−a with h , where the function is continuous and n times differentiable in an interval [x,x+h]

#### Taylor Series

The Taylor series for the functions x4+x−2 which is differentiable at a =1 about the specified a is:

So, in this case, we take Taylor expansion of the loss function up to the second order only and approximating the later differentiation terms to be ~ 0.

$$//obj\left(t\right)\;=\;\sum_{i=1}^n\;L\left(Y_i,\;F_{t-1}\left(x_i\right)\;+\;h_t\left(x_i\right)\right)\;+\;\sum_{t=1}^T\Omega\left(h_t\right)\\\\\\//$$

as:

$$///obj\left(t\right)\;=\;{\textstyle\sum_{i=1}^n}\left[L\left(Y_{i\;,\;}F_{t-1}\left(x_i\right)\right)\;+\;p_i\;h_{t\;}\left(x_i\right)\;+\;\frac12\;q_{i\;}h_t^2\;\left(x_i\right)\right]\;+\Omega\left(h_t\right)\\\\\\/$$

where the pi and qi are defined as

$$//${P}_{i=}\partial {F}_{t\u20131}\left({x}_{i}\right)L\left({Y}_{i},{F}_{t\u20131}\left({x}_{i}\right)\right)\phantom{\rule{0ex}{0ex}}{q}_{i=}{\partial}_{{F}_{t\u20131\left({x}_{i}\right)}}^{2}L\left({Y}_{i},{F}_{t\u20131}\left({x}_{i}\right)\right)$//$$

$$//L\left(y_{i,}\;F_{t-}\left(x_i\right)\right)\;is\;a\;cons\tan t\;term\;irrespective\;of\;any\;function,\;so\;after\;removing\;this\\constant\;we\;obtain\;the\;following\;simplified\;specific\;objective\;at\;step\;\\\\’t”.\;\;//$$

$$//$Obj\left(t\right)=\sum _{i=1}^{n}\left[{P}_{i}{h}_{t}\left({x}_{i}\right)+\frac{1}{2}{q}_{i}{h}_{t}^{2}\left({x}_{i}\right)\right]+\Omega \left({h}_{t}\right)$//$$

Hence, this becomes our optimization goal for the new tree. The advantage of this definition is that it only depends on pi and qi which also lets XGBoost support custom loss functions including logistic regression, weighted logistic regression etc.

**XGBoost Regularisation**

We have introduced the training step here, now what remained is defining the regularization term. Let’s define the complexity of the tree Ω(ht). In order to do so, let us first go through the definition of the tree ht(x)

Here w is the vector of scores on leaves, m is a function (structure) assigning each data point to the corresponding leaf, and JT is the number of leaves. Now the regularization function becomes:

#### Further diving deep into XGBoost

Upon defining, Ij={i|m(xi)=j} as the instance set of leaf j and imputing the training and regularization part in the objective function, we can write it with the t-th tree as:

$$//$\mathrm{Obj}\left(\mathrm{t}\right)\approx \underset{\mathrm{i}=1}{\overset{\mathrm{n}}{\sum [}}{\mathrm{\rho}}_{\mathrm{I}}{\mathrm{\omega}}_{\mathrm{m}\left({\mathrm{x}}_{\mathrm{i}}\right)}+\frac{1}{2}{\mathrm{q}}_{\mathrm{i}}{\mathrm{\omega}}_{\mathrm{m}\left({\mathrm{x}}_{\mathrm{i}}\right)}^{2}]+\mathrm{\gamma T}+\frac{1}{2}\mathrm{T}\sum _{\mathrm{j}=1}^{\mathrm{T}}{\mathrm{\omega}}_{\mathrm{j}}^{2}\phantom{\rule{0ex}{0ex}}\phantom{\rule{0ex}{0ex}}\underset{\mathrm{j}=1}{\overset{\mathrm{T}}{\sum \mathrm{j}}}=[(\underset{}{\sum _{\mathrm{i}\in {\mathrm{l}}_{\mathrm{J}}}{\mathrm{p}}_{\mathrm{i}}){\mathrm{\omega}}_{\mathrm{j}}+\frac{1}{2}(\sum _{\mathrm{I}\in {\mathrm{l}}_{\mathrm{j}}}{\mathrm{q}}_{\mathrm{i}+}\mathrm{T}){\mathrm{\omega}}_{\mathrm{J}}^{2}]+\mathrm{\gamma T}}$//$$

#### Note

*In the second line, the index of the summation has been changed because all the data points on the same leaf get the same score.*

As wjs are independent with respect to each other and Pjwj+12(Qj+τ)w2j is a quadratic function hence for a fixed structure m(x), we can compute the optimal weight w∗j of leaf j as:

$$/${W}_{J}^{*}=\u2013\frac{{P}_{J}}{{Q}_{J}{+}_{T}}$///$$

by differentiating the function Pjwj+12(Qj+τ)w2j with respect to wj.

And, calculate the corresponding optimal value of the objective function by substituting the above-found w∗j as

$$//$Ob{j}^{*}=\u2013\frac{1}{2}\sum _{j=1}^{T}\frac{{p}^{2}J}{{Q}_{J+T}}+\gamma T$//$$

#### Note

*The scoring function can be used as a measure of how good a tree structure m(x) is. *

It may sound a bit complicated but basically, for a given tree structure, we push the statistics

pi and qi to the leaves they belong to and use the formula to calculate how good the tree is.

Based on some split criteria, the node is divided into left and right branches. Some instances fall in the left node and other instances fall in the right leaf node.

XGBoost greedily builds a tree. The split that results in maximum loss reduction or gain is chosen.

**Gain**= **Loss**(parent) – [**LossL**(left branch)+**LossR**(right branch)]

Apart from this, XGBoost uses pre-sorted algorithm & Histogram-based algorithm for computing the best split. There have been other new boosting algorithms like LightGBM and CatBoost which have been developed recently. LightGBM uses Gradient-based one-side sampling(GOSS) to find the split value. You can read about them in the additional reading material below.

**Additional References:**

- You can go through this paper for learning more about the Parallel Tree Learning Algorithm used for finding best split authored by the founder Tianqi Chen.

- To have a better understanding of the objective function in the XGBoost scheme of things you can read from pg. 31 in the following pdf

- To read more about CatBoost, go through the following paper and the following site.

- To read more about LightGBM which uses GOSS, refer the following paper and original documentation.