Untitled

\chapter{Variable Selection: a Case of Study}
Before beginning to see Monte Carlo study, this chapter outlines the variable selection theory which the described algorithms above are based on, and it shows how the algorithms perform variable selection. The Titanic dataset\footnote{https://github.com/paulhendricks/titanic} has been used to show how variable selection is performed by the three models. The titanic dataset contains informations about the passengers that were embarked on the tragic Titanic first and last travel. The response variable is \textit{Survival}, indicating whether every passenger survived the accident or not. The dependent variables contain informations about the economical situation of passengers, such as the amount paid for the ticket and the travelling class, demographical informations, such as age, sex and presence of family members on board and others informations. Characters variables have been dropped to simplify the analysis.

\section{Variable Selection Using XGBoost}
\cite{chen2016xgboost} have implemented  their algorithm in multiple programming languages, including R. The R package is named is \textit{xgboost}\footnote{https://github.com/dmlc/xgboost}.

Among the various functions available in the package, there is a function called \textit{xgb.importance}. After having fitted an XGBoost model, through the function \textit{xgb.importance} it is possible to see the importance that variables have had into the prediction of the dependent variable. In Table \ref{xgimp} are shown the importances the XGBoost gives to the variables in the Titanic dataset.

\begin{table}[ht]
    \centering
    \begin{tabular}{rlrrr}
        \hline
        & Feature & Gain & Cover & Frequency \\
        \hline
        1 & Fare & 0.65824 & 0.70703 & 0.35294 \\
        2 & SibSp & 0.12298 & 0.03313 & 0.05882 \\
        3 & Pclass2 & 0.10568 & 0.10942 & 0.23529 \\
        4 & Pclass3 & 0.09447 & 0.11728 & 0.23529 \\
        5 & Parch & 0.01833 & 0.01566 & 0.05882 \\
        6 & Age & 0.00031 & 0.01747 & 0.05882 \\
        \hline
    \end{tabular}
    \caption {Variable importance of XGBoost built on Titanic dataset.}
    \label {xgimp}
\end{table}

The function \textit{xgb.importance} returns three columns with different informations: gain, cover and frequency.

The gain column gives information about the relative contribution of a specific variable to the model. In order to calculate the gain, the contribution of the variables in every tree of the model is taken into account. A high gain value means that a variable has high importance in the prediction of the target.
Cover is a metric that is higher if the variable is related to a lot of observations.
Frequency gives informations about the percentage of times a variable has occurred in the trees of the model.
As it might be expected, all the three columns given from the \textit{xgb.importance} function, being relative frequencies, sum to 1.

\section{Variable Selection Using BART}
In addition to prediction, BART can be used to perform variable selection by selecting those variables that appear more often in the fitted sum-of-trees models.
As already introduced in Section (3.6), BART is implemented in R in two packages: the first, made by the creators of BART itself, is \textit{BayesTree}, the second is \textit{bartMachine}.
In Figure \ref{comparisonBART} are shown the differences between the two packages.

\begin{figure}[H]
    \includegraphics[width=1\linewidth]{img/3_5.png}
    \caption {Comparison of features offered by BayesTree and bartMachine}
    \label{comparisonBART}
\end{figure}

\subsection{bayesTree}
\cite{chipman2010bart} report that, as the number of trees $m$ grows, the ability to select the most important variables is lost. Indeed, the redundancy offered by many trees tends to mix irrelevant predictors with relevant ones. As $m$ decreases, the redundancy is diminished and BART tends to pick relevant predictors for its fit: when $m$ is small, competition arises between predictors to improve the fit. However, if the number of trees becomes too small, the Gibbs sampler in BART remains often trapped in local modes and it might be possible to get destabilized results \citep{chipman1998bayesian}.
The proposed model-free based approach to perform variable selection is implemented, basically, observing what happens to percentage of usage of variables in a sequence of $K$ MCMC samples of sum-of-trees model, $f_1^*,...,f_K^*$, as the number of trees gets smaller. For each simulated sum-of-trees model $f_k^*$, let $z_{ik}$ be the proportion of all splitting rules that use the i-th component of $x$, leading to have

\begin{equation}
\label{inclusionBayes}
v_i\equiv\frac{1}{K}\sum_{k=1}^{K}z_{ik}.
\end{equation}
Equation \eqref{inclusionBayes} represents the average use per splitting rule for the i-th component of $X$.
As $m$ is set smaller, the sum-of-trees models tend to support the inclusion of those variables that are better at predicting $Y$ and to exclude those variables that are unnecessary and unrelated to the target $Y$. Intuitively, the variables with greater $v_i$ are the ones that provide the biggest amount of informations to predict $Y$.

\begin{figure}[H]
    \includegraphics[width=1\linewidth]{img/3_1.png}
    \caption {Variable importance output with the model fitted with BayesTree on titanic dataset.}
    \label{titanicBayesTree}
\end{figure}

In Figure \ref{titanicBayesTree} it is possible to see variables importance for different values of the number of trees $m$ on the Titanic dataset.
It can be seen that when more trees $m$ are used to built the model, the inclusion proportions tend to be the same. When using 10 trees the decreasing of the inclusion probabilities is much steeper from $Age$ to $Parch$ than from $Parch$ to the last variable. Indeed, the decreasing of the inclusion probabilities gets smoother after $Parch$. With 50 trees, only the first 2 variables seem to have a probability higher than the remaining predictors. The choice of $m$ is crucial, indeed increasing the number of trees can lead to better predictions but the possibility to understand which are the variables that contribute the most to the prediction of $Y$ is lost as $m$ grows. Without knowing the mechanism behind $Y$, it is impossible to know which is the right $m$ to highlight the variables associated with $Y$.

\subsection{bartMachine}
\cite{bleich2014variable} have decided to make some changes to the original BART model and they have then decided to implement a new package called \textit{bartMachine} containing their changes.

Among the changes, in \citep{kapelner2013bartmachine} it can be noticed that there is a difference in the choice of splitting rules for non-terminal nodes. In the original formulation, a variable is likely to be chosen with probability $1/p$. In the \textit{bartMachine} implementation the inclusion of a variable can be given a specific probability. Additionally, variables without valid split points are automatically assigned a probability equal to zero.

The prior for $\sigma^2$ is chosen to be an inverse gamma, instead of an inverse chi-square. However, as explained in Section (3.5), the choice of $\sigma$ does not concern classification.

There are also changes in the Metropolis-Hastings algorithm. Instead of having the original four operations, growing a terminal node, pruning a pair of terminal nodes, changing a non-terminal rule and swapping a rule between parent and child, in the \textit{bartMachine} implementation there are only three possible choices: growing, pruning and changing. The change operation proposes new splits on nodes that are singly internal, nodes whose children are both terminal. After a node with this characteristic is chosen, a new split variable has to be selected from the set of available predictors and a new split value has to be selected from the available values.

The BART implementation of \textit{bartMachine} is also able to deal with missing values, both in model creation and prediction.
The implemented method is known as \textit{"Missing Incorporated in Attributes"}, MIA \citep{twala2008good}. Missing values are handled augmenting splitting rules of nodes to sort missing observations left or right, similar to what happens in XGBoost, and using the missingness itself as a variable to be used as a splitting rule. The creation of new dummy variables to represent whether a statistical unit contains a missing for the j-th variable can increase the number of predictors by up to a factor of 2. As it has been seen in Section (3.6), the run time of BART increases negligibly regarding the number of covariates.
For more details consult \citep{kapelner2015prediction}.

Although these are the most noticeable changes described into the \textit{bartmachine} documentation, running variable selection with the \textit{bartMachine} package leads to very different results.

\begin{figure}[H]
    \includegraphics[width=1\linewidth]{img/3_2.png}
    \caption {Variable importance output with the model fitted with \textit{bartMachine} on Titanic dataset}
    \label{titanicmachine}
\end{figure}

In Figure \ref{titanicmachine} it is clear that changes in $m$ do not lead to noticeable differences in the inclusion proportions of variables. The reason why this happens is not totally comprehensible from the papers that have been examined.

The package \textit{bartMachine}, moreover, offers much more options to investigate which variables have to be considered better than the others in predicting the dependent variable $y$. In \citep{bleich2014variable} it is made clear that the inclusion proportions of each variable $p_k$ cannot be interpreted as posterior probabilities that the corresponding predictor $x_k$ has a \textit{real effect}, where $k$ is the index for the $k-th$ variable. The definition of a real effect that \cite{bleich2014variable} give is the impact of a predictor with the dependent variable $y$, where the relationship can be either linear or non-linear. \cite{bleich2014variable} then proceed in asking and answering the question \textit{how large does the variable inclusion proportion $p_k$ has to be in order to select predictor variable $x_k$ as an important variable?}

To find an appropriate selection threshold, what is proposed is to employ an approach based on permutations to generate null distributions for the various variable inclusion proportions, \textbf{p}=$(p_1,...,p_k)$.
$P$ permutations of the $y$ vector are created, $y_1^*, ..., y_p^*$. Then, BART is run on every permutation. During the permutations the predictors are left untouched. The only variable in the model matrix that changes is the vector of the dependent variable $y$. The permutation strategy helps in preserving possible dependencies among predictor variables while, at the same time, the strategy removes any kind of dependency between predictors and the dependent variable. At each run, variable inclusion proportions are calculated and stored, leading to $P$ different values for each predictor. The inclusion probabilities are then used to compute the null distribution for every predictor. What is left to do is to determinate an appropriate threshold to choose the predictors based on the null distributions of the permutations, $p_1^*$, ..., $p_p^*$.

\cite{bleich2014variable} of \textit{bartMachine} have developed three different strategies to create thresholds that are different in terms of how rigid they are in the inclusion of variables.

The \textit{local} threshold is calculated by taking the $1-\alpha$ quantile of the distribution of $p_{k,1}^*,...,p^*,_{k,P}$. A predictor is then selected only if it exceeds the $1-\alpha$ quantile of its null distribution.

The \textit{global max} threshold is obtained from the maximum across the permutation distributions of the variable inclusion proportions for all predictor variables. Therefore, first of all, every probability of inclusion have to be calculated. Then, the largest inclusion proportion across all the predictors $p_{max,p}^*=max \{ p_{1,p}^*, ..., p_{K,p}^* \}$ in every permutation $p$ is taken. The null distribution is obtained from all the $P$ maximum inclusion probabilities at every iteration, $p_{max,1}^*,...,p^*_{max,P}$, and its $1-\alpha$ quantile is used as a threshold. The variables are then selected only if their probability of inclusion $p_k$ exceeds the threshold made by the $1-\alpha$ quantile.

The third strategy, called \textit{global SE threshold}, is also global across predictor variables but less conservative than the \textit{global max} method. This last strategy makes use of the mean and standard deviation of the permutation distribution of each variable inclusion proportion $p_k$ to create a global threshold for all predictors. Defined $m_k$ and $s_k$ as mean and standard deviation of variable inclusion proportion $p_k^*$ for $x_k$ across all permutations, then it is calculated as

\begin{equation}
C^* = \inf_{C \in \mathbb{R}^+} \{ \forall k, \frac{1}{P} \sum_{p=1}^{P} \mathbbm{1}(p_{k,p}^* \le m_k+C \cdot s_k) > 1-\alpha \},
\end{equation}
where $C^*$ represents the smallest global multiplier that gives simultaneously $1-\alpha$ coverage across the permutation distributions of $p_k$ for all the variables. A predictor $x_k$ is then selected only if $p_k > m_k + C^* \cdot s_k$.

The local procedure is the least conservative in terms of selection, the global max is instead the most rigid one. The last procedure, the global threshold, is a compromise between the two. \cite{bleich2014variable} demonstrate that the best procedure depends on the underlying sparsity of the problem, that is often unavailable, since it is usually unknown. In the package \textit{bartMachine} there is a procedure based on cross-validation that is able to automatically choose the best threshold for a given problem.

\begin{figure}[H]
    \includegraphics[width=1\linewidth]{img/3_3.png}
    \caption {Results of the 3 procedures to create thresholds for variable selection with \textit{bartMachine}.}
    \label{titanicthreshold}
\end{figure}

In Figure \ref{titanicthreshold} an example of the three thresholds procedures is shown. In the plot above it is shown the local procedure. This procedures chooses three variables: $Age$, $Sex\_female$ and $Pclass\_3$. In the plot below the two global methods are displayed together. The red line represents the threshold of global max method. Based on the global max procedure, no variable is to be selected. The global SE threshold procedure, instead, just selects $Pclass\_3$. The predictors chosen by the global SE strategy are represented by stars. If both the global methods choose a variable it is then displayed as a black dot.

\section{Variable Selection Using SVM-RFE}
The SVM-RFE algorithm proposed in \citep{guyon2002gene} has been implemented by John Colby and the code can be found both on his website and on his GitHub page\footnote{http://www.colbyimaging.com/wiki/statistics/msvm-rfe, http://github.com/johncolby/SVM-RFE}.

The variable ranking obtained using SVM-RFE is displayed below. To run the example, the tuning parameter $C$ has been kept equal to one, its default value.

\begin{multicols}{2}
    \begin{enumerate}
        \item Fare
        \item Age
        \item SibSp
        \item Parch
        \item SexMale
        \item Pclass3
        \item EmbarkedC
        \item Pclass1
        \item EmbarkedQ
        \item Pclass2
        \item EmbarkedS
    \end{enumerate}
\end{multicols}

\section{Some Remarks}
The models that are treated in this work are very different. As shown in this chapter, the variables that they consider to be the best in predicting the dependent variable $Y$ are very different as well. In the following example it is shown how it could be possible to make hypotheses about the way the selected predictors explain the dependent variable.

In XGBoost the predictor that has the highest relevance in the model is \textit{Fare}. A possible explanation could be that those passengers who spent the most are located, very probably, in the first class of the ship, closer to the deck and lifeboats. Another possibility is that rich passengers might have been privileged when it came to choose who should have been saved first, leading to a higher probability of survival. The fact that the coefficients of the second and third class, shown in Figure \ref{xgimp} as \textit{Pclass2} and \textit{Pclass3}, have been used in the splitting phase with a frequency nearly as high as the one of \textit{Fare}, might confirm that owning a cabin closer to the deck may have been a fundamental factor in determining passengers' probability to survive. The variable with the second highest gain value, \textit{SibSp}, is relative to the number of siblings and spouses on board, thus families or/and couples. It might be reasonable to think that families have had a greater chance to be allowed on the lifeboats.

The two packages that have been used to build BART model, \textit{BayesTree} and \textit{bartMachine}, have different opinions over which variables are the most important in predicting $Y$. Though, they agree on the fact that \textit{Age} is the most needed variable to predict the target. Young passengers, especially children, might have had a higher priority when the time came to decide who should have gone on the lifeboats. It is also possible that older people might have preferred to save young people, sacrificing themselves. \textit{BayesTree} attributes to the sex of a passenger the second highest inclusion probability. Computing the marginal effect of sex on the probability to survive, it can be seen that a much higher percentage of women were saved, $0.3257$, compared to the percentage of men, $0.223$. In Figure \ref{titanicBayesTree} it is shown that \textit{Fare}, the variable that for XGBoost was the most useful to predict $Y$, is for BART one of those variables with the lowest probability of inclusion. Informations about the belonging of passengers to third and first class, however, might provide the same information that \textit{Fare} provides. Both \textit{BayesTree} and \textit{bartMachine} have \textit{SibSp} among the top three variables with the highest probabilities of inclusion. However, \textit{bartMachine} has \textit{Parch} as the third variable, relative to the number of parents and/or children on board. It might be intuitive to retain that \textit{Parch} is complementary with the number of spouses/siblings in providing informations about the belonging of a passenger to a family. When it comes to \textit{bartMachine} permutation strategies, in the local procedure only 3 variables are selected: \textit{Age}, \textit{Sex} and \textit{Pclass3}, with the latter being a dummy variable assuming value $1$ for passengers travelling in the third class. Building a model with the three predictors selected by the local procedures would dismiss informations about the belonging of a passenger to a family, that intuitively would seem a clear mistake. The two global procedures are so conservative that only one variable is selected by the global SE threshold procedure and none are instead selected by the global max strategy.

SVM-RFE partially confirms what has been seen from the other models. The last two variables to be deleted are \textit{Fare} and \textit{Age}, and the third and the fourth are the number of siblings/spouses and the number of children/parents, respectively \textit{SibSp} and \textit{Parch}. \textit{Sex} is the fifth variable in the ranking. On the basis of the choice of the subset, it is possible to decide which of the informations are to be included in the model. The variable representing the belonging of a passenger to first class, \textit{Pclass1}, is one of the first predictors to be excluded using the recursive feature elimination. As already pointed out, the information about the belonging of a passenger to the first class might be explained by the variable \textit{Fare}, the amount of money paid by passengers to buy their tickets.

To summarize the informations from the four variable selections, it seems that there are 4 distinct relevant informations:

\begin{itemize}
    \item the belonging of the passenger to a couple or a family, specified by \textit{SibSp} and \textit{Parch},
    \item the wealth of passengers, with variables indicating it indirectly, like the Classes in which they travel, and more directly, like the amount that they paid for the ticket,
    \item the age of passengers,
    \item the sex of passengers.
\end{itemize}

Although it seems quite clear that the amount paid for the tickets, \textit{Fare}, is very correlated with the class in which the passengers travel, the models choose different ways to incorporate informations about the wealth of passengers. In XGBoost the classes are used as as split nearly as much as \textit{Fare}. The other three models clearly prefer \textit{Fare}. In the group of variables associated with the amount of people a passenger is travelling with, \textit{SibSp} is generally preferred to \textit{Parch}, although some models suggest the inclusion of both.

Without knowing the process that brought to the choice of which passengers should have been saved first, whether a process behind the choice even exists, it is not possible to understand how each model decides which variables are to be included and if the predictors that they retain important have an actual reason to have a higher importance.

If the interest lies in understanding the predictors that have a direct association with the dependent variable, choosing those predictors capable of providing a good prediction of the target might not be a good solution. Indeed, the fact that a variable is able to improve the prediction of $Y$ does not imply that it has a direct association with the dependent variable. For example, if one would be interested in predicting when the rate of drownings in swimming pools is higher, a possible relationship might be found with ice cream sales. The number of drownings is certainly higher during the hot seasons, when ice creams are sold the most. The given example represents a spurious relationship. A growth of both drownings and sales of ice creams might be simply caused by a heat wave.

The aim of this work is to see whether the models that have been described in the previous chapters are capable of understanding which variables have a direct association with the dependent variable $Y$, in situations where the process behind the creation of the target is completely known as in Monte Carlo studies. However, this work is limited to show whether or not the models that have been described are capable of picking those predictors that are used to generate the dependent variable $Y$. Understanding the reasons behind the choices made by the models is left for future developments.