
Both XGBoost and TensorFlow are very capable machine learning frameworks but how do you know which one you need? Or perhaps you needย both?
In machine learning there are โno free lunchesโ. Matching specific algorithms to specific problems often outperforms the โone-fits-allโ approach. However, over the years the data science community has gained enough experience to generate thumb rules for matching between certain algorithms and typical tasks.
In this short post I will try to cover some of these rules to help you decide between Gradient Boosting Machines using XGBoost and Neural Networks using TensorFlow.

In 2012 Alex Krizhevsky and his colleagues astonished the world with a computational model that could not only learn to tell which object is present in a given image based on features, but also perform the feature extraction itselfโโโa task that was thought to be complex even for experienced โhumanโ engineers.
Since then, deep neural networks have dominated the research landscape with more articles suggesting revolutionary concepts for image recognition, translation, playing computer games and even self driving cars. Nevertheless, in the very popular data-science hub Kaggle, the benchmarks for many competitions are based on some implementation of Gradient Boosting Machines.
While Cloud Machine Learning Engine offers both frameworks now, itโs not always clear which algorithm should be explored first. After solving many machine learning problems which use both methods, I put together some of my thoughts on how to choose between the different approaches
Case 1: Difficult to Breakdown to Features?
If we have learned anything from 2012 is that neural networks are very efficient for dealing with high dimensional raw data. Image, video, text and audio are all examples of high dimensional raw data that is very hard to preprocess and represent as features. In these cases, using NNโs built-in feature extraction units (CNNs, LSTMs, Embedding layers) can obtain phenomenal results in a fraction of the development time of the classic engineering approaches.

Case 2: Handling Missingย Values?
If you ever tried to feed a neural network with a missing data you probably ended with errors. This is because the equations which are being solved during NN training assume a valid value for each input variable.
XGBoost on the other hand, has its own way of dealing with missing data. During training XGBoost performs a sub-task of learning to impute data for each feature. Many real world problems have missing data that for itself contains valuable information about the target. So for โfreeโ missing values handlingโโโXGBoost wins.

Case 3: How Deep Are Yourย Pockets?
Neural networks training is โembarrassingly parallelโ, making them great for parallel and distributed training. That is, if your budget can cover running hours of training on expensive machines with TPUs or GPUs.
On the other hand, if you only use up to several million of records XGBoost can be trained on a less expensive multi-core CPU and converge in less time. So if you have a limited amount of data and want to train a modelโโโXGBoost may be more affordable and achieve similar results.

Case 4: How Much Data Do Youย Have?
The previous case brings us to the question of how much data do you have. Due to its underlying data structure, XGBoost is limited in the ways it can be parallelized making it short in the amount of data it can process. One way of handling massive datasets is splitting the data to shards and stacking modelsโโโthus effectively multiplying the number of parameters used to fit the data.
However with neural networks it is usually โthe more the merrierโ. When dealing with massive datasets neural networks can converge with the same number of parameters to lower generalization error. But for smaller datasets XBGoost typically converges faster and with smaller error.

Case 5: How Complex are the Input/Output shapes?
XGBoost has more limitations than NNs regarding the shape of the data it can work with. It usually take 1-d arrays as record inputs and outputs a single number (regression) or a vector of probabilities (classification). For this reason, it is easier to configure an XGBoost pipeline. In XGBoost there is no need to worry about shapes of dataโโโjust provide a pandas datafame that looks like a table, set the label column and you are good to go.
Neural networks on the other hand, are designed to work on tensorsโโโa high dimensional matrix. NNโs output and input shape can vary between numbers, sequences (vectors), images and even videos. So for classic problems like click-though prediction based on structured data โboth can work well. in terms of data shape. But when it comes building more complex data transformationsโโโthe NNโs may be your only valid choise!

Case 6: Can I Have Bothย Please?
What if you donโt want to choose? In many cases, a combination of both models may achieve better results than each individual model. I already mentioned model-stacking in this post. The mathematical difference between models results in a different error distribution over the same data. When stacking the modelsโโโa lower error rate can be achieved while using the same amount of dataโโโbut with the cost of complicating the engineering of the system.
Iโd like to thank Philip Tannor for his insightful remarks
Want more stories? Check our blog, or follow Gad on Twitter.


