XGBoost or TensorFlow?

1 xq4cu5zgz003t6ekccf43a

Both XGBoost and TensorFlow are very capable machine learning frameworks but how do you know which one you need? Or perhaps you need both?

In machine learning there are “no free lunches”. Matching specific algorithms to specific problems often outperforms the “one-fits-all” approach. However, over the years the data science community has gained enough experience to generate thumb rules for matching between certain algorithms and typical tasks.

In this short post I will try to cover some of these rules to help you decide between Gradient Boosting Machines using XGBoost and Neural Networks using TensorFlow.

1 ztvsqgsns r52ti7glo ta
XGBoost vs TensorFlow Summary

In 2012 Alex Krizhevsky and his colleagues astonished the world with a computational model that could not only learn to tell which object is present in a given image based on features, but also perform the feature extraction itself — a task that was thought to be complex even for experienced “human” engineers.

Since then, deep neural networks have dominated the research landscape with more articles suggesting revolutionary concepts for image recognition, translation, playing computer games and even self driving cars. Nevertheless, in the very popular data-science hub Kaggle, the benchmarks for many competitions are based on some implementation of Gradient Boosting Machines.

While Cloud Machine Learning Engine offers both frameworks now, it’s not always clear which algorithm should be explored first. After solving many machine learning problems which use both methods, I put together some of my thoughts on how to choose between the different approaches

Case 1: Difficult to Breakdown to Features?

If we have learned anything from 2012 is that neural networks are very efficient for dealing with high dimensional raw data. Image, video, text and audio are all examples of high dimensional raw data that is very hard to preprocess and represent as features. In these cases, using NN’s built-in feature extraction units (CNNs, LSTMs, Embedding layers) can obtain phenomenal results in a fraction of the development time of the classic engineering approaches.

0 rhbe17bp84kcgzjg
Instead of extracting features, let the network learn the representation.

Case 2: Handling Missing Values?

If you ever tried to feed a neural network with a missing data you probably ended with errors. This is because the equations which are being solved during NN training assume a valid value for each input variable.

XGBoost on the other hand, has its own way of dealing with missing data. During training XGBoost performs a sub-task of learning to impute data for each feature. Many real world problems have missing data that for itself contains valuable information about the target. So for “free” missing values handling — XGBoost wins.

1 qznemabqu8kznohgtc5vqq
XGBoost, the tree learns how to handle missing values. arXiv:1603.02754v3 [cs.LG] 10 Jun 2016

Case 3: How Deep Are Your Pockets?

Neural networks training is “embarrassingly parallel”, making them great for parallel and distributed training. That is, if your budget can cover running hours of training on expensive machines with TPUs or GPUs.

On the other hand, if you only use up to several million of records XGBoost can be trained on a less expensive multi-core CPU and converge in less time. So if you have a limited amount of data and want to train a model — XGBoost may be more affordable and achieve similar results.

1 xq4cu5zgz003t6ekccf43a
Cloud TPU v2 Pod — outstanding results in handling massive datasets

Case 4: How Much Data Do You Have?

The previous case brings us to the question of how much data do you have. Due to its underlying data structure, XGBoost is limited in the ways it can be parallelized making it short in the amount of data it can process. One way of handling massive datasets is splitting the data to shards and stacking models — thus effectively multiplying the number of parameters used to fit the data.

However with neural networks it is usually “the more the merrier”. When dealing with massive datasets neural networks can converge with the same number of parameters to lower generalization error. But for smaller datasets XBGoost typically converges faster and with smaller error.

1 jya9k49jpde5kcscr6oxyw
large amount of data contributes to lower generalization error.

Case 5: How Complex are the Input/Output shapes?

XGBoost has more limitations than NNs regarding the shape of the data it can work with. It usually take 1-d arrays as record inputs and outputs a single number (regression) or a vector of probabilities (classification). For this reason, it is easier to configure an XGBoost pipeline. In XGBoost there is no need to worry about shapes of data — just provide a pandas datafame that looks like a table, set the label column and you are good to go.

Neural networks on the other hand, are designed to work on tensors — a high dimensional matrix. NN’s output and input shape can vary between numbers, sequences (vectors), images and even videos. So for classic problems like click-though prediction based on structured data —both can work well. in terms of data shape. But when it comes building more complex data transformations — the NN’s may be your only valid choise!

0 dpmn7xjjyp8rgybq

Case 6: Can I Have Both Please?

What if you don’t want to choose? In many cases, a combination of both models may achieve better results than each individual model. I already mentioned model-stacking in this post. The mathematical difference between models results in a different error distribution over the same data. When stacking the models — a lower error rate can be achieved while using the same amount of data — but with the cost of complicating the engineering of the system.

I’d like to thank Philip Tannor for his insightful remarks

Want more stories? Check our blog, or follow Gad on Twitter.

Subscribe to updates, news and more.

Leave a Reply

Your email address will not be published. Required fields are marked *

Related blogs