# GCP Machine Learning Face-off: BigQuery ML -vs- Custom Estimators in TensorFlow

The Google Cloud Platform (GCP) has been making waves recently with great new tools in Machine Learning (ML) and Analytics. What can we do to test these tools and investigate their capacity? Which is better, Cloud Datalab or BigQuery ML (BQML)? Can we build a custom estimator that is not supported by Google’s Estimator API?

I investigated these two interesting tools provided by GCP for ML to see how they would perform on the sample Google Analytics dataset for BigQuery. This dataset contains Google Analytics 360 data from the Google Merchandise Store.

The Google Merchandise Store data is typical of what you would see for an ecommerce website. As per the sample dataset description, it includes:

Traffic source data: information about where website visitors originate. This includes data about organic traffic, paid search traffic, display traffic, etc.

Content data: information about the behaviour of users on the site. This includes the URLs of pages that visitors look at, how they interact with content, etc.

Transactional data: information about the transactions that occur on the Google Merchandise Store website.

Market Analytics for website interactions track things such as: did someone buy something on my website, what country/region/city did they come from, and other interesting facts and figures. To create a model and predict purchases with this data first we have to consider our possible methods.

**The Easy Way**

Using BQML it is incredibly easy to create, train and predict on GCP. In the Getting Started with BigQuery ML for Data Analysts online guide, we see that it can be as easy as 1, 2, 3:

**1: CREATE_MODEL → 2: ML.EVALUATE → 3: ML.PREDICT**

In BQML we can use standard SQL with these special commands to effectively do our entire ML process. Unfortunately, it is a little bit of a black box as we do not get to pick very many parameters of our model. One of the biggest advantages of BQML however, is that you can bring your code to the data, and not the other way around. This means you don’t need to perform any type of ETL’ing. Your model executes in the same place as the data. This makes it incredibly scalable, fast and easy to use.

Once your dataset exists you select what model type you would like. Currently there are several options, but for this I chose to do a logistic regression, following the demo example. In this example the model runs 9 iterations on the logistic regression and stops early. It takes about 187 seconds for the model to run.

The model we created also stored a schema:

Following the basic scenario, when I evaluate with ML.EVALUATE inside BigQuery we see that it runs in only 2.4 seconds with the following test results:

As a reminder, the model evaluation in this case is done on a few key metrics from the Google Cloud documentation:

precision— A metric for classification models. Precision identifies the frequency with which a model was correct when predicting the positive class.recall— A metric for classification models that answers the following question: Out of all the possible positive labels, how many did the model correctly identify?accuracy— Accuracy is the fraction of predictions that a classification model got right.f1_score— A measure of the accuracy of the model. The f1 score is the harmonic average of the precision and recall. An f1 score’s best value is 1. The worst value is 0.log_loss— The loss function used in a logistic regression. This is the measure of how far the model’s predictions are from the correct labels.roc_auc— The area under the ROC curve. This is the probability that a classifier is more confident that a randomly chosen positive example is actually positive than that a randomly chosen negative example is positive.

With our low precision and recall rate, high accuracy, low f1 score, low log_loss and high roc_auc we have a bit of a mixed message from this model. On the one hand our high accuracy is good, but we should not trust such a low level of precision and recall. So far, our BQML is..ok.

We already knew that this would be a fast, easy way to get a model in BigQuery, but what happens when we want to use a custom estimator instead?

**The Hard Way**

In ML, as with many things, it pays to do a proper exploration of the data at hand when building a model. In the previous method, we ignored the statistics and processing as well as the model parameters (because we could). On GCP we have an option to build our own custom estimators using the TensorFlow open source machine learning framework but we have to pay a bit more attention to the details.

One of the best ways to develop new models on GCP is to take advantage of the Datalab notebook environment. These notebooks are based on Jupyter notebooks and will allow us to use a Pythonic environment along with several special commands.

In particular, I can now look at the revenue generated by different countries in a much more interactive way:

Since we are in Datalab, we can do quick visuals for our country data. In particular, using the special command `%%chart`

we can render the below image:

Having a full look through the data we decide to build a TensorFlow custom model to do the same type of logistic regression that we had before. For any custom model we need the same 3 steps that we had above.

**1: CREATE MODEL → 2: EVALUATE MODEL→ 3: PREDICT**

Since in this case we are running a TensorFlow model in Datalab, I will need a more code based approach — as opposed to using SQL in BQML. To match the datasets used in BQML, I select the same data and save it to a variable that is put into a Python DataFrame in Datalab:

The only special thing that we need to bring this data from BigQuery into Datalab is the `-n train`

statement at the top of the notebook cell.

To arrive count the number of transactions as predicted by our model, I need to use the same ‘label’, which indicates whether the transaction happened. I also have the same parameters as the schema above in the BQML case.

TensorFlow will use a neural net structure to try to learn the expected output for the inputs that we give it in our training dataset. For more info on basic neural networks, check out the TensorFlow Playground:

The process of classification in a custom neural network with TensorFlow has a few extra steps: we have inputs, the construction of our neural net, and outputs.

**Creating a model with a custom estimator in TensorFlow**

TensorFlow custom estimators can be used to create categories of models not supported by TensorFlow. In particular, they are useful for defining a specific loss function, calculating a unique metric, or any specific type of problem. Our problem is very traditional, so we don’t need to put in quite so much work, but to make the comparison we will see how it runs. Since we want to ‘match’ as closely as we can the BQML behaviour, I will use the custom estimator and use similar parameters and retrieve the same metrics. With the same metrics returned, we can compare BQML to TensorFlow directly. The TensorFlow setup will have:

Input Layer → Hidden Layers → Output Layer

I will use a custom estimator, but first I have to have an input function for the training data.

In TensorFlow, when inputs are sparse you have the option to use the `tf.feature_column.embedding_column`

to convert from sparse, categorical input into a dense representation to feed a deep neural net. That means I have to put data into tensors before I will have anything that the model will actually be able to handle, but for this case study it is easy to implement. I set all columns as tensors and use column embedding for all of our non-numeric columns. Once `feature_columns`

is set, I then define the model:

The model definition above is one of the most critical parts, by setting it up with a custom estimator there are all kinds of interesting things we can adjust. There are many, many ways to set up custom loss functions; we use classes_weights to set the weights to avoid problems with unbalanced data.

What is unbalanced data? One big advantage to using Datalab is that we can quickly hop into Python and see how many of each class (purchases or no purchases) we actually have. In our training set we have 10,478 purchases and 818,807 non-purchases. This large difference in the amount of data in the two categories that we are sorting with our ML algorithm means that the neural net will be biased towards non-purchases. The testing set will likewise be biased. Since all machine learning is only as good as the data we give it, we should not expect a great result with such biased data. We could rebalance our data, but that is a blog post for another time.

Fortunately, TensorFlow** **`tf.nn.weighted_cross_entropy_with_logits`

** **allows for pos_weights to be set that will decrease false negatives in our classification. This is a sort of in-process weighted rebalancing, if you will. There are a few different options for this but I chose the above as the weighting made a bit more sense to me. So, we have a model, we have features, we are just missing our parameters:

By setting up a deep neural net with a narrower middle (as shown above) I force the classification to find the sparse purchases as well as the non-purchases. With this I just run a `model.train`

for TensorFlow and then evaluate.

**Evaluation**

Evaluation with TensorFlow models, much like in BQML, should be performed on different data than what is used for training. I again follow the BQML example and capture the same training set, and store it in an evaluation DataFrame:

Next, the evaluation input function can be set and I just gather our metrics with a `model.evaluate`

:

How did we do? Well, in the custom estimator I had to spend some time adjusting the parameters in the model (which we did not do before) and in the Datalab notebook we can see that my choices above led to a much higher recall as I have optimised for recall in this custom model. Our recall is 0.682, compared to the BQML recall of only 0.089. Is this a victory for TensorFlow?

# Who wins?

Let’s take a closer look at the two sets of results. We did a quick custom estimator and got much improved recall with TensorFlow over BQML. We also know that just improving recall does not necessarily mean an overall better model. Looking at the TensorFlow case we should consider time as well since it took much longer to set up. We might be better suited in this experiment to a canned model or doing some adjustments in BQML (because our model is so simple)**. **Running the code is comparable in the two methods, but developing a good TensorFlow model can take days or even weeks.

A few important points to consider include that this was a very unbalanced dataset, and needed the love and attention we could give to it in Datalab. If we had balanced our dataset in BQ, perhaps it would have done a better job. So while BQML a bit of a black box, TensorFlow custom estimators goes to the other extreme in this case, with many settings and transforms that can be very hard to untangle.

**When to use BQML:** when you only have about 10 minutes, or, as in this case, your modelling does not require a custom estimator.

**When to use TensorFlow: **when you have a couple of days and you are sure that you must write the model function.

What is my final takeaway? I might go back to Pytorch and Scikit Learn for most of my basic machine learning, but if I need to build a truly fast basic estimation, BQML will probably be my new favourite. That said, as a final extra, make sure to check out Keras if you are interested in doing fast prototyping in TensorFlow.

To see the source code for this experiment, check out my GitHub repo here.