Machine Learning with TensorFlow on Google Cloud Platform

A T

Published Jan 9, 2019

I've been doing Machine Learning (ML) as a hobby for a couple of years now and I started out using various Mac's which turn out to be not that great for machine learning. They lack the right kind of GPU (at least mine do) and as a result, it takes forever to train even simple models. Using a cloud service like PaperSpace and FloydHub helps: easy to use, model training is much faster, and they usually have a free tier. As an alternative, I also train models on my windows desktop which has an Nvidia GTX 780 GPU. And even though it is an older model GPU, training goes much faster (hours instead of days). But when it comes to machine learning the hardware is only half the story. Setting up (and maintaining) the correct environment can be a hassle; GPU's require the correct CUDA libraries, you need tensorflow, python (sometimes 2.x, sometimes 3.x), jupyter notebooks, keras, panda, numpy etcetera. And their versions all need to be compatible with each other. Although doable in my situation, it made me wonder how a solid and well working ML production environment should look like.

That was one of the reasons I decided to take the Machine Learning with TensorFlow on Google Cloud Platform specialization at Coursera (the other being I like to learn new things). This specialization consists of 5 courses each focusing on a different aspect but all in the context of how to do machine learning with the Google Cloud Platform (GCP for short). As you can expect from an AI First company where ML is central to all its products and operations, Google knows how to do it right.

I can highly recommend doing the course yourself if you have the time (it's technical but not too technical), but in this article, I want to share the important lessons I learned.

Data is the challenge

If you ever looked at a technical ML book or a piece of ML code, you might think that making the model is the hardest and most time-consuming part of the process. But it usually isn't. The biggest challenge is the data you need for training your model, which has to fulfil at least the following requirements:

you need to have enough of it,
it needs to be usable,
and you need to be able to handle vast amounts.

What is enough? The more data, the better as more data will usually result in better models. To give you a general idea of the magnitude: Tencent open-sourced its ML image data set last year, and it contains almost 18 million labelled images (available on GitHub). It doesn't necessarily mean you will need millions of data points, as that will depend on what you try to achieve. But you do need data and you should start thinking about it as soon as possible; what data do I need, is it available already, and if not: how can I get it. And how should this data look like, in what format should it be? Which brings us to the second requirement; usability.

Usable means the data should be of a high quality; (too much) missing or invalid values will mess up your training if not fixed (don't forget to apply the same fixes to your production data after your model has gone live!). It also means having the right kind of data. You may have lots of data available, but you have to make sure you have the data that is relevant for your problem and your model. Machine Learning models are not magic black boxes which you feed all your unprocessed data, whereupon they return the correct predictions and classifications. You need to pre-process: carefully select, clean, combine(or split) your data to get the most out of your model. This will be relatively easy when you do something like image classification where the input is straightforward (images), but what about a model for predicting loan defaults. What input does a model like that need and do you have that data available? The preprocessing phase is a necessary step to make sure your model will be usable and performant in a production environment.

Which brings us to the last requirement: you need to be able to handle all that data. Your technical infrastructure needs to be able to store and manipulate vast quantities of data requiring lots of storage (pretty sure 18 million images won't fit on your laptop), memory and GPU processing power. And your organisation needs to have people with enough business knowledge to help you with your data selection and skilled people who can manipulate data on a more technical level; query, transform, clean and enrich.

GCP offers services like BigQuery to help you. BigQuery lets you explore your data in an easy and intuitive way regardless of the size (a fairly complex SQL query on a 70 million row table took just a few seconds). Manipulating data is just as easy, offering all the tools you need to get your data in the shape you want and need.

The right Abstraction level

Tensorflow is Google's open-source tool for machine learning. It is well documented and has a lot of learning resources available. But if you look at the lower level API's of Tensorflow, you will be intimidated by all its possibilities, settings, functions and parameters. That means that besides actually making the ML model (which is, as we will see, a daunting task in itself), you will also need to get yourself familiar with all the technical details of the tool and how to use it. If you look at the practical applications of machine learning, many problems fall in a small set of categories (regression, classification etc) and, even better, their solutions do too. That made it possible for Google to add an abstraction layer above the low-level API's which is much easier to use. They succeeded in not only reducing the amount of how-to-do-it's you need to know for ML, but even reduced the amount of what-to-do's by aggregating (sub)tasks into a more general task. The tf.estimator seen in the image below is a good example; it allows a simple definition of common models while the technical details are taken care of for you. Thanks to this (relatively) intuitive abstraction level you can focus on making and training the model instead of how to get it all to work in the first place.

Training a ML model

If you have followed the news around Machine learning and AI, you probably know that much of the progress that has been made is thanks to a field called Deep Learning. By using neural networks of various types and consisting of multiple layers and neurons, deep learning models can do some impressing stuff (yes, it can do a lot more than recognising cats in pictures). But making a model is as much an art as it is a science as there are no fixed solutions that are guaranteed to work for your specific situation and data.

The best way to experience this is to try it out yourself in Googles tensorflow playground:

Try adding/removing layers or neurons or change some of the other settings and as you will find out the effects are not always easy to predict. This trial and error training process is probably something you don't expect from a mathematical discipline like Machine Learning, and yet it is. There are, of course, general guidelines on which models work best for a given problem but in the end, it is up to you to find the best model for YOUR problem.

And to see if a model is any good you need to train it. In training, your model will hopefully learn enough from the data so it can predict or classify new data with an acceptably low error rate. As this isn't a Tensorflow tutorial, we won't go into the technical details of the training itself. And these are not important for my next point; training will take time and effort, and you will have to go through the cycle of train, evaluate, adjust and retrain a number of times.

The things you will adjust in these cycles are called the hyperparameters of training. Think of it as a set of knobs you can turn and which will influence the quality of your model. The problem is there are a variety of hyperparameters, and they have no fixed values which will guarantee you the best result. There are general rules of thumb, but the effect of the hyperparameters will depend on the kind of model and data you work with.

So in training, your goal is to find the optimal combination of hyperparameters which will result in the best performing model. If you think this sounds like an optimising problem, you are right. And GCP offers the possibility to do just that for you; Hyperparameter Tuning. You can define a set of hyperparameters and a range of values, and it will find the optimal settings for you. Doesn't get much easier than that.

From training to production

Once you have trained your model, it is time to start using it in production. But how? Whatever your model does, it will need input, more specific: the same kind of input in the same format it received during training. This can be in all forms; periodic datasets, streaming inputs, sequential or parallel input and so on. But no matter in what form toy production data is delivered, you need the right kind of infrastructure to handle it in a scalable way. Again, your company might have this infrastructure already, or possess the required knowledge on how to implement this. But what if you don't? Depending on your situation, you can choose to invest time and money in implementing or changing your infrastructure or you can use GCP which has convenient ways of doing it for you.

AWS, Azure or GCP?

By now you may think this is some kind of advertorial for Google Cloud Services. Rest assured, it isn't. I don't work for Google, and they don't pay me. But if you need to do a job, you need the right set of tools that will help you do it best. And if your job is to make and implement production ready ML models, GCP offers the right set of tools. This is not to say AWS and Azure don't offer equivalent services; they do. And although this article is not an in-depth comparison of the three biggest cloud platform providers, we need to address the question which one is the best when it comes to machine learning.

AWS jumped into the cloud service business early (and successfully) and had a clear headstart over the others. As a result, many enterprises of all sizes now use one or more AWS services. If you have already invested deeply in AWS, you might want to look at what they have to offer on ML. If you haven's or only use a couple of unrelated AWS services, GCP may be a better option:

Microsoft on the other hand has a strong connection with (large) enterprises for decades. It has suffered a dip but after a period of strategic reorientation, it is working hard to regain enterprise territory with its Azure offering. Again, if you are already using Azure a lot, it makes sense to see what they can offer on Machine learning:

Google, on the other hand, isn't a typical enterprise player. Although everyone uses Google search, its other products are not used on a large scale in enterprises. But it has rebranded itself as an AI first company for a reason and if you look at both how Google successfully uses ML for their own products and the high-quality ML oriented services GCP platform provides you should have a serious look at it if your focus is on AI/machine learning:

Pricewise, there is not much to say that can help you decide on which platform to use. For all three platforms, costs depend on multiple factors; contract type, number of services you use and how you use them (at what scale during what time). This means that while you can make an educated (but rough) guess about costs, you will only know for sure after the bill has come in. There is no way on deciding which one is cheaper other than trying them all (and given the work this involves, this isn't a viable option). Which leaves looking at the available services and their quality as the only viable option. And when it comes to data science and AI/Machine learning, Google offers a lot.

Or you can all do it yourself; nothing prevents you from buying your own hardware: a suitable workstation, some powerful GPU's and enough storage will get you a long way (the image below shows an of the shelf Deep Learning work station)

This is probably the best option if you are trying out ML as a hobby or for a small PoC in your company. Even for a startup, this may make sense as the investment in some decent hardware will be lower than a couple of months on intensive use of multiple cloud services. For established enterprises, it depends on their current situation and technology; if you have the hardware and know-how on setting up the infrastructure ML needs, doing it yourself may be a viable option. If not, go for a cloud solution. Like I said; training a model is one thing, using it in a production environment is something else. So something you should definitely take into account is how your production environment should look like (Will it use datasets or streaming data? How much? Can you scale easily if necessary?). Although GCP services come at a price, they do make your life easy, predictable and safer when going live. The decision whether that is worth it, is up to you.

As for me, this course gave me an appetite for more: the Advanced Machine Learning with TensorFlow on Google Cloud Platform Specialization (after finishing Andrew Ng's excellent Deep Learning Specialization).

Dennis Gijzen 7y

Erg goed geschreven. Mooi stuk Anko.

1 Reaction

Erivan de Sena Ramos 7y

Machine Learning is the future! this already has applications in areas such as management, marketing, sales and retail and is a promising branch that will bring even more innovations to the corporate market. By applying the right methodologies and using an appropriate data set, there is the possibility to predict - with good confidence - business opportunities that would hardly be discovered by human analysis. This is therefore a great competitive advantage, which will bring many advantages and stand out from your competitors.

See more comments

To view or add a comment, sign in

Machine Learning with TensorFlow on Google Cloud Platform

A T

Data is the challenge

The right Abstraction level

Training a ML model

From training to production

AWS, Azure or GCP?

More articles by A T

Others also viewed

How I run days-long scripts without breaking them.

My summary for AWS re:Invent -2020 Machine learning announcements list- 8th Dec.

Chapter 1 - Azure AI for Beginners: what it is, what you’re building, and how to think about it

Les Meilleures Ressources pour Maîtriser Azure Machine Learning

AWS AKL: AWS DEEP LEARNING MEETUP

Azure Machine Learning (Intro)

AWS & OpenAI Partner to Launch Open‑Weight GPT‑OSS Models on Bedrock & SageMaker!

Navigating Model Selection & Capacity Planning in Azure - March 2026

Automated ML in Azure: Getting Started

From OpenAI Roadblocks to Azure Triumph: A Lesson in Problem-Solving Over AI Reliance

Explore content categories

Data is the challenge

The right Abstraction level

Training a ML model

From training to production

AWS, Azure or GCP?

More articles by A T

AI ethics

The initial pain of (app) rejection.

What will go wrong with AI?

GroupClip: alive, but kicking?

Freemium game apps: evil by design

Davos 2017: Artificial Intelligence

My first iOS app

Fun with Deep Learning and TensorFlow

CTO snippet #1: There is no universal best solution for the challenges you face.

CTO snippet #1

Others also viewed

How I run days-long scripts without breaking them.

My summary for AWS re:Invent -2020 Machine learning announcements list- 8th Dec.

Chapter 1 - Azure AI for Beginners: what it is, what you’re building, and how to think about it

Les Meilleures Ressources pour Maîtriser Azure Machine Learning

AWS AKL: AWS DEEP LEARNING MEETUP

Azure Machine Learning (Intro)

AWS & OpenAI Partner to Launch Open‑Weight GPT‑OSS Models on Bedrock & SageMaker!

Navigating Model Selection & Capacity Planning in Azure - March 2026

Automated ML in Azure: Getting Started

From OpenAI Roadblocks to Azure Triumph: A Lesson in Problem-Solving Over AI Reliance

Similar topics

How to Optimize Machine Learning Performance

Tips for Machine Learning Success

How to Train AI Models on a Budget

Explore content categories