Model serving with TFX

One way or another, every Data Scientist faces the problem of deploying a trained model into production. Unfortunately, developers often have to reinvent a bicycle and write a wrapper for a Machine Learning (ML) library when solving this task. Such an approach significantly slows down the whole process. However, suppose you used TensorFlow (TF) or […]
Nov 12th 2021

Share this post

Nov 12th 2021

Share this post

Model serving with TFX

Vladimir Lyashenko

Marketing Content Creator

One way or another, every Data Scientist faces the problem of deploying a trained model into production. Unfortunately, developers often have to reinvent a bicycle and write a wrapper for a Machine Learning (ML) library when solving this task. Such an approach significantly slows down the whole process. However, suppose you used TensorFlow (TF) or any other framework that uses TensorFlow as a backend (for example, Keras) as your Deep Learning (DL) framework. In that case, you should not face such challenges thanks to the TensorFlow Serving tool.

word image

Image created using

In this article, we will talk about:

  • What is Model Serving?
  • What is TensorFlow Serving?
  • TensorFlow Serving architecture
  • How to set things up with TensorFlow Serving?
    • Installing Docker
    • Installing TensorFlow Serving
  • Building an image classification model
  • Serving a model with TensorFlow Serving
    • Communication protocols
    • Creating gRPC and REST endpoints
    • Making a request to the model
  • Challenges of working with TensorFlow Serving
  • Common errors you might face when working with TensorFlow Serving
  • Best tips, practices, and gotchas when working with Tensorflow Serving
  • How to simplify your deployment experience with

Let’s jump in.

What is Model Serving?

To start with, let’s make the terminology clear and talk about Model Serving in general. As you might know, there is an ML lifecycle consisting of:

  1. Training and testing a model
  2. Packaging the model
  3. Validating the model
  4. Repeating steps 1 – 3 until you are satisfied with your model’s performance
  5. Deploying the model
  6. Monitoring and retraining the model

In the ML lifecycle terms, Model Serving is in the “Deploying the model” step. The ultimate goal of Serving is to provide smooth access to the trained model by users or other software.

What is TensorFlow Serving?

Right now, you might be wondering why the article is dedicated to Model Serving with TFX, but you are reading about the TensorFlow Serving (TFS) tool. Let’s make this clear. The official documentation defines TFX as “a Google-production-scale ML platform based on TensorFlow that provides a configuration framework and shared libraries to integrate common components needed to define, launch, and monitor a machine learning system.”

So, as you might have guessed, TensorFlow Serving is a TFX library that is directly responsible for model deployment. Thus, TFS is “a flexible, high-performance serving system for machine learning models, designed for production environments that makes it easy to deploy new algorithms and experiments, while keeping the same server architecture and APIs.”

The key advantages of TensorFlow Serving compared to other serving tools are:

  • The out-of-the-box integration with TensorFlow models. Moreover, you can extend TFS to serve other types of models and data.
  • TFS does not require writing additional wrapper code.
  • Models deployed with TFS have better inference time than ones using Flask or Django web applications.
  • The TFS architecture allows you to efficiently manage your models (for example, swiftly swap them) and control their versions.
  • Using TFS does not require as much time as other serving tools and techniques.

TensorFlow Serving architecture

The crucial part of the TensorFlow Serving architecture presented in the picture below is the Model Server. The basic concept of the Model Server operation is quite simple. When the Server starts, it receives a path to a model it needs to load and a port it will listen to. And that is what it does, it loads the specified model and listens to the specified port.

word image 1

Image created using

If the Model Server receives a request, it will behave in either of the following ways:

  • Execute the model for this request;
  • Combine several requests into a batch and perform computations for the entire batch. Model Server will process batches only if you pass the corresponding flag (`–enable_batching`) at startup;
  • Enqueue a request if the computing resources are busy at the moment.

As mentioned above, Tensorflow Serving can swap models swiftly. It is possible because the Model Server constantly scans the path specified at startup for new models and automatically uploads a new version when found. Thus, you can upload new versions of models without stopping the Model Server.

Thus, Tensorflow Serving is an excellent tool to use in production. Sure, you can create a self-written model wrapper, but such an approach is unjustified since Tensorflow Serving offers the same capabilities without writing and maintaining custom solutions. If you want to learn more about TFS architecture, refer to the related guide from the TensorFlow official documentation.

How to set things up with TensorFlow Serving?

Unfortunately, TensorFlow Serving is not a built-in package of the TensorFlow framework. It means that you need to install it separately before using it. However, that is where Docker comes to help.

As you might know, Docker is a software that allows you to create and manage virtual machines automatically. However, it does it more elegantly than conventional virtualization tools such as VirtualBox because virtualization happens at the operating system level. The most significant advantage of Docker is that it allows you to build an application with its libraries and dependencies in a single container that can be successfully executed in a completely different environment.

Docker is “the most straight-forward and easiest way of installing and using TensorFlow Serving”. Still, the TFS official installation documentation has other approaches you might use to achieve the goal. Nevertheless, if you do not have a strong case against Docker, you should go the easy way.

Installing Docker

So, the TensorFlow Serving installation path consists of two steps. Firstly, you need to install Docker; secondly, the TFS tool itself. As for Docker, there are multiple guides on the Web that will help you install it. Still, you should pay closer attention to the official installation documentation covering various installation paths for multiple platforms, such as Windows, Mac with an Intel chip, Mac with M1 chip, and Linux systems.

SEO LINK: Docker Guide

You can check whether Docker is running correctly by typing `docker run hello-world` in the Terminal command line. If the result is similar to the picture below, everything works just fine, and you are ready to proceed with the following installations.

word image 2

Installing TensorFlow Serving

As for TensorFlow Serving itself, you can simply pull its image from Docker Hub using the `docker pull` command. It is as simple as it sounds:

docker pull tensorflow/serving:latest

If the result is similar to the picture below, then everything worked as intended.

word image 3

As official documentation suggests, such an approach will install a CPU image of TensorFlow Serving. If you want to run a GPU serving image, you need to install it separately by running a different shell command.

docker pull tensorflow/serving:latest-gpu

Last but not least, please note that you might face errors when installing TensorFlow Serving on Mac with M1 chip because, as of today, TF does not support M1. Unfortunately, the only stable solution in such a case right now is to use another system. Please address the related topics on the web as the problem might already be solved.

Building an image classification model

Now, when you have TensorFlow Serving properly installed, you will need a TensorFlow model to serve. In real life, you would have to either train your model or download a pre-trained one using the TensorFlow Deep Learning framework (or any other framework that uses the TensorFlow backend). If you do not have TensorFlow installed, please refer to the official installation documentation or simply use Google Colab.

For this tutorial, I’ve prepared a Google Colab notebook featuring training a simple Convolutional Neural Network (CNN) on the CIFAR-10 dataset. It is worth mentioning that I did not aim to train a perfect model. Thus, the model presented in the notebook might need further tuning.

As you might know, the CIFAR-10 image classification dataset is a popular DL task. It consists of 60 thousand images featuring ten completely mutually exclusive classes. Fortunately, CIFAR-10 can be easily accessed as it is built-in into the tensorflow.keras.datasets package.

word image 4

Image examples from the CIFAR-10 dataset. Source: Google Colab notebook

There is nothing complicated in the notebook, so please feel free to experiment and play around with it, as there is no better way to learn something than to practice. Also, if you plan to work in Google Colab, please do not forget to turn on the GPU accelerator as it will increase the training speed. The notebook covers the following steps:

  1. Downloading and preprocessing the CIFAR-10 dataset.
  2. Building a simple CNN architecture by using two `conv-conv-pool `blocks and two fully connected layers.
  3. Training the model.
  4. Saving the model in a folder with a timestamp name. Please pay attention to this step as having folders named after specific timestamps are effective for model versioning.

If you have successfully executed the notebook, you will get a trained model saved in a folder named after a specific timestamp.

word image 5

Source: notebook

In Google Colab, you can check your saved files in the left “Files” menu. The structure of the folder should be similar to the one presented in the picture below.

word image 6

Source: notebook

For the complete code, please refer to the notebook.

Serving a model with TensorFlow Serving

Excellent, now you have a TensorFlow model to serve. Let’s serve it using TensorFlow Serving. Let’s break the whole process down. TFS creates a communication channel between your model and an API endpoint. An API endpoint is a gateway that receives a request to a model and returns the result of the model execution. TFS will create two endpoints:

Communication protocols

REST, an acronym for Representational State Transfer, is a style of software architecture for distributed systems such as the World Wide Web that is typically used to build web services. The term was coined in 2000 by one of the authors of the HTTP protocol. In general, REST has a straightforward information management interface where each piece of information can be uniquely identified by a global identifier such as a URL.

On the other hand, gRPC, an acronym for Remote Procedure Calls, is a high-performance remote procedure call (RPC) framework developed by Google. It is meant for building distributed systems (microservices) and APIs. Moreover, gRPC is easy to use and is flexible as you can create a client application in any language that supports gRPC.

In our case, both REST and gRPC are communication protocols that define the communication rules between several entities communicating with one another.

Creating gRPC and REST endpoints

Now, let’s create REST and gRPC endpoints using Docker for the model trained in the “Building an image classification model” section. Sure, there are guides on the web featuring other ways of running the TensorFlow Serving Model Server, but in this guide, we will use an approach featured in the TFS official documentation.

So, all you need to do is to run a similar command in the Terminal. Please pay attention that the paths in your command might differ.

docker run -p 8501:8501 –name tfserving_test –mount type=bind,source=/Users/vladimirlyashenko/Desktop/TFS/trained_model/,target=/models/trained_model -e MODEL_NAME=trained_model -t tensorflow/serving

At first, the command might seem a bit complicated. Fortunately, this is not true. Let’s break the command down:

  • You might face a “docker daemon” error when running this command. In such a case, please try to add `sudo` before `docker` (`sudo docker` …). It should help you to avoid the error.
  • `docker run` is used as we use Docker to run the tensorflow/serving image.
  • `-p 8501:8501` publishes the container’s 8501 port to the local machine’s port 8501. Thus, you specify the REST API endpoint port. So, when making a prediction request you will make it to https://localhost:8501 as it will redirect the request to the container’s 8501 port.

word image 7

  • `–name tfserving_test` sets a name of Docker container that will run TensorFlow Serving;

word image 8

  • `–mount type=bind,source=/Users/vladimirlyashenko/Desktop/TFS/trained_model/,target=/models/trained_model` specifies that a model must be copied from the source path to the target path in the container. TFS will upload the model from the container’s target path. Please make sure that the source and the container’s folder have the same name(in this case, they are called `trained_model`) because otherwise, you will face an error;

word image 9

  • `-e MODEL_NAME=trained_model` specifies the model that TFS will upload. Pay attention that the container’s folder with a model is also called “trained_model”;
  • ` -t tensorflow/serving` specifies the TFS image that will be used to run the Model Server.

If everything works just fine, the Terminal output will be as follows:

word image 10

Skimming through the output, you will see that TFS created both REST (https://localhost:8501) and gRPC ( endpoints.

Making a request to the model in TensorFlow Serving

Finally, it is time to make a request to the model through the REST endpoint and get its prediction. Official documentation suggests using shell commands to make a REST API call. Still, you can do the same through Python using a simple function.

def make_prediction(data, headers, endpoint):
json_response =, data = data, headers = headers)
prediction = json.loads(json_response.text)
return np.argmax(np.array(prediction[‘predictions’]), axis = 1)

The key components of this function that you must pass to it to make a successful request are the data, headers, and the URL address of the REST API endpoint:

  • Please keep in mind that the data passed to the function must follow a specific JSON format {“instances”: values.

word image 11

  • When working with the REST API endpoint, the headers should be “application/json”.


  • The URL address of the endpoint consists of the several components arranged in a specific order (https://HOST:PORT/v1/models/MODEL_NAME:TASK)
    • `HOST` is the IP address of the Model Server. When working with REST API, it is `localhost`.
    • `PORT` is the port specified at the server’s startup. By default, the REST API endpoint serves a model under 8501 port. Still, you can use any other available port.
    • `MODEL_NAME` is simply the name of the model you want to use.
    • `TASK` is a keyword that specifies a task for a model that must be done. When making predictions, you must set the `TASK` to `predict`.

That is it. Now you simply need to pass all the necessary information into the `make_prediction` function and get the predictions.

endpoint_url = ‘http://localhost:8501/v1/models/trained_model:predict’
data = json.dumps({“instances”: x_test[0:4].tolist()})
headers = {“content-type”: “application/json”}
make_prediction(data, headers, endpoint_url)

If you want to make predictions using the gRPC endpoint, please use this code. Thus, you now know the basics of model serving with TensorFlow Serving. For more detailed information, please refer to the official documentation.

Disadvantages of working with TensorFlow Serving

Unfortunately, despite TFS being an easy-to-use and convenient model deployment tool, it has disadvantages that you must take into account:

  • TFS will not work if you have built a model using a DL framework other than TensorFlow or a framework with TensorFlow backend;
  • TFS is an excellent tool for individuals but a poor option for a team as it does not allow you to collaborate with your teammates.
  • It does not have a convenient UI.
  • If you want a custom deployment solution, you will have to use either Flask or Django.

Common errors you might face when working with TensorFlow Serving

As you might have noticed, the general usage of TensorFlow Serving is relatively simple and can be picked up very smoothly. Still, you must be careful about certain things that may cause errors. To tell the truth, if you have ever worked with Docker or other deployment tools before, you might have come across these errors many times.

  • The name of the Docker container you are trying to create might be already in use. If you are not careful when working on Terminal, you might simply forget to shut the previous TFS container down. Please do not forget to terminate containers you do not plan to use anymore.
  • The port you are trying to open for TFS to listen to might also be used by some other program or container. Please manage your port usage carefully.
  • Also, you might face some “docker daemon” errors as well. Please either try to run the `docker` command using `sudo` or simply Google the error.
  • As mentioned above, saving the model using a timestamp is a great practice. Honestly, such an approach helps to avoid another TFS error you might come across. TFS requires your model to be saved in a sub-folder of the main folder specified as the `source` path when creating the endpoints. Using a timestamp as a sub-folder name will inform TFS about the model version and simplify the versioning. But if your model is not in a sub-folder, you will face a corresponding error.
  • Do not forget to use the same name for both the `source` and the container’s folder, as otherwise, you will face an error.

Best practices, tips, and strategies when working with TensorFlow Serving

Throughout this post, I have mentioned some valuable tips and techniques, so let’s summarize them into a list:

  • Use TFS for model deployment if you built your model using either TensorFlow or another DL framework with TensorFlow backend.
  • Use the Docker container to work with TensorFlow Serving as it is the easiest and the most convenient way.
  • Use timestamps to simplify the model versioning.
  • Use Python to send requests to the model.
  • Use a gRPC endpoint when working with a large amount of data.
  • Be careful about the paths specified at the server’s startup. You are likely to face many errors if you make a mistake there.

How to simplify your deployment experience with Layer

Despite TensorFlow Serving being an easy-to-use tool, it is always better to simplify your working experience even more. For example, is a Declarative MLOps platform that helps data teams to produce machine learning applications based on code. With Layer, your model can be deployed in a few clicks directly from a convenient UI. Let’s check how it works.

Check out Layer’s Quickstart Guide to learn more about getting started with your first Layer project.

When your model is trained and ready to be deployed in Layer Model Catalog you need to click on it and then press the + Deploy button in the right upper corner of the Model Catalog of the page.

word image 12

After you have confirmed the deployment in the popup menu, the process is automatically done by Layer. The button you have just clicked has turned green signalizing that your model is deployed and can be accessed via a REST API endpoint. To get a deployment URL, click the Copy button 🔗 next to the API Deployed phrase.

word image 13

For more information, for example, a shell command to make a real-time prediction with a deployed model, please refer to Layer’s official documentation.

Final Thoughts

Hopefully, this tutorial will help you succeed and use TensorFlow Serving as a Model Serving tool in your next Machine Learning project.

To summarize, we started with some theoretical information about Model Serving, TensorFlow Serving, and TFS architecture. We went through a step-by-step guide on how to install and use TensorFlow Serving for model deployment. Also, we mentioned the tool’s disadvantages and some common errors you might come across when working with it. Lastly, we talked about some tips you may find helpful when working with TensorFlow Serving.

If you enjoyed this post, a great next step would be to build your Machine Learning project with all the relevant tools. Check out tools like:

Thanks for reading, and happy training!


Nov 12th 2021

Share this post

Try Layer for free

Get started with Layers Beta

Start Free