A Guide to TensorFlow Callbacks

Frameworks like Tensorflow provide tremendous flexibility for custom training operations, efficient pipelining, and layer sub-classing. In addition to this, easy debugging and options like TFX for scaling and deployment make Tensorflow a preferred choice for many machine learning practitioners. These advantageous features of Tensorflow are accompanied by Keras, a multifaceted high level API which provides […]
Oct 13th 2021
read

Share this post

Oct 13th 2021
read

Share this post

A Guide to TensorFlow Callbacks

Darshan Deshpande

ML Practicioner

Frameworks like Tensorflow provide tremendous flexibility for custom training operations, efficient pipelining, and layer sub-classing. In addition to this, easy debugging and options like TFX for scaling and deployment make Tensorflow a preferred choice for many machine learning practitioners. These advantageous features of Tensorflow are accompanied by Keras, a multifaceted high level API which provides flexibility for almost everything ranging from simple model creation to 16-bit floating point based mixed precision training.

Callbacks are among the most prominent features of the Keras API. Callbacks are an important tool to monitor the training process, whether it’s the management of checkpoints or the documentation of your experiments.

This blog seeks to educate readers about Keras Callbacks’ relevance and diversity. We shall discuss Callbacks in depth in the following sections.

Need for callbacks

The Callback API allows users to track their experiments throughout the training process at various points. In other words, a callback is a piece of code that runs:

  • during training or prediction,
  • before or after every epoch or batch.

A great example is the training checkpoint callback. In this callback, weights are saved for continued training or to store the best weights configuration after each interval. This is done before the weights start to diverge.

Callbacks assist in logging and tracking training trials using handy interfaces such as the Tensorboard or the raw CSV file logging using the existing `CSVLogger` Callback.

Recent developments in the active research on mid-training learning rate alterations have led to the popularization of Learning Rate Schedulers like `Cyclical Learning Rate` alterations. With the help of custom callbacks, researchers can implement their custom scheduler implementations capable of mapping learning rates according to specific functions. Optionally, practitioners can use the pre-existing `LearningRateScheduler` and `ReduceLROnPlateau` to tweak their learning rates as they desire.

Tracking CPU, GPU, or TPU resources spent during training is critical to reducing cost in today’s growing computing needs of state-of-the-art models. With such heavy requirements, Tensorflow’s Tensorboard comes in handy in such scenarios. Tensorboard’s TF Profiler, with the help of data logged by Keras Callbacks, tracks the usage of compute resources and plots informative graphs, which help in pipeline optimization specific to the operating system and its configuration.

Callbacks thus serve to identify bottlenecks and aid in the allocation of workloads.

Let us now have a look at all the callbacks that Keras offers.

About Keras callbacks

Keras callbacks are easy to integrate, short pieces of code which run after set intervals. These callbacks can perform pythonic tasks or Keras backend based metric or tensor manipulation, thereby allowing better control over the training process. With Tensorflow’s Keras API, you can either use the pre-made Callback classes or define your own. In the following subsections we will have a look at the Callbacks offered by Keras, their benefits and their implementation in detail.

Predefined Callbacks

Keras offers a variety of inbuilt callbacks which can be plugged into its Model or Sequential API through the `fit()` function. It is recommended to use these Callbacks with your training process because they can be easily tweaked by passing different arguments and are highly optimized for multi-GPU or TPU configurations which require different training strategies in Tensorflow. Let us have a look at these predefined callbacks:

CSVLogger

A `CSVLogger` is useful for saving metric information to a specific file. The logger only needs a file name to be passed during instantiation. If you wish to save the logs in a Tab Separated File (TSV), you can use a different separator than the standard ‘,’. A simple example of how to use this:

csv_logger = keras.callbacks.CSVLogger(filename, separator=",", append=False) 
model.fit(x_train, y_train, epochs=20, callbacks=[csv_logger,])

EarlyStopping

It is relatively common for models to overfit while experimenting with various configurations. A good practice is to stop the model training as soon as the loss starts to diverge. However, stopping the training by killing the process midway can lead to further problems like the unusability of allocated memory space for future training. An excellent way to overcome this is to use the `EarlyStoping` Callback.

word image 128

Credits: Robert Martin Short

The callback requires certain arguments:

early_stop = tf.keras.callbacks.EarlyStopping( 
monitor="val_loss", 
min_delta=0, 
patience=0, 
verbose=0, 
mode="auto", 
baseline=None, 
restore_best_weights=False, 
) 
model.fit(x_train, y_train, epochs=20, callbacks=[earlystop,])

The `monitor` parameter decides the quantity that is to be observed during training. This is the metric that determines whether or not to halt the training. This quantity is set to validation loss by default but can be changed to any validation/training loss or accuracy.

In addition to this, `min_delta` is an important parameter. This defines the minimum change in the monitored loss or accuracy that would count as an improvement. This has a default value of 0 which means that the training will stop immediately even if the loss increases slightly. If the training is expected to have an unstable descent, it is recommended that this parameter is set to some higher value, like 0.001 or 0.01.

The patience parameter decides the number of epochs to wait before stopping the training if there is no improvement. Therefore, if unstable training is expected, you should consider setting the patience parameter to a higher value than the default 0.

Depending on your monitored quantity, you might want to maximize or minimize the value of your metric. The `mode` parameter can be set to `max`, `min` or `auto`, denoting maximization, minimization and automatic determination of the metric trend. This callback also offers the ability to track improvements by comparing the monitored metric with the baseline value when the `baseline` parameter is passed.

A noteworthy parameter for EarlyStopping is the `restore_best_weights` parameter. If this is set to True, then the callback will restore the best weights obtained throughout the training before the training is stopped due to the desired condition. This is extremely useful in cases where the model needs to be tested further but is computationally heavier as it maintains a separate model in the memory and updates its weights at every epoch.

ModelCheckpoint

There are cases in the training cycle where complete training of a model in a single go is not feasible, either due to time constraints or limited access to compute resources. As a remedy, it is better to save the model after every specific number of epochs which can then be loaded again and trained according to the practitioner’s convenience.

The `ModelCheckpoint` callback tracks the desired training metric to decide if the model should be saved or not. Instantiating a ModelCheckpoint only requires a necessary `filepath`. This is the location at which the checkpoints will be saved. This `filepath` can have a f-string-like formatting to include logging the desired metric in the file’s name, thereby making it convenient to monitor and load the desired checkpoint weights.

checkpoint = tf.keras.callbacks.ModelCheckpoint( 
filepath, 
monitor="val_loss", 
verbose=0, 
save_best_only=False, 
save_weights_only=False, 
mode="auto", 
save_freq="epoch", 
options=None, 
) 
model.fit(x_train, y_train, epochs=20, callbacks=[checkpoint,])

There are two other important parameters:

  • the `save_weight_only`, which only saves the weights of the model and discards the model architecture and optimizer states, and
  • the `save_best_only` parameter is a boolean parameter that decides whether to save only those weights that show improvement compared to the previous epoch.

Optionally, the `save_feq` parameter can also be set, ensuring that the checkpoint is saved every `save_freq` number of batches if `steps_per_execution` is set while calling `model fit()`, otherwise every `save_freq` epochs.

LearningRateScheduler

Keras supports an inbuilt Learning Rate Scheduler that tracks the learning rate at every epoch and alters its behavior according to a custom function. This is extremely helpful for:

  • tweaking the learning rate in the middle of training or
  • to experiment with different learning rates at different epochs.
scheduler = tf.keras.callbacks.LearningRateScheduler(schedule, verbose=0) 
model.fit(x_train, y_train, epochs=20, callbacks=[scheduler,])

It should be noted that the custom scheduler function takes two parameters as input: A zero-indexed epoch number and learning rate for the current epoch, and outputs the learning rate for the next epoch.

This function can support Pythonic implementations, but it is recommended that the operations inside the scheduler be TensorFlow ops for the sake of serializability.

The optional verbose parameter decides if the user is to be informed by printing the name of the saved file on stdout after saving the model.

TerminateOnNaN

While experimenting with custom implementations or configurations, activation functions like sigmoid or linear can sometimes lead to gradient explosion. While training a large number of models at a time, gradient explosion can occur in certain models that will cause wastage of compute resources and time if the training continues. The `TerminateOnNaN` callback automatically checks for NaN values during training and stops the model if such values are found.

terminate_on_nan = tf.keras.callbacks.TerminateOnNaN() 
model.fit(x_train, y_train, epochs=20, callbacks=[terminate_on_nan,])

This callback requires no inputs.

ReduceLROnPlateau

A major reason for the failure of machine learning models is overfitting. A general trend seen before the loss starts to increase is that the loss value becomes constant for a few epochs. This indicates that the model is experiencing trouble finding the next local minima on the loss landscape or is stuck in a local minima.

An appropriate way to treat this is to reduce the learning rate slightly so that the loss does not jump around too much, leading to overfitting. The `ReduceLROnPlateau` offered by the Callbacks API handles this task.

reduce_lr = tf.keras.callbacks.ReduceLROnPlateau( 
monitor="val_loss", 
factor=0.1, 
patience=10, 
verbose=0, 
mode="auto", 
min_delta=0.0001, 
cooldown=0, 
min_lr=0, 
) 
model.fit(x_train, y_train, epochs=20, callbacks=[reduce_lr,])

Similar to the `EarlyStopping` Callback, `ReduceLROnPlateau` also needs a `monitor` parameter which decides what metric is to be monitored and a `patience` parameter which defines the number of epochs to wait before making changes to the learning rate.

A unique parameter to the `ReduceLROnPlateau` is `factor`. It defines the factor by which the learning rate is to be reduced.

word image 129

Calculation of new learning Rate

Two other important arguments that are optional but worth noting are `cooldown` and `min_lr`. Reducing the learning rate too frequently might cause the gradient impact to shrink since:

word image 130

Backpropagation with updated learning rate

Hence, the `cooldown` argument can help the network move around the loss landscape for a while before the learning rate is decreased again.

The `cooldown` parameter helps mitigate this problem by adding an option of waiting for `cooldown` number of epochs before resuming to track the monitored metric.

The `min_lr` parameter defines a minimum bound to the learning rate value.

Tensorboard

Sometimes a visual representation of the training process like loss plots, embedding visualizations, and gradient tracking can help provide insight when evaluating models. Tensorboard is an excellent tool for monitoring the training process diligently.

word image 131

Tensorboard Visualization of MNIST labels (Credits: Tensorflow)

Tensorboard allows users to display text, images, and audio data and ops and layers involved in creating and training a model. In addition, Tensorboard has a special “Fairness Indicators” section where it can quickly analyze common metrics like F1 score and AUC to provide helpful insight into the performance and biases of the model.

Another amazing feature of Tensorboard is the What-If Tool (WIT) which provides immediate visualizations of results from custom inference examples. These examples can be edited manually or through code and re-run to see instantaneous changes on the interface.

The What-If Tool only has two requirements:

  • First is that the model should be available in the form of Tensorflow serving and
  • Second is that the dataset used for inference should be in TFRecord format.

The exploratory analysis of the black-box ML model is a beneficial feature for all practitioners in the field of AI.

Although the Tensorboard interface itself requires a localhost port to run or the magic command `%load_ext tensorboard` to run in Jupyter notebooks, Keras helps to connect to the Tensorboard interface through a simple TensorBoard callback which can be instantiated and used as follows:

tensorboard = tf.keras.callbacks.TensorBoard( 
log_dir="logs", 
histogram_freq=0, 
write_graph=True, 
write_images=False, 
write_steps_per_second=False, 
update_freq="epoch", 
profile_batch=2, 
embeddings_freq=0, 
embeddings_metadata=None, 
) 
model.fit(x_train, y_train, epochs=20,callbacks=[tensorboard])

The `log_dir` argument decides where the log files are stored. These logging files will contain metrics, gradient information, model architecture, and embedding vector information.

`histogram_freq` decides the frequency at which activations and weight histograms are to be calculated. If set to 0, these won’t be calculated.

`write_graph` decides whether the model graph is to be saved in the log file. This increases the log size but provides a visualization of the model graph.

`write_images` decides whether to write the model’s weights to the logged file so that they can be analyzed, and `write_steps_per_second` decides whether to log the training steps per second. This parameter supports batch and epoch-based logging.

`embeddings_freq` and `embeddings_metadata` decide the frequency of logging embeddings and determine if there is a dictionary where the layer name file name mapping is saved.

ProgbarLogger

This is a simple logger which prints metrics to the output window.

progbar = tf.keras.callbacks.ProgbarLogger(count_mode="samples", stateful_metrics=None) 
model.fit(x_train, y_train, epochs=20, callbacks=[progbar,])

This callback only has two arguments:

1.` count_mode`: It decides whether the progress bar should count samples or steps.

2. `stateful_metrics`: A list or tuple containing metric names that should not be averaged at the end of each epoch. If not provided, the model’s metrics are used.

RemoteMonitor

The `RemoteMonitor` callback enables streaming logs and metrics to a custom server. This callback uses the `requests` library to send data in the form of a JSON string response or a form-based serialized JSON.

remote_callback = tf.keras.callbacks.RemoteMonitor( 
root="http://localhost:9000", 
path="/publish/epoch/end/", 
field="data", 
headers=None, 
send_as_json=False, 
) 
model.fit(x_train, y_train, epochs=20, callbacks=[remote_callback,])

The events are streamed to root + ‘/publish/epoch/end/’ by default, but this can be changed by setting the `path` parameter. The root parameter determines the host and port number of the server.

`field` defines the name under which the data will be stored in the JSON response, and `send_as_json` shows if the response to be sent is a JSON-encoded dictionary or a serialized form. The headers field is optional and can take in a dictionary of HTTP headers.

LambdaCallback

Creating custom callbacks can turn out to be a tedious task if one is unaware of the workings of Keras. To simplify the process, Keras has a LambdaCallback which helps create a custom callback using custom functions.

lambda_callback = tf.keras.callbacks.LambdaCallback( 
on_epoch_begin=None, 
on_epoch_end=None, 
on_batch_begin=None, 
on_batch_end=None, 
on_train_begin=None, 
on_train_end=None,) 
model.fit(x_train, y_train, epochs=20, callbacks=[lambda_callback,])

The callback has a few, easy to understand arguments:

  • `on_epoch_begin`: The function passed to this argument is executed at the beginning of every epoch. This custom function must take in two positional arguments: epoch and logs.
  • `on_epoch_end`: The function passed to this argument is executed at the end of every epoch. This custom function must take in two positional arguments: epoch and logs.
  • `on_batch_begin`: The function passed to this argument is executed at the beginning of every training batch. This custom function must take in two positional arguments: batch and logs.
  • `on_batch_end`: The function passed to this argument is executed at the end of every training batch. This custom function must take in two positional arguments: batch and logs.
  • `on_train_begin`: The function passed to this argument is executed at the beginning of training. This custom function must take one positional argument: logs.
  • `on_train_end`: The function passed to this argument is executed at the end of training.

This custom function must take one positional argument: logs.

History

The History callback enables users to access the logging `dict`, continually updated with metric data from the training process. This logging dictionary is returned after the end of the training procedure by default after `model.fit()` is called.

history = model.fit(x_train, y_train, epochs=10)

BackupAndRestore

This callback is meant to restore the model’s state where the training is interrupted due to an unexpected error or killing of the process. It creates a temporary checkpoint file after every epoch and restores the checkpoint weights if the same `model.fit()` is called again.

Note: This callback is experimental as of TF 2.5 and only compatible with eager execution.

tf.keras.callbacks.experimental.BackupAndRestore(backup_dir)

This callback accepts a single parameter which is the relative or absolute path to the backup directory. As a word of caution, it is necessary to mention that the `BackupAndRestore` callback does not bring the running jobs back, and this must be done manually by the user. A bonus for this callback is that it supports the Multi-Worker, Mirrored, and MultiWorkerMirrored strategies out of the box which are useful for multi-GPU or TPU configurations requiring special distributed training setup.

CallbackList

This callback is only useful when working with distributed systems where each callback must behave differently on different batches and splits. CallbackList has inbuilt support for Multi-Worker, Mirrored, and MultiWorkerMirrored strategies. A `CallbackList` can be defined as follows:

tf.keras.callbacks.CallbackList(callbacks=None, add_history=False, add_progbar=False, model=None)
  • `callbacks`: This is a list of callbacks that are to be used along with training
  • `add_history`: A boolean argument defining whether to add the History callback or not
  • `add_progbar`: A boolean argument that defines whether to add a progress bar with the help of a `ProgbarLogger` or not.
  • `model`: The Keras model, which is to be used with these callbacks

Custom callbacks

As helpful as they might be, Lambda callbacks are not very flexible. Lambda callbacks cannot be configured differently for testing and prediction and cannot define custom class variables which might be helpful under some circumstances. This section is a walkthrough of how to configure and use custom callbacks.

Let us see the additional functionality that custom callbacks offer over the LambdaCallback:

1. `on_test_begin(self, logs=None)`: This defines the tasks to perform during the beginning of the testing phase.

2. `on_test_end(self, logs=None)`: This defines the tasks to perform after the end of the testing phase.

3. `on_predict_begin(self, logs=None)`: This function defines the tasks to perform before the prediction phase begins.

4. `on_predict_end(self, logs=None)`: This function defines the tasks to perform after the prediction phase ends.

5. `on_train_batch_begin(self, batch, logs=None)`: This defines the tasks to perform when the training batch begins. It is similar to `on_batch_begin` in LambdaCallback.

6. `on_train_batch_end(self, batch, logs=None)`: This defines the tasks to perform when the training batch ends. It is similar to `on_batch_end` in LambdaCallback.

7. `on_test_batch_begin(self, batch, logs=None)`: This defines the tasks to perform when the test batch begins.

8. `on_test_batch_end(self, batch, logs=None)`: This defines the tasks to perform when the test batch ends.

9. `on_predict_batch_begin(self, batch, logs=None)`: This defines the tasks to perform when the prediction batch begins

10. `on_predict_batch_end(self, batch, logs=None)`: This defines the tasks to perform when the prediction batch begins.

For demonstrating how a custom callback can be implemented, we will create a callback for changing the learning rate of a model by a certain scale after every five epochs.

class LRChanger(keras.callbacks.Callback): 
def __init__(self, scale=0.01): 
super(LRChanger, self).__init__() 
self.scale = scale 

def on_epoch_begin(self, epoch, logs=None): 
if epoch % 5 == 0: 
lr = float(backend.get_value(self.model.optimizer.lr)) 
lr = lr * self.scale 
print(f"Learning rate scaled to {lr}") 
backend.set_value(self.model.optimizer.lr, backend.get_value(lr)) 

def on_epoch_end(self, epoch, logs=None): 
logs = logs or {} 
# Ensure that the updated learning rate is logged after epoch ends 
logs['lr'] = backend.get_value(self.model.optimizer.lr) 

model.fit(x_train, y_train, epochs=20, callbacks=[LRChanger(), EarlyStopping()])`

Every custom callback in Keras must inherit from either the Base Callback class i.e. `tf.keras.callbacks.Callback`, or any one of the pre-existing callbacks. In the above implementation, we add a scaling parameter as an argument. This argument will be used later to scale the learning rate at every set interval.

Since we wish to update the learning rate at the end of every five epochs, we need to override the `on_epoch_begin` and `on_epoch_end` functions from the Callback class. The `on_epoch_begin` function scales the learning rate after checking if the epoch is a multiple of 5.

5. This is done by getting the value from the model, which is called using the current object of the Callback class, making adjustments to it, and then setting the value back to the model.

The `on_epoch_end` function ensures that the updated value is updated in the log `dict` so that other callbacks relying on these values aren’t affected.

After this is done, we simply compile and fit the model. The `fit()` call accepts a list of callbacks. We have used two callbacks here to demonstrate how multiple callbacks can be passed at the same time.

Additionally, this will also ensure that the logging done in `on_epoch_end` is accessible to other callbacks.

Keras callbacks with Layer

The Layer API offers complete support for `tf.keras` based training tasks. Along with its user-friendly interface, it is not only possible to track your training experiments effectively but also your updations and deployments. The following subsections aim to provide a brief walkthrough of how you could create your own Layer project and use Tensorflow Callbacks with them to benefit from the best of both worlds.

Connecting your data

Connecting your data from a pre-existing BigQuery or Snowflake is as simple as following the steps mentioned on the add integration tab in Settings. The first thing to do is select the type of integration (BigQuery or Snowflake). Once that is done, you will need to configure the connection.

This process will involve providing details about the database like URL, Schema, Warehouse, Role, User, Password, etc. After this comes the final step, where you will have to provide a name for the integration. This name will be used to reference the dataset once it is created. An optional description can also be added to the integration.

For the Layer demonstration of Callbacks, we will be using the `catsdogs` dataset, which is available by default on the Layer. This dataset has training and testing sets, both in the form of bytes encoded using base64.

Create your first Layer project

A new Layer project can be created in just a few simple steps. To get started, the `layer-sdk` has to be installed:

pip install layer-sdk

Once the pip package installation is complete, a simple `layer –help` can help confirm the installation.

Before you can create your Layer project, you need to log in to your Layer account to enable access to the Layer Model and Data Catalog interface.

You can log in using two methods:

  • Using the pop up browser
  • Using the headless mode (use — `layer login` on the terminal)

Once you have entered your credentials, you can move ahead and create your first layer project. This can be done by cloning an empty Layer project:

layer clone https://github.com/layerml/empty

Defining your datasets

First, let us have a look at the `dataset/dataset.yml` file:

This file defines the `name` of your dataset along with its `type`. By default, this type is set to `source`, meaning that the dataset is from an external source. Finally, we define `materialization`, which defines the data source integration for this project.

For our Cats and Dogs example, we will use the following configuration.

apiVersion: 1

name: "catsdogs"

type: source

materialization:

target: layer-public-datasets

table_name: "catsdogs"

Develop features and model train code

Now that we have defined the dataset that we will be using, next we will have a look at the features/dataset.yaml file:

This file structure is very similar to the dataset.yml discussed above, but the `type` is `featureset`. Additionally, we have to define the features in the `featureset`, which generally includes the name, description, source, and environment required for the features.

We then define the schema of the features where we necessarily have to define a primary key attribute that informs the layer API about the primary key used to join all of the features together.

For our Cats and Dogs example, we will define the target label as a feature with the `name`: category. We then define the source of this feature which we will define in the category folder next. Apart from this, we will declare the primary key as `id`, which will help us join the features in the dataset together.

Next up, we create a subdirectory with the name `category`. This subdirectory will have two files: category.py and a separate requirements.txt. The category.py file builds the features for the dataset.

We need to implement a reserved function called `build_feature` which will be called and executed by Layer to build our feature. This function takes as input a Dataset object and returns the feature data.

def build_feature(sdf: Dataset("catsdogs")) -> Any:

df = sdf.to_pandas()

df = df[df['path'] != 'single_prediction']

filenames = list(df['path'])

categories = []

for filename in filenames:

category = filename.split('/')[1]
if category == 'dogs':
categories.append(1)
else:
categories.append(0)

df['category'] = np.array(categories)

feature_data = df[["category", "id"]]
return feature_data

We first parse the `sdf` Dataset to a Pandas DataFrame. Subsequently, we load the paths of the dataset images and check if the image belongs to a cat or a dog. If it is a dog, we append 1; otherwise, 0. This `categories` array is then attached to the pandas DataFrame, and the restructured DataFrame is returned.

Before we can start defining our callbacks, we will need to create a `model.py` file in the model directory. Once the file is created, we can start creating and using our dataset to train our models. The code for the creation and training of the model must go inside the `train_model` function. Layer executes this function and all of its contents automatically when training the model. The `train_model` function accepts two arguments: train (type: layer.Train) and `pf` (type: spark.DataFrame).

Let us first import all required libraries.

from typing import Any

from layer import Featureset, Train, Dataset

from PIL import Image

import io

import base64

from tensorflow.keras.preprocessing.image import img_to_array

import numpy as np

from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau

from tensorflow.keras import Sequential

from tensorflow.keras.layers import Dense, Conv2D, MaxPooling2D, Flatten, Dropout

from tensorflow.keras.preprocessing.image import ImageDataGenerator

We will begin with importing Featureset, Train, and Dataset from the Layer API. After that, we will be using PIL, Tensorflow, and Numpy for most of our tasks involving defining models and callbacks.

Apart from this, the `catsdogs` dataset images that we will be using will be base64 encoded, and so the base64 library will help us decode them.

We will start by defining a helper function for loading the dataset images as NumPy arrays as below:

def load_process_images(content):

image_decoded = base64.b64decode(content)

image = Image.open(io.BytesIO(image_decoded)).resize([224, 224])

image = img_to_array(image)

return image

Here, we first convert the base64 encoded image to bytes, which are converted to a PIL Image instance. This PIL Image is then parsed as a NumPy array using `img_to_array` and returned.

Defining Callbacks on a Layer project

Now let us have a look at our final code for the creation and training of the model:

def train_model(train: Train, ds:Dataset("catsdogs"), pf: Featureset("cd_pet_features")) -> Any:

df = ds.to_pandas().merge(pf.to_pandas(), on='id')

training_set = df[(df['path'] == 'training_set/dogs') | (df['path'] == 'training_set/cats')]

testing_set = df[(df['path'] == 'test_set/dogs') | (df['path'] == 'test_set/cats')]

X_train = np.stack(training_set['content'].map(load_process_images))

X_test = np.stack(testing_set['content'].map(load_process_images))

train.register_input(X_train)

train.register_output(df['category'])

train_datagen = ImageDataGenerator(rescale=1. / 255,

shear_range=0.2,

zoom_range=0.2,

horizontal_flip=True,

width_shift_range=0.1,

height_shift_range=0.1

)

train_datagen.fit(X_train)

training_data = train_datagen.flow(X_train, training_set['category'], batch_size=32)

validation_gen = ImageDataGenerator(rescale=1. / 255)

testing_data = validation_gen.flow(X_test, testing_set['category'], batch_size=32)

model = Sequential([
Conv2D(filters=32, kernel_size=(3, 3), input_shape=(224, 224, 3), activation='relu'),
MaxPooling2D(pool_size=(2, 2)),

Conv2D(filters=32, kernel_size=(3, 3), activation='relu'),
MaxPooling2D(pool_size=(2, 2)),
Dropout(0.25),

Conv2D(filters=64, kernel_size=(3, 3), activation='relu'),
MaxPooling2D(pool_size=(2, 2)),
Dropout(0.25),

Flatten(),
Dense(128, activation='relu'),
Dropout(0.25),
Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
callbacks = [
EarlyStopping(patience=5),
ReduceLROnPlateau(factor=0.01, patience=2, verbose=1)
]
epochs = 50
model.fit(
training_data,
epochs=epochs,
validation_data=testing_data,
callbacks=callbacks
)
test_loss, test_accuracy = model.evaluate(testing_data)
train_loss, train_accuracy = model.evaluate(training_data)

train.log_metric("Testing Accuracy", test_accuracy)
train.log_metric("Testing Loss", test_loss)

train.log_metric("Training Accuracy", train_accuracy)
train.log_metric("Training Loss", train_loss)
return model

Using the `load_process_images` function that we defined before, we will map the pandas series’ as arrays. This dataset will later be passed to the `ImageDataLoader`, which will help us with our augmentations. Augmenting images involves applying transformations like rotating, flipping, etc. to images to prevent overfitting of models.

Coming to the model itself, we define a small five-layered Convolutional model for our Cats and Dogs dataset. The model is not the most optimal for this dataset, but it will suffice because our goal is to see the working of callbacks in a Layer project.

We load our image dataset using Keras’s `ImageDataLoader` with extra augmentations, train the model with the `EarlyStopping` and `ReduceLROnPlateau` callbacks and log the metric to Layer using the `train.log_metric` function. We can visualize and monitor the training and metrics of this model in the Layer Model Catalog. Now that we have everything in place, we will run the project and start the training process.

Pass your source code for features and models

At this point, we are done coding our preprocessing, data loading, model creation, and training parts. However, before we can run our code, we will need to ensure that our requirements are properly set. There must be two `requirements.txt` in the project:

  • one in the `data/features/category` directory and
  • the second in `model/model`.

These requirements are necessary and will help set up the environment before the execution starts.

Once you are ready to execute your experiments, just switch to your terminal inside your main project directory and execute:

layer start

On running the project, Layer automatically defines the pipeline. The pipeline takes your dataset, featuresets and model architecture into consideration and builds the entities in that specific order. The user does not have to provide specifics about the pipeline layout when using a Declarative MLOps platform like Layer. The processing and pipeline generation is handled automatically by Layer which gives you the freedom to focus on developing the best model for your task at hand. Executing `layer run` will display the current version of Layer you are using and start processing the dataset according to the instructions provided in `category.py`. Once the dataset is processed, the desired featuresets are extracted from it and the training process begins. The training process might take a while to complete based on your model configuration. Till then, you can switch to your Layer interface on your browser and monitor the process.

word image 132After completing training, we will have two tabs, one on each Model and Data Catalog pages, which will describe the training process, evaluation metrics, and scores for each metric.

Final thoughts

In summary, callbacks are pieces of code that execute a task or perform a specific function at regular intervals in the training, testing, or prediction process. This article discussed what callbacks are, their advantages, their types, and their usage. We explored code implementations of various callbacks and implemented a custom callback from scratch. We covered the ability to monitor various kinds of metrics using the inbuilt log dictionary in detail as well.

Callbacks are an ideal way to handle and visualize the model states in Keras. The ease of use and wide range of options offered by inbuilt callbacks and the flexibility of Custom defined callbacks make them an all-rounder for progress tracking. Simple implementation of custom callbacks to get fine-grained access and not having to babysit the model training is a significant advantage for businesses and researchers. Callbacks save time and money while providing insights on configurations and convenient logging.

Resources

1. Keras, François Chollet, & others. (2015)

2. Tensorflow, Martín Abadi, et al. (2015)

Oct 13th 2021
read

Share this post

Try Layer for free

Get started with Layers Beta

Start Free