Calculating the Keras F-Score: Practical Guide with Layer

Keras is one of the most well-known deep neural network APIs executing on Google’s Tensorflow platform. It is designed to enable easy and flexible implementation of deep learning workflows and thus to improve the efficiency of the model building process. Before jumping right into the technical part of this article, let’s remind ourselves of the […]
Oct 13th 2021

Share this post

Oct 13th 2021

Share this post

Calculating the Keras F-Score: Practical Guide with Layer

Yi Li

Senior Data Scientist at Public Risk Innovation, Solutions, and Management (PRISM)

Keras is one of the most well-known deep neural network APIs executing on Google’s Tensorflow platform. It is designed to enable easy and flexible implementation of deep learning workflows and thus to improve the efficiency of the model building process.

Before jumping right into the technical part of this article, let’s remind ourselves of the concept of deep learning vs. machine learning. One may argue that these two are structurally different because deep learning resembles the biological network of neurons in the human brain, whereas machine learning does not. However, in practical terms, deep learning is a part of machine learning in the sense that they function in the same way when it comes to the modeling pipeline, i.e.,

With the analysis data/features,

  • Create a neural network model and specify the hyperparameters for training the model,
  • Train the model based on its performance metrics,
  • Select the best-tuned model to make predictions against new observations.

One of the core components of this pipeline lies in the model evaluation step, where choosing the appropriate performance measures and correctly implementing them becomes extremely crucial. Working in the machine/deep learning field, data scientists know that the optimal choice of evaluation metrics varies depending on the business problems one needs to tackle. For instance, accuracy, and ROC (Receiver Characteristic Operator)/AUC (Area Under Curve) are often used for classifiers, whereas MSE/RMSE (Mean Squared Error/Root Mean Squared Error), MAE (Mean Absolute Error) are preferred for regression models.

In this article, we will cover the performance metrics for classification models, focusing on the scenario of imbalanced datasets. Specifically, the article is organized as follows,

  • The concept of Loss-Metric Mismatch
  • Evaluation metrics for classifiers with imbalanced datasets and metrics in Keras 2.0
  • Neural network modeling with Layer
    • Prerequisite to get started
    • Layer project structure
  • Training a neural network model with a custom F-score metric vs. a Callback object in Keras

Loss-Metric Mismatch

An objective or loss function is utilized for training and optimizing a machine learning algorithm or a neural network. As the name suggests, an evaluation metric is used to measure and evaluate how well the algorithm performs. Intuitively, loss function and evaluation metric should match because they share the same ultimate goal — optimizing the algorithm(s).

However, for real-world problems, chances are that loss functions are far from being good approximations to evaluation metrics; this concept is referred to as loss-metric mismatch or loss-metric conflict. It leads to a dilemma in the algorithm training process, where the loss function, e.g., cross-entropy, is minimized, yet certain performance metrics, e.g., accuracy, are maximized! Certainly, this conflict is counterintuitive, and it adds an extra layer of complexity to model optimization.

Now, why can’t evaluation metrics, such as accuracy, double as loss functions? Well, loss functions must be differentiable and ideally also convex to ensure that the local minimum is global. Nonetheless, performance metrics for classification are non-continuous and non-differentiable; thus, they can not be optimized using gradient-based methodologies like stochastic gradient descent.

Why can’t loss functions, such as cross-entropy, be the model evaluation metrics? In my opinion, this mainly boils down to the practicality of evaluation metrics; they usually are designed to be straightforward and easily interpretable, particularly to our non-technical business partners. On the other hand, loss functions, such as cross-entropy, are often not as intuitive and can be difficult to explain.

Evaluation metrics for classifiers with imbalanced datasets

As one of the most straightforward and well-defined metrics, classification accuracy is frequently used to evaluate and compare the performance of classifiers. By definition, classification accuracy equals

word image 133

Putting this in a confusion matrix gives:

  • Classification Accuracy = word image 134where TP = True Positive, TN = True Negative

It should be acknowledged that (classification) accuracy is a good performance metric for problems involving balanced datasets. For instance, in the binary classifier scenarios, the class ratio is close to 50:50. Nevertheless, when it comes to problems with imbalanced class ratios, accuracy can provide “inaccurate” information. Now I will use the following example to show what I mean by “inaccurate information”.

Let’s say our task is to build a predictive model to detect fraud, where the output variable is discrete (i.e., fraud vs. non-fraud). For training data, we have a total of 1000 records, among which 10 are true fraud, whereas 990 are non-fraud (i.e., fraud:non-fraud ratio = 1:99). Then we build a model with this confusion matrix, where Accuracy = 990 / 1000 = 99%!

Predicted: non-Fraud Predicted: Fraud
True: non-Fraud 990 (TN) 0 (FP)
True: Fraud 10 (FN) 0 (TP)

Achieving an accuracy score of 99% indicates an excellent model, isn’t it? Well, this conclusion is too soon.

How many fraudulent records does the model detect? None! And let’s remind ourselves of the goal of this model: to detect fraud. Hence, is the model useful? The answer is no. This model is useless because all it can identify is non-Fraud, which, unfortunately, is NOT the class of interest.

Knowing the pitfall of accuracy in judging imbalanced classifiers, one may ask what the preferred metric(s) are? Instead of Accuracy, Precision, Recall (a.k.a Sensitivity), and F-score should be used,

  • word image 135
  • word image 136
  • word image 137

As we can see, all these three measures return an 0 in this example, which renders the model completely impotent.

Evaluation metrics in Keras 2.0

As we alluded to, Keras allows users to perform efficient experimentation with deep learning/neural network models. A partial reason for this efficiency is that it offers a list of performance metrics right off the bat for one to choose. Specifying a metric that is available in Keras is simple; all one needs to define is the metrics augment in the compile() method,



loss='binary_crossentropy', metrics=[metrics.categorical_accuracy]


In the Keras 2.0 release notes, a callout is that several legacy metric functions, such as precision, recall, fbeta_score/f-score, have been eliminated. Without this default implementation, practitioners have no choice but to create their metrics.

To address this issue, the rest of this article will be dedicated to implementing the F1-score in Keras and focusing on the correct and incorrect ways of doing so.

Neural network modeling with Layer

Layer is a Declarative MLOps (DM) platform that helps data teams create and deploy machine learning/deep learning applications based on code. Users pass the Dataset/Feature set, together with machine learning model definitions, and Layer builds the entities seamlessly.

In a nutshell, Layer wraps machine learning components into two catalogs:

  • the Data Catalog containing Datasets and Feature sets, and
  • the Model Catalog containing the modeling code.

This architecture enables easy management of the machine learning life cycle and improves the reusability of entities across different projects within your organization.

With Layer, data science teams no longer need to worry about:

  • intimidating deployment jobs,
  • Training-Serving Skew,
  • how to use a container, or
  • deployment language (e.g., Java) is different from the modeling language (e.g., Python or R).

Layer takes all these burdens off data teams’ shoulders so that they can focus on the most impactful task; developing models.

Can’t wait to get your hands dirty with Layer? Different use cases are made available here. In the following sections of this article, I will showcase how we can utilize Layer to set up the F1 metric in Keras and how Layer makes it straightforward to deploy this project.

Getting started: Prerequisites

First and foremost, let’s install the Layer SDK in Python. Please note that the Layer SDK is compatible with Python versions 3.8+.

! pip install layer-sdk

Before running any Layer SDK commands, we need to login:

layer login

You will be directed to to sign in.

One tip here is that when anything strange happens, go ahead and try to upgrade the Layer package using,

! pip install --upgrade layer-sdk 

### OR if you don't have the admin right, use the following 
! pip install --upgrade --user layer-sdk

Layer project structure

Now, we are ready to build out our project! As aforementioned, each Layer project consists of a data folder and a model folder. Additionally, it also has a README file and a folder named .layer, which contains a project YAML configuration file.

We will work with the financial fraud detection dataset in Layer’s fraud detection model example for this exercise. More details regarding this dataset will be discussed in the next section.

Let’s take a look at our current project structure, which is written in the README file,

Graphical user interface, text Description automatically generated

As we can see,

  • in the data folder, our features are created using SQL queries. These queries will extract data from the Layer Demo Snowflake database.
  • in the model folder, the is our Python code for modeling.

Note, Layer will treat any directory that includes a “model.yml” as a machine learning model project.

With everything ready to go, let’s dive into the modeling part by importing all the packages,

## Import Layer 
from layer import Featureset, Train 
from typing import Any 
from collections import Counter 
import numpy as np 
from random import sample, seed 
from sklearn.model_selection import train_test_split, GridSearchCV, KFold, StratifiedKFold 
from sklearn.metrics import f1_score, accuracy_score, precision_score, recall_score, average_precision_score 
from tensorflow.keras import backend as K 
from tensorflow.keras.models import Sequential, Model 
from tensorflow.keras.layers import Input, Dense, Embedding, Concatenate, Activation 
from tensorflow.keras.callbacks import Callback, EarlyStopping, ModelCheckpoint

Experiment # 1: create a customized F1-score metric

In Keras, one can define a custom metric score by first creating a callable object with an array as the return, i.e., the my_f1_metric shown below. This function will then be passed to compile() as a metric, which you will see demonstrated in the following sections.

### Define the custom F1 score for the current exercise 
def my_f1_metric(y_true, y_pred):

### calculate the recall score 
def recall(y_true, y_pred): 
TP = K.sum(K.round(K.clip(y_true * y_pred, 0, 1))) 
Positives = K.sum(K.round(K.clip(y_true, 0, 1))) 

recall = TP / (Positives+K.epsilon()) 
return recall 
### calculate the precision score 
def precision(y_true, y_pred): 
TP = K.sum(K.round(K.clip(y_true * y_pred, 0, 1))) 
Pred_Positives = K.sum(K.round(K.clip(y_pred, 0, 1))) 

precision = TP / (Pred_Positives+K.epsilon()) 
return precision 

precision, recall = precision(y_true, y_pred), recall(y_true, y_pred) 
### return the f1 score 
return 2*((precision*recall)/(precision+recall+K.epsilon()))

One small note:

  • the K.epsilon() added to the end of each formula is to avoid division by 0, which could happen during the training depending on specific problems.

Current dataset: Financial fraud detection

As I briefly touched upon, to show how this custom metric function works, we will use the financial fraud detection dataset as an example. The exploratory data analysis shows an extreme class imbalance with nonFraud (99.2%) and Fraud (0.8%).

For demonstration purposes, we will use all the input features in our neural network model and save 10% of the data as the hold-out testing set:

X = train_df.drop(["transactionId", "is_fraud"], axis=1) 
Y = train_df["is_fraud"] 

trainX, testX, trainY, testY = train_test_split(X, Y, test_size=0.1, random_state=13)

### transformation 
sc = StandardScaler() 
trainX= sc.fit_transform(trainX) 
testX= sc.transform(testX) 
trainY, testY = np.array(y_train), np.array(y_test)

Building the multilayer neural network model

With the training dataset in the desired format, let’s create a deep learning model for this exercise. The following code builds a neural net containing three hidden layers with different dropout rates. Since the current task is a binary classification problem, the activation function for the output layer is the sigmoid function.

### Building a neural network model
def NeuralNetModel(x_tr):

inp = Input(shape = (x_tr.shape[1],)) 

x = Dense(1024, activation='relu')(inp) 
x = Dropout(0.3)(x) 
x = BatchNormalization()(x) 
x = Dense(512, activation='relu')(x) 
x = Dropout(0.4)(x) 
x = Dense(128, activation='relu')(x) 
x = Dropout(0.4)(x) 
x = BatchNormalization()(x) 

out = Dense(1, activation='sigmoid')(x) 
cur_model = Model(inp, out) 

return cur_model

Model training using the custom metric my_f1_metric

To train the model with the my_f1_metric defined above, we will set up a stratified k-fold cross validation. The performance metrics — F1 score, precision and recall scores — will be calculated,

f1_cvList, recall_cvList, precision_cvList = [], [], [] 
## Define the training parameters 
my_folds = 5 
my_epochs = 60 
my_batchSize = 16 

kfold = StratifiedKFold(my_folds, random_state=13, shuffle=True) 

## Stratified cross validation 
for fold, (tr_inds, val_inds) in enumerate(kfold.split(X=x_train, y=y_train)): 

x_tr, y_tr = x_train[tr_inds], y_train[tr_inds] 
x_val, y_val = x_train[val_inds], y_train[val_inds] 

### Model.compile with my_f1_metric defined above 
model = NeuralNetModel(x_tr) 
model.compile(loss='binary_crossentropy', optimizer= "adam", metrics=[my_f1_metric, 'accuracy']) 

history =, y_tr,

#### Log the custom F1 metric, where train is from layer.Train 
for value in history.history['my_f1_metric']: 
train.log_metric('Custom F1 metric', value) 

y_val_pred = model.predict(x_val) 
y_val_pred_cat = (np.asarray(y_val_pred)).round() 

### Get performance metrics 
f_measure = f1_score(y_val, y_val_pred_cat)

f1_cvList.append(round(f1, 4))

precision = precision_score(y_val, y_val_pred_cat) 
precision_cvList.append(round(precision, 4))

recall = recall_score(y_val, y_val_pred_cat)

recall_cvList.append(round(recall, 4)) 

##### Log performance measures after CV in Layer 
train.log_metric('Average f1 score across all CV sets', np.round(np.mean(f1_cvList), 4))

Please note here,

  • the goal of the current exercise is not to develop a model with top-notch performance but rather to demonstrate how the custom `my_f1_metric` works. Hence, there is no need to get hung up on the actual performance numbers;
  • the pre-defined function `my_f1_metric` is specified in the `model.compile` step;
  • the train object is defined by `Layer.Train`. It represents the current model training, passed by Layer when the training starts;
  • we extract the f1 values from the training process and use `log_metric()` function to log and track the custom f1 values with Layer;
  • once the model training finishes, the mean value of all f1 scores across the cross-validation is calculated and tracked in our Layer project.

This will be the code in the file. Now let’s change our directory to this project and run it on the terminal using:

layer start

Simple and easy, isn’t it? In no time, the message will show that Layer is building everything up,

Text Description automatically generated

After the training process is completed, we can go to our Layer project and click on the model version to view the result. Specifically to this exercise, we will see the Custom F1 metric score and the Average F1 score across CV since they are logged and tracked during the model training/validating process.

Graphical user interface, application, email Description automatically generated

Looking at the following performance chart in Layer, we noticed something strange, i.e., the F1 scores computed as model training goes (e.g., 0.0701) are notably different from the F1 scores computed against each validation set (e.g., 0.284),

Graphical user interface, application Description automatically generated

Intuitively, this must imply some issues underneath!

Specify Keras metrics with the Callback

If one takes a step further researching into this issue, one would realize that when using a custom metric callable/function, Keras will perform so-called batch-wise computations. More specifically, the F1 score is calculated after each batch run. Suppose that we define five batches/iterations in one epoch. We would have five different F1 scores by the time each epoch ends. These five scores are then averaged to get a global approximation for a particular epoch.

However, the information provided by this global approximation is more confusing than helpful because what one aims to monitor is a macro/global training performance score when each EPOCH ends.

What does a ‘macro or global training score’ mean? It is the performance score calculated at the end of each epoch during training (i.e., epoch-wise computations). If we consider each batch a micro process or the smallest unit for the model weights updating, then each epoch, comprised of one or multiple batches, is considered a macro process; thus, the name “macro”. This macro score is not the same as the averaged batch-wise scores aforementioned because the divisor/denominator in each batch is different.

Nevertheless, the macro F1 score here shouldn’t be mixed up with the F1-macro in the scenario of multi-label classification, i.e., the argument of average = ‘macro’ for f1_score in the Python Sklearn package. In `sklearn.metric`, the F1-macro refers to the unweighted average of F1 scores calculated for each label/class, whereas here, the macro F1 is an epoch-wise score in training a neural network model.

Back to the inappropriate batch-wise methodology, it is the reason for metrics, such as F-scores, recall, and precision, being eliminated from Keras 2.0 release. Then what is the appropriate way of implementing a macro F1 metric?

We can turn to the Callback functionality:

### Defining the Layer Callback Metrics Object 
class LayerMetrics(Callback): 
def __init__(self, train, validation, fold): 
super(LayerMetrics, self).__init__() 
self.train = train 
self.validation = validation 
self.curFold = fold 

def on_train_begin(self, logs={}): 
self.validation_f1List = []

self.validation_precisionList = [] 
self.validation_recallList = [] 
### At the end of each epoch 
def on_epoch_end(self, epoch, logs={}): 
validation_targ = self.validation[1] 
validation_pred = (np.asarray(self.model.predict(self.validation[0]))).round() 

validation_f1 = round(f1_score(validation_targ, validation_pred), 3)

validation_precision = round(precision_score(validation_targ, validation_pred), 3) 
validation_recall = round(recall_score(validation_targ, validation_pred), 3) 



### Send the performance metrics to Layer to track 
self.train.log_metric('Epoch End F1-score', validation_f1)

With this code, we defined the LayerMetrics, a Callback object to compute and track the performance metrics upon each epoch end:

  • the `_init_` constructor is specified to initialize the attributes of this class, i.e., training set, validation set, and folds;
  • the function `on_train_begin` defines the actions when training starts; here, we created three lists to store the performance metrics F1, precision, and recall, respectively;
  • the function `on_epoch_end` defines the actions after each epoch run; in our example, we ran the model against the validation set, calculated the metrics values, and appended these values to their corresponding lists created in `on_train_begin`. Using `train.log_metric`, we also sent the validation performance to Layer for tracking purposes.

Compiling and fitting the `NeuralNetModel`:


optimizer= "adam",

callbacks=[LayerMetrics(train, validation=(x_val, y_val), fold)], # LayerMetrics Callbacks 

Re-running the training with cross-validation, Layer will automatically assign a new model version for tracking purposes.

Checking the Layer performance chart generated by the Callback approach as training goes, we can see that our LayerMetrics produces consistent F1 scores (with the average equals approximately 0.3-0.4) for training and validation, as shown below:


Final Remarks

This article discussed and demonstrated the inappropriate and appropriate approaches to implement and monitor the F1 scores in neural network models. Additionally, we have walked through steps to create, track and deploy the project with Layer.

Readers are encouraged to apply similar procedures to other performance metrics calculation, such as recall or precision scores. One can also set up their Layer projects to better understand how Layer works.

Hopefully, this article is informative and helpful, especially if you are in the process of experimenting with various evaluation metrics for imbalanced classifiers. The full script and the current Layer project demonstration can be found in my Github repo.

Oct 13th 2021

Share this post

Try Layer for free

Get started with Layers Beta

Start Free