Data validation guide with TensorFlow

Every machine learning model is built on data, and the model’s usefulness and performance are determined by the data used to train, validate, and analyze the model. As you may expect, we can’t construct robust models without solid data. In this chapter, we first discuss why data validation is important, and then we show you […]
Oct 13th 2021
read

Share this post

Oct 13th 2021
read

Share this post

Data validation guide with TensorFlow

Dhiraj Kumar

Dhiraj Kumar is a Data Scientist & Machine Learning Evangelist.

Every machine learning model is built on data, and the model’s usefulness and performance are determined by the data used to train, validate, and analyze the model. As you may expect, we can’t construct robust models without solid data. In this chapter, we first discuss why data validation is important, and then we show you how to use TensorFlow Data Validation, a Python tool from the TensorFlow Extended ecosystem (TFDV). We demonstrate how to utilize the package in your data science projects, walk you through common use cases, and highlight some extremely valuable workflows.

What is Data Validation?

Data is the building block of every machine learning model. A machine learning model’s performance depends on the data used to train and validate the model. Machine learning models help to gather knowledge from massive amounts of data. But, if input data is not correct, then it will impact the model’s output. Hence, it is crucial to validate the data before using it as input to the model.

The data validation process ensures that the data entering a system is correct and meets the desired quality standards. It also ensures the delivery of clean data to a machine learning model. The process is implemented by building several checks into a system or report to ensure the logical consistency of input data to match the desired requirements.

Data Validation Process Flow

Data validation components check new data as it is added to the training data. The data validation process can be summarised in five steps:

  • ​​​​Calculate the statistics from the training data against a set of rules
  • Calculate the statistics of the ingested data that has to be validated
  • Compare the statistics of the validation data with the statistics from the training data
  • Store the validation results and take automated actions like removing a row, capping, or flooring the values
  • Send notification and alerts for approval​​​​​​​

Why do Data Validation?

Sometimes you have to collect data from different sources before giving input to the model. So, there’s a chance that data gets corrupted while moving and merging. Our goal is to ensure that data collected from different sources and repositories are not corrupted and follow business rules.

Machine learning models learn from patterns generated from input data, indicating that data is the heart of machine learning workflows. The success of machine learning models, therefore, depends on the quality of the data. The machine learning pipeline has multiple steps, and each step determines whether the workflow can proceed to the next step or the entire workflow needs to be abandoned and restarted. Data validation is a critical checkpoint of the ML pipeline. It identifies changes in the data coming into the machine learning pipeline before it reaches the time-consuming preprocessing and training steps.

word image 102

Image from: https://miro.medium.com

When we talk about validation, we refer to three different checks on the data:

  • Identify data anomalies.
  • Check whether the data schema has changed.
  • Check whether statistics of new datasets still align with statistics from previous training datasets.

Now, let’s discuss the reason why data validation is necessary.

Model Performance

The performance of a machine learning model does not only depend on developing a robust model. It depends on the data provided to the model. If bad data is used, it will never give the desired output.

Reduction of Cost

After a machine learning model is installed on a production server, it is routinely retrained with new data. The ML model makes predictions using new input data and then adds the same data to retrain the model.

There’s a chance that the new incoming data will result in errors in the serving data. In a period, the erroneous ingested data​​​​​​​ will become part of the model training data, which can start degrading the model accuracy. As in each iteration, the new data is mostly a small fraction of complete training data. Therefore, the changes in model performance may be missed, and the errors or deviations will keep adding with time. ​​​​​​​

It indicates, catching the data errors at an early stage is good practice because it will reduce the cost of data error.

word image 103

Image from: https://lh5.googleusercontent.com

Various Data Validation Techniques

At present, you will find several data validation techniques that allow the user to check that the provided data is valid and complete, as per the requirement. These techniques are responsible for validating data successfully through any needed transformations without any loss. In simple words, the data validation technique is a part of data testing where it checks that the entered data is valid or not according to the provided business conditions.

Let us discuss the important data validation techniques.

Exploratory Data Analysis (EDA)

The initial step of any Data Science project is Exploratory Data Analysis (EDA). It’s a process of evaluating the data and extracting insights or main characteristics of the data and using a statistical approach to analyze input data and produce descriptive output with graphical summaries. EDA is one of the critical processes of performing analysis on input data to discover the patterns, find anomalies, and test hypotheses. EDA supports Data Analysts in understanding the data.

The primary goals of EDA are:

  • Evaluate the data distribution.
  • Handle missing values of the given dataset.
  • Handling the outliers.
  • Identify and remove duplicate data.
  • Encoding the categorical variables
  • Normalizing and scaling the data

Methods of Exploratory Data Analysis (EDA)

Generally, data can be displayed in the three formats; Univariate, Bivariate, and Trivariate; displaying data graphically on the x-axis, (x, y-axis), and (x, y, z axis), respectively. EDA is majorly performed using the methods below:

Univariate analysis provides summary statistics for each field in the raw data set and a summary of one variable. Ex: – Probability Density Function (PDF), Cumulative Distribution Function (CDF[1] ), Box plot, Violin plot.

Bivariate analysis is performed to find the relationship between variables in the dataset and the target variable of interest or using two variables and finding the relationship between them.

Example: Violin plot and Box plot.

Multivariate analysis is a way to analyze the interactions between various fields in the dataset or analyzing interactions between more than two variables. For example, pair plotting or 3D scatter plots.

Dimensionality reduction helps identify the parameter that reduces the maximum variation in results and enables fast processing by reducing the volume of data.

  • Schema Validation – The dataset’s schema description can be used to assess future datasets to see if they are consistent with past training sets. When preparing datasets for training machine learning models, the schemas generated can be used in the next workflow step.
  • Skew Detection – The skew detection is used to examine and detect any significant discrepancies in the statistics of the two datasets. The skew is the L-infinity norm of the difference between two datasets’ serving statistics.

Note that we might need to do the data validation for checking data drift. Data Drift is a change in the distribution of data referred to as data drift. This is the difference between real-time production data and a baseline data set, most likely the training set, that is reflective of the task the model is designed to do in the case of production ML models.

Unit-test

In the software industry, developers write unit tests for their code. Similarly, you have the option to write unit tests for incoming data. The idea is to create a system where users can define constraints on data. The system should also allow users to convert these constraints to computable metrics and generate a report indicating which constraints succeeded and failed. The report should also contain the corresponding metric value that triggered a failure. The major steps of unit tests are explained below.

Declare constraints: Users have the option to define how their data should look. It works by declaring checks on input data by composing constraints on various columns. The list of constraints is shown below in Table 1.

Compute metrics: The matrix can be calculated, based on the declared constraint, translated to measurable metrics as shown below in Table 2. These metrics can be compared between the data in hand and the incremental data.

Analysis report: Anomalies in data can be predicted based on metrics that have been collected over time. The user has a system to issue a warning message if the new metric is more than three standard deviations away from the previous mean. Based on the analysis report, the failed constraints include the values that made the constraint fail. ​​​​​​

word image 104

Table 1. Image from: https://lh4.googleusercontent.com

Data Validation with Tensorflow (TFDV)

TensorFlow Data Validation (TFDV) is a Python package for exploring and validating machine learning datasets. It’s an open-source library developed by the team of Google to help ML developers to validate the data. . It is designed to work well with TensorFlow and TensorFlow Extended (TFX). It has three major components:

  • Data Analyzer,
  • Data Validator, and
  • a Model Unit Tester as shown in the architecture.

word image 105

Image from: https://d3i71xaburhd42.cloudfront.net

The Data Analyzer computes a predefined set of data statistics in a scalable fashion over large amounts of data.

The Data Validator checks the properties of data as specified through a Schema.

The Data Visualizer visualizes the statistics, schema, and anomalies. It generates a simple table-based view for the schema and the anomalies recorded.

The Model Unit Tester checks for errors in the training using synthetic data generated through the schema.

Supported platforms by TensorFlow Data Validation (TFDV)

TFDV is tested successfully on the following 64-bit operating systems:

● macOS 10.14.6 (Mojave) and later.

● Ubuntu 16.04 and later

● Windows 7 and later

Dependencies required by TensorFlow Data Validation (TFDV)

TensorFlow Data Validation requires Apache Beam to support distributed computation. Apache Beam is used to define and process data pipelines. Apache Beam runs in local mode by default, but it can also run in distributed mode using Google Cloud Dataflow.

Apache Arrow is the second need. TFDV uses Apache Arrow to describe data internally, allowing vectorized NumPy functions to be used.

Features of TFDV

TFDV having several features, but a few of them are listed below:

  • Descriptive statistics can be easily computed through TFDV. It is like a quick overview of the data in terms of the features that are present.
  • Visualizes computed statistics using a tool called Facets Overview.
  • Option to inspect the schema with the help of schema viewer.
  • Has the option to identify anomalies such as missing features.
  • To know about what features anomalies are, have a tool called anomalies viewer present.

Let’s now look at the workflow of TFDV:

  • Generate statistics for the input data
  • Use statistics to generate a schema for each feature
  • Visualize and inspect the schema to identify the problems in data
  • Update the schema if required

Installing TFDV

Installing TFDV is not a big task. The standard installation process is through `pip`. Once you install the TensorFlow Extended package (TFX), TFDV will be installed as a dependency. If you would like to use TFDV as a standalone package, install it with the command below:

pip install TensorFlow-data-validation

After the installation, you can integrate data validation into machine learning workflows or visually analyze the data in a Jupyter Notebook.

Let’s discuss a couple of use cases below.

Generating Statistics for Input Data

The first step in the data validation process is to generate some summary statistics for data. As an example, suppose you have a consumer complaints file in CSV format. Let’s look at how to compute the statistics for the CSV file.

The function with the name `generate_statistics_from_csv` is used to generate the statistics. In the case of `TFRecord` file, you can generate the statistics as shown below:

import os

import tempfile

import urllib

import tensorflow as tf

import tensorflow_data_validation as tfdv

import warnings
stats = tfdv.generate_statistics_from_csv(data_location = 'customers.csv')

tfdv.visualize_statistics(stats)

word image 106

Schema-based Validation

The schema definition of the dataset can be used to validate future datasets to determine whether they are in line with previous training sets. The schemas TFDV generates can be used in the following workflow step when preprocessing datasets for training machine learning models.

Once summary statistics are generated, the next step is to create a schema for the dataset. A data schema is a way of describing the representation of datasets. A schema defines features expected in the dataset and the types of each feature(float, integer, bytes, etc.). The schema defines the boundaries of data.

Based on the statistics, you can generate the schema information from generated statistics with a single function call. TFDV infers a schema that is meant to reflect the stable characteristics of the data.

Comparing Datasets

Sometimes comparing the datasets is required. TFDV has features to solve this problem. Suppose you have training and validation datasets and want to determine how representative the validation data is to the training data set before training the machine learning model. The TFDV feature can quickly provide the solution to these questions:

● Does the validation data follow the training data schema?

● Are any feature columns or a significant number of feature values missing?

As depicted in the code below, we have loaded both datasets and visualized them as well. If we execute the following code in a Jupyter Notebook, we can compare the dataset statistics easily:

# compute stats for training data

train_stats = tfdv.generate_statistics_from_csv(data_location = 'customers-training.csv')# Compute stats for evaluation data

eval_stats = tfdv.generate_statistics_from_csv(data_location='customers-eval.csv')# Compare evaluation data with training data

tfdv.visualize_statistics(lhs_statistics=eval_stats,rhs_statistics=train_stats,lhs_name='EVAL_DATASET',rhs_name='TRAIN_DATASET')

word image 107

Updating the Schema

Manually creating the schema based on domain knowledge of the input data is another use case for TFDV. You’ve learned how to spot differences in the schema built automatically from the dataset up to this point. You can change the schema to reflect your requirement for this feature to be present in training samples. The schema must first be loaded from its serialized location.

schema = tfdv.load_schema_text(schema_location)
Skew Detection

TFDV has a built-in `skew_comparator` function used to inspect and find any significant differences between the statistics of the two datasets. Under TFDV, skew is the L-infinity norm[1] of the difference between the `serving_statistics` of two given datasets. TFDV has three skew detection types; Schema skew, feature skew, and distribution skew.

Schema Skew

When the training and serving data do not follow the same schema, schema skew occurs. The same format is expected for both training and serving data. Any predicted differences between the two (for example, the label feature being present only in the training data but not in serving) should be indicated using the schema’s environments field.

Feature Skew

Feature skew arises when the feature values that the model trains on differs from the feature values that it sees at serving time. For example, this can happen when:

  • a data source that offers some feature values is changed between training and serving time.
  • generating features differs between training and serving. For instance, suppose you just apply a transformation in one of the two code pathways.

Distribution Skew

When the distribution of the training dataset differs significantly from the distribution of the serving dataset, a distribution skew arises. Using diverse code or data sources to build the training dataset is one of the main reasons for distribution skew. Another problem is that a defective sampling mechanism selects a non-representative subset of the serving data to train on.

Assume that the difference between the two datasets surpasses the L-infinity norm criterion for a given feature. In that situation, TFDV uses the anomaly detection described above to identify it as an abnormality immediately. This means that when reporting an abnormality, a threshold value is always applied.

Drift Detection

TFDV has another valuable component called `drift_comparator`. The drift comparator is used to compute the statistics of two datasets of the same type. For example, two training sets collected on two different days or months. If any drift is detected, the Data Analyst is responsible for checking either the model architecture or deciding whether feature engineering needs to be performed again.

The code below defines the `drift_comparator` for the features you would like to watch and compare. Next, you can call the `validate_statistics` function with the two dataset statistics as arguments, one for yesterday’s dataset and one for today’s dataset.

train_day1_stats = tfdv.generate_statistics_from_tfrecord(data_location='customers-training.csv')

tfdv.get_feature(schema, 'payment_type').drift_comparator.infinity_norm.threshold = 0.01

drift_anomalies = tfdv.validate_statistics(statistics=train_day2_stats, schema=schema, previous_statistics=train_day1_stats)

Slicing data in TensorFlow Data Validation

Tensorflow Data Validation has a feature for slicing datasets on chosen features to show whether they are biased. This is almost similar to the calculation of ML model performance on sliced features of data. For example, missing data is a subtle way for data bias. On the other hand, if data is not missing at random, it may be missing more frequently for a single group of people within the dataset than for others. This indicates that when the final model is trained, its performance will be worse for these groups.

import tensorflow_data_validation as tfdv
from tensorflow_data_validation.utils import slicing_util# Slice on country feature

slice_fn1 = slicing_util.get_feature_value_slicer(features={'country': None})# Slice on the cross of country and state feature

slice_fn2 = slicing_util.get_feature_value_slicer(features={'country': None, 'state': None})

Distributions Detection

TFDV can detect unbalanced features as well as uniform distribution.Lets discuss these two in a bit more detail below:

Detect unbalanced features

In distribution detection, an unbalanced feature is a TFDV feature in which one value predominates and occurs naturally; however, if a feature always has the same value, it may indicate a data issue.

Now the question is how to detect unbalanced features in a Facets Overview. For this, choose “Non-uniformity” from the “Sort by” dropdown. The most unbalanced features get listed at the top of each feature-type list.

word image 108

Detect uniform distribution

If all possible values appear close to the same frequency, this is a sign of uniformly distributed data. Like unbalanced data, this distribution also occurs naturally. It can also be produced by data bugs. To detect uniformly distributed data in a Facets Overview, always choose “Non-uniformity” from the “Sort by” dropdown list and check the “Reverse order” checkbox.

word image 109

Large differences in scale between features

It’s possible that there could be a large difference in scale between features. If data features vary widely in scale, it may cause challenges in training the model. For example, if some features vary from 0 to 1 and others vary from 0 to 1,000,000,00, there is a big difference in scale here. To detect widely differing scales, compare the “max” and “min” columns across features. Always use normalizing feature values to reduce these wide variations.

Data with Invalid labels

TensorFlow’s Estimators have one feature to restrict the type of data they accept as labels. For example, if it’s classified as binary, then typically only work with {0, 1} labels only. To overcome this issue, review the label values in the Facets Overview and make sure they conform to the requirements of Estimators.

Final thoughts

In this article, you learned about validating the input data fed to machine learning pipelines and its importance in ML projects. We have discussed how to generate data statistics and schemas. The comparison of two different datasets based on their statistics and schemas has also been explained. The complete implementation of TFDV for an ML program has been explained.

Oct 13th 2021
read

Share this post

Try Layer for free

Get started with Layers Beta

Start Free