What is a feature store?Melda
Within the world of machine learning (ML) development, efforts and time dedicated to feature engineering, analysis, model training, and deployment become the key to success for an organization. Undoubtedly, the rapid growth of machine learning is posing new challenges to large organizations, and developing machine learning algorithms is just the tip of the iceberg. Without configuring and bringing your ML systems into production seamlessly, all the development efforts would be vain.
In this article, I will first brief you on some technical debt in the world of machine learning. Moving from there, I will focus on feature stores, which is considered one of the most essential parts to perform successful ML Operations (MLOps). After all, it is well acknowledged that a machine learning model is just as accurate as the data/features fed into it.
In this paper, a team at Google offers some hard-won advice for machine learning practitioners, developers, and researchers on machine learning systems in production. Similar to the framework of technical debt in the software world, the “technical debt” in ML is referred to as a dangerous emerging trend, where developing and deploying ML systems is relatively fast and cheap, but maintaining them over time gets more difficult and expensive.
One of the major categories of technical debt outlined in the paper is related to ML analysis features, and some key arguments include,
- No centralized location to store and access features during model serving,
- Lack of feature reproducibility and reusability in different ML pipelines, leading to minimum, if not none, ML collaborations among data science teams,
- Training-Serving Skew where features in training and production are inconsistent,
- Changes in feature data are likely to cause model weight changes, making the entire system vulnerable and sometimes catastrophic. Even worse, it can be challenging to pinpoint which feature(s) vs. the whole system needs to be recalculated and reconfigured.
Knowing the technical debt, does it mean that we should completely avoid machine learning? Or, on the flip side, should we ignore the technical debt for the benefit of moving fast in the short run? The answer to both, of course, is an NO. As the paper noted, data teams must be aware of the potential technical debt and make the best effort to figure out mitigation strategies. Acknowledgedly, one of such strategies or best practices to reduce the technical debt is feature stores!
The concept of a feature store was first introduced in the Michelangelo platform created by Uber. The feature store, according to Uber, is a centralized operational data management layer that allows teams to store, share and reuse highly curated features for their machine learning projects.
ML features are sets of measurable properties or characteristics that act as input/predictors in the system. Depending on the specific projects, the source of ML features can be everywhere, from raw tabular data to words or sentences of human languages, to pixels of images, and finally to the aggregated or derived variables coming out of the process of feature engineering.
To serve a model, data scientists need to access the raw data, look at the distributions, and generate a list of promising features, often through lots of trial and error. When the model is promoted to production, data science teams face new challenges, such as calculating and serving features and continuously monitoring these features in production. Upon celebrating the implementation of Model A, the next Model B comes along, and the same procedures need to be repeated.
As data scientists, we all find the challenges mentioned above relatable. What makes it even more complicated is that ML projects never happen sequentially but rather take place parallelly. On top of that, with more observation samples being added, the model will be trained on an ever-increasing dataset, which poses additional challenges of model scalability and efficiency.
The process depicted below shows an ML infrastructure without a feature store for several projects.
As we can see, there are two aspects of duplications: First, repeated code in the feature engineering procedures, e.g., Feature A in the Training model A and B; Secondly, repeated code in the implementations with one version being included in the training, and another in the deployment. By producing tremendous work redundancy, this non-DRY (Don’t Repeat Yourself) code not only complicates our ML process but also renders the model error-prone.
There will be no easy access to the available features from the perspective of data infrastructure management because they are all embedded in some specific training jobs. Hence, the reusability (of these features) would be significantly compromised, if not impossible.
Now, let’s look at the same structure, but with a feature store bridging the data source and the ML process in the middle.
It doesn’t take a rocket scientist to point out how well-organized this pipeline looks, does it?
All available features, either raw data elements or derived representations from different sources, are cached in a centralized layer for easy access with the feature store. These features can also be reused across all internal projects within the organization and maintained or governed at the enterprise level.
With the basic understanding of what a feature store is and its functionality connecting the data sources and ML applications, let’s move on to an overview of the key components that typically make up a feature store.
When it comes to the ML features, there are two types: online and offline. Offline features, as the name suggests, are mainly used by offline batch operations. For example, the average number of transactions in the past six months is an offline feature. These operations are known to take a longer time because they are often designed to analyze large volumes of historical data. Common technologies involved to store and process offline features include Hadoop, Spark, cloud storage (AWS S3, Azure), Teradata, IBM Netezza, etc.
As opposed to offline features, online features need to be calculated and served in (near) real-time. Examples of these features would be standardizing a real-time observation score for an individual prediction. Calculating online features requires almost immediate access to the data and even faster computational resources. The resulting features can be stored in a key-value format database, such as MySQL Cluster and Redis.
Data science teams can access and use both online and offline features together in the same model for inference with feature stores.
Specifically, feature stores compute or transform feature values regularly when new data samples come in. They can also be configured to verify the consistency of these calculations using proper statistical analytics. All these processes will ensure that the data steps are completed successfully before being served to the model.
If new features or transformations are introduced on the model development side, feature stores can run “backfill” jobs that generate training data against historical values. Take UberEATs as an example; this near-real-time workflow is used for features like a “restaurant average meal preparation time over the last hour”. Some feature stores automatically backfill newly registered values by linking feature groups to jobs and data dependencies.
In addition to calculations and transformations, feature stores provide support to persist features as data from different sources passing through them. Similar to the feature differentiation, feature storage also contains online vs. offline storage.
As features evolve with the ML lifecycle, offline storage maintains a record of historical versions of various features along with the associated metadata. On the contrary, online storage gives data teams access to low latency feature values to build feature vectors used in model inference.
Specifically, offline features can be stored in regular data warehouses like BigQuery, Redshift, and expanding the existing data warehouses is likely to be a preferred approach. On the other hand, online stores typically store the most updated values and often do not require strict data consistency.
Now, we have the features available and ready to go. The following step in the workflow would be to retrieve and feed them into our ML applications. The process is known as feature serving. In addition to storage, feature stores are designed to serve features to different models, usually through an API.
The critical point of this step is to make sure that the features used to train our model match exactly the features provided in production. Otherwise, the process will lead to a cataphatic outcome known as Training-Serving Skew. Simply put, the Training-Serving Skew is a difference in mode performance between training and serving/production, which oftentimes is caused by a discrepancy between data handling in training and serving pipelines.
Feature stores ensure that the same features are easy to discover, retrieve, and analyze with other existing features in both training and production. This helps to overcome the Training-Serving Skew. Feature stores also provide data science teams a canonical way to use all shared features within the organization.
Last but certainly not least, once built, our ML applications require consistent monitoring to avoid model drift (i.e., the statistical properties of the features change). Feature stores suit this requirement because all historical data and the latest up-to-date values are kept in a time-consistent manner. With all data information available, comparing feature values and model performance would become as simple as an API call.
Feature stores also evaluate the overall health of an ML system by calculating and monitoring the operational metrics, e.g., metrics for feature serving or model evaluation (latency, error rates). These metrics can be made accessible to existing model monitoring infrastructure, which automatically compiles health alerts into specific views available to users.
As discussed throughout the blog, there are several advantages of leveraging a feature store:
In feature stores, data features are documented and pre-computed to be ready for sharing across the organization. This means that different data science teams don’t have to re-create similar code to engineer similar features. Rather, they can focus their effort on the actual modeling or algorithm development.
From the data engineering perspective, a feature store is a centralized storage for all features, making it much easier to manage than having data everywhere.
As aforementioned, feature stores eliminate the Training-Serving Skew by applying the same transformations in both training and production pipelines. On top of that, they monitor the data quality and model performance in production to ensure consistent predictive capability over time.
With feature stores, instead of working in silos to create and reimplement thousands of pipelines on their own, data science teams can share and collaborate. This will give them a comprehensive overview of the features available and avoid duplicated work due to not being aware of each other’s projects.
After understanding modern feature stores, let’s now appreciate the idea that features sit at the core of all ML pipelines. In fact, data science communities are expecting an increasing trend in feature store adoption within the next few years.
Looking to get started with feature stores? Here are several options:
Layer is a Declarative MLOps (DM) platform to help data teams across all companies produce ML applications based on code. Users pass the Dataset, Feature, and ML model definitions, and Layer builds the entities seamlessly.
The first-class entities in Layer include Datasets, Features, and ML models, with the former two being placed in the Data Catalog and the ML models in the Model Catalog. This structure enables easy monitoring and managing of the ML life cycle and improves the reusability and reproducibility of entities across different projects within your organization.
Image Source: Layer ML pipeline
Specifically, the Layer Data Catalog is a central repository for data teams to build their features and serve them in both online and offline (batch) fashion. The Layer Data Catalog functions as a feature store and makes sure that your dataset meets the quality expectations by executing tests automatically.
Regarding the data sources, Layer allows users to integrate as many data sources as needed in the Layer UI, and thus, utilize them to run complex cross-data source queries. As is shown in the Layer ML pipeline chart, you can use SQL queries to transform the raw datasets and extract analysis features. You can also leverage more advanced scripting, such as Python to gain more transformation capabilities.
In our upcoming blogs, we will demonstrate use cases to build ML applications with Layer to showcase how it makes the model building and deployment more straightforward and robust. So stay tuned!
As an enterprise platform, Hopsworks integrates popular data processing platforms, such as Apache Spark, Tensorflow, Kafka, and many others, and can be accessed via either a REST API or a user interface.
With Hopsworks 0.8.0, the open-source feature store service has been released to be integrated into the HopsML framework. The Hopsworks feature store allows data science and data engineering teams to manage ML features efficiently.
However, one challenge involved in using Hopsworks is that it would rely heavily on the HopsML infrastructure.
As a new capability of Amazon SageMaker introduced in December 2020, SageMaker Feature Store, a fully managed, centralized repository to store, update, retrieve, and share ML features.
All features stored in SageMaker Feature Store are created in groups and tagged with metadata for a quick feature search and discovery. This makes it easy for data science teams to evaluate whether the available features are suitable for their models.
In addition, features can be directly retrieved and used in the entire SageMaker workflows, i.e., from model training to data transformation and ultimately to real-time inference.
Admittedly, feature stores are still at an early stage of development. Still, organizations that actively and consistently promote ML applications to production have realized the importance of building a centralized repository to engineer, transform, and store analysis features for easy availability and reusability. I hope that this blog gave you a big picture of feature stores and inspired you to explore more possibilities with feature stores to make your ML implementation more efficient.
In this blog, we have covered,
- Technical Debt in Machine Learning Systems
- What is a feature store?
- ML infrastructure: without vs. with a feature store
- Concepts and Components of a feature store
- Why ML applications need feature stores
- Where to start with features stores
Feature stores will not allow the technical debt to tear us apart!