How to Choose the Right Feature Storeadmin
As the world progressively advances towards AI solutions and SaaS solutions with AI add-ons, the ML production pipeline becomes more standardized for seamless production. MLOps plays a key role in this evolution by facilitating coordination between development and deployment teams to create high-quality models.
MLOps deals with optimizing the end-to-end ML pipeline. A major segment of that pipeline is supported by a feature store that manages features and feature functions – everything that comes between the data source and model training.
A feature store is a unit in the machine learning pipeline that ingests raw data, engineers relevant features and stores them efficiently for various teams to access under controlled environments.
Smart features are the lifeline of ML solutions and are game-changers compared to raw data. With the help of feature stores, ML teams can build models faster, better and also ensure that the solutions are long-lasting.
Feature Store Overview | Image by VentureBeat
As emerging technologies become a must-have instead of good-to-have for being consistently relevant, organizations need to shift from an ad-hoc approach to a more foundational structure that can ease the production of AI and data-based solutions in the future.
However, as the corporate world is still on the verge of exploring AI solutions, most companies still execute the primary functions on an ad-hoc basis. In other words, there is no standardized process or infrastructure for solution development, and for each new solution, the functions have to be recreated and tested end-to-end. A major chunk of these functions falls into the bracket of feature development and management.
It turns out that managing features, in our experience, is one of the biggest bottlenecks in productizing your ML models. — Uber
Inefficient feature management is a huge deficit of the ML pipeline, which is non-standardized for the most part. Some challenges that organizations are facing today due to the lack of a standardized infrastructure for feature management are:
- Feature redundancy
When different teams work on the same data, or even if individuals from the same teams work on different solutions by leveraging common data, a lack of transparency of features can lead to duplicate features. This leads to unnecessary time investment in rebuilding the features and also eats up valuable storage.
- Access to correct input data by ML models
Data is bountiful, and therefore it is a challenge to sort the right from the wrong. As the popular saying goes, “garbage in, garbage out”. It means that even if the machine learning model is state-of-the-art, it will be defenseless while confronting poor data quality. Also, managing access to different data sources can be challenging, especially when restricted by compliance rules.
- Creating features from raw data
Innovating features cannot be completely automated, especially when the creation process depends on specific business use cases. However, for similar use cases with a simple variation of clients, the feature creation process becomes clerical and non-strategic. There is a high potential for automation in such cases. Also, some feature creation techniques are automated end-to-end. For example, the genetic algorithm.
- Aggregating features into training and validation data
Leaky features are the cause behind poor customer satisfaction. When models are created on top of a training set that has information that the model is trying to predict, it shows high prediction accuracy on paper. However, when the model is deployed in production and is processing completely new and unseen data, the prediction performance falls significantly – leading to unnecessary iterations and a poor reputation. Feature leakage is usually the result of minor manual errors that expose the test/validation data to the training set. This can be easily avoided by eliminating manual touchpoints, and this is where automation by feature stores plays a vital role in closing the tiny margins of error.
- Calculating features in real-time
While training on batch data, it is easy to source and calculate features from pre-existing data. But when the solution runs on the production environment, the incoming data is new and arriving in real-time. The challenge is to create the features in online applications with real-time demands.
- Monitoring features in production
Traditionally, machine learning solutions are monitored to gauge the performance of the model. However, the factors that lead to the numbers are often overlooked. For example, drifting (changing statistical properties) features or target variables can cause poor performance with no fault on the solution. Such cases can only be resolved if the properties of the features are closely monitored.
These challenges are why an infrastructure specific to machine learning must be established to address the lack of standardization and poor quality management of features. The feature store contributes a huge chunk of this ML-specific infrastructure. Let’s dive into it further.
Establishing a feature store can help in resolving the above challenges by fulfilling a few objectives. Finding the right feature store is a result of determining the best match for the organization. Given that each organization has a unique personality, the optimal feature store for one organization might not fit another. However, these fundamental objectives are some of the defining marks of an excellent feature store platform:
High-level objectives define the end-impact of implementing feature stores. The high-level objectives are met by fulfilling the low-level objectives.
- Scaling ML solutions
Most of the low-level objectives are meant to standardize the feature management process, and a standardized process is the most basic principle for scaling up. Through automation, version controls, low latency, high reliability, and security, there can be a huge net positive impact on the number of solutions produced.
- High-quality production
Eliminating error points by minimizing and supervising manual touchpoints, feature stores can create high-quality, robust feature sets for model training. This improves the quality of the overall solution and also ensures faster recovery in case of poor performance.
Low-level objectives define the deeper technical impact of feature stores.
- Verifying integrity of the raw data
A feature store takes care of the intermediary functions between ML models and the data sources. Therefore, to eliminate reiterations or poor quality results because of faulty data, a feature store must check the validity of the data input points, data freshness, and overall data quality.
- Feature engineering
Some data solutions require new and innovative features, but basic feature engineering schemes and formulas are necessary for all solutions. Automating the creation of such features can save a lot of resources and time. Also, when new features are created, they can be stored and reused for building future solutions.
- Feature sharing
Once a feature is created, it should be up on a feature registry as a reference for future projects and avoid double work by encouraging reusability of existing features. This saves time and space, and improves team collaboration.
- Integration with production server
Creating features offline while training is extremely different from creating the same features on online or production servers. The Feature store must ensure that the features are seamlessly calculated or sourced when demanded in real-time.
- Leveraging all data assets
There are primarily three types of data assets: Batch data, streaming data, and real-time data. All three, if available, should be considered during development and production and must be accounted for by the feature store. Even though challenging, more quality data is equivalent to more insights and better results if done well.
As discussed previously, the data drift or changing statistical properties of features or predicting variables must be closely tracked to account for the shift in model performance. This ensures that the solutions become more long-lasting through effective, more targeted retraining and re-engineering.
- Tracking lineage
Data lineage is a vast topic, and to simply phrase it: it is tracking the journey of data. Feature lineage is a subset of data lineage and is tracked by effective feature stores. It helps understand the data sources, engineering steps, logical reasoning, and potential impact on the solution’s performance.
- Versioning data
The purpose of data versioning is to improve the reproducibility of every experimental stage. If a particular experiment with features did not work, one could simply roll back to one of the more successful previous versions and start tweaking again from there.
- Harnessing self-sufficiency
The feature store infrastructure is meant to eliminate the need for engineering teams in the feature management process. This is because feature stores standardize and automate the process, enabling data scientists to work and move features on their terms.
A feature store plays several roles to manage features in an ML pipeline. Below are the key functions or pillars of the feature store infrastructure:
Components of Feature Store | Image from bmc
Processing or working with features offline or in batches is easier done compared to real-time processing. This is because real-time processing requires low latency and high throughput, which is unnecessary when using offline features. Models usually get trained on features created on batch data, and when the same models go into production, the features have to be served in real-time from the new incoming data.
This is done either by creating the features in real-time or synchronizing the production stack with the analytics stack. If the features are created in real-time, the developers must ensure that the code is extremely light and optimized in terms of space and time complexity to avoid buffering and jamming the processing memory. If the features are created by synchronizing two stacks, the features are calculated on batch data and updated in the production stack at hourly or daily intervals. The periodicity of updating the features must align with the business use case.
Storing developed features is crucial since it allows reusability. Features are either stored for use in offline or online applications. For online applications, the stored features should be easily and promptly accessible. Such features are often recently created or sourced. For offline use, like training models, data/features across several months or even years are stored. Such storage does not need low latency access.
Raw data is often transformed to become usable as features or as intermediaries for creating secondary features. Data transformation can be of various types, such as category encoding or statistical transformations. Such transformations make the data relevant to the models, and therefore, transformation is one of the key components of feature stores. Also, irrespective of solution type, transformations can guarantee a consistent format for the features as well.
Monitoring comes much later into the production process, yet it is one of the major functions that help make the solutions long-lasting. Data is an ever-changing asset. Sometimes, the target variable might suffer a shift in statistical features, or the predictor variables might change. Therefore, consistently monitoring the data and the model triggers retraining and helps identify the causes that led to the changes.
In real-world scenarios, data comes from multiple sources and is combined to get more wholesome datasets. Therefore, a feature store must extract data from multiple sources and then combine them with good reasoning right before the feature creation starts.
A feature registry is like a directory of features that can be referred to when needed. For example, for building a secondary feature, one can refer to the registry to check if the intermediary feature already exists or if it has to be created from scratch. Similarly, with the help of a feature registry, duplication and discrepancies can be avoided. It is like a one-stop dictionary that can be accessed by several teams for searching and querying the definition, use, or existence of certain features.
A feature store must serve two types of features:
- Online features
Online feature stores serve online applications that require data or features. Since these applications are real-time, the features are served with low latency or high speed and require high throughput.
- Offline features
Offline feature stores serve applications running on batch data. This is not necessarily a high-speed process since batch processing and calculation of features take more time than processing data incoming in real-time. Such features are used to develop AI/ML solutions or create the foundational layers of feature governance.
A feature store should act like a virtual assistant to the ML team by facilitating a smooth feature pipeline round the clock. While the ML team’s job is to innovate and create working solutions, the feature store should manage the team’s feature assets to enable the following:
- Implementation-agnostic features
Irrespective of the type of framework or model used, the feature store serves features consistently, and the solutions adhere to it. This lifts the burden of customized engineering, saves enough time to redirect resources to more strategic needs, and maintains standardization.
- Better inter and intra team collaboration
Feature stores serve as the one-stop destination for accessing, creating, storing, and using features. This eliminates unnecessary communication and reduces iterations between teams. Inter-team collaboration is also boosted since Data Scientists can individually work on different features and then have the feature store manage them with no redundancy.
- Faster production
Eliminating iterations between teams, manual touchpoints, and improving validation, monitoring, storage, and feature creation techniques accelerates the development process since features occupy a significant chunk.
- Faster serving
Based on the type of storage, the serving speed differs. When correctly configured, optimal serving speed can be obtained for the high-priority real-time demands, and a relatively lower speed suffices for offline training purposes.
- Leveraging all data sources
The data in use for training should be easily available in the production servers as well. However, combining and sourcing from multiple data sources, even though less intimidating in the development stack, is highly challenging in production since each input point has to be activated, collaborated, validated in real-time. Even though challenging, feature stores can help overcome a good percentage of the hurdles through strategic automation.
- Self-sufficiency of development team
Feature stores increase the self-sufficiency of data scientists as it acts as a substitute for engineering teams. The standardized infrastructure of feature stores enables Data Scientists to run operations on development and production environments with the least support.
- Ensuring governance
Governance of data is crucial in ML or data-based projects. It includes security and compliance checks, data quality validation, process health checks, and much more. Feature management via feature stores ensures data governance for a subset of the end-to-end process.
Here are some tools and platforms that can help you to test the waters and figure out how to standardize your feature management process:
Layer is a declarative MLOps platform that helps to develop, deploy and manage ML applications at scale. Layer’s feature store service is fully flexible and offers complex features with a built-in robust data transformation service. Layer provides a central place to find high-quality data and features for use in projects, which is instantly searchable and filterable. It offers offline and online serving, fast production, and high reusability.
Hopsworks feature store integrates with multiple data storage, data pipelines, and data science platforms. It is available as a managed platform on both AWS and Azure. It manages storage to optimize speed and allows vast scaling opportunities.
Some of the key features of Amazon SageMaker’s feature store are that it allows data ingestion from multiple sources, easy search and discovery of existing features, feature consistency and standardization, and easy implementation with AWS SageMaker Pipelines.
Feast is an open-source feature store and is optimal if your organization already has transformation pipelines. Feast offers great storage and serving layers. It offers consistency across training and serving, integrates with existing infrastructure, and standardizes data workflows across teams.
Molecula’s feature base is a feature extraction and storage technology, enabling real-time analytics and AI functions. It makes raw data usable, accessible, and re-usable.
It is important to test feature store solutions in a controlled environment. A feature store service can offer all the important functions of an optimal feature store. However, some platform-specific properties should also be considered. Here are a few points to note before selecting a feature store service for your organization:
- Data transparency
Feature stores usually encapsulate the deeper data operations that support feature management. However, the feature store should have a way for the users to analyze the inner workings of how it ingests raw data and serves online and offline features.
- Sync with production stack
It is important to analyze if the new feature store can easily integrate with the organization’s existing production framework. The ultimate goal is to shift the model from the developer’s machine to the end customer. This can be achieved through the interaction of the feature store with the production server.
During the first few stages of implementation, it will be an error to blindly trust the feature store and its capabilities. Instead, a few days or months of experimentation and close monitoring of the feature store service are crucial to understanding its operations’ correctness.
- Ease of use
Is the feature store easy to use or has massive complications? Does the organization need additional resources to figure out the feature store, integrate it with the existing framework, and then document it for future users? Overall, if the service adds more burden instead of easing the shoulders, it is a no-go.
Strategic automation is the key to make feature stores and the entire ML pipeline more reliable. By eliminating clerical tasks, the organization can redirect its valuable resources towards more strategic and demanding tasks. Furthermore, automation minimizes errors, cuts down on time, and is much more reliable for its accuracy.
Working with features can get extremely messy without an automated pipeline, and therefore, it is highly challenging to scale ML solutions without an end-to-end MLOps workflow. Feature stores are built on the foundational principles of DevOps, and consequently, MLOps, offering ML teams the flexibility and speed that traditional software developers have enjoyed for ages.
Nothing can be more reliable than a strong base to carry the expanded scale of solutions to bring your organization’s machine learning solutions to the frontline. The feature store is a subset of ML pipeline management, and one subset at a time, the entire MLOps puzzle can be solved.