How Machine Learning Teams Share and Reuse Features

Over the last few years, the talk of the data town has been that data scientists spend 80% of their time in stages that process the data to make it fit for model consumption. This adage is gradually ceasing to be true in organizations integrating a data-warehouse-like infrastructure for data features. Just like data warehouses […]
Oct 13th 2021
read

Share this post

Oct 13th 2021
read

Share this post

How Machine Learning Teams Share and Reuse Features

Samadrita Ghosh

Samadrita is an AI specialist with proven experience as a Data Science Engineer

Over the last few years, the talk of the data town has been that data scientists spend 80% of their time in stages that process the data to make it fit for model consumption. This adage is gradually ceasing to be true in organizations integrating a data-warehouse-like infrastructure for data features. Just like data warehouses have standardized enterprise data and made the job of business analysts easier and faster, feature stores standardize and centralize feature data to make the development of data solutions easier for data scientists.

A feature store is a feature management system that retrieves, stores, manages, serves, shares, and implements features as and when necessary in machine learning projects.

It is, therefore, an excellent recipe for scaling machine learning projects, especially for organizations that aim to turn their solutions into ML-based offerings at scale. ML-enabled companies implement and manage hundreds of ML solutions. Due to their centralized systems that reuse and reshare features across various machine learning use cases, the ML-enabled companies continue to scale rapidly.

Why share features?

It might seem out of the ordinary to reuse features for apparently unique solutions. This concept is the major bottleneck that organizations are suffering from, and here’s why:

  • Conceptual overlap

The practical reality is that various use cases are often conceptually related, especially if the organization specializes in a niche area such as machine learning solutions for marketing. In such a case, the use case of mining relevant lead contacts will have several overlapping features with the use case for understanding customer behavior.

  • Cost of not sharing features

Consider what could happen when features and use-cases have one-on-one relationships. If an organization hopes to scale without feature sharing, it has to implement fresh features from scratch for every new use case and make space for computation and storage of feature values. Of course, there is the added demand for time for processing and implementation, which adds expensive overheads and limits scaling and processing potential.

  • Enhanced ideation

When features are restricted to use cases, teams working on different use cases can miss out on good feature ideas that otherwise would have benefited their solutions. While some companies ensure a fair exchange of ideas over monthly or quarterly meetings, this process is nowhere close to having a transparent structure with complete view access to new and modifying ideas 24/7!

  • Compounded results

The compounding theory is true for several verticals of life and business: banking, stock markets, education, or even careers. It’s no surprise that it even holds for scaling with feature sharing. On enabling feature sharing for one use case, the subsequent use case gets to use its own features along with the features of the first solution.

Similarly, the next solution has access to the features of the two previous solutions. Gradually, as the organization builds an army of ML use cases, the store of unique features also builds up and enables developers to leverage a plethora of existing features and ideas.

Challenges of sharing features

Reusing and sharing features across various use cases and ML solutions have not been the norm since sharing features comes with its own set of challenges. These challenges are the focus of feature stores.

Let’s take a look at the top five challenges to get more clarity on how feature stores can be an effective solution:

  • Non-standard format

Most organizations that have tried to share features have done it in a non-standard format, usually through code sharing. The problem with this is that different teams build their code separately, and while referring to a bunch of features from another solution, the code might be out of sync.

The team might have to reformat the return structure or parameters of the feature functions. Code reformatting is not restricted to feature functions and also applies to various other dependent functions and classes.

  • Unreliable sources

In an ML-based organization, there are multiple use cases and multiple developers who keep updating them. Even if the features are shared across teams, it can be challenging and seldom fatal to rely on them due to low clarity on the validity of the source and the review process.

  • Lack of communication across teams

Having a shared dump of features is not enough. While using an existing feature, the developer needs to be aware of why the feature was used, how it impacted the performance of the solution, how demanding it is on the available resources, etc. This data, or the metadata of features, is traditionally not communicated due to the lack of a proper infrastructure to record and convey the information.

  • Poor access management

Sharing features via code with no proper access management protocol can be fatal to existing solutions and even mess up the logic of upcoming ones. Say, if one user with access to an existing solution updates the feature logic erroneously, it can potentially incur a considerable cost to resources and reputation. The source of error needs to be tracked down, corrected, and re-implemented while the solution is on pause. Even if there is no direct impact on the solution, an underlying feature logic mistake eventually compounds into a significant problem, affecting several processes and teams.

Methods of feature sharing

Let’s explore a few ways machine learning teams can reuse and share features while confronting the least number of challenges:

Feature Pipelines

Sharing feature pipelines is the most common technique through which organizations share and reuse features, and this is done by sharing the source code of the features through platforms such as GitHub. Therefore different teams can pull the features, modify or update as necessary in their own use cases.

Moreover, the logic of the features becomes extremely clear since developers can easily look at the feature logic and what requirements it is fulfilling by tracing the lineage of the feature pipeline.

However, this technique exposes the features to rampant updates, which can be harmful to the existing and developing solutions. This method is a significant drain on the organization’s resources since, for every small change in the existing pipeline, the whole pipeline from the point of change needs to be forked and updated.

Shared storage location

The second technique is publishing features to a shared storage location, and it is partially similar to the above technique of sharing feature pipelines. The main difference is that, in this case, the feature logic and pipelines are encapsulated, and only the feature values are available for different use cases.

However, some metadata on the features can be referenced from company-owned documents/sites. The storage locations can be data warehouses or databases. The features can be reused for both online and offline solutions.

The management of the shared location is often owned by the first team that created an ML solution and updated the features in the shared location. However, this sort of ownership is not effective when it comes to scaling the process.

Moreover, the metadata on the features is not enough to establish the validity of the features. Sharing code is relatively a better method since it offers an end-to-end view of the feature logic and significance. The metadata documentation of old features is also barely updated and becomes obsolete with scale.

Feature Store

A feature store dismisses most of the challenges involved in feature sharing. It is a one-stop method that saves maximum cost, covers the maximum ground – the end-to-end feature cycle, and needs minimal maintenance.

Feature stores have not been well-explored by organizations since they were born to solve the challenges faced by the industry at large. Today, organizations with a deep interest in scaling and optimizing their ML foundation are rapidly investing in feature stores to gain the early-bird edge.

Feature stores offer support to design, transform, serve, store, share, monitor, and automate features such that developers can easily create and reuse features while ensuring that existing use cases and pipelines are unaffected.

How feature stores help in feature sharing

Here are few ways in which feature stores combat the challenges of feature sharing:

Discoverability

Feature store standardizes metadata and feature definitions and stores them in a standard format allowing a breezy feature search. Metadata that determines if a feature can be reused or not are all documented by the feature store. This data is least likely to become obsolete due to automated recording wherever applicable.

Feature validation

A feature store tracks feature ownership for feature validation to establish trust. It offers a view into the source code and the feature pipelines to enable the developers to trust the feature logic and reasons for implementation.

Use case agnostic

Given that the feature store manages storage and implementation of features in standardized formats for both online and offline serving, developers or users can steer clear of learning about the infrastructure details or code reformatting.

Modularity

A feature store is designed to be incremental to extend and modify features without impacting the pre-existing solutions in production. Feature dependencies are also well tracked so that the extensions can be wholesome, and the risks of disrupting other use cases are minimal.

Feature stores are extensive and offer high returns on investments. However, building them in-house can be challenging and take years to reach high quality due to their extensive nature. Consulting and leveraging the feature store solutions of niche experts who have already built and experimented with the feature store infrastructure over a long duration can significantly save cost and time.

How to reuse features with a feature store?

Feature stores have the ideal infrastructure for feature sharing with the help of simple and cost-effective methods. In this section, let’s explore how to share features by using Layer’s feature store platform.

What is Layer?

Layer is a Declarative MLOps (DM) platform to help data teams across companies to produce machine learning applications based on code. With DM, you define what to accomplish rather than describe how to accomplish it. On passing a Dataset, Features and ML model definitions, Layer builds the entities seamlessly. This enables the developers to focus on what matters more – designing, developing, and deploying models – without worrying about the infrastructure.

The Layer framework has two central repositories, Data Catalog and Model Catalog, containing reusable entities across all projects within the organizations. In between the two catalogs sits the Feature Store infrastructure.

word image

Why use Layer’s Feature Store?

Layer’s feature store empowers machine learning teams throughout the feature lifecycle by offering the following functionalities:

  • Familiar techniques: Layer does not require developers to learn and implement new domain-specific languages. Instead, they can code the features in Python and SQL and create short and powerful YAML files to tell what is required. Layer uses a declarative, high-level approach towards MLOps. Developers do not have to invest time in low-level process details of how but can just focus on the what.
  • Full flexibility: With a robust built-in data-transformation service, Layer can create complex features such as feature aggregations with sliding windows and even allows the use of custom logic in native Python.
  • Reusability and instant discoverability: Feature engineering is often repeated from scratch in new projects since it is challenging to access and discover prepped or existing data/features. Layer offers a central place to find high-quality features that are instantly searchable and filter-able. With the centralized and single source of truth shared across the organization through Layer, teams can easily reuse the tested features in new models. Therefore, naturally, this considerably speeds up projects by saving time on feature creation and testing.
  • Guaranteed consistency: Many data teams suffer from using different data when training and serving features which often leads to drift and inconsistencies. This, in turn, creates challenging bugs and causes model performance to degrade. Declared feature definitions are used by Layer to generate datasets for training and to serve the same features in production. Overall, this prevents pipeline issues and helps in removing associated bugs.
  • Bookkeeping: Layer keeps track of features diligently, enabling better experimentation, “time travel” through older data, and auditing. Users can explain and debug models through reproducible pipelines.
  • Offline and online serving: Layer provides a common interface to define features that can then be served either offline or online. Layer handles all the intricacies of performance, caching, consistency with no extra effort or input from the user.

Implementing feature reuse with Layer

Let’s dive into a fundamental overview of how feature sharing can be implemented with Layer’s developer-friendly feature store:

Step 1: Layer setup

To set up Layer, first install and then log in:

  • Install Layer

Layer is a Python-based SDK distributed on PyPI. It can be installed via pip, and using virtual environments is recommended when installing with pip.

 

Installation command: `pip install layer-sdk`

In case of permission problems, command: `pip install –user layer-sdk`

Verify installation- command: `layer –help`

 

If Layer help is displayed, installation is successful.

  • Log in to Layer

To login: `layer login`

A browser window will open for the sign-in process: https://beta.layer.co. If the user doesn’t have a Layer account already, it can be created.

Step 2: Clone an example project

The Layer SDK can be used to clone the pre-built Titanic example created by Layer.

Commands:

  1. `layer clone https://github.com/layerml/examples`
  2. `cd examples/titanic`

Once the checkout is complete, a set of files that declaratively define Layer’s MLOps pipeline can be found. The Titanic Example project has three primary directories:

  • data/titanic_data

A YAML file that connects the Titanic dataset and the directory is present in this directory. The source data has been uploaded into Layer’s demo database- this project’s main Datasource.

  • data/passenger_features

Each SQL file corresponds to a feature that will be used to create the training data. The descriptor for this Featureset is the `dataset.yaml` file.

  • data/models

The `model.py` file can be found in this directory. The file is the implementation of the model, and the `model.yaml` file is the descriptor for this ML Model.

Step 3: Run the project with Layer

Layer discovers the Featuresets and ML Models in the project to build and put them into the two catalogs:

  • Data Catalog: This is where to start when looking for data while developing models and projects. It’s the central repository for all Featureset among all projects.
  • Model Catalog: This is the central repository for ML Models and can be used to review the results of experiments or to deploy models for inference with a single click.

To run the Layer project use the following command: `layer start`.

Step 4: Deploy and test the model.

Layer’s model is trained and ready to be deployed in the ​Layer Model Catalog. Many details about the model, including the accuracy metric, logged while training or the signature of the model can be seen on clicking the model.

word image 1

On clicking the + Deploy button on the top right, the model gets deployed!

word image 2

Once the model is deployed, the button will return to green. Click the 🔗 to copy the deployment URL.

To test the model, a real-time prediction can be made. The model is deployed for real-time inference, and expects seven inputs, each corresponding to a feature, to make a prediction (replace $MODEL_DEPLOYMENT_URL):

curl –header “Content-Type: application/json; format=pandas-records” \
–request POST \
–data ‘[{“Sex”:0, “EmbarkStatus”:0, “Title”:0, “IsAlone”:0, “AgeBand”:1, “FareBand”:2}]’ \
$MODEL_DEPLOYMENT_URL

The model will return [0], predicting that the passenger with the passed inputs won’t survive.

Step 5: Reusing features

Now that a machine learning model is trained by Layer let’s look at how to reuse features. The features generated are stored in the Layer Data Catalog. In the Titanic example, features known as `passenger_features` were created.

word image 3

To reuse the features, they can be imported to a Jupyter Notebook. Fetching features from Layer in a Jupyter Notebook is quite straightforward. It can be done in three steps:

# Import layer
import layer
# Authenticate your Layer account
layer.login()
# Fetch the features
passenger_features_df = layer.get_features([“passenger_features”])
# Display the features
passenger_features_df

The `layer.get_features()` function accepts a list of the features you’d like to fetch and returns the features as a DataFrame. Once you obtain the DataFrame, you can proceed with your analysis and model building as you’d like!

word image 4

Best practices of feature sharing

Here are some best practices for feature sharing:

  • Define feature groups

Several features used or created across different use cases can be related to a particular entity or group. Say the feature “clicks per visit” defines the behavior of the user, as well as estimates the performance of the website. To optimize references and discoverability, it is best practice to create groups of features that jointly define particular entities such as the user and the website.

  • Balance resources

While serving features to both online and offline features, it is important to note the CPU consumption. If the batch processes (offline) are taking a toll on CPU resources, then the online serving can slow down. Therefore, if the online solution is a priority, it is best to optimize CPU allocation by first analyzing the consumption of the online solution and then allocating the remaining resources for offline serving.

  • Use declarative platforms

The declarative approach means that the machine is capable of taking care of how to achieve a task. The developers do not have to instruct it explicitly on how to do the task- all they need to do is command what tasks need to be executed, and the machine will figure out the rest.

A feature store that sits on top of a declarative platform will have access to a plethora of speedy insights straight from the machine. It can enable partial or full automation on several feature sharing modules, eventually making the process much more reliable and faster.

Final thoughts

If the above article has to be summarized in a few lines, it can be said that feature sharing is the next step towards ML maturity for organizations that aim to become ML-enabled at the earliest. As the adage goes – the simple act of sharing knowledge multiplies it, which holds for feature sharing across various ML solutions that an organization runs. Each ML solution is a treasure trove of invaluable features, and missing out on the opportunity to utilize this data just because of the lack of a proper sharable system is not worth the cost of rebuilding good features from the ground up.

References:

https://www.tecton.ai/blog/how-machine-learning-teams-share-and-reuse-features/

https://aws.amazon.com/blogs/machine-learning/enable-feature-reuse-across-accounts-and-teams-using-amazon-sagemaker-feature-store/

https://www.logicalclocks.com/blog/feature-store-for-mlops-feature-reuse-means-join

https://cloud.google.com/vertex-ai/docs/featurestore/best-practices

https://layer.co/product-features/feature-store

https://docs.beta.layer.co/docs/

Oct 13th 2021
read

Share this post

Try Layer for free

Get started with Layers Beta

Start Free