Best practices for MLOps

Are you aware that many Machine Learning projects get delayed due to inefficient collaboration between data scientists and the operations or production team? Most enterprise-level software development work requires a high degree of collaboration between different teams, and machine learning projects are no exception. The commercial use of AI and ML is new relative to […]
Jun 28th 2021
read

Share this post

Best practices for mlops
Jun 28th 2021
read

Share this post

Best practices for mlops

Best practices for MLOps

Dhiraj Kumar

Dhiraj Kumar is a Data Scientist & Machine Learning Evangelist.

Are you aware that many Machine Learning projects get delayed due to inefficient collaboration between data scientists and the operations or production team?

Most enterprise-level software development work requires a high degree of collaboration between different teams, and machine learning projects are no exception.

The commercial use of AI and ML is new relative to traditional software development. However, we require a systematic and collaborative approach in the machine learning operations realm.

In this article, we will discuss a collaborative system called MLOps and its best practices in detail. MLOps aims at automating data flow processes and produces better data insights.

We’ll start by understanding what MLOps is in more detail.

What is MLOps?

MLOps stands for machine learning operations. In the simplest form, it is coordination between data scientists and the corresponding operations team.

word image 26

MLOps is an engineering discipline that works to solidify machine learning systems development and operations. The objective is to streamline the continuous development, testing, deployment, and monitoring of high-performing machine learning models in production.

Many organizations have started integrating machine learning models into their products, leading to changes in the existing practices and principles of software development lifecycle and DevOps. Therefore, a new engineering operations discipline has emerged, called Machine Learning Operations.

In the next section, we will discuss the typical components of a machine learning operations system.

Components of MLOps Systems

Before discussing MLOps best practices, it is advantageous first to discuss various components of the MLOps system. After that, we will discuss the MLOps best practices.

word image 27

Inspired by this paper.

Feature store

Feature storage is a crucial characteristic of MLOps systems. Features are generated through feature engineering, which is the process of creating features from data. High-quality features mean better machine learning models.

Generating new features takes a significant amount of effort and time. Once the new features are generated, they are stored in a feature store. This will eventually save a tremendous amount of time and effort in the long run as it will improve the reusability of features.

Data versioning

Data versioning helps in automating machine learning model development. Also, data versioning is required to develop reproducible results. Though you can create it from scratch, there are many data versioning tools available in the market.

Machine learning metadata store

Metadata is a standard way to provide machine learning model descriptions. Machine learning metadata is information about what the model is supposed to do, its input and output data.

The metadata has mainly two parts. The first part is the human-readable part about best practices for using the model, and the second part is the machine-readable part that code generators can use.

Model versioning

Version control is part of machine learning operations used to track the changes to the machine learning model over time. Machine learning model creation is mostly rapid experimentation with lots of iteration. Experimentation requires keeping track of model development history.

By implementing the version control of machine learning models, we can analyze how well we are doing and what parameters we used to achieve that.

Model registry

A Model registry is a centralized tracking system that stores and versions metadata for machine learning models.

The model registry is a service used for the management of model artifacts. It can also be used to keep a record of which model is deployed in production.

For example, MLflow is a Model Registry for ML models.

Model deployment

Machine learning engineers and data analysts are experts in training and evaluating models. However, we need the skills to deploy these models to production that are more inclined towards general software engineering and MLOps.

If we try to formalize it, deployment simply means integrating a machine learning model into a production environment or production server. The deployment of the model makes it possible to make valuable predictions that will aid in business decision-making.

Technically, when we deploy a model, we create a version of the machine learning model. After that, we link the model version to the model file stored in a production server environment.

Model serving

After deploying a machine learning model, MLOps services are responsible for keeping it online and working as expected. Also, stakeholders expect to have minimal downtime. The downtime of ML systems is a common issue encountered in MLOps.

The suggested solution is to machine learning platforms designed for enterprise-grade serving. These systems are built to serve many model requests with high throughput.

Model monitoring

A machine learning project owner is interested in creating a good machine learning model and maintaining its performance for a long period. Therefore, a model monitoring sub-system is required in MLOps systems.

Due to evolving data profiles, degradation of the production performance of models may happen. Machine learning models in production may start showing sub-optimal performance automatically. The data models can show a decrease in performance more often than conventional software systems. Therefore, we need to be ready for it during our MLOps system planning.

Model retraining

A well-performing machine learning model in production is not the end of the story. As we know, machine learning models need high quality as well as quantity of data for training, therefore when we get more data in future, we keep retraining the model at regular intervals.

Also, keep in mind that training the machine learning model is not a single, one-time process. Many times, just to ensure the model’s predictive performance over a long period, we need to keep re-training the model frequently. We may require a good chunk of new training data for re-training. Therefore, in our MLOps pipeline, we need to consider model retraining as well.

In the next section, we are going to discuss machine learning operational issues.

Machine learning operations issues

MLOps deals with a lot of issues if not designed and executed correctly. In this section, let us discuss the common problems encountered while implementing MLOps for a company.

word image 28

Handling large data

MLOps generally face issues as they handle a large amount of data. This can cause the demand for facilities like distributed data storage and streaming. The suggested solution could be Hadoop Distributed File Systems along with Kafka and Cassandra.

Uncleaned data

In most cases, we receive uncleaned data from our data collection sources. However, for data modeling, we require properly cleaned and formatted data. Therefore, we need to clean and format data as per our project needs.

Generally, the suggested solution, in this case, is to use OpenRefine and Trifacta Wrangler.

Model training issues

Almost all MLOps systems face the installation and setup of required machine learning or deep learning frameworks. You can train your model more systematically and efficiently using standard and well-known frameworks.

In this case, the suggested frameworks are Google’s TensorFlow, Facebook’s PyTorch, and Apache Spark.

Model management issues

Once trained, many machine learning models and datasets accumulate in the MLOps system leading to file management issues. Therefore, it requires the management of all these files and models and their corresponding datasets.

The suggested solution is to use Hadoop Distributed File Systems, Google File System, MongoDB, etc. For example, HDFS provides a way to store large files in a distributed manner.

In the next section, we are going to discuss the advantages of using MLOps.

Advantages of MLOps

MLOps is instrumental in bringing machine learning system workflows to production. MLOps leads to less friction among machine learning teams and corresponding operations teams.

Let us discuss a few advantages of MLOps.

word image 29

Eliminates waste

Eliminating waste is one of the essential aspects of MLOps. This is mainly implemented by doing periodic audits and creating clear and concise documentation. Ultimately, this effort results in richer and consistent insights.

Automates processes

Automation is helpful while making a workflow efficient. MLOps is no exception in this regard. The MLOps helps in reducing the time and cost of creating machine learning models into production. This is achieved partly using automation.

Better insights

MLOps helps in improving machine learning pipeline processes. This is primarily done by standardizing the process ( for example, the data cleaning process) and using state-of-the-art tools and platforms. With the help of MLOps, the platforms and tools resonate in sync, thus producing highly valued data insights.

Highly collaborative

A lack of communication can damage a company beyond imagination. There may exist many efficient components of an extensive system. However, we need an arrangement such that these components can work in tandem with each other. In other words, we need a high degree of collaboration inside the MLOps system. MLOps is responsible for creating the procedures for transferring one task to various departments. This creates a highly effective workflow and communication pipeline.

In the next section, we are going to discuss best practices related to MLOps.

MLOps Best Practices

To get the best result out of MLOps, we must follow best practices during the design and operations of machine learning systems. In this section, let’s discuss some of the top best practices.

Design to iterate

Iteration is the process of repeating a set of tasks to get the desired output. In the context of machine learning, iteration is the number of times an algorithm parameter is updated.

However, when designing an MLOps system, we need to keep in mind that production-ready machine learning models require a lot of iteration. Therefore, we must design the MLOps system by considering iteration as a critical factor.

Let’s understand it with an example. As we create a machine learning model, we need to think about how easy it is to add, remove or combine features of the dataset. Also, we may need to design with the ability to run two or three copies of model training in parallel. It is beneficial to develop the system so that it is possible to create a fresh copy of the pipeline itself.

Implement a data validation pipeline

Data is the most critical part of the MLOps pipeline. Therefore, it is of utmost importance to identify errors in the data. Data validation will eventually help the machine learning model to perform better in the long run. Identifying errors in data is essential, but finding them as early as possible in the life cycle is crucial.

Invalid data or errors in data may create various issues. By invalid data, we mean the values of individual data points and the statistical properties of the data as a whole.

If the data used for prediction has different statistical properties than the training data, we would like to mark it invalid. The statistical property of data may change due to various reasons, such as data drift.

To solve this problem, we can validate the data using multiple data validation libraries like Cerberus. Working with Cerberus is very easy as you just need to define a validation schema, create an instance of validator class, and pass it.

Create a model validation pipeline

From a statistical perspective, model validation ensures that the model’s outputs are as expected by the product design.

We can also say that machine learning model validation is checking a trained model on the test data. Model validation is a basic check for the generalization capability of a given trained model. It is done by splitting the data into training and test parts.

Model serving provides a solution to host machine learning models on a production server to be exposed to the outside world. Nowadays, there are many different solutions to serve machine learning models in production. For example, we can implement a REST endpoint that is updated automatically as part of MLOps.

Automated deployment, enabled by MLOps systems, can deploy a trained machine learning model acting as a prediction service. The models can be automatically trained and deployed as part of the pipeline. This helps in model serving.

Ensure models are reproducible

In machine learning, reproducibility indicates that when an algorithm is run repeatedly using a given dataset, the output produced is the same.

Reproducibility is of enormous help when we are moving projects from development to production. Also, the debugging and fixing of errors can be achieved quickly with the help of reproducibility. Reproducibility not only helps in maintaining data consistency but also helps in making sure that the output produced is correct.

From the client’s perspective, reproducibility helps ensure the credibility and trust of the company developing machine learning projects. The MLOps system is a great place to implement the reproducibility of models being trained. A properly designed and reproducible machine learning product generates confidence in stakeholders as well as engineering teams. Reproducibility can be achieved using timestamps to pinpoint training data or taking a snapshot of training data and saving it.

Implementing scalability

Scalability in the realm of machine learning means handling large amounts of data and performing a lot of computations. All this should be done in such a way that it saves time and the system remains cost-effective.

Scalability is very crucial for machine learning because of few reasons:

  • Models are generally huge and don’t fit into the memory of a training machine.
  • Buying a big machine is expensive compared to having few small machines with the same combined processing power.
  • Implementing scalability has lots of benefits, like an increase in productivity, portability, and reduced cost. Scalability also reduces human involvement in larger machine learning projects.

Continuous integration

Continuous integration is automatically integrating code changes collected from various code contributors to a single project. It helps developers to merge their code changes quite frequently into a shared code repository. Then the code builds, and tests are carried out. This helps in ensuring that code from various developers are merging seamlessly, and if there is any error, it is being reported back to developers.

Continuous integration in machine learning is a primary MLOps best practice that ensures that the machine learning pipeline is rerun whenever there is an update in code or data of the project.

It is worth noting that continuous integration helps in code validation and helps validate data and models. This is because if data has changed, the code for validation of data will be re-run to ensure that only valid data is being fed to the model. Which in turn ensures that the model is still valid and producing the required result.

Continuous testing

In a software delivery pipeline, continuous testing involves automatically executing test cases. It is implemented to get feedback on the proper functioning of the software version as early as possible.

Continuous testing in machine learning operations is a type of machine learning model testing. It ensures the model testing at every stage of the model development, training, and validation life cycle. The primary aim of continuous testing is to measure the quality of the machine learning model at every step of the pipeline by testing frequently and early.

Therefore, continuous testing is a relatively new step that deals with automatically retraining and serving the machine learning models.

Continuous deployment

Continuous deployment is the process of automatic software deployment. It uses automated execution of test cases to ensure that changes to a code repository are correct. It also checks if the code is stable for immediate deployment to a production environment.

Continuous deployment in the MLOps pipeline enables machine learning model and data changes of all kinds to be deployed to production environments in a sustainable, quick, and safe way. Continuous deployment aims at making deployments more predictable and enables developers to perform deployments as and when required. This helps the engineering team to innovate and helps in making the processes faster.

It is worth noting that continuous deployment of a machine learning pipeline is not only about automatically deployment of services, but it also takes care of rollback of the code, if needed.

Keep ensembles simple

In the MLOps system, we deal with many kinds of machine learning models. First is the unified or non-ensemble model, which is easier to debug and test. However, in many cases, we need to deal with ensemble models.

An ensemble model is a machine learning model that combines the outcomes of more than one machine learning model.

In this case, it becomes very critical to keep things simple. Combining machine learning models that have been trained separately may lead to unexpected results. We, therefore, need to use simple models for ensembles. These models should only take the result of the base models as inputs. In the case of ensembles, we may need to enforce some rules in the MLOps system. For example, an increase in the predicted probability of a participating classifier model should not decrease the prediction probability of the ensemble. This will ensure that changes in the participating models do not confuse the ensemble model.

Track all experiments

A sound MLOps system continuously tracks the data related to model training during multiple runs. This is called experiment tracking. For example, experiment tracking collects and stores information like data splits, model size, hyperparameters, parameters, etcetera.

Given the experimental nature of machine learning systems, the benchmarking of different models is required. This is implemented using experiment tracking.

Use a cloud-based machine learning platform

As part of the machine learning engineering team, you may have wondered many times about tools and platforms available for bringing the models into production.

If a company has invested in computational resources such as graphical processing units, it can implement all the training and experimentation on-premise. However, it is also a fact that you might own all the required resources, but they may remain idle for a large percentage of the time since you are not training models every day.

It is, therefore, more cost-effective to use cloud providers to run the entire pipeline. The cloud is designed to be flexible in terms of the use of computer resources. The major cloud providers are AWS, GCP, and Azure.

Set up coding standards

When working on a machine learning project, we would want our code to be readable and maintainable. If we follow the coding standards related to our technology or programming languages, the code written by different ML engineers will be uniform, maintainable, and readable.

In the long run, coding standards also help in code reuse and faster error detection. Coding standards not only improve the programming practices but also increase the programmer’s efficiency.

Some examples of the coding standards are:

  • Writing of documentation and proper comments. It is good practice to write documentation and comments so that other developers can understand it clearly in less time.
  • Proper code indentation. Using proper indentation increases code readability, which helps analyze code.
  • Using helper methods. Helper methods make the code more object-oriented. This makes code more modular and ultimately efficient.
  • Not hard-coding makes the code more flexible and generic.

Implement continuous monitoring

The owner of an ML project is interested in creating a good machine learning model and maintaining its performance for an extended period. Therefore, a model monitoring sub-system is required in MLOps systems.

It is known that production machine learning models may degrade over time due to evolving data profiles. Production machine learning models can show a decrease in performance more often than conventional software systems. Therefore, we need to be ready for it during our MLOps system planning.

Set up a pipeline for updating and retraining the model

A well-performing machine learning model in production is not the end of the story. As we know, machine learning models need high data quality and quantity for training.

Also, keep in mind that training the machine learning model is not a single, one-time process. Many times, just to ensure the model’s predictive performance over a long period, we need to keep re-training the model frequently. We may require a good chunk of new training data for re-training. Therefore, in our MLOps pipeline, we need to consider model retraining as well.

Plan to reduce the cost of MLOps Implementation

Cost is a critical factor while measuring the success of a project. One of the primary goals of the MLOps is to reduce the cost of implementing machine learning projects in the long run.

Machine learning operations help stakeholders and businesses deploy solutions that reduce cost, increase profit, and save time. This can be achieved by creating better workflows during the MLOps design process.

The best practices for MLOps are not limited to the points discussed above. There are many more. However, I have discussed the most essential and standard best practices.

Final thoughts

As discussed, we know that MLOps is responsible for establishing and maintaining communication between all the teams. These teams are involved and accountable for developing and maintaining machine learning projects and corresponding models.

However, the key advantage provided by MLOps is an efficient pipeline for continuous training and validation of machine learning models. Machine learning operation processes don’t end with model deployment. We are required to fine-tune the system, especially the model, periodically. This ensures that the model works efficiently with newly added data.

MLOps facilitates a system needed to change the data involved, code committed, and models deployed. It also tracks these items at all times. In short, the data pipeline required to start from raw data and reach a finished model is something MLOps automates.

The advantage of the MLOps is to establish a system, to develop efficient models. Undoubtedly, MLOps brings excellent value to the organization, and it puts the organization on par with other organizations in the same industry.

Resources

Rules of Machine Learning: Best Practices for ML Engineering

Engineering best practices for Machine Learning

MLOps: Continuous delivery and automation pipelines in machine learning

Challenges in Deploying Machine Learning

Jun 28th 2021
read

Share this post

Try Layer for free

Get started with Layers Beta

Start Free