Challenges in Deploying Machine Learningadmin
Machine learning has continuously evolved, from being considered a purely academic field of study to being adopted by an ever-increasing number of applied fields. A 2021 Forbes article stated that 50% of surveyed enterprises plan to spend more on AI and ML this year. However, the survey also found that in 38% of these enterprises, data scientists spend more than 50% of their time on model deployment. This article will examine the challenges faced in deploying machine learning models, leading to such a large proportion of time being spent on deployment.
As the adoption of machine learning models in industry increased, so has the amount of literature documenting specific project deployments, challenges in various industries, and papers on the ways challenges were overcome. We will dive into some of this literature to extract the most common challenges faced at each deployment stage.
Although many definitions of the machine learning workflow exist, in an industry setting, they generally come down to the four stages defined by Ashmore et al. in their paper on Assuring the Machine Learning Lifecycle shown in Figure 1. Namely, these are:
- data management (preparing data needed for ML model),
- model learning,
- model verification and
- model deployment.
Figure 1: Machine Learning Workflow. Source.
This stage of the workflow focuses on data collection and processing, which can significantly affect the results to a similar extent as the model choice.
Challenges arise in various aspects of data management covered below:
Often the first part of any machine learning initiative is finding and attaining a high-quality dataset. This data discovery and collection process includes understanding what data is available and determining how it can be collected, organized, and stored. Depending on the environment and topic, this can cause significant challenges when there is little relevant data. On the contrary, difficulties also arise when large volumes of data exist.
Determining what data is available can be particularly challenging when dealing with sensitive information, for instance, in the healthcare industry, as discussed by Kruse et al.
Another potential issue highlighted by Paleyes, Urma, and Lawrence in their survey of case studies is that determining the location of data sources and understanding the structure can prove to be a huge task potentially preventing the application deployment. They, in turn, reference the Lin and Ryaboy paper on scaling of infrastructure at Twitter, where this situation can occur due to data from one user being processed by multiple services responsible for one operation while calling each other. Despite facilitating easy scalability at scale, this architecture can make tracking how the data is stored very difficult.
Even when you have sufficient data, the data quality can halt deployment. Therefore, researchers must ensure that the model is not biased towards gender, age, income groups, and race, which means that the collected data must be of high quality. This ensures that the model is not biased towards these classes.
Assuming adequate data is available and can be collected, the next hurdle is data integration and storage. During data integration, data from several sources is collated in a single centralized location. The differences in data formats, storage locations, and collection can lead to challenges in collating the datasets. There may also be delays in receiving live data, as discussed in a Microsoft survey of data scientists.
Machine learning models also typically require large volumes of data that can reach terabytes in size. Finding appropriate storage solutions for this collated data can be expensive and pose further challenges in compliance (where data can only be stored in certain territories). Moving the data can also be a very time-consuming process, with the migration process requiring a high level of expertise.
Most data will be subject to various laws and regulations relating to how it must be collected, stored and who can access the data, for instance, GDPR. This can cause further challenges when sourcing and integrating the data, particularly when collected in various territories with differing laws.
Highly sensitive data may be subject to even additional restrictions in terms of access from third parties, consequently requiring the use of synthetic data generation methods as described by Surendra and Mohan.
Collected data will need to be preprocessed, including cleaning and labeling the data. Difficulties in labeling the data can arise in many scenarios, including:
- limited access to experts,
- difficult to label data, and
- high volumes of data.
Various scenarios would require highly specialized experts to label the obtained data, such as labeling signs of disease in medical images. Such experts can be difficult and costly to attain. This makes labeling vast volumes of data an almost impossible task. Further challenges arise from inconsistencies in labeled data.
As discussed by Sheng et al., in some cases, non-experts can label data by outsourcing to services such as Freelancer.com and Amazon’s Mechanical Turk. However, the resulting labels suffer from inconsistencies that are not always corrected through repeat labeling. Sheng et al. produce a mathematical analysis of experimental results on repeat labeling, including a metric for the difficulty of labeling the information. Figure 2 below, taken from the paper, shows a graph of maximum difficulty against individual labeling quality. Relabelling improves the quality only when the difficulty is below the derived curve.
Figure 2: Graph of maximum difficulty against individual labeling quality. Curve represents the bordering below which relabelling improved the quality, derived by Sheng et al. Source.
Data profiling is the process of verifying the quality of the data that was collected. This is a crucial step in the data quality lifecycle. It allows one to produce appropriate rules for attributes or pairs of attributes, reducing incorrect data entering the database as highlighted by an Experian data profiling article. That said, the process of determining appropriate rules and alerts can be time-consuming and costly, requiring the use of various statistical methods.
The model learning stage of the workflow is where a large number of academic efforts are focused. Consequently, many sophisticated models and techniques are developed in academia, providing a wealth of choices for implementation in industry—this variety of models to choose from leads to further challenges in deployment as discussed in this section.
The sophisticated state-of-the-art models developed in academia can be tempting to adopt in industry. However, these models can be cumbersome and incredibly time-consuming to deploy, as explained in the case of Airbnb by Adrian Colyer. Airbnb left their more complex initial model in favor of a much simpler single hidden layer neural network after several failed deployment attempts.
Although the more advanced models published in literature can offer great results in terms of evaluation metrics in academic settings, these can be more difficult to translate to live industry data. These complex models can be difficult to scale, with error detection becoming challenging and requiring expensive infrastructure that is often not feasible.
One example of this is in Advanced Driver Assistance Systems (ADAS), as discussed by Borrego-Carazo et al. in their review of Resource-Constrained Machine Learning for ADAS. This family of methods predominantly focuses on specific problems such as pedestrian detection and traffic light recognition. The computational resources and power are often constrained, making more complex and computationally expensive models inappropriate. Borrego-Carazo et al. discuss different ways in which these systems adapt to comply with the memory and real-time requirements, favoring simplex models such as support vector machines (SVM).
Model interpretability goes hand in hand with model complexity. In industries where an output of the model must be understandable in business terms, interpretability can outweigh the model’s performance.
Model interpretability is discussed by Paleyes et al. in reference to the use of decision trees in baking when using machine learning to predict customer churn. In this case, the relationship between the input parameters and the ultimate customer churn output is the ultimate goal as it can be used to address the loss of customers. This, however, comes at a cost, as a more simple and interpretable model can be harder to develop and tune to produce sufficiently high prediction accuracy.
SEO LINK: Model Interpretability Guide
Resource constraints come in at all parts of the machine learning workflow and present serious challenges in deployment. As discussed previously, the complexity of the model must comply with resource-constrained environments’ requirements. However, even with great computational resource budgets and capabilities, further challenges come in.
Training models can take a vast amount of resources and time. This is particularly true in the context of NLP, where Or Sharir, Barak Peleg, and Yoav Shoham estimate costs for training a complex NLP model of 1.5 billion parameters to cost around $80,000 per run reaching $1.6m after hyperparameter tuning. A comparatively more modest model of 340 million parameters would still cost $10,000 per run and $200,000 after tuning. Such enormous costs are often accompanied by long training times, requiring the procedure to support troubleshooting, such as logs and alerts.
Code quality can present a challenge when attempting to deploy machine learning models. These models require experimentation and consequently are developed using tools such as Jupyter Notebooks. The experimental nature can make version control more difficult and result in poorer quality documentation, requiring further development and modulation before it is ready for deployment.
The infrastructure must have a certain level of elasticity due to the higher resource requirement periods during retraining of the model to be appropriate for machine learning model deployment. These periods of high resource requirements need to be accounted for, for instance, through cloud computing.
As defined by Ashmore et al., the model verification stage concerns:
- formal verification,
- test-based verification,
- requirement encoding
Challenges arise in determining appropriate metrics, adhering to regulatory standards, and generalisability of the model, as discussed below.
Standard machine learning metrics, be it classification accuracy or mean absolute error (MAE), are defined in an academic setting to measure the model’s fit to test and train data. Although they provide a reasonable mathematical basis for estimating the model performance, these don’t always correlate to the model’s effectiveness in the business setting for which it is being deployed.
A chosen model may have a low MAE but fail to drive business value. One example of this is presented by Coyler in their case study of Booking.com. The case study shows that estimating the value of the model through randomized trials and observing the effect on the business, as opposed to through standard machine learning metrics, proved to be most effective.
They also rationalize that an increased model performance, in any case, may not translate to real business gain due to factors such as business saturation and over-optimizing on proxy metrics. The wrong choice of metric to appropriately measure the business gains can make highly accurate machine learning models futile for the company, halting deployment.
You must check models to ensure that they adhere to industry-specific regulatory frameworks before deployment. For instance, this is particularly prominent in banking following the European Central Bank guidance for review of models.
In the Federal Deposit Insurance Corporation’s supervisory guidance of model risk management, the requirements outline that quantitative models should be evaluated on conceptual soundness, including but not limited to factors relating to the model design and construction. The methods and variables used must be supported by empirical evidence and documented.
Likewise, testing must be conducted to understand the model’s shortfalls and the assumptions made. The development material must undergo a careful review before deployment and a comparison to alternative theories.
The complete guidance also includes factors addressing all aspects of model deployments from third-party vendor validation processed to the required board of directors. Such extensive guidance is present in many industries requiring a lot of further calculations to prove the soundness of the model before deploying it and requiring all aspects of compliance to be thoroughly documented. The costs associated with this need to be considered from the onset to allocate sufficient resources.
A common factor present in the regulatory frameworks is the requirement to have extensive test-based verification of the generalisability of the model to previously unseen data. In an academic setting, this test-based verification would typically be conducted using the validation split of the data. The challenge in deployment comes from the need to perform this validation on the business data in the business setting using business-specific evaluation metrics such as customer bookings in Booking.com.
Such testing can be difficult on a full scale and is consequently replaced with either simulated environments or testing on a subsection of the real-world environment (for instance, with Booking.com, as discussed previously). While subsections of the real-world environment come at the drawback of not being representative of the whole system, simulations hinge on the assumption made about the real world, requiring further assumption verification. The verification also extends to testing the continuous quality of the incoming data to ensure that errors don’t propagate throughout the pipeline.
In this last stage, after the model attains the required accuracy in the business setting, it is deployed to make predictions against real and live data (be that in batches or live as the data comes is). At this point in the deployment, challenges can arise from various factors, including:
- model degradation,
- feedback loops,
- propagating errors, and
- ultimate integration into the existing workflows.
In software engineering deployment, one aspect that facilitates easier maintenance is having reusable code, which means only one piece of code needs to be maintained despite being used in multiple workflows. Unfortunately, despite being a common practice in software engineering, it is yet to translate fully to machine learning deployment. Thus, adding to the complexity of model maintenance.
Another critical aspect of deployment that can prove challenging is the continuous monitoring of the model and input data. You can use standard software engineering practices to design baseline checks, such as checking for missing fields or format changes. However, this is insufficient, and more sophisticated checks using tools such as time series analysis are typically required.
As discussed previously, formulating appropriate alerts to monitor the data quality can be challenging due to the experimental nature of this process, particularly when exploring relationships between one or more variables. These alerts must also be actionable and be tuned to have appropriate sensitivity to detect true deviations early enough to avoid the propagation of errors while also allowing for natural variation.
A further challenge is faced when determining how to act on these alerts. The potential considerations are discussed by Polyzotis et al. in their presentation on Data Management Challenges in Production Machine Learning, including the best course of action when multiple alerts are raised simultaneously. The procedures for conducting repairs on multiple related alerts is still an open research question with various proposed approaches, but the focus is predominantly given to alerts for data that are being used actively. Determining which alerts to act on can be a complex procedure in its own right, partly due to the difficulty in estimating improvement in the model without implementing the correction. This again is a further open research question. Automated fixes could theoretically resolve some of these concerns but are very challenging to implement.
The investigation required to set up and carry out appropriate monitoring is very complex. It can be a barrier to model deployment as, without sufficient monitoring, the model can degrade and become unusable.
Apart from ensuring the quality data, the state and appropriability of the model must also be monitored continuously. This is to ensure the model continues to reflect the trends seen in the deployment environment and account for data drift, which is changes in the distribution of data over time. This can be achieved through retraining, leading to the requirements for dynamic resource allocation discussed previously.
Last but not least, there are factors external to the workflow that can be crucial to the successful deployment. You must also consider these thoroughly.
In many circumstances, the developed model is met with hesitation from the intended end-users. This may be partly due to the end-users’ need to know how the model makes decisions. Furthermore, it could also be because they are not convinced of the utility of the model. This is particularly evident in medicine.
The medical-related challenges are discussed by Mateen et al. The authors raise the concerns of the lack of appreciation of context and where such tools would fit in in the physician’s workflow. Additionally, the authors highlight that insufficient case studies accompany the models to demonstrate the effect on patient care, reiterating the importance of using business-specific evaluation metrics discussed previously.
Researchers, organizations, and end-users are becoming increasingly aware of the importance of ensuring models are not biased towards members of various classes. Ingold and Soper from Bloomberg present the case of Amazon Prime delivery. The model used resulted in predominantly black neighborhoods being significantly more likely not to be eligible for next-day delivery. The case shows the importance of diligent evaluation of whether the implemented model is not biased.
The above discussion highlights that challenges are faced at all stages of model deployment, from data management to ethical considerations. The challenges faced have varying degrees of solutions available, with many areas such as data monitoring still requiring a great deal of academic research. That said, numerous tools are available to facilitate the easier deployment of machine learning models, allowing for easier deployment to production.
- Bilal A. Mateen, James Liley, Alastair K. Denniston, Chris C. Holmes & Sebastian J. Vollmer, Improving the quality of machine learning in health applications and clinical research, https://www.nature.com/articles/s42256-020-00239-1
- Challenges in Deploying Machine Learning: a Survey of Case Studies https://eprints.whiterose.ac.uk/172330/1/2011.09926v2.pdf
- Jimmy Lin and Dmitriy Ryaboy. Scaling big data mining infrastructure: the Twitter experience. Acm SIGKDD Explorations Newsletter, 14(2):6–19, 2013.
- Challenges and Opportunities of Big Data in Health Care: A Systematic Review https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5138448/
- A Review Of Synthetic Data Generation Methods For Privacy Preserving Data Publishing Surendra .H, Dr. Mohan .H .S https://www.ijstr.org/final-print/mar2017/A-Review-Of-Synthetic-Data-Generation-Methods-For-Privacy-Preserving-Data-Publishing.pdf
- Example Labeling Difficulty within Repeated Labeling, Victor S. Sheng https://www.researchgate.net/publication/267232570_Example_Labeling_Difficulty_within_Repeated_Labeling
- Data Scientists in Software Teams: State of the Art and Challenges, Miryung Kim, Thomas Zimmermann, Robert DeLine, Andrew Begel https://www.microsoft.com/en-us/research/wp-content/uploads/2017/09/kim-tse-2017.pdf
- Resource-Constrained Machine Learning for ADAS: a Systematic Review, Juan Borrego-Carazo, David Castells-Rufas, Ernesto Biempica and Jordi Carrabina https://www.researchgate.net/publication/339545543_Resource-Constrained_Machine_Learning_for_ADAS_a_Systematic_Review
- The cost of training NLP models, Or Sharir, Barak Peleg and Yoav Shoham, https://arxiv.org/pdf/2004.08900.pdf
- Data Management Challenges in Production Machine Learning, Neoklis Polyzotis, Sudip Roy, Steven Whang, Martin Zinkevich, https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45a9dcf23dbdfa24dbced358f825636c58518afa.pdf
- Federal Deposit Insurance Corporation supervisory guidance of model risk management https://www.fdic.gov/news/financial-institution-letters/2017/fil17022a.pdf
- David Ingold and Spencer Soper, Amazon Doesn’t Consider the Race of Its Customers. Should It?, https://www.bloomberg.com/graphics/2016-amazon-same-day/