Once you have one or many models that were created either by retraining the old model or by creating a new model from scratch, it is extremely important to test these new models in the production environment. After testing models in the production environments only is the best model selected and deployed to the production environment. In this chapter of the MLOps Tutorial, you will learn how to perform testing of Machine Learning models in production.
Note– This testing in production is only performed for promising models that perform well during the testing done at the time of training/ retraining the model.
There are two methods of performing model testing in the production environment-
- Shadow Testing
- A/B Testing
Shadow Testing
Shadow Testing, as the name implies, refers to the testing process in which the models being tested operate in the shadow of the currently deployed model. In Shadow Testing, multiple models are deployed to the production environment along with the model that is already deployed and being used. The model that is currently deployed is called the Champion model, whereas the model(s) that are being tested in production is/ are called the Challenger models.
All the input data that is fed to the deployed model for inference is also passed to the other models being tested. Although the prediction of only the deployed model is used, the predictions of all models(including the currently deployed one) are stored. These predictions are then used to compare the accuracies of various Challenger models against each other and against the Champion model. If any of the Challenger models perform better than the Champion model and the improvement in performance is proved to be statistically significant and not due to mere chance, then the Challenger model is deployed to production and becomes the new Champion model.
Advantages
- The same input data is used for comparing the performance of the Champion model and various Challenger Models. Therefore, there is no room for sampling bias to originate.
- It is easier to implement as compared to A/B Testing.
- It is safer as the predictions from Challenger models are not being used yet.
Disadvantages
- Multiple models are used for inferencing on the same data, this results in an increase in the consumption of resources such as memory, processing power, storage, etc, which indirectly results in an increase in cost. Hence, Shadow Testing is more expensive.
A/B Testing
A/B Testing, also known as split-testing is a randomized experimentation process used in various fields including software testing. Performing A/B Testing to test model performance is not different from performing it in Software Testing or anywhere else. In A/B testing, some of the input data goes through the model that is deployed in production and the other goes through the model which is being tested. Effectively, there are 2 different models which are making predictions with input data being split between them. If the model that is being tested performs better than the model that is currently deployed and the difference is statistically significant, then the model being tested is deployed in place of the currently deployed model and all of the input data goes through it.
Advantages
- The total number of times a prediction needs to be made does not change, hence the resources and therefore the cost associated with A/B testing is lower than Shadow Testing.
Disadvantages
- The data on which the predictions are being made is different for the models. Therefore, efforts should be made to ensure that the split in the data does not cause any sampling bias for testing the Machine Learning models.
- It is often more difficult to implement than Shadow Testing.
- Since the predictions made by the model being tested are being used, there is a risk involved if the performance of the model being tested degrades considerably between initial testing and testing in production.
Even though Shadow Testing is more expensive than A/B testing, it should be used wherever possible because it is easier to implement, and leaves no room for sampling error.