The most important component for training Machine Learning models is data. Before the development of any Machine Learning model, it is important to properly handle the raw data. In this chapter of the MLOps tutorial, you will learn about Handling Data and the various steps involved in it.
The following steps are performed for data handling-
- Data Ingestion
- Versioning Data
- Analyzing the data
Data Ingestion
The data is consumed from the pipelines built by Data Engineers for getting data from databases on-premise or from cloud service providers. The data is first divided into 3 parts, the training set, the validation set, and the test set. While the transformations on the data such as feature engineering, imputing missing values, One-Hot encoding of categorical values, etc are done on all the datasets. Only the training dataset is used for analyzing and visualizing the data.
Versioning Data
In Machine Learning, data is a dynamic component, therefore it is important to keep a track of changes in data that is used for training the models. To keep track of changes in data, the data is versioned just like code. Each model that is trained using that data is referenced to that version of the data. This step also helps with regulatory and compliance by keeping a track of the data used for training each model. Additionally, keeping track of the changes in data and the model which trained on the data helps in reproducing the results.
Analyzing the Data
In this step, data is analyzed and explored. Various features of the data are independently observed by the use of various measures such as mean, median, mode, range, variance, standard deviation, etc for continuous valued features and value counts for categorical features. Various charts and plots such as histograms, scatter plots, etc are used to visualize the data. Additionally, correlation between features is also check to gain additional insights from the data.