Validation of Machine Learning Models Focusing on Reliability

How faulty data preparation will lower your model’s credibility

Avoidable mistakes that lead to an unreliable performance estimation

10 min readJul 20, 2020

Part 3 of the article Validation of Machine Learning Models Focusing on Reliability

This article examines the task of data preparation for possible mistakes that lead to a misjudgment of machine learning models. In this context the following questions are answered:

Why can random sampling lead to overoptimistic performance estimates?
How to properly split the data?
How can data normalisation violate the independence of a the test set?

One reason to think of a model as excellent although it is not, can be the incorrect handling of data. Before the sample data is fed to the machine learning algorithm that trains and validates the model, it might go through some preparation steps that require a careful handling. It is visualised in figure 1.1 as an intermediate step between sampling from the population and the actual training of the model. So even if we are given a dataset from the sampling step perfectly representing the population, we can still ruin this condition before training the model on data.

Figure 1.1: Data preparation as part of the development process

The goal of data preparation is to bring the raw data into a form suitable for training and validation. It involves multiple tasks such as data cleaning, feature extraction, feature scaling or encoding the categories for a classification task. Those steps are often referred to as data preprocessing. [1] Another important preparation step is the splitting of the data into training and test sets. These preparations require a great deal of caution. A mistake, for example some coding error, can have a dramatic impact on the resulting model. This article however, examines how preparing the data can lead to misjudgements and eventually to an unreliable assessment of performance. Those errors can occur during data splitting as well as during preprocessing and are further explained in the following sections.

Splitting the data

A model should be trained and validated on two distinct data sets in order to assess whether the model generalises to unseen data. If the data set used for training contains data from the test set, the model might simply remember those samples. This leads to an overoptimistic estimation of the model’s performance due to overfitting. [2, p. 7]
A rather naive approach to evaluate a model is to split the data into two distinct subsets: One for training and one for testing. This is typically done by randomly assigning 70% to the training set and the other 30% to the test set, called the holdout set. However, it sounds more trivial than it actually is.
The question is: Which mistakes during data splitting can result in an unreliable performance estimation?

Altering the distribution

When a data set is divided by random, the distribution of the resulting sets diverge, although training and test set should be drawn from the same distribution to achieve a meaningful estimation. The need of identically independently distributed data samples is dealt with in the former article of this series: Difficulties capturing reality.
The problems of sampling from a population also apply to sub-sampling. Since the subset serves as a representation of the whole set, the statistics should resemble. In other words, when randomly sampling from a data set, the resulting subsets might not represent the population sufficiently anymore. As a result the algorithm is trained on a different distribution than it is tested on. In the worst case certain classes or characteristics might even be missing entirely. [2]

Stratified Sampling
Splitting the data in a way, that the original distribution remains the same in the resulting subsets, is called stratification or stratified sampling. [2, p.7] According to the proportions of classes, the implementation is straight forward. If the samples need to be stratified along the feature axes it is more problematic. [2, p. 14] When multiple characteristics need to be considered, it ends up in an optimisation problem. S. Raschka visualises this behaviour using the example of the Iris dataset. He randomly divides the data, assigning 2/3 of the samples to the training set and 1/3 to the test set. He then plots the frequencies of the three iris classes setosa, virginica and ersicolor with respect to the sepal width, which is one of the four features of a data record.

Based on the implementation given by S. Raschka, I visualised the class distribution along the feature Sepal length before and after stratified sampling while keeping the original class proportions in the resulting subsets. The corresponding implementation is linked in the footnotes¹.
Figure 2.1 shows the class frequencies within the entire data set before stratified sampling and the distributions in the resulting training and test set. Although the relative frequencies of the three classes in the subsets resemble, the graphic clearly shows that this does not apply to the feature sepal length.

Figure 2.1: Class proportions depending on sepal length before and after stratified sampling according to classes

Looking at the distribution of the training set in Figure 2.1 this is not obvious but the one of the test set shows gaps in certain areas. For instance the model would be poorly tested on samples with a sepal length of ≈5.25 and≈7.25. Afterwards I repeated the experiment but instead of keeping the class proportions, the distribution of the feature sepal length are maintained. Figure 2.2 visualises the results.

Figure 2.2: Class proportions of training and test set after stratified sampling according to sepal length

In contrast to figure 2.1, the test set now covers the entire range of values for sepal length and the training and test set resemble according to the proportions of the feature. However, there is no longer a guarantee that the distribution of classes do not differ too much from the original. After all, stratified sampling with respect to multiple characteristics is quite challenging.

When working with very large data sets, randomly splitting the data should usually not result in very differently distributed subsets. However, for smaller data sets it is more likely to happen. Depending on the characteristics of the data, it may be advantageous to examine the subsets for important characteristics. [2, p.8]

Sensitivity to particular splits

In general, it can happen that the samples from a particular test set are easier for the model to predict correctly than those from another one.
Especially for small data sets, the randomly chosen subset might not be as random as expected. When a model is validated on a single validation set, the estimate of generalisation refers to this particular data fold. If the model has a high variance, a point estimate like that might be misleading since the performance of such models differs greatly depending on the data set.

Repeated holdout validation
Intuitively, the problem that a model is sensitive to a particular split can be dealt with by computing the average over multiple runs on different test sets. Assuming the test results to be normally distributed, the mean and standard deviation can be calculated, carrying more information about the confidence of the estimates and the stability of the model. This procedure is called repeated holdout validation [2, p.15]. The averaged estimate is less dependent on a specific test set. However, this approach is questionable since some data records might occur in several test sets while some are not included at all. As a result certain data samples have more impact on the test error than others.

Cross-Validation
Nowadays in practice k-fold-cross-validation probably is the predominantly applied technique for model evaluation. In short, the data set is divided into k disjoint parts. Each part is utilised for testing exactly once after the model has been trained on the remaining k−1 parts. This procedure assures that the model is tested on all data records equally often.[4, p.11][2, p.24 –25]
Figure 2.6 depicts the difference between repeated holdout validation and k-fold cross-validation. The given data set is divided into three pairs of training and test sets. On the left side of the image, the samples are assigned randomly. As a result the data record 7 appears in two test sets and consequently has more impact on the final test result. Data row 1 however, is not contained in any test set so potential weaknesses in this area will not be reflected in the test result. On the right side 3-fold cross-validation is applied, assuring that each sample is used equally often.

Figure 2.6: Difference between Repeated Holdout Validation and k-fold Cross-Validation

Although cross-validation solves many problems, the representability of the data splits is still an issue. In fact, there are some further pitfalls when applying this technique and even some data science platforms do not implement it properly. [4 p.12] Those difficulties are discussed in the following article.

Accidental contamination of the training data

A model must be trained and tested on separate and non-overlapping data sets in order to retrieve a realistic performance estimate. Cross-validation offers a good starting point to meet this requirement.
However, an unrealistic estimation is still possible— as for instance when preprocessing is done outside the training and validation procedure. As an example imagine to introduce a normalisation step scaling the feature values to the same range. This can be achieved through Min-Max-Normalisation given by

If you applied this formula to the whole data set and afterwards split the data according to cross-validation, the information about minimum and maximum values from the test set could be exposed to the training data. A wrong application of normalisation in combination with training and evaluation is shown in figure 3.8.

Figure 8: The wrong way: Information about test data leaks into training data

Such a leak of information makes the training data dependent on the test data, which can eventually lead to an overoptimistic performance estimate. In other words, the model is likely to disappoint when running in production environment. [4, pp. 15–16] Therefore such types of data transformation need to be included in the cross-validation cycle. For each run, training and test data are processed separately before training and evaluation. This is depicted in figure 3.9. After the training data has been normalised independently, the test data can be processed incorporating the parameters derived from the normalisation on the training data. Eventually the model is not exposed to any knowledge from the test data during the training phase.

Figure 9: The right way: Training and test data are normalised separately

Of course the model won’t perform better in production if normalisation is performed inside cross-validation since it will finally be trained on the entire set anyway. Nevertheless it might save you from a bad surprise due to an overoptimistic performance estimate. [4, pp. 15–16] [3, p.3]

Summary

Since data forms the basis of intelligent systems, data preparation is a very significant and time consuming task. To sum up, these are the main guidelines from the article to follow, if we aim to advance a more reliable assessment of our model.

For classification tasks, use stratified sampling to keep the class proportions and avoid a misleading estimate.
Also consider the representation of features when splitting the data into subsets. A visualisation of the distributions might be helpful.
Use more than just one test set for validation to get a more robust estimation.
Use cross validation to assure that the data sets do not overlap.
Do not apply data transformations to the entire data set before splitting the data into training and test sets. Avoid accidental contamination.

When developing machine learning models, maybe in a team of software engineers and data scientists, it is important to be aware of those issues. However, it is not easy to commit to a consistent and secure approach, especially if there is no uniform development environment that provides predetermined procedures or avoids common mistakes.
In general, it’s always a good idea to maintain a careful handling of training and test data. Beware that mistakes during data preparation often go undone and are not noticed in time, especially if it gets little attention.

References

[1] Jason Brownlee. What Is Data Preparation in a Machine Learning Project. Machine Learning Mastery. Library Catalog: machinelearningmastery.com. June 16,2020. URL: https://machinelearningmastery.com/what- is- data-preparation-in-machine-learning/ (Accessed: 07/15/2020).

[2] Sebastian Raschka. “Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning”. In: arXiv:1811.12808 [cs, stat](Dec. 2, 2018). arXiv:1811.12808. URL: http://arxiv.org/abs/1811.12808 (Accessed:05/04/2020).

[3] Zoltan Prekopcsak, Tamas Henk, and Csaba Gaspar-Papanek.
Cross-validation:the illusion of reliable performance estimation. URL: http://prekopcsak.hu/papers/preko-2010-rcomm.pdf
(Accessed: 05/04/2020).

[4] RapidMiner. How to Correctly Validate Machine Learning Models. RapidMiner.Library Catalog: rapidminer.com Section: Whitepaper. URL: https://rapidminer.com/resource/correct-model-validation/
(Accessed: 04/16/2020).

Footnotes

https://gist.github.com/JulesMary/0a02edee22d1da0cd78c8a20d89b4de8