1.1 Modelling from data
Modelling from data is often viewed as an art, mixing an expert’s insight with the information contained in the observations. A typical modelling process cannot be considered as a sequential process but is better represented as a loop with many feedback paths and interactions with the model designer. Various steps are repeated several times aiming to reach, through continuous refinements, a good description of the phenomenon underlying the data.
The process of modelling consists of a preliminary phase which brings the data from their original form to a structured configuration and a learning phase which aims to select the model, or hypothesis, that best approximates the data (Figure 1.2).
The preliminary phase can be decomposed in the following steps:
Here the model designer chooses a particular application domain, a phenomenon to be studied, and hypothesizes the existence of a (stochastic) relation (or dependency) between the measurable variables.
This step aims to return a dataset which, ideally, should be made of samples that are well-representative of the phenomenon in order to maximize the performance of the modelling process .
In this step, raw data are cleaned to make learning easier. Pre-processing includes a large set of actions on the observed data, such as noise filtering, outlier removal, missing data treatment , feature selection, and so on.
Once the preliminary phase has returned the dataset in a structured input/output form (e.g. a two-column table), called training set, the learning phase begins. A graphical representation of a training set for a simple learning problem with one input variable $x$ and one output variable $y$ is given in Figure 1.3. This manuscript will focus exclusively on this second phase assuming that the preliminary steps have already been performed by the model designer.
Suppose that, on the basis of the collected data, we wish to learn the unknown dependency existing between the $x$ variable and the $y$ variable. In practical terms, the knowledge of this dependency could shed light on the observed phenomenon and allow us to predict the value of the output $y$ for a given input (e.g. what is the expected weight of child which is 120cm tall?). What is difficult and tricky in this task is the finiteness and the random nature of data. For instance a second set of observations of the same pair of variables could produce a dataset (Figure 1.4) which is not identical to the one in Figure 1.3 though both originate from the same measurable phenomenon. This simple fact suggest that a simple interpolation of the observed data would not produce an accurate model of the data.
The goal of machine learning is to formalize and optimize the procedure which brings from data to model and consequently from data to predictions. A learning procedure can be concisely defined as a search, in a space of possible model configurations, of the model which best represents the phenomenon underlying the data. As a consequence, a learning procedure requires both a search space, where possible solutions may be found, and an assessment criterion which measures the quality of the solutions in order to select the best one.
The search space is defined by the designer using a set of nested classes with increasing complexity. For our introductory purposes, it is sufficient to consider here a class as a set of input/output models (e.g. the set of polynomial models) with the same model structure (e.g. second order degree) and the complexity of the class as a measure of the set of input/output mappings which can approximated by the models belonging to the class.
Figure 1.5 shows the training set of Figure 1.3 together with three parametric models which belong to the class of first-order polynomials. Figure 1.6 shows the same training set with three parametric models which belongs to the class of second-order polynomials.
The reader could visually decide whether the class of second order models is more adequate or not than the first-order class to model the dataset. At the same time she could guess which among the three plotted models is the one which produces the best fitting.
In real high-dimensional settings, however, a visual assessment of the quality of a model is not sufficient. Data-driven quantitative criteria are therefore required. We will assume that the goal of learning is to attain a good statistical generalization. This means that the selected model is expected to return an accurate prediction of the dependent (output) variable when values of the independent (input) variables, which are not part of the training set, are presented.
Once the classes of models and the assessment criteria are fixed, the goal of a learning algorithm is to search i) for the best class of models and ii) for the best parametric model within such a class. Any supervised learning algorithm is then made of two nested loops denoted as the structural identification loop and the parametric identification loop.
Structural identification is the outer loop which seeks the model structure which is expected to have the best performance. It is composed of a validation phase, which assesses each model structure on the basis of the chosen assessment criterion, and a selection phase which returns the best model structure on the basis of the validation output. Parametric identification is the inner loop which returns the best model for a fixed model structure. We will show that the two procedures are intertwined since the structural identification requires the outcome of the parametric step in order to assess the goodness of a class.