8 Nonlinear approaches

In the previous chapter we have considered input/output regression problems where the relation between input and output is linear and classification problems where the optimal decision boundaries are linear.

The advantage of linear models are numerous:

  • the least-squares $\hat{\beta }$ estimate can be expressed in an analytical form and can be easily calculated through matrix computation.

  • statistical properties of the estimator can be easily defined.

  • recursive formulation for sequential updating are available.

Unfortunately in real problems it is extremely unlikely that the input and output variables are linked by a linear relation. Moreover, the form of the relation is often unknown and only a limited amount of samples is available. Along the years statisticians and machine learning researchers have proposed a number of nonlinear approaches, with the aim of finding approximators able to combine high generalization with effective learning procedures. The presentation of these techniques could be organized according to several criteria and principles. In this chapter we will focus on the distinction between global and divide-and-conquer approaches.

A family of models traditionally used in supervised learning is the family of global models which describes the relationship between the input and the output values as a single analytical function over the whole input domain (Fig.8.1). In general, this makes sense when it is reasonable to believe that a physical-like law describes the data over the whole set of operating conditions.

Figure 8.1: A global model (solid line) which fits the training set (dotted points) for a learning problem with one input variable (x-axis) and one output variable (y-axis).

Examples of well-known global parametric models in literature are the linear models discussed in the previous chapter, generalized linear models and neural networks which will be presented in Section 8.1.1.

A nice property of global modeling is that, even for huge datasets, a parametric model can be stored in a small memory. Moreover, the evaluation of the model requires a short program that can be executed in a reduced amount of time. These features have undoubtedly contributed to the success of the global approach in years when most computing systems imposed severe limitations on users.

However, for a generic global model, the parametric identification (Section 5.2) consists in a nonlinear optimization problem (see Equation 5.2.5) which is not analytically tractable due to the numerous local minima and for which only a suboptimal solution can be found through a slow iterative procedure. Similarly, the problem of selecting the best model structure in a generic nonlinear case cannot be handled in analytical form and requires time consuming validation procedures.

For these reasons, in recent years, alternatives to global modeling techniques, as the divide-and-conquer approach, gained popularity in the modeling community. The divide-and-conquer principle consists in attacking a complex problem by dividing it into simpler problems whose solutions can be combined to yield a solution to the original problem. This principle presents two main advantages. The first is that simpler problems can be solved with simpler estimation techniques: in statistical language this means to adopt linear techniques, well studied and developed over the years. The second is that the learning method can better adjust to the properties of the available dataset. Training data are rarely distributed uniformly in the input space. Whenever the distribution of patterns in the input space is uneven, a proper local adjustment of the learning algorithm can significantly improve the overall performance.

We will focus on two main instances of the divide-and-conquer principle: the modular approach, which originated in the field of system identification, and the local modeling approach, which was first proposed in the statistical nonparametric literature.

Modular architectures are input/output approximators composed of a number of modules which cover different regions of the input space. This is the idea of operating regimes which propose a partitioning of the operating range of the system as a more effective way to solve modeling problems (Section 8.1.2).

Although these architectures are a modular combination of local models, their learning procedure is still performed on the basis of the whole dataset. Hence, learning in modular architectures remains a functional estimation problem, with the advantage that the parametric identification can be made simpler by the adoption of local linear modules. However, in terms of structural identification the problem is still nonlinear and requires the same procedures used for generic global models.

A second example of divide-and-conquer methods are local modeling techniques (Section 8.1.10), which turn the problem of function estimation in a problem of value estimation. The goal is not to model the whole statistical phenomenon but to return the best output for a given test input, hereafter called the query. The motivation is simple: why should the problem of estimating the values of an unknown function at given points of interest be solved in two stages? Global modeling techniques first estimate the function (induction) and second estimate the values of the function using the estimated function (deduction). In this two-stage scheme one actually tries to solve a relatively simple problem (estimating the values of a function at given points of interest) by first solving, as an intermediate problem, a much more difficult one (estimating the function).

Figure 8.2: Function estimation (model induction + model evaluation) vs. value estimation (direct prediction from data).

Local modeling techniques take an alternative approach, defined as transduction by Vapnik [115] (Fig.8.2). They focus on approximating the function only in the neighborhood of the point to be predicted.

Figure 8.3: Local modeling of the input/output relationship between the input variable $x$ and the output variable $y$, on the basis of a finite set of observations (dots). The value of the variable $y$ for $x=q$ is returned by a linear model (solid line) which fits the samples in a neighborhood of the query point (bigger dots).

This approach requires to keep in memory the dataset for each prediction, instead of discarding it as in the global modeling case. At the same time, local modeling requires only simple approximators, e.g. constant and/or linear, to model the dataset in a neighborhood of the query point. An example of local linear modeling in the case of a single-input single-output mapping is presented in Fig.8.3.

Many names have been used in the past to label variations of the local modeling approach: memory-based reasoning [108], case-based reasoning [77], local weighted regression [27], nearest neighbor [30], just-in-time [32], lazy learning [3], exemplar-based, instance based [2],... These approaches are also called nonparametric in the literature [60, 107], since they relax the assumptions on the form of a regression function, and let the data search for a suitable function that describes well the available data.

In the following we will present in detail some machine learning techniques for nonlinear regression and classification.