# 2.2 Numerical data summaries

Numerical summaries of data sets are widely used to capture some essential features of the data with a few numbers. A summary number calculated from the data is called a statistic.

## Univariate statistics

For a single data set, the most widely used statistics are the average and median.

Suppose $N$ denotes the total number of observations and $x_i$ denotes the $i$th observation. Then the average can be written as1

[ \bar{x} = \frac{1}{N}\sum_{i=1}^N x_{i} = (x_{1} + x_{2} + x_3 + \cdots + x_{N})/N . ]

The average is also called the sample mean.

By way of illustration, consider the carbon footprint from the 20 vehicles listed in Section 1/4. The data listed in order are

4.0 4.4 5.9 5.9 6.1 6.1 6.1 6.3 6.3 6.3
6.6 6.6 6.6 6.6 6.6 6.6 6.6 6.8 6.8 6.8

In this example, $N=20$ and $x_{i}$ denotes the carbon footprint of vehicle $i$. Then the average carbon footprint is

\begin{align} \bar{x} & = \frac{1}{20}\sum_{i=1}^{20} x_{i} \\ &= (x_{1} + x_{2} + x_3 + \dots + x_{20})/20 \\ &= (4.0 + 4.4 + 5.9 + \dots + 6.8 + 6.8 + 6.8)/20 \\ &= 124/20 = 6.2 \text{ tons CO}_{2}. \end{align}

The median, on the other hand, is the middle observation when the data are placed in order. In this case, there are 20 observations and so the median is the average of the 10th and 11th largest observations. That is

[ \text{median} = (6.3+6.6)/2 = 6.45.]

Percentiles are useful for describing the distribution of data. For example, 90% of the data are no larger than the 90th percentile. In the carbon footprint example, the 90th percentile is 6.8 because 90% of the data (18 observations) are less than or equal to 6.8. Similarly, the 75th percentile is 6.6 and the 25th percentile is 6.1. The median is the 50th percentile.

A useful measure of how spread out the data are is the interquartile range or IQR. This is simply the difference between the 75th and 25th percentiles. Thus it contains the middle 50% of the data. For the example,

[ \text{IQR} = (6.6 - 6.1) = 0.5. ]

An alternative and more common measure of spread is the standard deviation. This is given by the formula

[ s = \sqrt{\frac{1}{N-1} \sum_{i=1}^N (x_{i} - \bar{x})^2}. ]

In the example, the standard deviation is

[ s = \sqrt{\frac{1}{19} \left[ (4.0-6.2)^2 + (4.4 - 6.2)^2 + \cdots + (6.8-6.2)^2\right]} = 0.74. ]

R code