# 1.1.2 What is statistics?

So, what is, and where do, statistics come into play? Well, the natural world is inherently variable. This is particularly true for biological systems where repeated observation of the same process rarely gives the exact same result! There are many reasons for this variability. Obviously, random chance can play a large role, but there are also other biological reasons as well. Genetic variation gives rise to variation among individuals. Epigenetic controls alter phenotypic response to stress. Individual behavior varies as a result of environmental conditions. Physiological response varies as a result of temperature, etc. The list goes on and on. In fact, there are very few deterministic rules in the realm of biological phenomena.

Simply put, the field of statistics provides us with two important tools for dealing with such variability. First, it provides concepts and measures which the scientist can use to describe data, including its variability. This is descriptive statistics. Second, it provides the concepts and frameworks from which scientists can attempt to distinguish natural (or random) variability from variability that is a result of the biological process of interest. That is, it facilitates our ability to draw conclusions, or make inferences, in light of natural variability. This is especially important because, as biologists, we are rarely able to completely measure or otherwise observe the processes or biological entities in which we are interested. As a result, we must rely on sampling (Fig.1.2). This is inferential statistics.

Figure 1.2: The process of sampling. Phenomena of interest exist in the real world. In statistics, we sometimes use terms like “statistical populations” or “processes” to refer to these phenomena. Biologists rarely are able to completely sample the real world. As a result, we examine a subset of the real world. This subset is our “data” or our “sample.” From this subset we hope to draw meaningful conclusions about the real world.

With this in mind, others have defined statistics as “the analysis and interpretation of data with a view toward objective evaluation of the reliability of the conclusions based on data” [33], or, “the scientific study of data describing natural variation” [29]. In essence, statistics is the study of variation and uncertainty; in particular, it is the study of how to make reasonable decisions and predictions in the face of uncertainty1.

In total, these definitions highlight some important themes. First, statistics is a field of study closely aligned with the process of science. In fact, the topic for this text, applied statistics, describes a set of approaches that depend on an underlying theoretical discipline, mathematical statistics. However, an often under-appreciated consequence of this fact is that the way in which you apply statistics within your research assumes a philosophical stance. Null hypothesis testing assumes one stance, model selection approaches another, etc. As a biologist, you need to understand that “Data analysis is an aid to thinking and not a replacement for it” [28].

Figure 1.3: Biostatistics as a system. Consideration of data analysis starts before data collection, and includes aspects of how the data will be managed and interpreted. In addition, feedbacks (dashed lines) from system components (the boxes) impact the biologists’ overall study plan. These feedbacks illustrate the way in which aspects of each system component are also influenced by subsequent components (e.g., plans for data analysis impact the way in which data will be managed). In all things, the biological context of the study must be considered.

Second, statistics deals with data. Data come from experimentation and observation, and no amount of statistical maneuvering will overcome poor experimental design or biased observation. As a result, thinking about statistics does not start after the data have been collected! In fact, my contention is that statistics2 is really a synthetic view of how you collect, manage, analyze, present, and interpret your data — all within a biological context (Fig.1.3). Failure to consider this bigger picture eventually leads to frustration.

Finally, because we are talking about an applied field, it is helpful to take a step back from thinking about “right” or “wrong” ways to do things. There are “appropriate” and “inappropriate” applications of concepts, but there is no cookbook formula or series of steps that can be blindly applied! Therefore, it is your responsibility, as a biologist, and as a data analyst, to think critically about the questions being addressed, the biological context of those questions, and the appropriate application of statistical tools to evaluate your data in light of those questions. In most cases, there are a variety of approaches that may lead to the appropriate conclusions, but an ultimate measure of the usefulness and/or appropriateness of any statistical analysis is your ability to defend its use.

1. I must attribute this definition, and much of this discussion, to my colleague, Dr. Richard Strauss.

2. more specifically, applied biostatistics.