2 Data and Research Designs

In the previous chapter (Chapter 1) we have provided an overview of the research process in psychology and related disciplines. The goal of this overview was to show that statistics is just one part of the full research endeavour and usually not the end goal. We also highlighted that an answer to our research questions requires not only the results from the statistical analysis, but the context in which this result was generated. More specifically, we argued that the most important part of the research process is usually the operationalisation of the research question: What are our measures? What is the task participants have to do? What is our study design? The goal of this chapter is to provide us with the necessary conceptual knowledge and terminology to answer these questions for our research.

2.1 Empirical Evidence and Data

In this book we are concerned with empirical research. In other words, we are generally not interested in research questions for which an answer or proof can be found purely through thinking hard, such as in mathematics or philosophy. Instead, we are only interested in research question for which the evidence comes in the form of observations or experiences: empirical evidence for short. As we have discussed in the previous chapter, that does not mean that our theories cannot include unobservable quantities such as mental states (e.g., fear, enjoyment, attention). However, if our theories include such unobservable quantities, these must be causally responsible for something that is observable (e.g., behaviour, or an electrical wave from the brain). This way, we can still test the theories (e.g., if our theory predicts that fear should lead to aggression, but we can induce fear without it leading to aggression, we learn that our theory must be wrong).

The fact that we are interested in empirical research means that the ultimate arbiter of whether or not we should believe in a theory is empirical evidence. It does not matter how elegant or intuitive a theory is. If the observed behaviour of people disagrees with a theory, it is wrong. It also means that theories that are so vague that there is no possible empirical evidence that would disprove them cannot be considered part of the empirical sciences (i.e., they are not empirical theories). This criterion is also known as falsifiability and was introduced by the philosopher Karl Popper in the 1930s. For example, there is a never ending discussion of whether Freudian psychoanalysis is in principle falsifiable or not. Whereas Karl Popper was very strong in his belief that it is not (which would render psychoanalysis non-scientific), proponents of Freudian psychoanalysis naturally see this rather differently.¹¹

Empirical evidence comes in at least two different forms, either as anecdotes or as data. Whereas an anecdote typically refers to a single person, data usually contains information about multiple persons. However, anecdotes and data differ on more dimensions than just the number of observations, as summarised in the following aphorism: The plural of anecdote is not data.

Anecdotes are unsystematic observations, typically in the form of stories (e.g., "the friend of a friend"), that somehow address our research questions. The problem with anecdotes is that they are generally difficult to verify and to investigate further. This makes it impossible to rule out possible alternative explanations for the relationship between the anecdote and the research question. And as we have seen in the previous chapter, one of the main criteria for deciding whether an observation provides evidence for a theoretical claim is whether we can rule out plausible alternative explanations. In sum, anecdotes surely matter when coming up with good hypotheses or ideas of what to study, but for mature sciences anecdotes should only play a minor evidentiary role in deciding which claims to believe.

Data are systematic observations that are collected for a specific purpose, such as answering a research question or bookkeeping. Data generally consists of observations on multiple variables. In the previous chapter we have defined variables as dimensions, features, or characteristics on which individuals or situations can differ. A more technical definition is that each variable corresponds to a specific set of possible outcomes (or states of affair/events), where each possible outcome corresponds to one value of the variable. Furthermore, we can define an observation as the smallest unit of data. More technically, one observation results from collecting at least one value of one variable or values on different variables from one unit of observation (in psychology, the unit of observation is usually the participant).

As an example of data, consider again the study by Walasek and Stewart (2015) discussed in the previous chapter (1). The task of participants was to accept or reject 50-50 lotteries (for an example, see Figure 1.2) and each participant had to do this for 64 trials. Table 2.1 below shows six observations each from two different participants from this study. The way the data is shown here is exactly the format that Lukasz Walasek used to analyse the data (i.e., no variables added or removed). Observations are shown in rows and variables are shown as columns. This tabular representation of the data with observations in rows and variables in columns is common and will be used throughout the book.

Table 2.1: Selected observations of the data from Walasek & Stewart (2015, Exp. 1a). The "[...]" indicates that this is only part of the whole data set and some observations are not shown.
subno	loss	gain	response	condition	resp
8	6	6	accept	20.2	1
8	6	8	accept	20.2	1
8	6	10	accept	20.2	1
8	6	12	accept	20.2	1
8	6	14	accept	20.2	1
8	6	16	accept	20.2	1
			[...]
369	6	12	reject	40.2	0
369	6	16	reject	40.2	0
369	6	20	reject	40.2	0
369	6	24	accept	40.2	1
369	6	28	accept	40.2	1
369	6	32	accept	40.2	1

In total we can see six different variables in this data set. Let us discuss these in turn. The first variable, subno (we generally use a monospace font to refer to variable names as they appear in a data set), is the participant identifier or "subject number" (because actual individuals take part in research and not passive subjects, the term "participant" is now preferred to "subject"). This variable should be part of any data set to uniquely identify to which participant (or more generally, unit of observation) a specific observation belongs. Here, we see that it only takes numbers. It is not uncommon to only use numbers for the participant identifier variable, but it can also be a combination of numbers and letters, or a (ideally anonymous) name.

The second and third variables, loss and gain, specify the possible outcomes of the lotteries for each trial. For example, the second observation/row shows a lottery in which the potential loss was $6 and the potential gain was $8. Based on these two columns, we can see that the observations are ordered by the combination of loss and gain. This means the order of observations in the data does not reflect the actual order of trials in which participants saw them (as this order was random). All values of the two variables are numbers with the lowest possible loss/gain being 6 and the largest possible loss/gain being 40, depending on the condition a participant is in.

Column four shows the response of the participant to the lottery, either accept or reject. The sixth variable, resp, is a numeric version of the response. Here, an accept decision is represented by a 1 and a reject decision by a 0. So these two variables carry the same information, but in different formats that have different benefits. The response variable makes it easy to understand in ordinary language what the participant’s response was. The resp variable makes it easy to perform calculation on the results because it uses a numerical code to represent the same information. For example, because reject is mapped onto 0 and accept is mapped onto 1, we could take the mean over all observations to get the overall acceptance rate across all gambles (the mean of the whole data set is 0.38 which corresponds to an acceptance rate of 38%).¹² However, if we only had the resp variable, we would additionally need the information of what responses 0 and 1 correspond to.

Finally, variable five informs us in which condition each participant is in. Remember from the previous chapter that the experiment varied the range of gains and losses across participants, resulting in four conditions in total: a condition with loss and gains ranging up to -$20/+$40, a -$20/+$20 condition, a -$40/+$40 condition, and a -$40/+$20. This information is here provided in form of a decimal number with the values before the decimal point referring to the range of gains and the values after the decimal to the range of losses without the trailing 0 (i.e., in the opposite order to how we have referred to the conditions so far). Thus, the participant with subno 8 is in the -$20/+$20 condition and participant 369 in the -$20/+$40 condition.

The example from Walasek and Stewart (2015) hopefully clarifies the abstract definition of a variable that was provided above. For each variable, we have a set of possible outcomes that is defined by our research design. For example, for the participant identifier subno, this set encompasses all possible participants who took part in the study. For loss and gain, this set contains all potential losses and potential gains that occur in the lotteries. For response there are the two possible outcomes: to accept or to reject a lottery. We then define values for each possible outcome. In the case of subno we assign a different number to every participant from whom we collect data. For loss and gain, the values correspond to the magnitude of the potential loss and gain in US dollars (the currency that was used in the experiment). For response, we do not use a numeric code, but a word to represent the two outcomes.

There are a few things of note in the example data shown in Table 2.1.

Every observation in this data set is complete; for each observation we have values on each variable and no missing data. Whereas this is common in experimental research, it is not always the case for other types of research. Whereas missing data is not something we will discuss in detail in this book, it is important to be aware that it can happen and to think about what to do in this case (sadly, there is no general solution). For example, in the type of experimental research used as an example here, missing data can happen if the computer used for data collection crashes. As this usually happens rarely and in an unsystematic manner, we generally simply discard such incomplete data and collect data from another participant. A different issue exists if data is missing systematically. This can happen in research on sensitive issues. For example, it is easy to imagine that in a study on sexual health, participants that have a sexually transmittable disease (STD) are unwilling to report this and could therefore just not answer questions about their STDs. Discarding these cases of missing data is problematic as doing so will bias the results in the sense of resulting in lower rates of STD than actually present.
We have multiple observations per participant, 64 to be precise. These 64 observations are given in different rows. We call this data format – in which the data from each participant potentially spans multiple rows, one row per observation – the long format. This long format contrasts with a wide format that is also commonly found in the social sciences. In the wide format, the data of one participant only spans a single row. When a participant has multiple observations, these are given in different columns. For the procedures introduced in this book, we generally want the data to be in the long format. Note that if each participant provides only one observation, there is no difference between the long and the wide format.
The variables differ in whether they contain numbers (all variables but response) or not (only response).

2.2 Data Types

Let us discuss the last point from above in more detail. A common intuition is to think about numbers when thinking about data. As we have seen in the example data, this is not necessary. Data does not have to be numbers. The response variable shows that we can use other values, such as words or phrases, to represent values of variables. However, the conception of data primarily as numbers is also not completely false. For example, the statistical analyses introduced in this book need to represent all variables in terms of numbers. Fortunately for us, the tools we will be using will generally convert data not using numbers into numeric data when necessary. This means we will use the type of data representation that makes it easiest to understand what the data stands for.

2.2.1 Numerical Versus Categorical Variables

An important issue that arises when thinking about data as numbers is that numbers can mean different things. One possibility is that numbers represent numerical information; that is, they represent a measurement, magnitude, or count of something. However, we can also use numbers in a broader sense in which they only serve as a label (e.g., numbers on football jerseys or telephone numbers). In this case, numbers only represent categorical information; each observation falls into one of a set of mutually exclusive categories. The meaning of the numbers has important consequences for how we use them. If numbers do not represent numerical information, most mathematical operations do not make sense. For example, it does not make much sense to calculate the average of two telephone numbers. We’ll demonstrate this with some variables from the example data.

Let’s begin with the loss/gain variable pair (we can consider them together, because the type of information is the same; the only difference is whether the number refers to a potential loss or a potential gain). For these variables, the meaning of numbers corresponds to the common understanding of numbers as a magnitude of something. Here, it is the magnitude of a potential loss and potential gain. We could understand these variables as measuring the magnitude of the potential loss and potential gain of a lottery. The larger the number, the larger the potential loss and gain. We can treat these variables as numeric variables in a statistical analysis, because (a) the numbers of the variables represent numeric information, and because (b) performing mathematical operations, such as addition or calculating the average of the numbers, is meaningful for this variable. For example, it would be useful to calculate the average potential loss/gain for a participant (i.e., we could interpret the average in a meaningful way, for example by comparing it with the average loss/gain in a different condition).

As a second example, let us consider the subno variable. Here, the numbers do not measure the magnitude of anything. Participant number 16 is not twice participant 8. From just looking at the variable, we also do not know what the values mean. As described above, one just needs to assign unique numbers in some way to each participant. For example, one could assign number 1 to the first participant who completed the experiment, number 2 to the second, and so forth. Alternatively, one could also assign number 1 to the first participant invited, number 2 to the second, and so forth. Another possibility is to specify the maximum number of participants, say 500 , and then just assign a unique random number from 1 to 500 to each one. Importantly, we do not need to know which of these procedures was used. The only reason we have the subno variable is so we know to which participant a particular observation belongs to. The numbers in subno only serve the purpose as a label identifying the participant. Instead of numbers, we could also use non-numeric labels, such as random strings of letters, as the participant variable. Consequently, it does not make much sense to perform any mathematical operation on the subno variable. For example, the average participant number does not provide any useful information.

For the purposes of this book, this distinction between these two data types is central: Should we treat a variable as a numerical variable or a categorical variable? The statistical methods introduced in the following chapters can only deal with these two types of variables (and categorical variables can generally also only serve the role of an explanatory variable and not as an outcome variable). So how can we identify whether a variable is numerical or categorical?

Usually, it is easy to identify categorical variables among the variables that have numbers. Whenever the numbers represent a label, that variable is usually a categorical variable. For example, in addition to the subno variable, the numbers in the condition variable only serve as a label to identify the condition. We cannot interpret the numbers of the condition variable shown in Table 2.1 as actually representing a numerical value of either 20.2 or 40.2. Instead, each of the four possible values of the variable, 20.2, 20.4, 40.2 and 40.4, refers to one of their four conditions of the experiment, with loss and gains ranging up to either -$20/+$20, -$40/+$20, -$20/+$40, and -$40/+$40 (note again that the value after the decimal point is the range of the potential loss and the value before the decimal point the range of the potential gains). And non-numeric variables in which the values are labels, such as for the response variable, are also clearly categorical variables.

More difficult is the decision for the resp variable. Clearly, the two possible values of the response variable, to accept or reject the lottery, are response categories or labels. However, when transforming it into a variable with numbers 1 and 0, we can perform meaningful mathematical operations on it. As discussed above, the mean of the variable can be interpreted as the average proportion of lotteries that are accepted. More generally, any binary categorical variable (i.e., a categorical variable of two categories) can be seen as a special case where treating it as a numerical variable can in certain situations be meaningful. However, whether or not it is meaningful depends on the situation. For example, if we want to calculate the proportion with which different observations are in one or the other category, then it makes sense to represent a binary categorical variable as a numeric variable. In contrast, if it is one of the variables in our experimental design and we know how often it appeared, then we might not need to do so. In general it is best to explicitly treat a variable as categorical unless one is sure treating it as numerical is meaningful.

To sum up, for the statistical purposes of this book we distinguish numerical variables and categorical variables. Numerical variables hold numerical information such as magnitudes of something or the degree with which something holds. For categorical variables the values of the variable serve as labels designating membership in one of a number of mutually exclusive categories. When categorical variables are part of an experimental design, we will later also call them factors.

2.2.2 Assumptions of Numerical Variables

What the case of the resp variable (i.e., the numerical representation of a binary categorical variable) shows is that the decision of whether something is a numerical or categorical variable can depend on the situation. To help with this decision, it is helpful to know what exactly is entailed by treating a variable as numerical. For the statistical methods used here, when we treat a variable as numerical we assume it represents continuous numerical information. What this means is that we assume that:

A certain difference or interval has the same meaning anywhere on the scale. For example, a difference of 1 unit of the variable means the same whether we add it to 10 or 20. We can see that this holds for the loss/gain variable pair but we will later discuss examples where this is a questionable assumption. A corollary to this assumption is that calculating the mean for our variable must be meaningful in itself. If we cannot interpret the mean, a variable cannot be treated as numeric.
Our variable can in principle take on any real-valued (i.e., decimal) number. That is, even though we might have only used discrete values for our variables,¹³ such as for the loss/gain variable pair only a subset of the whole numbers between 6 and 40 (see Table 1.1), our statistical method assumes the in-between values are possible and in principle meaningful.

As we can see, for the loss/gain variable pair, the two assumptions are fully satisfied. However, the loss/gain variable pair is not actually an outcome that was measured in an experiment. Therefore, in a statistical analysis it will not play the role of a variable for which it is most important that the assumptions are fulfilled. Instead, this variable pair was part of the design of an experiment. Consequently, let us consider a few other example variables to see how well they fulfil the two assumptions for a numerical variable.

For example, the numerical outcome variable in this data set is resp. Clearly, resp does not fulfil the assumptions as it only has two discrete outcomes, the values 0 and 1. However, we can calculate and interpret the mean (as the average proportion accepted). We could also assume that a specific difference, say a 0.1 (or 10%) difference, means the same whether it happens at an acceptance rate of 50% or an acceptance rate of 85%.¹⁴ So whereas some assumptions are violated, others are fulfilled. What this entails is that whether we can interpret the results from an analysis depends on the exact context and circumstances. For example, if our statistical analysis would lead to results or predictions beyond the probability range of 0 to 1, this would be clearly problematic as the results would not be meaningful. In other words, we would have learned very little meaningful about our data from such a statistical analysis.

A very popular kind of variable in psychology and related sciences involves subjective rating scales (also known as "Likert scales"). For example, we have discussed the study of McGraw et al. (2010) where participants in one condition where asked to rate the intensity of their emotional reaction to a potential loss or potential gain on a response scale ranging from 1 = "no Effect" to 5 = "Very Large Effect" (see Figure 1.1, unipolar intensity scale). Does this variable represent a numerical variable? Clearly, a value of 5 represents an emotional reaction that is larger than a value of 1.So the variable does represent a magnitude, but does it fulfil the assumptions spelled out above? We can also take the average of the scale and interpret it in a meaningful way. Specifically, the average emotional intensity for participants in the loss condition, 3.6, was larger than the average emotional intensity for participants in the gain condition, 3.1. However, it is questionable whether a difference of 1 means the same everywhere across the scale. More specifically, is the difference between "No Effect" and "Small Effect" (the difference between 1 and 2) the same as the difference between "Moderate Effect" and "Substantial Effect" (the difference between 3 and 4)? Numerically it is, but whether this also holds psychologically is a question that is difficult to answer. Like most researchers McGraw et al. (2010) have treated this variable as a numerical variable so have made this assumption (which is also implicit in the process of calculating the average). The validity of their conclusions rests to some degree on whether or not we believe this to be reasonable.

Let’s try to generalise the conclusion of the previous paragraph in order to answer the following question: Can we only treat a variable as a numeric variable in a statistical model if it perfectly meets the two assumptions stated above? In an ideal world (statistical or otherwise!) the answer would be yes. But the everyday realities of data analysis always differ from the ideal. Many of the variables that we regularly encounter in our research violate one or the other of the assumptions to some degree. Yet, we still need to treat them as numerical variables, simply because treating them as categorical does not help us in answering our research questions. Whenever these assumptions are violated to any degree, this can be interpreted as another instance of an epistemic gap (specifically as an instance of the first epistemic gap, Section 1.3.1). The fact that the assumptions are violated opens up the possibility of an alternative explanation of the results that differs from our hypothesis. In other words, if the assumptions are perfectly met, the evidence provided by our statistical analysis is stronger than when the assumptions are only partially met.

The problem is that once we have numbers and treat them as a numerical variable, the computer treats the numbers the same way, which is to say, as if they satisfy the two assumptions we stated above. In more everyday language, "the numbers don't remember where they came from" (Lord 1953)¹⁵. Only we – the researchers – know where the numbers came from and need to take this into account when interpreting statistics. We can also interpret this insight in terms of the concepts introduced in the previous chapter. The numbers are part of the operationalisation; we establish a procedure that maps real world entities (above we have called these possible outcomes or states of affairs) onto values of the variables (which are in many cases numbers). The numbers that emerge from this procedure are related to our research question, but they are not identical with our research question. Any inference from the statistical results based on these numbers requires many auxiliary assumptions, one of which is that we assume that numerical variables are continuous. And as we can never be sure if all the auxiliary assumptions are true, we have to be careful and humble with the conclusions we draw from our research.

2.3 Measurement

So far, we have categorized different variables as they appear in a data set and how we can integrate them into a statistical analysis. Here, we take a step back and consider in a principled manner how the variables are created. What we are discussing here is how the way we measure different variables in an experiment influences what we can infer from them.

2.3.1 Measurement Scales

One way to interpret the discussion above, about the meaning of the numbers we collect, relates to the first epistemic gap introduced in the previous chapter, the difference between our research question and its operationalisation (Section 1.3.1). Here, this epistemic gap can be understood as the distinction between the actual magnitude of a latent construct (e.g., strength of an emotional intensity or personal risk preference) and the measurement of the magnitude of the latent construct (i.e., the application of a procedure that assigns a number for that attribute to an observation). When speaking about measurement, a latent construct is also called an attribute. Thus, we are here concerned with the distinction between an attribute and its measurement. An important theoretical contribution to this distinction within the context of psychology comes from Stevens (1946). He claimed that we can distinguish four different types of measurement scales that differ with respect to which type of relationship between attributes they reveal: nominal, ordinal, interval, and ratio scales.

Before describing these four scales in detail, let’s already make it clear what Stevens’ (1946) goal was. He hoped that by using a particular measurement scale, we could bridge the epistemic gap between the measurement of an attribute and the attribute itself. For example, he claimed that measuring an attribute using an interval scale revealed that the attribute itself existed on an interval scale (and we will see what this means in just a bit). This claim (or better: hope) is probably the reason why his idea was and still is extremely popular. However, as we have discussed before, bridging epistemic gaps is never easy and also not in this case. We will get back to this issue below after introducing the four scales in detail.

A nominal scale is equivalent to what we have termed a categorical variable, in which the values do not exhibit any quantitative relationship among each other. In addition to the examples discussed above, many demographic variables can be understood to be on a nominal scale such as gender (e.g., male, female, non-binary, or other) or handedness (right-handed, left-handed, or ambidextrous).

If a scale is constructed through rank orderings of attributes, it is called an ordinal scale. As a consequence, we can order attributes along a dimension but cannot make any further quantitative distinctions. A common example of an ordinal scale is the final result of a sports competition with first place, second place, and so on. The important aspect of an ordinal scale is that differences between values on the ordinal scale do not need to correspond to equivalent differences in the attribute that is measured with the ordinal scale. If we stay within the sports competition example, the difference between the first and the second place in terms of performance does not need to be the same as the difference between the second and the third place. For example, in the 2020 Olympics 100-m women sprints final, the difference between the first place (Elaine Thompson-Herah) and the second place (Shelly-Ann Fraser-Pryce) was 0.13 seconds, whereas the difference between the second and third place (Shericka Jackson) was only 0.02 seconds. In this case the same difference on the ordinal rank scale (i.e., one rank difference), does not correspond to the same difference in the underlying attribute (i.e., performance, time needed for sprinting 100 m). Some demographic characteristics can also be understood to be on an ordinal scale, such as education levels (e.g., some primary or secondary school education, compulsory education up to age 16, college, or higher education or professional & vocational equivalents).

An interval scale results when differences, or intervals, between the values of the attributes on the scale have the same meaning across the scale. Thus, an interval scale can also be understood as fulfilling the requirements of a numerical variable.¹⁶ The typical example of an interval scale is temperature measured in either degrees Celsius (°C) or Fahrenheit (°F). Within each temperature scale, a 1 degree difference has the same meaning independent of the current temperature. Furthermore, both scales can be transformed into each other. For interval scales it also makes sense to calculate the mean (e.g., the mean temperature), but calculating ratios does not make sense. For example, saying the 40 °C is double the temperature of 20 °C is not a really meaningful statement (e.g., because 20 °C = 68 °F and 40 °C = 104 °F and $2 \times 68 \neq 104$).

The final scale type, ratio scale, results when, in addition to maintaining the meaning of differences across the scale, the scale also contains a true zero point of the attribute. To stay within the example of temperature, whereas the zero point of degrees Celsius and Fahrenheit is arbitrary and they represent only interval scales, the zero point of the Kelvin scale, 0 K, is the lowest possible temperature making the Kelvin scale a ratio scale. Many physical scales are on a ratio scale, such as length or time. For example, it makes sense to say that a sprinter who took 20 seconds for the 100 m sprint took twice the time as a sprinter who only took 10 seconds because 0 seconds is the true zero point of time.

Whereas Stevens' four measurement scales are widely popular in psychology and related disciplines thanks to their prominence in most introductions and textbooks, their actual scientific contribution needs to be considered critically (following Michell 1997, 2002, 1999). What Stevens proposes attempts to bridge the epistemic gap between the attribute and its measurement. According to this position, if we have established an interval or ratio scale, we have learned that the underlying attribute exhibits an interval or ratio structure. Unfortunately, because of the problem of underdetermination discussed before, learning about the actual structure of a theoretical construct or attribute is not that easy. Only because the numbers look like a numerical variable, this does not mean that the underlying attribute behaves like a numerical variable. Therefore, I do not recommend using the four different measurement scales to discuss psychological measures. For example, in this book we will use the theoretically more neutral terms categorical and numerical variables. As discussed in the previous chapter, this way our description of the research remains on the level that it is performed, the level of the operationalisation and statistical analysis.¹⁷

Even though Stevens' idea of measurement scales is ultimately untenable in psychology, his distinction is nevertheless helpful as it again allows us to understand the limits of what we can learn from data. What we can take away from the present discussion is that the common measurement approach in psychology and related fields is generally only able to establish ordinal relationships. For example, subjective ratings scales, but also choices among lotteries, only represent ordinal relationships in terms of the underlying latent attributes. Nevertheless, we generally treat data from such variables as numerical in statistical analyses. This again reinforces the point made before that we cannot interpret the results from statistical analyses as directly answering our research questions. Our statistical analysis generally makes assumptions about the nature of the underlying attribute or construct that we cannot verify.

However, this does not mean that all hope is lost. As Stevens (1946) already mentioned when introducing ordinal scales, treating them as numeric is not always pointless (p. 679): "In numerous instances it leads to fruitful results." We just have to be mindful that measurement in psychology is generally not the same as measurement in physics when interpreting the results. We primarily learn something about our operationalisations and not directly about the theoretical constructs that we use when formulating our research questions.

2.3.2 Reliability and Validity

The main message of the previous section is that measurement of mind and behaviour is not as straightforward as measuring physical attributes such as length. Nevertheless, we should aim to use measures that are of high quality. Two concepts that are important for judging the quality of measurement are reliability and validity.

Reliability refers to the consistency of a measure. One intuitive way to understand reliability is the extent to which we obtain the same value when we measure something under the same conditions, but at different times. Reliability is also inversely related to noise in the measurement process. A measure has a high reliability if repeated measurements under the same conditions lead to very similar outcomes (i.e., the level of noise in the measurement process is low). A measure has low reliability if repeated measurements under the same conditions lead to widely different outcomes (i.e., the level of noise in the measurement process is high). For example, consider an ordinary bathroom scale. We expect such a scale to have a very high reliability. We should get pretty much the exact same results if we step on it several times in a row, as long as we do not change our weight in between (e.g., by drinking a glass of water).

Validity refers to how strongly a measure measures what it is supposed to measure. Within the concepts introduced within this book, validity thus refers to the ability of a measure to bridge the epistemic gap between the operationalisation of a construct and the actual (i.e., "true") value of the construct. Given the difficulties in defining or even establishing the constructs researchers are interested in, establishing whether or not a measure has a high or low validity is generally difficult.

One way to visualise both reliability and validity is given in Figure 2.1 below. Here, each panel represents one operationalisation or measure and each shot on the target represents one measurement with that measure. We can see that reliable measures have low levels of noise around one mean value (i.e., a low level of dispersion of the shots). In this figure, validity is visualised as a bias with respect to the centre of the target. Valid measures are centred on the target whereas invalid measures are centred around an off-target value. What this figure highlights is that reliability and validity are in principle distinct qualities. A measure being reliable does not mean it is valid. And likewise, a valid measure does not have to be reliable.

Visualisation of reliability and validity as shots on a target. Figure is taken from [*Statistical Thinking for the 21st Century* by Russell A. Poldrack](https://statsthinking21.github.io/statsthinking21-core-site/) ([Figure 2.1](https://statsthinking21.github.io/statsthinking21-core-site/working-with-data.html#reliability)).

Figure 2.1: Visualisation of reliability and validity as shots on a target. Figure is taken from Statistical Thinking for the 21st Century by Russell A. Poldrack (Figure 2.1).

Whereas the visualisation in Figure 2.1 should provide a good first intuition about reliability and validity, the conceptualisation of validity as a bias is not the only way to think about it. To stay within the metaphor provided by the figure, an invalid measurement could also be one that aims for the floor instead of the target. Or even something completely wrong, such as shooting darts when the goal is to shoot bullets. The problem with validity is that we often don't know how to better define or measure what we are hoping to measure, so assessing validity is typically not easy.

The previous paragraph already points to an important distinction between reliability and validity. Usually there is a way to quantify reliability, but quantifying validity is only possible for specific interpretations of validity.

The most common ways to quantify the reliability of a measure are:

Split-half reliability or internal consistency refers to the reliability estimate that results from splitting a measure into sub-measures and comparing the scores across sub-measures (e.g., by calculating a score for all odd items and comparing them with the score of all even items). Generally, this kind of reliability can be improved by making a measure longer (i.e., adding additional items). Therefore, longer measures are often more reliable than shorter measures (which also intuitively makes sense – the shorter a measure the more likely noise plays an important role).
Test-retest reliability refers to the extent to which you get the same outcome when applying the same measure at different times. When using a measure such as a questionnaire, the difficulty in calculating the test-retest reliability is the possibility of so-called memory consistency effects or temporal instability. Memory consistency effects refer to the observation that participants often prefer to be self-consistent with their previous answers if they remember them, thus potentially artificially increasing the reliability. Temporal instability, on the other hand, can result in artificially lower reliabilities if the measured construct varies naturally over time (such as mood).
Inter-rater reliability only applies to measures in which an outside judge is used to make a measurement. Prominent examples of situations in which one can calculate inter-rater reliabilities are medical diagnoses (i.e., by comparing the diagnosis across multiple doctors) or essay marks (when marked by multiple independent markers).

To quantify validity we need to have an external criterion that also measures the construct of interest. When we have such a criterion, we can compare the value on the criterion with the value on our measure which gives us an estimate of the criterion validity of the measure. For example, in the previous chapter we had discussed that there exist different tasks or questionnaires for measuring an individuals' risk preferences. Suppose we also had access to a persons' financial history showing to what degree they invest their money into relatively high risk financial instruments (e.g., stock options), medium risk ones (e.g., individual stocks), or lower risk ones (e.g., index funds). A risk preference measure with high criterion validity would be one for which participants that have a high score on the measure invest more of their actual money in high-risk investments. As we have discussed in the previous chapter, it is unclear if such a measure of risk preferences exists.

Other types of validity are non quantifiable. Construct validity is usually considered the most important type of validity as it refers to all empirical and theoretical support that provides evidence that a measure measures what it is supposed to measure. As construct validity essentially asks how well a measure bridges the first epistemic gap, it is more an abstract concept in which one talks about measures, rather than an actual criterion used to describe measures In fact, I am not aware of a psychological measure that has construct validity.

Another interesting type of validity is face validity. This refers to the degree with which a test appears to measure what it is supposed to measure. Whereas face validity is not objectively important for the validity of a test (i.e., what matters is not if a test looks like it measures what it is supposed to measure, but if it does so), it can be important for the commitment of participants. For example, if participants are intrinsically interested in contributing their time and effort to a research question, then a measure with high face validity (i.e., one that matches their interest) should reduce participant drop out compared to a measure with lower face validity.

Considerations of both reliability and validity should be incorporated into the overall assessment of the results when judging the empirical evidence provided by a given study. For example, if the dependent variable of a study only consists of a single item or response, this generally implies low internal consistency compared to a dependent variable based on more items or responses. In this case, the level of noise should be relatively high, which should influence how strongly we weigh the evidence from such a study.

2.4 Independent and Dependent Variables

In addition to distinguishing the type of information variables can contain, we can also distinguish the different roles variables can play in the research process. Remember, when discussing the operationalisation step of the research process, we identified the need to identify relevant variables that we hope can address the research question, as well as an empirical hypothesis involving at least two variables. As we will see here, we can assign two different roles to these (at a minimum) two different variables.

Usually, there is a single variable which represents the main result in which we are interested. In psychology, this is called the dependent variable. Synonyms for "dependent variable" that are common in the statistical literature are response variable, outcome variable, or criterion. The dependent variable is the main outcome of our study, the variable we are primarily interested in measuring.

The other variable(s) is called independent variable(s) in psychology, because we believe the values of the dependent variable depend on the values of the independent variable(s). A popular synonym for "independent variable" in the statistical literature is predictor or covariate, as the dependent variable is assumed to covary with the independent variables.¹⁸ Above, we have also used the term explanatory variable to describe independent variables. Loosely speaking, we can describe the distinction such that a study is about the effect of the independent variable on the dependent variable.¹⁹

Let’s look at this distinction in the loss aversion study of Walasek and Stewart (2015), the data of which is shown in Table 2.1 above. Their hypothesis was that what matters for people's preference for symmetric 50-50 lotteries is not the absolute value of the potential loss and gain, but the relative rank of the lotteries compared to other lotteries. The operationalisation of this research question involved the manipulation of the range of lotteries in one variable, condition, and measuring participants' responses to symmetric lotteries in the response variable. Here the distinction between both variable types is relatively straight forward. We are interested in the effect of condition on response: What is the impact of different ranges of lotteries as manipulated across conditions, on participants' responses. This makes condition the independent variable and response the dependent variable.

In general, the distinction between independent and dependent variables is easy to understand in an experiment, such as the example discussed here. In an experiment, independent variables are manipulated, here condition. "Manipulated" means we assign participants to the different conditions, rather than measuring in which condition a participant is in.

Not all studies are or can be experiments in which the independent variable is manipulated. For example, a common research question is the effect of a demographic variable on an outcome. However, demographic variables cannot readily be manipulated or assigned to participants. For example, we might be interested in studying the effect of parental wealth on children's educational attainment. Whereas manipulating parental wealth is in principle possible, a more common approach is to measure this variable as well as children's educational attainment. Nevertheless, we can still make the distinction between the independent variable (parental wealth) and dependent variable (educational attainment).

Not all variables in an experiment neatly fall within the distinction of independent and dependent variables. For example, let us go back to the six variables shown in Table 2.1 that make up the full data set collected in the study by Walasek and Stewart (2015). As discussed above, the response/resp variable pair is the dependent variable and condition the independent variable. This leaves us with three further variables that need to be classified. Have a go and try to classify the other three variables before continuing reading.

Let’s begin with the loss/gain variable pair that determines the possible outcomes of a lottery shown to participants. Clearly, these are also manipulated. More specifically, the condition determines exactly which lotteries and therefore which values of the loss/gain variable pair a participant works on. Thus, the loss/gain variable pair are also independent variables that jointly determine the independent variable of condition. In other words, if we did not have the condition variable in this data set we could determine it from the loss/gain variable pair. This means that determining the independent variables in this study depends on the perspective one takes. If we only focus on the main research question and symmetric lotteries then condition is the independent variable. However, if we also look at the lotteries individually, then condition and the loss/gain variable pair are the independent variables.

Table 2.1 contains one more variable, subno, the participant identifier. It might seem a bit surprising to think about this variable in terms of independent and dependent variables, but we should be able to classify this variable somehow. Clearly, subno is not a dependent variable. We are not interested in the values or results of the subno variable. However, it also seems not clearly an independent variable. We do not have any specific expectations or ideas of how different subjects affect the response variable. To help us with the classification, recall why we have the subno variable in the data in the first place – because we collect data from multiple participants and need to identify to which participant an observation belongs. The follow-up question to this is: why do we collect data from multiple participants? As discussed in the previous chapter (Section 1.3.2), participants are a source of noise in our experiment as different participants can do what they do for a multitude of reasons. If we only had data from one participant, we could not distinguish between the idiosyncratic noise of that participant and the signal we are interested in. By collecting data from multiple participants, we try to control for the noise by averaging over it with the hope that what remains is the signal. The overarching idea is that noise has an unsystematic effect on the results; some participants may be more likely to show a particular behaviour whereas other participants may be less likely to show that behaviour. On average the noise cancels out. In sum, we collect data from multiple participants to control for noise that is inevitable when dealing with real people. Thus, we could say subno is a control variable. We could even be more specific and say it is a control variable and an independent variable, because we use it to control noise at the level of the design. In other situations we might measure a variable for control purposes in which case we could call it a control variable and a dependent variable.

To sum this section up: Jointly, the dependent and independent variables are the key components of the operationalisation of a research question. They are the central concepts that make up the study design and link the practical reality of the research (i.e., how the research actually takes place and what is measured) with the research question. It is not wrong to say that a study is defined primarily by its dependent and independent variables. When designing one's own study, making it clear what the dependent and independent variables are is maybe the most important decision after having decided on a research question. Likewise, when reading a scientific article describing a study, understanding clearly what the independent and dependent variables are is central to understanding the study. Therefore, whenever thinking and talking about any research, make sure to be clear what the dependent and independent variables are. In experimental research this often boils down to asking: What was the task of the participants? Other variables that are part of a study usually serve some control purpose and can be denoted control variables.

2.5 Experimental versus Observational Variables

As we have seen when discussing the distinction between independent and dependent variables, we can distinguish different types of independent variables or research designs, namely experimental and non-experimental independent variables. Here we will adopt the common terminology and use observational variable to describe non-experimental independent variables. If a study solely consists of experimental variables (we drop the "independent" part from now on because experimental or observational variables are always independent variables), we can call it an experimental study or experiment for short. If a study solely consists of observational variables, we can call it an observational study. If a study contains both experimental and observational variables, there is no agreed upon name and depending on which variable is more relevant to the research question, researchers tend to use either experimental or observational study. However, as experimental variables provide a number of evidential benefits that will be discussed here, there is a tendency to call a study an experiment even if it also contains observational variables. Depending on the actual situation and the inferences drawn, this can be seen as a stretch.

An experimental variable is one that is controlled by, and can be manipulated by the researcher. What this means is that the values of the variables can be assigned to participants by the researcher.

For example, in the study of Walasek and Stewart (2015), the researchers assigned each participant to be in one of the four conditions corresponding to a different range of potential losses and gains. Likewise, in the previous chapter, we briefly introduced the study of Hinze and Wiley (2011) on the generality of the testing effect. After an initial reading of a piece of text, participants were assigned to one of three experimental conditions: a control condition in which they could re-read the materials, a testing condition using open-ended questions, and a testing condition using a fill-in-the-blank text. In both these studies, the researchers decided which condition a participant was part of.

The important part of an experiment variable is not only that participants can be assigned to different conditions, but how they are assigned. More specifically, for an experimental variable, the assignment needs to be performed randomly; we say participants are randomised to the available conditions. One way to understand random assignment is that before the experiment takes place, the probability to be in any of the experimental conditions needs to be the same for every participant.²⁰

For example, random assignment means that for every participant in the study of Walasek and Stewart (2015), the probability to be in any of the four conditions is 0.25 (i.e., 1/4). For every participant in the study of Hinze and Wiley (2011), the probability to be in any of the three conditions is approximately 0.33 (i.e., 1/3).

You can imagine randomisation as an actual physical process that produces a random outcome, such as the toss of a coin or throw of a die. For example, for the study of Hinze and Wiley (2011) we could imagine that for every participant that takes part in the experiment, the researcher (or research assistant) throws a regular six-sided die. If the die lands on 1 or 2 the participant is assigned to the re-reading condition, if the die lands on 3 or 4 the participant is assigned to the open-ended question condition, and if the die lands on 5 or 6 the participant is assigned to the fill-in-the-blank condition. Alternatively, if we pre-specify the sample size and want to ensure that every group is of approximately the same size, we could use a different approach. We could prepare as many sheets of paper as the number of participants we want to collect. On each sheet we write one condition so that among all sheets, each condition appears equally often. Then, we shuffle all sheets in a bowl to randomise their order. When performing the experiment, we take one sheet out of the bowl for every participant (without putting it back) and assign the participant to the condition written on the sheet. Back in the day, this was common practice, but nowadays, randomisation is mostly done through a computer using so-called random number generators.

Different from experimental variables are observational variables, variables that are not controlled by the researcher and are not randomly assigned to participants. In other words, whenever randomisation is impossible, an independent variable is an observational variable.

Above we talked about demographic characteristics, in particular parental wealth, as an independent variable. As already described, demographic variables generally cannot be randomly assigned but are a mostly immutable part of a person. Consequently, demographic characteristics (e.g., age, gender) are generally observational variables. The same is true for many other psychological characteristics of a person such as personality traits (e.g., extraversion) or abilities (e.g., intelligence quotient). In the vast majority of cases, such variables are observational variables.²¹

At this point you might wonder why an experiment necessarily entails randomisation of the independent variable. What is the benefit of an experimental over an observational variable? The reason for this is that only randomisation allows drawing causal inferences from a study. Only with an experimental – that is randomised – independent variable can we say that the independent variable is the cause of the dependent variable. Without randomisation (i.e., when dealing with an observational independent variable) such an inference is not permitted.

Remember that above, we said one way to think about the distinction between dependent and independent variables is that we want to learn about the effect of the independent variable on the dependent variable. We did not specify what we meant by "effect", but we did say that this way to think about dependent and independent variables only holds loosely. In fact, when we say "effect", we mean a causal relationship; changes in the independent variable cause changes in the dependent variable. In the absence of such a cause-effect relationship, it seems wrong to speak of an "effect of the independent variable". And we can now see the reason why we said this only holds loosely, as it only holds if the independent variable is an experimental variable, not an observational variable.

2.5.1 Epistemic Gap 3: Causal Inference and Confounding Variables

The reason only an experimental variable allows a causal inference is due to another epistemic gap, the possible influence of confounding variables when dealing with observational data. A causal inference means we learn that the independent variable and nothing else is responsible for the effect observed on the dependent variable. A causal inference is only possible if plausible alternative explanations for the observed effect on the dependent variable that do not involve the independent variable can be ruled out. In the context of a causal inference, such an alternative explanation is known as a confounding variable or confounder for short. If we randomly assign participants to conditions, we can theoretically rule out confounders as an alternative explanation. However, in the case of an observational variable we cannot; there may be other reasons, the confounders, that are related to the observational variable that are responsible for the observed effect. Similar to the first epistemic gap, the inference that the independent variable is the cause of the effect in the dependent variable is underdetermined for an observational variable, but not sofor an experimental variable.²²

Let us exemplify this problem with a new example. Imagine you want to investigate the effectiveness of a novel drug against a control treatment (i.e., an old drug) for the treatment of a viral infection in a hospital setting. The independent variable is the treatment (control treatment versus novel drug) and the dependent variable is viral load (i.e., whether or not the virus can still be detected in the patient). Let us imagine that the results show that the new drug is more effective than the control treatment. That is, patients show a lower viral load (i.e., are less sick) after the novel treatment than the control treatment. The question we are trying to answer now is whether it matters for the inference we can draw from this study if the assignment to treatment condition is random or not.

Let us begin by considering a non-random assignment of the independent variable. For example, one way to implement non-random assignment could be to use the novel drug in one hospital and the control treatment in another. If we were to run such a study, would this allow us to conclude that a difference in the dependent variable is due to the differences in treatment? This inference would only be allowed if the two hospitals were identical. If there were any systematic differences between the hospitals, say patients in the hospital with the control treatment are on average older than patients in the hospital with the new treatment (because patients are drawn from different neighbourhoods), then this difference could be responsible for the difference in the dependent variable.

The systematic difference in age between the hospitals plays the role of a confounder. Because the average age of patients differs between the two hospitals, age may also be responsible for the difference on the dependent variable, because older patients are less likely to recover and thus have a higher viral load at the end of the study than young patients. In a situation in which a confounder is present, we cannot infer that the treatment is the cause of the observed effect. We again have the logical structure indicative of an underdetermination: The two conditions differ in terms of two characteristics, treatment status and age, so either of them (or both) can be responsible for the differences in the dependent variable.

Let’s now consider a situation in which participants are randomly assigned to the two treatment conditions. Here again, we have data from two hospitals, but participants in each hospital are randomly assigned to the treatment conditions. For example, for each new patient that shows the relevant symptoms, the doctor administering the treatment takes a pre-randomised envelope that contains either the old drug or the novel drug.²³ Because the assignment to conditions is random, we would expect that confounders such as age are balanced across the two conditions (and which of course could be checked afterwards). Every participant that comes to any of the two hospitals has the same chance to either get the control treatment or the novel drug, independent of their age or other characteristics. Consequently, as long as randomness did not introduce an accidental confounding (see Section 1.3.2) we can attribute the effect on the dependent variable to the independent variable.

In summary, the difference between an experimental and an observational variable is the degree with which you can rule out possible alternative explanations. For an experimental variable for which participants are randomly assigned to conditions, in theory, all confounding variables should be balanced. As long as the randomisation did proceed as planned, one can be certain that the only alternative explanation for the effect of the independent variable on the dependent variable is random chance. We always might get unlucky and a confounding variable just happens to be unbalanced in our data set. However, with larger sample sizes and larger effects, an explanation due to chance becomes increasingly unlikely. So for experimental variables, the only epistemic gap when wanting to judge whether the independent variable is genuinely responsible for the effect of the independent variable on the dependent variable, is the effect of random chance or noise (see Section 1.3.2).

For example, in the main study that led to the approval of the Biontech Covid-19 vaccine (Polack et al. 2020), over 40,000 participants were randomly assigned to either receive the real vaccine or a placebo (a saline injection without active ingredients). Randomisation was done in equal proportions so that both conditions had more than 20,000 participants. Among the participants that received the vaccine, only 8 participants developed Covid-19. Among the participants that received the placebo, 162 participants developed Covid-19. Whereas from these results we cannot definitely rule out that there is some confounding variable that explains the difference in contracting Covid-19, it seems extremely unlikely. Participants for this trial were recruited from six different countries (e.g., USA, Turkey, Brasil) and were diverse in their demographic characteristics (e.g., sex, ethnicity, age, weight), but these characteristics were extremely similar for the participants in both conditions (Polack et al. 2020, Table 1).

For an observational variable for which participants are not randomly assigned to condition, we do not know whether there is a potential confounding variable. One way to address this problem is to measure known confounding variables and show that they are not responsible for the difference in the dependent variable. But even when we are able to control or measure a large number of possible confounding variables, we can never be certain that there is not another unobserved confounding variable that is responsible for the effect. So for observational variables we always have to deal with the two epistemic gaps when wanting to judge whether the independent variable is responsible for the effect of the independent variable on the dependent variable: the problem of possible confounders plus random chance or noise.

To end this section, let’s come back to the example of our trial testing a new drug in a hospital setting. After the lengthy discussion on observational versus experimental variables, you can hopefully see that the idea of only administering the new drug in one hospital and the control treatment in another hospital is a bad idea. Without proper randomisation of participants to treatments, the inference that the drug is responsible for the effect on the viral load seems very weak, thanks to the possible influence of confounders. You might even go so far to wonder who would ever run such a study without proper randomisation, or believe the corresponding results.

Sadly, a study that pretty much did exactly what we have sketched above – administering the novel drug only in one hospital and the control treatment in another hospital, with patients systematically differing between hospitals – played a very unfortunate role during the Covid-19 pandemic. In particular, the first study to suggest that Hydroxychloroquine was effective against Covid-19, the study by Gautret et al. (2020), had exactly this problem.²⁴ Whereas critics where quick to point out this and other problems with the study (Bik 2020; Rosendaal 2020; Sayare 2020), the damage was done. The then current US president Donald Trump praised Hydroxychloroquine as a wonder cure of Covid-19. It required much scientific effort and follow-up studies, using resources that could have potentially been used more productively elsewhere, to show that it is not (for a full timeline of events see Sattui et al. 2020). The problem in this case was that whereas medical and statistical experts could immediately see the problems with the study, the general public could not. And once a false claim that appears to be scientific is established in the public discourse (i.e., within the media reporting along the lines of "researchers have shown that ..."), it is often difficult to combat it. In general it seems that discussing the empirical evidence provided by a particular scientific study is either beyond the expertise available to the mass media or that they are unwilling to invest the time commitment to do so.

2.5.2 Is Causal Inference from Observational Data Possible at All?

What the previous section argues is that causal inference is generally only possible from experimental independent variables. With observational variables, there can always be a confounder that is responsible for the effect instead of the independent variable. However, many interesting research questions cannot be investigated with experiments, only through observational variables. As we have discussed above, demographic variables or other immutable features of individuals, such as personality traits, are observational variables by definition. Likewise, many variables relating to lifestyle choices, such as dietary or exercise habits, might in principle be amenable to experimental manipulations, but in reality it seems difficult to impossible or completely unethical to run corresponding experiments. Does this mean we cannot draw causal inferences for such research questions? I believe the honest and realistic answer is that in the vast majority of cases we cannot. In my eyes a fair assessment of the situation is that, causal inference from observational data is literally the most difficult problem in the empirical sciences.²⁵

Importantly, causal inference from observational data is not primarily a statistical problem. We have introduced the problem that confounders pose as an epistemic gap. And as for the other epistemic gaps, overcoming this epistemic gap requires diverse and conceptually strong evidence. There are statistical methods that can assist in providing such evidence, but they cannot provide the type of compelling evidence that is needed. The problem is that even if the observational data strongly suggests something, there always is the possibility that a confounder was missed or not adequately taken into account.

As an example of the problem, let us consider the case of vitamin supplements, specifically vitamin C and E supplements (Lawlor et al. 2004; Woodside et al. 2005; Mozaffarian, Rosenberg, and Uauy 2018). Early evidence from large observational studies in the 1990s with tens of thousands of participants suggested that taking vitamin C and E supplements reduces the chance of getting cancer and cardiovascular diseases to a considerable degree. Based on these positive results, large scale experiments (i.e., also with tens of thousands of participants) followed in which participants were randomly assigned to either take vitamin supplements or a placebo (a sugar pill without vitamins) and monitored for several years. By and large, these experiments could not replicate the positive effects found in the observational studies. Unless an individual is susceptible to a vitamin deficiency, vitamin supplements do not appear to have a measurable health benefit. The probable reason for the difference between the observational studies and the experiments is likely due to an insufficient adjustment of socioeconomic status as a confounder. As is often found, participants with a higher socioeconomic status were more healthy (i.e., less likely to develop cancer and cardiovascular diseases) but they were also more likely to take vitamin pills (because they believed them to be helpful). Whereas the observational studies measured and tried to account for differences in socioeconomic status between participants who already take vitamin supplements and those who do not, they only did so partially [e.g., they did not account for differences in the socioeconomic status of the parents of the participants which led to developmental differences that also affected the probability of developing cancer and cardiovascular diseases as well as the probability of taking vitamin pills; Lawlor et al. (2004)].

What this example shows is that even in a situation in which the confounder is in principle known (i.e., socioeconomic status) and the observational data sets were large (> 10K participants), causal inference from observational data was not possible. Even after attempting to account for confounding, the observational data suggested a relationship that turned out to be spurious. The apparent problem was that accurately measuring the influence of the confounder was not possible, before knowing that the observed relationship was in fact spurious. Only an experiment was able to reveal that there was no effect of vitamin supplements. This example suggests that for many research questions and data sets common in psychology and related disciplines, causal inference from observational data is equally difficult or even impossible. Extra complications are that data sets are often considerably smaller and less is known about the causal relationships existing in a domain (i.e., which variables could act as confounders).

As a consequence of the problem with observational data, the current book primarily focuses on experimental data sets and, where non-experimental variables are considered, their limitations will be discussed. Whereas focussing on experimental data restricts the type of research questions that can be investigated, it at least eliminates one of the three epistemic gaps introduced here. This also means that applying the methods introduced here to observational data sets will require additional care when trying to draw justifiable conclusions and is not recommended. To repeat what we have said above, the question of whether an effect found in observational data reflects a causal relationship is not a statistical question. So the statistical tools introduced here can determine if there is any effect to be explained, but cannot provide an answer to the question of whether a relationship in observational data is causal.

For researchers interested in analysing observational data, some good introductory literature that attempts to approach the problem of confounders in a principled manner are Rohrer (2018), Gelman, Hill, and Vehtari (2021), McElreath (2020), Hernán and Robins (2021), and Shadish, Cook, and Campbell (2002). Note that, given the additional epistemic gap that needs to be bridged, these methods are more advanced than the methods introduced here, and so require more advanced technical and mathematical knowledge than what we use. After all, drawing causal inferences from observational data is literally the most difficult problem in the empirical sciences.

2.5.3 Internal versus External Validity

Above we have talked about validity in the context of measurement. In that context, the question of validity is whether a measure measures what it is supposed to measure (e.g., when a risk attitudes questionnaire really measures risk attitudes it has a high validity). However, the term "validity" is also used in the context of experimental versus observational studies. In this context, the two relevant types of validity are internal validity and external validity. These terms do not refer to a specific measure, but are used to describe complete studies or research designs. We provide a brief introduction to these two terms here, for more see Shadish, Cook, and Campbell (2002).²⁶

Internal validity refers to the internal structure of a study and reflects the degree with which the study provides evidence for the causal relationship between independent and dependent variables. This means that, generally speaking, internal validity is high if a study is an experiment (i.e., the independent variable is randomised) and internal validity is low if a study is an observational study.²⁷ Within the terminology of the epistemic gaps introduced in this book, internal validity is related to the third epistemic gap. Only when we can be sure there are no possible confounders is the internal validity high, and we can only be sure of this in case we have an experimentally manipulated independent variable.

External validity refers to the degree with which the results of a study generalise to different settings, such as different situations, people, stimuli, and times. Within the terminology of the epistemic gaps introduced in this book, external validity is related to the first epistemic gap, the underdetermination of theory by data. The degree with which we can be sure that our results really address our research question and are not confined to the specifics of our operationalisation, the more we can be sure that the results generalise to other situations. In other words, if we only learn that the causal link holds in the very specific circumstances that are tested within our study, but do not actually hold in the general terms in which our research question is formalised, external validity is low. For example, the study by Hinze and Wiley (2011) introduced in Chapter 1 directly addressed the external validity by seeing whether the testing effects also hold for a different operationalisation of "testing".

2.6 Summary

In this chapter we have introduced a number of important concepts that allow us to describe studies and research designs. We began by highlighting that, as empirical scientists, the ultimate arbiter for whether or not to believe in a theoretical position or hypothesis is empirical evidence. This evidence should come from systematically collected data sets and not anecdotes.

Data sets that can be used to address our research questions consist of independent variable(s) and usually one dependent variable. The distinction between both is that we assume the dependent variable depends on the independent variable. If the independent variable is an experimental variable for which participants are randomised into conditions, we can even infer that the independent variable is causally responsible for the effect of the dependent variable. If the independent variable is solely an observational variable, we generally cannot make such a causal judgement.

The reason for why observational variables do not allow causal inferences lies in the third epistemic gap introduced here. For an observational variable, there can always be a different confounding variable that is responsible for both, the effect on the independent variable and the dependent variable.

Together the three epistemic gaps put clear limits on what we can learn from empirical data in psychology and related disciplines. The first epistemic gap, underdetermination of theory by data, is the difference between the research question and the operationalisation of the research question. Whereas the operationalisation attempts to address the research question, they are usually not the same. The second epistemic gap, signal versus noise, concerns the relationship between the operationalisation and the statistical analysis. Even if the statistical analysis appears to provide support for the empirical hypothesis, we cannot be 100% sure of that. There always is the chance that the observed outcome just occurred by chance – that is noise – and does not represent a genuine signal in the data. Finally, the third epistemic gap, confounding variables, is always present when dealing with observational independent variables. As just summarised, in the absence of randomisation we can never really be sure if the independent variable and not a confounding variable is the reason for the observed effect. Thus, we can reiterate the message with which we ended the previous chapter. If we interpret statistical results, we need to be careful and humble with the conclusions we draw.

We have also introduced the different data types we can deal with in a statistical analysis. Independent variables can be both numerical and categorical variables. If an independent variable is categorical, we generally call it an experimental factor, or just factor. Dependent variables can generally only be numerical variables, unless it is a binary categorical variable that we are treating as numerical.

We have also argued that most genuine psychological variables we collect, such as responses on rating scales, are only on an ordinal scale and do not satisfy the assumptions of being a numerical variable. However, in our analyses we nevertheless treat them as numerical. This violation of a statistical assumption places further limits on the inferences that are permitted from our studies. In line with this, we have argued that measurement in psychology is a generally difficult problem and simply assuming our measures provide more information than they actually provide is another inferential problem we have to deal with.

Chapter 1: Quiz

Chapter 2: Quiz