tidyverse
: Example, Quiz, and Exercises
To get into the habit of doing it regularly when starting a new task or analysis, now is probably a good time to restart your R
session. As a reminder, do so through the RStudio
menu with Session
- Restart R
. After this, we need to load the tidyverse
again:
3.7 Extended tidyverse
Example: Walasek & Stewart (2015) Exp. 1a & 1b
3.7.1 Reading Both Data Sets
As an extended tidyverse
example, let us look once again at the data of Walasek and Stewart (2015). As described in Section 3.4.1, there are two similar data sets, Experiments 1a and 1b, that we can consider together (and have done so in Section 1.2.2). To follow the extended examples, download them both into folder data
that is in your current working directory: Experiment 1a and Experiment 1b. Then, we can load them both as separate objects using read_csv()
.
ws1a <- read_csv("data/ws2015_exp1a.csv")
#> Rows: 22912 Columns: 6
#> ── Column specification ────────────────────────────────────
#> Delimiter: ","
#> chr (1): response
#> dbl (5): subno, loss, gain, condition, resp
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
ws1b <- read_csv("data/ws2015_exp1b.csv")
#> Rows: 27072 Columns: 6
#> ── Column specification ────────────────────────────────────
#> Delimiter: ","
#> chr (1): response
#> dbl (5): subno, loss, gain, condition, resp
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
We use new names for the resulting data objects (ws1a
and ws1b
) to help us remember that this is the data from Walasek and Stewart (2015), Experiments 1a and 1b. It's generally a good idea to use descriptive object names as this makes code easier to read later on. Especially when the code is long, one could even use more descriptive names such as walasek_stewart_2015_1a
and walasek_stewart_2015_1b
. However, as this example stays relatively short, the first letter of the last names and no year will suffice.
Note that as in the previous chapter, if you have problems downloading these two files or setting up the correct working directory, you can load them directly from the internet as well. However, this is not recommended and only shown here to make sure you can follow all steps in this example.
ws1a <- read_csv("https://github.com/singmann/stats_for_experiments/raw/master/data/ws2015_exp1a.csv")
ws1b <- read_csv("https://github.com/singmann/stats_for_experiments/raw/master/data/ws2015_exp1b.csv")
We now have two data sets that have the same structure (i.e., with the same columns and data types) as we can see by just printing them both:
ws1a
#> # A tibble: 22,912 × 6
#> subno loss gain response condition resp
#> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
#> 1 8 6 6 accept 20.2 1
#> 2 8 6 8 accept 20.2 1
#> 3 8 6 10 accept 20.2 1
#> 4 8 6 12 accept 20.2 1
#> 5 8 6 14 accept 20.2 1
#> 6 8 6 16 accept 20.2 1
#> # ℹ 22,906 more rows
ws1b
#> # A tibble: 27,072 × 6
#> subno loss gain response condition resp
#> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
#> 1 1 6 12 accept 40.2 1
#> 2 1 6 16 accept 40.2 1
#> 3 1 6 20 accept 40.2 1
#> 4 1 6 24 accept 40.2 1
#> 5 1 6 28 accept 40.2 1
#> 6 1 6 32 accept 40.2 1
#> # ℹ 27,066 more rows
What this output shows is that not only do the two data sets have the same columns in the same order, but they also appear to have the same condition names. To check, let's print the sorted unique values for each condition
column:
ws1a$condition %>%
unique() %>%
sort()
#> [1] 20.2 20.4 40.2 40.4
ws1b$condition %>%
unique() %>%
sort()
#> [1] 20.2 20.4 40.2 40.4
This confirms that indeed, the condition names are the same in both data sets.
3.7.2 Combining Both Data Sets
To analyse these two data sets together, as we have done in Section 1.2.2, we now have to combine these two tibble
s into one. Specifically, we need to add the rows of one of the data sets to the other data sets (in contrast to adding additional columns to one of the data sets). The tidyverse
function to combine multiple tibble
s by rows is bind_rows()
. So the code for combining the two data sets is simply bind_rows(ws1a, ws1b)
. This will only work in a case like this in which not only do we have the same columns, but they are also in the same order .
However, before combining the data, we have to consider one further complication. We can see that in both data sets the participant identifier column, subno
, appears to consist of whole numbers. This means that the same number may be used for different participants in the two experiments. If we do not account for this possibility before combining the data, we may end up with a data set in which data from different participants are treated as coming from the same participant -- a clear error. For example, the output above shows that there is a participant with subno
8 in Experiment 1a. If there is also a participant with subno
8 in Experiment 1b, the data from these two different participants (no participant participated in both studies) would be treated as being from the same participant -- a clear error. To see whether this is indeed a potential problem, we print all individual subno
values for both experiments.
ws1a$subno %>%
unique() %>%
sort()
#> [1] 5 8 13 24 35 36 38 40 42 48 52 53 59
#> [14] 60 61 62 63 64 70 73 76 80 82 83 85 86
#> [27] 87 88 89 90 92 93 94 95 96 97 99 100 103
#> [40] 104 106 107 109 110 112 113 114 115 118 119 120 122
#> [53] 123 124 125 128 129 130 131 132 133 134 135 136 137
#> [66] 139 141 142 143 144 145 147 148 149 150 152 154 155
#> [79] 156 157 158 159 160 161 162 163 164 165 166 167 168
#> [92] 169 170 171 172 173 174 175 176 177 178 180 181 183
#> [105] 184 186 187 188 190 191 193 194 195 196 198 200 201
#> [118] 202 204 206 207 208 209 210 211 212 213 214 215 216
#> [131] 217 218 219 220 221 222 223 224 225 226 227 228 229
#> [144] 230 231 232 233 234 235 236 237 238 240 242 243 244
#> [157] 245 246 247 248 249 250 251 256 257 258 259 260 261
#> [170] 262 263 264 265 266 267 268 269 270 271 272 273 274
#> [183] 276 277 278 279 280 281 282 283 284 285 286 287 288
#> [196] 289 290 291 292 293 294 295 296 297 298 299 300 301
#> [209] 303 304 306 307 308 309 310 311 312 314 315 316 317
#> [222] 319 320 322 323 325 326 327 328 330 331 332 333 334
#> [235] 335 336 337 338 339 340 341 342 343 344 345 346 347
#> [248] 348 349 350 351 352 353 354 355 356 357 358 359 360
#> [261] 361 362 363 364 365 366 367 368 369 370 371 372 373
#> [274] 374 375 377 378 379 380 381 382 383 384 385 386 387
#> [287] 388 389 390 391 392 393 394 395 396 397 398 399 400
#> [300] 401 402 403 404 405 406 407 408 409 410 411 412 413
#> [313] 414 415 416 417 418 420 421 422 423 424 425 426 427
#> [326] 429 430 431 432 434 435 437 438 440 441 442 443 444
#> [339] 445 446 447 448 449 450 451 452 453 454 455 456 457
#> [352] 458 459 460 461 462 463 464
ws1b$subno %>%
unique() %>%
sort()
#> [1] 1 2 3 4 5 7 8 9 10 11 12 13 14
#> [14] 15 16 18 19 20 21 22 25 26 27 28 29 30
#> [27] 31 32 33 34 35 36 37 38 39 40 41 42 43
#> [40] 44 45 46 47 48 49 50 51 52 53 54 55 56
#> [53] 57 58 59 60 61 62 63 64 65 66 67 68 69
#> [66] 71 73 74 77 78 79 80 81 82 83 84 85 86
#> [79] 87 88 89 90 91 92 93 94 95 96 98 101 102
#> [92] 103 104 105 106 107 108 109 110 111 112 113 114 115
#> [105] 116 117 118 119 120 121 122 123 124 125 126 127 128
#> [118] 129 131 132 133 134 135 136 137 138 139 140 141 142
#> [131] 143 144 145 146 147 148 149 150 151 152 153 154 155
#> [144] 156 157 158 159 160 161 162 163 164 165 166 167 168
#> [157] 169 170 171 172 173 174 175 176 177 178 179 180 181
#> [170] 182 183 184 185 186 187 188 189 190 191 193 199 200
#> [183] 202 203 208 209 210 211 212 213 214 215 216 217 218
#> [196] 219 220 221 222 223 224 225 226 227 228 229 230 231
#> [209] 232 233 236 237 238 240 243 244 245 246 247 248 249
#> [222] 250 251 252 253 254 255 256 257 258 259 260 261 262
#> [235] 263 264 265 266 267 268 269 270 271 273 274 275 276
#> [248] 277 279 280 281 282 283 284 285 286 287 288 289 290
#> [261] 291 292 293 294 295 296 297 299 301 302 303 304 306
#> [274] 307 309 311 317 318 322 323 325 326 328 329 330 331
#> [287] 332 333 334 336 337 338 340 341 342 343 344 345 346
#> [300] 347 348 350 351 353 354 355 356 357 358 359 360 361
#> [313] 364 368 369 370 371 372 373 374 375 376 377 378 379
#> [326] 380 381 382 383 384 385 386 387 388 389 390 391 392
#> [339] 393 394 396 398 400 401 402 403 404 405 406 407 408
#> [352] 409 410 411 412 415 416 417 418 419 420 421 422 423
#> [365] 426 427 428 429 430 431 432 434 436 437 438 439 440
#> [378] 441 442 443 444 445 446 447 448 449 450 451 452 453
#> [391] 454 455 456 458 459 460 461 462 463 464 465 466 467
#> [404] 469 471 472 473 474 475 476 478 480 481 484 485 486
#> [417] 488 489 490 491 492 493 494
The output confirms our suspicion. subno
values are not unique across experiments, so we need to make them so before combining the data. One straightforward way to do this is by adding a string to the start of each subno
, either e1a_
or e1b_
, that indicates which data set each participant belongs to. For this we use the paste0()
function which pastes strings together into one (paste0()
does so without adding any extra separation character to the combined strings).
In the same call, we will also convert the subno
variable to a factor (as it is a categorical variable). We can then combine both data sets into one called ws1
using the bind_rows()
call discussed above. We then get an overview of the combined data using the glimpse()
function which is the tidyverse
equivalent of str()
(glimpse()
often produces less verbose output for tibbles than str()
-- try it out yourself to see).
ws1a <- ws1a %>%
mutate(subno = factor(paste0("e1a_", subno)))
ws1b <- ws1b %>%
mutate(subno = factor(paste0("e1b_", subno)))
ws1 <- bind_rows(ws1a, ws1b)
glimpse(ws1)
#> Rows: 49,984
#> Columns: 6
#> $ subno <fct> e1a_8, e1a_8, e1a_8, e1a_8, e1a_8, e1a_8…
#> $ loss <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 8, 8, 8, 8, 8, 8…
#> $ gain <dbl> 6, 8, 10, 12, 14, 16, 18, 20, 6, 8, 10, …
#> $ response <chr> "accept", "accept", "accept", "accept", …
#> $ condition <dbl> 20.2, 20.2, 20.2, 20.2, 20.2, 20.2, 20.2…
#> $ resp <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1…
The output shows that the subno
for the first participant in Experiment 1 is now e1a_8
, not 8
. In other words, subno
's are now unique within each experiment and so we can combine the data without making the larger data set misleading.
To verify this logic, we can count the number of unique values of subno
for each data set using the length()
function on the unique()
elements of the subno
vectors. length()
is a function that returns the number of elements in a vector, so can be directly used in calculations. The calculation here shows that, indeed, the combined data has the same number of participants, 781, as Experiments 1a and 1b together.
length(unique(ws1a$subno)) + length(unique(ws1b$subno))
#> [1] 781
length(unique(ws1$subno))
#> [1] 781
Before continuing with the analysis of the combined data, it makes sense to take a step back and consider briefly why we discussed the question of how to combine these two data sets rather extensively. One the one hand, the code above uses a few new functions that we have not yet introduced such as bind_rows()
and length()
. As before, we want to introduce new functions by explicitly discussing their functionality.
On the other hand, the discussion above shows us one way of handling and checking the assumptions in our code. The issue here is that when combining two data sets using bind_rows()
, the code assumes that the participant identifiers are unique. This is not necessarily an obvious assumption we have when writing the code, but it is a consequence of the fact that we simply concatenate both data sets. Because this is not an assumption that is obvious, we may overlook it, possibly leading to problems later on in our code. Specifically, treating data from different participants as coming from the same participant, just because they coincidentally share the same subno
, is probably not appropriate in our statistical analysis. By checking whether the number of participants in the combined data set is equal to the sum of the number of participants in the separate data sets, we ensure that combining the two data sets was done appropriately.
Another reason why we are discussing this issue in detail is that in a paper on which I am a co-author, Skovgaard-Olsen, Singmann, and Klauer (2016), I overlooked this exact problem. The experiment reported in the paper consisted of three conditions that needed to be combined for the analysis. The problem was that the participant identifiers were only unique within conditions and not across conditions and I combined the data sets without considering this fact. Consequently, the statistical analysis incorrectly treated observations from different participants as belonging to the same participant. In fact, all statistical results reported in Skovgaard-Olsen, Singmann, and Klauer (2016) are affected as I only recognised the error after publication of the article. Luckily for my co-authors and me, once I realised what had happened and re-analysed the data after correctly combining it, that initial error had had no substantive effects on the results. Some details changed (the exact value of some numbers), but not the qualitative pattern of results. Consequently, we could fix this error by publishing a straightforward correction (Skovgaard-Olsen, Singmann, and Klauer 2018). The larger point here is that such errors can and will happen. The best way to avoid them is by regularly checking explicitly whether the assumptions we make about our data and our code hold.
3.7.3 Preparing the Data Set and Descriptive Analysis
Now that we have combined the two data sets into one tibble
, we can further prepare it for analysis. For this, we turn all categorical variables into factors, as shown before.
ws1 <- ws1 %>%
mutate(
subno = factor(subno),
response = factor(response, levels = c("reject", "accept")),
condition = factor(
condition,
levels = c(40.2, 20.2, 40.4, 20.4),
labels = c("-$20/+$40", "-$20/+$20", "-$40/+$40", "-$40/+$20")
)
)
ws1
#> # A tibble: 49,984 × 6
#> subno loss gain response condition resp
#> <fct> <dbl> <dbl> <fct> <fct> <dbl>
#> 1 e1a_8 6 6 accept -$20/+$20 1
#> 2 e1a_8 6 8 accept -$20/+$20 1
#> 3 e1a_8 6 10 accept -$20/+$20 1
#> 4 e1a_8 6 12 accept -$20/+$20 1
#> 5 e1a_8 6 14 accept -$20/+$20 1
#> 6 e1a_8 6 16 accept -$20/+$20 1
#> # ℹ 49,978 more rows
Next, we calculate the number of participants per condition and check if every participant has 64 trials. This is one way to check our assumptions about the data and ensure data integrity (we use the same code as described in Section 3.5.3.5).
ws1 %>%
group_by(condition, subno) %>%
summarise(n = n()) %>%
group_by(condition) %>%
summarise(
n_participants = n(),
all_64 = all(n == 64)
)
#> `summarise()` has grouped output by 'condition'. You can
#> override using the `.groups` argument.
#> # A tibble: 4 × 3
#> condition n_participants all_64
#> <fct> <int> <lgl>
#> 1 -$20/+$40 191 TRUE
#> 2 -$20/+$20 202 TRUE
#> 3 -$40/+$40 190 TRUE
#> 4 -$40/+$20 198 TRUE
Finally, it is time to perform a descriptive analysis of the data. The main result we have reported for this data was the mean acceptance rates for the symmetric lotteries that appear in all conditions. The code for this is the same as given in Section 3.5.3.4, but now we use the combined data. This code reproduces the results reported in Section 1.2.2.
3.8 Quiz
Note: The pull-down menu for selecting an answer turns green after selecting the correct answer.
Exercise 3.1 What does NA
stand for in R
?
- "not available", i.e., missing data
- "no answer", i.e,
R
couldn't find an answer to your command - "nothing applicable", i.e., there are no values/variables that match your request
Answer:
Exercise 3.2 Which of the following is NOT true for an R
package?
- Is a collection of functions that extend
R
. - Needs to be installed from CRAN every time you use it .
- Needs to be loaded every time you use it.
Answer:
Exercise 3.3 What are "bugs" in programming languages?
- variables that are not of interest for a specific analysis
- the same code is used multiple times for different datasets
- error in the code that leads to unexpected/wrong results
Answer:
Exercise 3.4 What is a tibble
?
- a variant of a
data.frame
- the
tiydverse
version of a vector - the result you get from data wrangling
Answer:
Exercise 3.5 This operator %>%
means we are using...
- base R
- a pipe
- a data.frame
Answer:
Exercise 3.6 The dplyr
package is NOT useful for:
- picking variables based on their names
- adding new variables
- summarizing multiple values
- changing the ordering of the rows
- reading file in as csv
Answer:
Exercise 3.7 Using the dplyr
package, which function is used for adding/changing variables in the current data?
Answer:
Exercise 3.8 Using the dplyr
package, which function is used for selecting subsets of rows?
Answer:
Exercise 3.9 Using the dplyr
package, which function is used for selecting subsets of columns?
Answer:
Exercise 3.10 When using the pipe operator %>%
, R
executes operations in which order?
- From innermost to outermost function
- That depends on which other packages are loaded.
- From left to right.
Answer:
3.9 Practical Helpers: Cheatsheets
Trying to remember and applying all the tidyverse
tools yourself can be a bit overwhelming at the beginning. There are quite a few functions and one has to remember for each what it does and what its syntax is. Furthermore, so far we have introduced the most important functions, but there are more in dplyr
and other tidyverse
packages that we do not have the space to introduce (for more see "R
for data science").
One way for you to keep an overview of which functions do what is with the help of cheat sheets (or cheatsheets). Cheat sheets are short (usually 1 or 2 pages) documents that provide an overview of a package or tool and are a common resource for many programming languages. Their benefit compared to other learning resources such as books or the documentation is that they provide a concise overview and can point you quickly to the right tool.
In the case of R
, once you know which function to use (e.g., thanks to the cheat sheet) it might then make sense to follow up by looking at the documentation. The documentation can be obtained for each function by entering ?functionname
(e.g., ?summarise
) at the R
prompt. This provides not only an explanation of what the function does plus all of its arguments, but will also usually have helpful examples.
For the tiydverse
packages, RStudio
provides an almost complete list of cheat sheets. The most relevant for us at the moment is the dplyr
cheat sheet. Furthermore, there are also cheat sheets for the RStudio
IDE iteself, reading in data, and even base R
. Before diving into the examples below, it might make sense to download (or even print) the cheat sheets and have them by your side while trying to solve the problems.
3.10 tiydverse
Exercise: Earworms and Sleep
For this exercise we will work with a data set from Scullin, Gao, and Fillmore (2021) investigating a research question about the relationship between listening to music and sleep: "Many people listen to music for hours every day, often near bedtime. We investigated whether music listening affects sleep, focusing on a rarely explored mechanism: involuntary musical imagery (earworms)." I'm sure most of you will have had the experience of an earworm -- hearing some particular piece of music in your head, and not being able to 'stop it playing'. If this seems alien to you, Wikipedia provides a longer introduction.
Here, we are looking at the data from Scullin, Gao, and Fillmore (2021)'s Study 2, which was a sleep-lab experiment. Participants came to the sleep-lab and stayed for one night in a controlled environment in which a number of objective sleep parameters could be measured via polysomnography. Throughout the night several physiological parameters such as electroencephalography (EEG) and breathing were recorded continuously. Participants were assigned to one of two conditions determining the music they heard just before going to bed: either a condition in which original versions of three pop songs, which included the lyrics, or an instrumental version of the same three pop-songs (without the lyrics). The three songs ("Don't Stop Believin'" by Journey, "Call Me Maybe" by Carly Rae Jepsen, and "Shake It Off" by Taylor Swift) were chosen because previous research had shown that they induce earworms. After participants woke up in the morning, they were asked whether they experienced any earworms at different points in the night or in the morning.
The researchers were interested in two research questions:
Does listening to the original version of a pop song (with lyrics) or an instrumental version of it affect whether or not participants develop earworms? This research question was based on previous results and the authors expected the instrumental version to induce more earworms than the lyrical versions.
Does the sleep quality differ between participants in the lyrical music condition and the instrumental music condition (in which participants are expected to have more earworms)?
Before looking at the data, take a guess yourself. Do you believe you sleep better after listening to music that makes it likely that you develop an earworm?
A version of the original data file made available by Scullin, Gao, and Fillmore (2021) can be downloaded from here. I recommend you download it to the data
folder in the working directory and then you can read in the data via the following code. It makes sense to restart R
before doing that, so we also reload the tidyverse
.
library("tidyverse")
earworm <- read_csv("data/earworm_study.csv")
#> Rows: 48 Columns: 19
#> ── Column specification ────────────────────────────────────
#> Delimiter: ","
#> chr (7): group, gender, raceethnicity, nativeenglishspe...
#> dbl (12): id, relaxed, sleepy, earworm_falling_asleep, e...
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Alternatively, you can also download directly from the internet for this exercise, but I really recommend getting comfortable with files, folders, and the R
working directory. The code for downloading from the internet is:
earworm <- read_csv("https://github.com/singmann/stats_for_experiments/raw/master/data/earworm_study.csv")
Let's take a first look at the data:
glimpse(earworm)
#> Rows: 48
#> Columns: 19
#> $ id <dbl> 102, 103, 104, 105, 106, 10…
#> $ group <chr> "Lyrics", "Lyrics", "Instru…
#> $ relaxed <dbl> 37, 96, 62, 83, 68, 78, 60,…
#> $ sleepy <dbl> 49, 10, 44, 97, 42, 52, 41,…
#> $ earworm_falling_asleep <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ earworm_night <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ earworm_morning <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, …
#> $ earworm_control <dbl> 1, 1, 0, 0, 0, 0, 1, 0, 0, …
#> $ sleep_efficiency <dbl> 90.0, 95.7, 89.8, 94.1, 98.…
#> $ sleep_time <dbl> 487.6, 519.4, 472.1, 494.8,…
#> $ age <dbl> 19, 21, 19, 20, 27, 18, 22,…
#> $ gender <chr> "Female", "Female", "Female…
#> $ raceethnicity <chr> "Caucasian", "Caucasian", "…
#> $ nativeenglishspeaker <chr> "Yes", "Yes", "Yes", "Yes",…
#> $ bilingual <chr> "No", "No", "No", "No", "No…
#> $ height_inches <dbl> 66, 68, 64, 69, 53, 62, 76,…
#> $ weight_lbs <dbl> 145, 183, 135, 155, 89, 120…
#> $ handedness <chr> "R", "L", "R", "R", "R", "R…
#> $ classrank <chr> "Sophomore", "Junior", "Fre…
In addition to the variables relevant to our research questions, the data contains a number of control variables (the original data file included even more). We can see a total of 19 variables:
-
id
: Participant identifier -
group
: experimental condition with two values: "Lyrics" versus "Instrumental" -
relaxed
: question asking participants how relaxed they felt on a scale from 0 (= not relaxed) to 100 (= very relaxed) after listening to the music. -
sleepy
: question asking participants how sleepy they felt on a scale from 0 (= not sleepy) to 100 (= very sleepy) after listening to the music. The next 5 variables concerned whether or not the participants reported having an earworm at different times: -
earworm_falling_asleep
: Had earworm when trying to fall asleep last night? 0 = No; 1 = Yes -
earworm_night
: Had earworm when woke up during the night? 0 = No; 1 = Yes -
earworm_falling_asleep
: Had earworm when woke up during the night? 0 = No; 1 = Yes -
earworm_morning
: Had earworm when woke up this morning? 0 = No; 1 = Yes -
earworm_control
: Had earworm at the control time point (i.e., after getting ready to leave the lab)? 0 = No; 1 = Yes -
sleep_efficiency
: sleep efficiency percentage score obtained from the polysomnography (0 = very low sleep efficiency and 100 = very high sleep efficiency) -
sleep_time
: total sleep time in minutes
The remaining variables are demographic ones whose meaning should be evident, except, perhaps, for classrank
. This uses US nomenclature for what year of 4 the student participant is in (i.e., freshman = year 1, sophomore = year 2, junior = year 3, and senior = year 4) plus some additional values.
Exercise 3.11 For research questions 1 and 2 described above, which variables are the independent and which are the dependent variables?
The answer to this question does not require coding. Instead you should come to a solution by just thinking about the description of the experiment and the variables.
Also, remember that in an experiment we test the effect of the independent variable on the dependent variable.
The independent variable for both research questions is the experimental condition, variable group
.
For research question 1, we have three dependent variables, all beginning with earworm_
: earworm_falling_asleep
, earworm_night
, and earworm_morning
. Only earworm_control
is not a dependent variable for this research question, but a control variable (i.e., we should not see an effect of group
on this dependent variable). [Note that the categorisation of earworm_control
as a control variable was a decision of the original authors. I would think that in principle there could still be an effect of the manipulation on this variable as it was collected after the experimental manipulation. However, as they declared it like that, we go with their categorisation here.]
For research question 2, we have one main dependent variable, sleep_efficiency
as a measure of sleep quality. We could also consider the time a participant sleeps, sleep_time
, as a secondary measure of sleep quality.
Exercise 3.12 How can you categorise the other variables, such as relaxed
, age
, etc., that are neither independent nor dependent variables? What results pattern would we expect for these variables?
The other variables can all be understood as control variables. The two explicitly psychological measures, relaxed
and sleepy
, are taken right after the experimental manipulation (i.e., listening to either lyrical or instrumental music). We hope that the type of music does not affect this measure. The other control variables, age
and following, are all demographic variables. We can use them to check whether the randomisation led to unbalanced conditions. And as we have discussed before, the participant identifier, id
, is also a control variable.
Exercise 3.13 Before starting with the analysis, we should prepare the data and transform all important categorical variables into factor
s to avoid any problems. Your task: Transform id
and group
into a factor
. For group
make sure that the original version with lyrics is factor level 1.
Exercise 3.14 As a first step in our analysis, we need to get an overview of the data. For this, we first need to learn something about the structure of the data. This requires two steps:
- How many rows does the data from each participant occupy?
- How many participants are in each condition?
Sounds like a job for summarise()
earworm %>%
group_by(id) %>%
summarise(n = n())
#> # A tibble: 48 × 2
#> id n
#> <fct> <int>
#> 1 102 1
#> 2 103 1
#> 3 104 1
#> 4 105 1
#> 5 106 1
#> 6 107 1
#> # ℹ 42 more rows
Looks like every id
has one row. However, let us make sure this is indeed the case.
earworm %>%
group_by(id) %>%
summarise(n = n()) %>%
summarise(all_1 = all(n == 1))
#> # A tibble: 1 × 1
#> all_1
#> <lgl>
#> 1 TRUE
Yes, everyone has only one row. As discussed before, if there is only one observation per participant there is no distinction between wide and long format, so we can use the data in this format.
Because there is only one observation per participant, we can directly count the number of participants per condition by counting the number of rows per condition.
earworm %>%
group_by(group) %>%
summarise(n = n())
#> # A tibble: 2 × 2
#> group n
#> <fct> <int>
#> 1 Lyrics 25
#> 2 Instrumental 23
There are 25 participants in the lyrics condition and 23 in the instrumental condition.
Exercise 3.15 Before looking at the results, it's a good idea to get an overview of the demographics of the data. In the first step, let us provide a basic overview of participants ignoring the condition they are put in:
- Provide some summaries of the age of participants (min, max, mean, sd).
- What is the gender distribution?
- What is the ethnicity distribution of participants?
When calculating the distribution of a categorical variable, not only calculate the frequencies, but also the proportions (i.e., relative frequencies).
For getting a summary on age
, summarise()
seems like a good idea.
For the other questions, we could also use summarise()
conditional on the categorical variables or alternatively count()
. We can then calculate the proportions from the frequencies using the formula proportion = frequency / total sum.
For the age distribution we have to look at the age
variable:
earworm %>%
summarise(mean = mean(age),
min = min(age),
max = max(age),
sd = sd(age))
#> # A tibble: 1 × 4
#> mean min max sd
#> <dbl> <dbl> <dbl> <dbl>
#> 1 21.0 18 28 2.20
Looks like the participants were all in the student age range.
To get the absolute frequencies for a categorical variable we can just do:
earworm %>%
group_by(gender) %>%
summarise(n = n())
#> # A tibble: 2 × 2
#> gender n
#> <chr> <int>
#> 1 Female 34
#> 2 Male 14
We can then add the proportions to the resulting tibble
using mutate()
:
earworm %>%
group_by(gender) %>%
summarise(n = n()) %>%
mutate(prop = n / sum(n))
#> # A tibble: 2 × 3
#> gender n prop
#> <chr> <int> <dbl>
#> 1 Female 34 0.708
#> 2 Male 14 0.292
Around 71% of participants were women and the rest were men.
For the ethnicity item we can use the same logic/code:
earworm %>%
group_by(raceethnicity) %>%
summarise(n = n()) %>%
mutate(prop = n / sum(n))
#> # A tibble: 4 × 3
#> raceethnicity n prop
#> <chr> <int> <dbl>
#> 1 African American 4 0.0833
#> 2 Asian 6 0.125
#> 3 Caucasian 29 0.604
#> 4 Hispanic 9 0.188
Around 60% are Caucasian, 19% Hispanic, and the rest Asian (12%) or African American (8%).
Exercise 3.16 Let us now calculate the conditional demographic distributions; that is, the demographic distribution conditional on the experimental condition group
. The reason for doing so is to check if the randomisation created balanced or unbalanced groups on the control variables we have collected.
We can re-use the code from the previous exercise, but need to add group_by()
to it.
For the age distribution we can essentially copy the code from above and just need to add group_by(group)
:
earworm %>%
group_by(group) %>%
summarise(mean = mean(age),
min = min(age),
max = max(age),
sd = sd(age))
#> # A tibble: 2 × 5
#> group mean min max sd
#> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 Lyrics 20.4 18 24 1.63
#> 2 Instrumental 21.6 18 28 2.59
Participants in the instrumental condition are on average around 1 year older than participants in the lyrics condition, which seems not too small given that the SD is not much larger. However, this difference could be mostly due to a larger maximum value, which would also explain the difference in SDs. To check this, we can split the data into two tibble
s, one for each condition, and take a look at the sorted ages.
earworm_lyrics <- earworm %>%
filter(group == "Lyrics")
earworm_instrumental <- earworm %>%
filter(group == "Instrumental")
earworm_lyrics$age %>%
sort()
#> [1] 18 18 18 19 19 19 19 19 20 20 20 20 20 20 21 21 21 21
#> [19] 21 22 22 22 23 23 24
earworm_instrumental$age %>%
sort()
#> [1] 18 19 19 19 20 20 20 20 21 21 21 21 21 21 21 22 22 22
#> [19] 22 25 26 27 28
This suggests that it is not only one particularly old participant in the instrumental condition, but several (i.e., the four oldest participants are all in the instrumental condition). So there might be some age imbalance between the two conditions.
To calculate the conditional gender distributions, we first calculate the number of participants by group
and gender
combination. Then, we need to group again by the experimental condition group
to calculate the relative frequencies within each condition. That is, we can also use mutate()
conditional on the grouping factor created by group_by()
(i.e., the sum(n)
part of the mutate()
call is executed separately for each experimental condition after the group_by()
part).
earworm %>%
group_by(group, gender) %>%
summarise(n = n()) %>%
group_by(group) %>%
mutate(prop = n / sum(n))
#> `summarise()` has grouped output by 'group'. You can
#> override using the `.groups` argument.
#> # A tibble: 4 × 4
#> # Groups: group [2]
#> group gender n prop
#> <fct> <chr> <int> <dbl>
#> 1 Lyrics Female 21 0.84
#> 2 Lyrics Male 4 0.16
#> 3 Instrumental Female 13 0.565
#> 4 Instrumental Male 10 0.435
The output shows that the gender ratio is not balanced between groups. Whereas we see a roughly 50-50 gender ratio in the instrumental condition, in the lyrical condition it is closer to 80-20.
Finally, let's use the same logic for the ethnicity question. Note that the resulting table is again too long to be shown fully, so we pipe it to print(n = Inf)
as done before.
earworm %>%
group_by(group, raceethnicity) %>%
summarise(n = n()) %>%
group_by(group) %>%
mutate(prop = n / sum(n)) %>%
print(n = Inf)
#> `summarise()` has grouped output by 'group'. You can
#> override using the `.groups` argument.
#> # A tibble: 8 × 4
#> # Groups: group [2]
#> group raceethnicity n prop
#> <fct> <chr> <int> <dbl>
#> 1 Lyrics African American 3 0.12
#> 2 Lyrics Asian 5 0.2
#> 3 Lyrics Caucasian 13 0.52
#> 4 Lyrics Hispanic 4 0.16
#> 5 Instrumental African American 1 0.0435
#> 6 Instrumental Asian 1 0.0435
#> 7 Instrumental Caucasian 16 0.696
#> 8 Instrumental Hispanic 5 0.217
This does not show any major differences between the two conditions. Only the proportion of Asian participants is somewhat imbalanced.
Overall, the results show that the two groups are not perfectly balanced, even though the groups were randomly created. This can happen in experiments, and in fact would be expected, especially if the group sizes are small as here (i.e., a maximum of 25 participants per condition). The larger the groups, the more likely it will be that they are balanced.
If the groups are unbalanced, is this a problem? Not necessarily. If you remember, the problem randomisation attempts to address are confounders -- third variables that can affect the dependent variable. Imbalance in experiments thus is problematic if we have imbalance in variables that are known to also affect the dependent variable. The question is if age and gender are such variables. I am not much of a sleep researcher, but would assume that the effect of gender on sleep efficiency is not extremely large. However, age surely has an effect on sleep, but I doubt whether that would be true in the age range of the participants here. If the age range were larger (e.g., one group had participants in their 40s or older and the other did not), I would be more concerned. The differences here seem rather unimportant overall. Nevertheless, in their main analysis Scullin, Gao, and Fillmore (2021) do adjust for the difference in gender, a statistical technique we will get back to later.
Exercise 3.17 In addition to the demographic variables, we should also get an overview of the two control variables measuring participants' responses about their own state after listening to the music, variables relaxed
and sleepy
. Check whether they look similar across condition by calculating their mean and SD.
A job for the group_by()
summarise()
combination that is one of the most frequent combination we use from the tidyverse
.
earworm %>%
group_by(group) %>%
summarise(m_relaxed = mean(relaxed),
sd_relaxed = sd(relaxed),
m_sleepy = mean(sleepy),
sd_sleep = sd(sleepy))
#> # A tibble: 2 × 5
#> group m_relaxed sd_relaxed m_sleepy sd_sleep
#> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 Lyrics 72.4 19.6 68.7 23.2
#> 2 Instrumental 70.7 20.7 66.2 18.3
The differences in the relaxed and sleepiness question after listening to the music seem minor given the variability (i.e., SD) in the data.
Exercise 3.18 Having now carefully checked and explored the data, it's time to take a look at research question 1: Does listening to the original version of a pop song (with lyrics) or an instrumental version of it affect whether or not participants develop earworms?
Let us look at this research question in multiple steps:
- Calculate whether experimental condition (
group
) affects any of the threeearworm
dependent variables as well as theearworm
control variable. For doing so, calculate the conditional means as well as the standard deviations. - Calculate two new dependent variables from the three dependent variables (i.e.,
earworm_...
variables with the exception ofearworm_control
): any earworm at all (which should be 0 if participant did not report any earworm and 1 otherwise) and proportion of earworms (which should be the sum of the three earworm dependent variables divided by 3). Then calculate whethergroup
affects these two new dependent variables.
Does the analysis support the prediction that instrumental music leads to more earworms than lyrical music?
You already know the drill: Calculating conditional means and standard deviations is done through group_by()
and summarise()
. To calculate new variables (analysis 2), you need to use mutate()
.
Analysis 1: Separate analysis for each dependent variable:
earworm %>%
group_by(group) %>%
summarise(m_fall = mean(earworm_falling_asleep),
sd_fall = sd(earworm_falling_asleep),
m_night = mean(earworm_night),
sd_night = sd(earworm_night),
m_morning = mean(earworm_morning),
sd_morning = sd(earworm_morning),
m_cont = mean(earworm_control),
sd_cont = sd(earworm_control)
)
#> # A tibble: 2 × 9
#> group m_fall sd_fall m_night sd_night m_morning sd_morning
#> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Lyri… 0.12 0.332 0.08 0.277 0.12 0.332
#> 2 Inst… 0.261 0.449 0.0435 0.209 0.391 0.499
#> # ℹ 2 more variables: m_cont <dbl>, sd_cont <dbl>
Whereas this is the correct code, it does not show all variables of the returned data. One easy way to ensure all variables are shown is to convert the tibble
back to a data.frame
.
earworm %>%
group_by(group) %>%
summarise(m_fall = mean(earworm_falling_asleep),
sd_fall = sd(earworm_falling_asleep),
m_night = mean(earworm_night),
sd_night = sd(earworm_night),
m_morning = mean(earworm_morning),
sd_morning = sd(earworm_morning),
m_cont = mean(earworm_control),
sd_cont = sd(earworm_control)
) %>%
as.data.frame()
#> group m_fall sd_fall m_night sd_night
#> 1 Lyrics 0.1200000 0.3316625 0.08000000 0.2768875
#> 2 Instrumental 0.2608696 0.4489778 0.04347826 0.2085144
#> m_morning sd_morning m_cont sd_cont
#> 1 0.1200000 0.3316625 0.4000000 0.5000000
#> 2 0.3913043 0.4990109 0.4782609 0.5107539
We can see that for two of the three dependent variables, for earworms when falling asleep and in the morning, the proportion of earworms is noticeably larger in the instrumental than in the lyrical condition. For the proportion of earworms during the night and at the control time, we see a much smaller difference or even a small difference in the opposite direction.
Analysis 2: Analysis for new dependent variables.
Step 1: Calculate new variables
earworm <- earworm %>%
mutate(
prop_earworm = (earworm_falling_asleep + earworm_night +
earworm_morning) / 3
) %>%
mutate(any_earworm = if_else(prop_earworm == 0, 0, 1))
earworm %>%
select(id, prop_earworm, any_earworm, starts_with("earworm")) %>%
as.data.frame() %>%
head()
#> id prop_earworm any_earworm earworm_falling_asleep
#> 1 102 0 0 0
#> 2 103 0 0 0
#> 3 104 0 0 0
#> 4 105 0 0 0
#> 5 106 0 0 0
#> 6 107 0 0 0
#> earworm_night earworm_morning earworm_control
#> 1 0 0 1
#> 2 0 0 1
#> 3 0 0 0
#> 4 0 0 0
#> 5 0 0 0
#> 6 0 0 0
We first calculate the new variables using mutate()
. To do so, we realise that the any_earworm
variable is a function of the prop_earworm
(= proportion of earworm) variable, so we calculate the latter first. To calculate the proportion variable we just add each of the three variables up and then divide by 3, being mindful that we need to put parentheses around the +
operations.
We can then calculate the any_earworm
variable from the prop_variable
. Note that we do so in a new mutate()
call. The reason is that we need to refer to the prop_variable
variable and it can lead to problems using variables that are created in the same mutate()
call. It might work, but is not guaranteed, so better to avoid it.
To calculate the any_earworm
variable, we need to check when the prop_earworm
variable is 0 and then assign 0 to the any
variable and 1 otherwise. We use the if_else()
function for this which allows us to apply an if-else logic, as we have here, to a vector (i.e., if called with a vector, if_else()
also returns a vector). Here, the if-part checks if prop_earworm == 0
; if it is, we return 0
, and 1
otherwise.
We then save the new variable back in the earworm
data so we can use it again.
Because this is a non-trivial calculation, we end by printing the first few rows of the results to see if the calculated variables show what they should. We do this to make sure we did not make any errors. As printing the tibble
per default would now show all variables we want ot see, we first select the relevant variables, convert the tibble
to a data.frame
, and then use head()
to get the first few rows. This shows it seems to work. However, as we do not see enough rows with 1s, we repeat the code but print the last few rows using tail()
. This output is more telling that our logic worked.
earworm %>%
select(id, prop_earworm, any_earworm, starts_with("earworm")) %>%
as.data.frame() %>%
tail()
#> id prop_earworm any_earworm earworm_falling_asleep
#> 43 145 0.3333333 1 1
#> 44 146 0.3333333 1 1
#> 45 147 0.0000000 0 0
#> 46 148 0.3333333 1 0
#> 47 149 0.0000000 0 0
#> 48 151 0.3333333 1 1
#> earworm_night earworm_morning earworm_control
#> 43 0 0 0
#> 44 0 0 1
#> 45 0 0 0
#> 46 0 1 1
#> 47 0 0 0
#> 48 0 0 1
Step 2: Calculate conditional means and SDs
earworm %>%
group_by(group) %>%
summarise(m_any = mean(any_earworm),
sd_any = sd(any_earworm),
m_prop = mean(prop_earworm),
sd_prop = sd(prop_earworm)
)
#> # A tibble: 2 × 5
#> group m_any sd_any m_prop sd_prop
#> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 Lyrics 0.28 0.458 0.107 0.186
#> 2 Instrumental 0.522 0.511 0.232 0.274
This shows that for both new dependent variables, we see a difference in the expected direction.
Overall answer: The analysis overall appears to support the prediction that listening to instrumental compared to lyrical music leads to more earworms. However, we still need, at some point, to do a proper statistical analysis to get a better understanding.
Exercise 3.19 Let's now take a look at research question 2: Does sleep quality differ between participants in the lyrical and the instrumental music condition (in which participants are expected to have more earworms)?
Check the effect of the independent variable on the main measure of sleep quality, sleep_efficiency
, and also on the length of sleep, sleep_time
. Don't forget to calculate both means and SD to judge the size of the difference given the level of noise.
In addition to just calculating the conditional means, also calculate the exact difference in sleep efficiency and sleeping time between both conditions.
Does the data overall support the prediction?
Conditional means is the job of the group_by()
and summarise()
combination.
The analysis can be straightforwardly adapted from the previous code. However, because we also want to calculate the difference exactly, we save the result in a new object, and then print it.
summary_sleep <- earworm %>%
group_by(group) %>%
summarise(m_efficiency = mean(sleep_efficiency),
sd_efficiency = sd(sleep_efficiency),
m_time = mean(sleep_time),
sd_time = sd(sleep_time)
)
summary_sleep
#> # A tibble: 2 × 5
#> group m_efficiency sd_efficiency m_time sd_time
#> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 Lyrics 94.3 2.97 507. 24.5
#> 2 Instrumental 91.9 4.75 498. 26.0
This shows that there is a small difference in sleep efficiency as well as time slept, in the predicted direction (i.e., a higher sleep quality for lyrical compared to instrumental music). However, the SD for sleep efficiency is much smaller than for time slept, which is perhaps not surprising given the very different magnitudes of the two variables (differing by about a factor of 5).
Let us now calculate the observed difference to understand the pattern better. With the tidyverse
this can be done as:
summary_sleep %>%
summarise(diff_efficiency = m_efficiency[1] - m_efficiency[2],
diff_time = m_time[1] - m_time[2])
#> # A tibble: 1 × 2
#> diff_efficiency diff_time
#> <dbl> <dbl>
#> 1 2.43 9.68
This shows that we can use normal vector indexing (i.e., [1]
or [2]
) for a vector inside summarise()
.
The same results could also be obtained with base R
as:
summary_sleep$m_efficiency[1] - summary_sleep$m_efficiency[2]
#> [1] 2.425391
summary_sleep$m_time[1] - summary_sleep$m_time[2]
#> [1] 9.67687
Overall, the results show that there is a difference in sleep quality that is around 50% of the observed SD in both measures. As for research question 1, a statistical analysis is necessary to better understand the pattern.
However, we can already say that the differences are not large in absolute terms. For example, the difference in time slept is around 10 minutes with an average sleep time that corresponds to more than 8 hours:
500 / 60 ## sleep time in minutes / 60 gives time in hours
#> [1] 8.333333
Exercise 3.20 The analysis so far supports the prediction for both research questions: Participants that listen to instrumental compared to lyrical songs report more earworms and sleep worse. However, this analysis does not address the question of whether participants that report earworms sleep worse than participants that do not report earworms.
As the last bit of the analysis, let's take a look at this question: Do participants that report having an earworm sleep better than participants that do not do so, ignoring the experimental condition. For this:
- Create a new
factor
,has_earworm
, that isyes
, whenever participants report at least once that they had an earworm for one of the threeearworm
dependent variables (i.e., within the experimental time frame, ignoring the control time). - Calculate the conditional N (number of participants) as well as conditional means and SD of the two sleep variables,
sleep_efficiency
andsleep_time
, as a function ofhas_earworm
.
What can we learn from this analysis?
To create the new factor, has_earworm
, we can transform any_earworm
into a factor:
earworm <- earworm %>%
mutate(has_earworm = factor(
any_earworm,
levels = c(1, 0),
labels = c("yes", "no")
))
Then, we can use the previous code to calculate the conditional means and SDs.
summary_earworms <- earworm %>%
group_by(has_earworm) %>%
summarise(n = n(),
m_efficiency = mean(sleep_efficiency),
sd_efficiency = sd(sleep_efficiency),
m_time = mean(sleep_time),
sd_time = sd(sleep_time)
)
summary_earworms %>%
mutate(prop = n / sum(n))
#> # A tibble: 2 × 7
#> has_earworm n m_efficiency sd_efficiency m_time
#> <fct> <int> <dbl> <dbl> <dbl>
#> 1 yes 19 91.4 4.17 508.
#> 2 no 29 94.3 3.64 500.
#> # ℹ 2 more variables: sd_time <dbl>, prop <dbl>
## want to see all variables:
summary_earworms %>%
mutate(prop = n / sum(n)) %>%
as.data.frame()
#> has_earworm n m_efficiency sd_efficiency m_time
#> 1 yes 19 91.41579 4.174428 507.5421
#> 2 no 29 94.27931 3.635183 499.6586
#> sd_time prop
#> 1 26.35405 0.3958333
#> 2 24.71534 0.6041667
The analysis shows that around 40% report an earworm during the experimental time frame. And there is also an interesting pattern: Participants that report having an earworm have lower sleep efficiency scores but sleep slightly longer, compared to participants that do not report an earworm.
We can also again calculate the difference. Note that we need to change the order here to get the same interpretation as above (i.e., positive means in line with the prediction).
summary_earworms %>%
summarise(diff_efficiency = m_efficiency[2] - m_efficiency[1],
diff_time = m_time[2] - m_time[1])
#> # A tibble: 1 × 2
#> diff_efficiency diff_time
#> <dbl> <dbl>
#> 1 2.86 -7.88
What we can learn from the analysis is that it looks as if having an earworm leads to reduced sleep efficiency. The pattern for sleep length is more difficult to interpret (but also a lot smaller compared to the variability in the data as measured by the SD).
However, in this analysis we lose the benefit of randomisation. We do not really know if it is the earworm or something related to the earworm that drives the effect. Nevertheless, because we do not really have another way of introducing an earworm other than playing different types of music, it is still an important result.