Prepare data

Setup

In [1]:


library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors


library(plotly)


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout


library(coretta2018itapol)

Coretta 2018

The package coretta2018itapol can be installed from GitHub stefanocoretta/coretta2018itapol@devel (the devel branch contains the necessary data which is not yet available in main). The dlc_voff tibble has DLC spline data from the timepoint corresponding to the acoustically determined VC boundary in the CVCV target words.

In [2]:


data("dlc_voff")

We can ggplotly to determine the frame_id of data to be excluded (hovering over a contour shows the frame_id).

In [3]:


p <- dlc_voff |> 
  filter(spline == "DLC_Tongue") |> 
  ggplot(aes(X, Y, group = frame_id, text = frame_id)) +
  geom_path(alpha = 0.2) +
  coord_fixed() +
  facet_wrap(vars(speaker))

ggplotly(p, tooltip = "text")

The following shows the tongue contours of participant PL04.

In [4]:


dlc_voff |> 
  filter(speaker == "pl04", spline == "DLC_Tongue") |> 
  ggplot(aes(X, Y, colour = vowel, group = frame_id)) +
  geom_path(alpha = 0.75) +
  facet_grid(cols = vars(c2_place)) +
  scale_color_brewer(palette = "Dark2")

Let’s filter the data to remove wrongly tracked tongue contours. The filtered data is saved in dlc-voff-f. We also create two new columns as the interaction of existing columns, we calculate within-speaker z-scores for X/Y coordinates, and we convert speaker to a factor (needed for mgcv::gam()).

In [5]:

dlc_voff_f <- dlc_voff |> 
  filter(
    X > -100,
    !(frame_id %in% c(110, 266, 289, 325, 398, 410, 415, 422, 801, 1134)),
    spline == "DLC_Tongue"
  ) |> 
  mutate(
    vow_place = interaction(vowel, c2_place),
    vow_place_lang = interaction(vowel, c2_place, language),
    speaker = as.factor(speaker)
  ) |> 
  group_by(speaker) |> 
  mutate(
    X_z = (X - mean(X)) / sd(X),
    Y_z = (Y - mean(Y)) / sd(Y),
  ) |> 
  ungroup()

This is the filtered data.

In [6]:

dlc_voff_f |>
  filter(spline == "DLC_Tongue") |> 
  ggplot(aes(X_z, Y_z, group = frame_id)) +
  geom_path(alpha = 0.2) +
  coord_fixed() +
  facet_wrap(vars(speaker), ncol = 5)

We also plot X and Y coordinates along knot separately.

In [7]:

dlc_voff_f |>
  filter(spline == "DLC_Tongue") |> 
  ggplot(aes(knot, X_z, group = frame_id)) +
  geom_path(alpha = 0.2) +
  facet_wrap(vars(speaker), ncol = 5)

In [8]:

dlc_voff_f |>
  filter(spline == "DLC_Tongue") |> 
  ggplot(aes(knot, Y_z, group = frame_id)) +
  geom_path(alpha = 0.2) +
  facet_wrap(vars(speaker), ncol = 5)

In [9]:


saveRDS(dlc_voff_f, "data/coretta2018/dlc_voff_f.rds")

Sakr 2025

Data from Sakr 2025 is in data/sakr2025.

In [10]:


dlc_emph <- readRDS("data/sakr2025/dlc_emph.rds") |>
  mutate(
    participant = as.factor(participant),
    vow_emph = interaction(vowel, emph)
  )
table(dlc_emph$knot)


  0   1   2   3   4   5   6   7   8   9  10 
497 575 799 892 892 892 892 892 892 892 892

In [11]:

p <- dlc_emph |> 
  ggplot(aes(X, Y, group = frame_id, colour = emph, text = frame_id)) +
  geom_path(alpha = 0.5) +
  coord_fixed() +
  facet_grid(rows = vars(participant), cols = vars(vowel))

ggplotly(p, tooltip = "text")

Let’s filter tracking errors.

In [12]:


dlc_emph_f <- dlc_emph |> 
  filter(
    !(frame_id %in% c(1309, 1557, 1073, 74, 1306, 1688, 1648, 1651, 233, 195, 192, 215, 357, 223))
  ) |> 
  mutate(
    vow_emph = interaction(vow_emph),
    participant = as.factor(participant)
  ) |> 
  group_by(participant) |> 
  mutate(
    X_z = (X - mean(X)) / sd(X),
    Y_z = (Y - mean(Y)) / sd(Y),
  ) |> 
  ungroup()

In [13]:

p <- dlc_emph_f |> 
  ggplot(aes(X, Y, group = frame_id, colour = emph, text = frame_id)) +
  geom_path(alpha = 0.5) +
  coord_fixed() +
  facet_grid(rows = vars(participant), cols = vars(vowel))

ggplotly(p, tooltip = "text")

Finally save filtered data.

In [14]:


saveRDS(dlc_emph_f, "data/sakr2025/dlc_emph_f.rds")