1 Introduction
This is a Quarto manuscript. The outline to the left can be used to navigate to specific sections in the manuscript and the links below the outline take you to the Article Notebook and the data preparation notebook, Prepare Data. The Article Notebook is the same as the main article in the homepage but all code is shown.
Ultrasound Tongue Imaging (UTI) is a non-invasive technique that allows researchers to image the shape of the tongue during speech at medium temporal resolution (30-100 frames per second, Epstein and Stone 2005; Stone 2005). Typically, the midsagittal contour of the tongue is imaged, although 3D systems exist (Lulich, Berkson, and Jong 2018). Recent developments in machine-learning-assisted image processing has enabled faster tracking of estimated points on the tongue contour (Wrench and Balch-Tomes 2022).
Wrench and Balch-Tomes (2022) have trained a DeepLabCut (DLC, Mathis et al. 2018) model to estimate and track specific flesh points on the tongue contour and anatomical landmarks as captured by UTI. The model estimates 11 “knots” from the vallecula to the tongue tip, plus three muscular-skeletal knots (the hyoid bone, the mandible base and the mental spine where the short tendon attaches). See Figure 1 for a schematic illustration of the position of the tracked knots. An advantage of DLC-tracked data over the traditional fan-line coordinate system is that (in theory) specific (moving) flesh points are tracked rather than simply the intersection of the tongue contour with fixed radii from the fan-line system. This makes DLC-tracked data resemble data obtained with electromagnetic articulography (EMA). The downside is that the tongue contour is represented by 11 freely moving points, which can move in any direction in the midsagittal two-dimensional space captured by UTI, and this requires more advanced statistical approaches than those commonly employed to analyse tongue contour data.
Classical ways to analyse tongue contour data obtained from a fan-line system, like SS-ANOVA (Davidson 2006; Chen and Lin 2011) and Generalised Additive Models using Cartesian coordinates are not appropriate with DLC-tracked data, due to the tongue contour “curling” onto itself along the root and due to the vectors of movement of the knots. The first point is illustrated in Figure 2: the plot shows the DLC-tracked points (in black) of the data from a Polish speaker and the traced tongue contours based on the points (see Section 2.1 for details on the data). The contours clearly curl onto themselves along the root (on the left of the contour). The red smooths represent a LOESS smooth, calculated for Y along X. This approach miscalculates the smooth for the back half of the tongue, simply because there are two Y values for the same X value, and the procedure, in that case, returns something like an average of the two values. Generalised Additive Models using Cartesian coordinates work on the same principle and hence would produce the same type of error.
Coretta (2018b) and Coretta (2019) suggest using polar coordinates with Generalised Additive Models to overcome the shortcomings of Cartesian coordinates. While using polar coordinates with DLC-tracked data would work mathematically, as shown in Gubian (2025), it is less ideal from an articulatory perspective: a polar-based fan-line system with a single origin fails to capture the the vectors of movement of the individual knots estimated by DLC. Figure 3 illustrates this point with DLC-tracked data from a Lebanese speaker. The figure shows the tracked points from the entire data set of that speaker and each of the 11 DLC-tracked knots is shown in a different colour. The knots appear to move along well defined vectors (with the exception of the fifth knot, which can be regarded as the pivot knot), represented in the figure by black segments. The vectors of articulatory movement make sense in light of the muscular anatomy of the tongue (Pettersson and Wood 1987; Hoole 1999; Honda 1996; Wrench 2024). The polar radii in grey radiate from a manually selected origin with coordinates (100, 30). While the radii would correctly capture the position of each knot on the bi-dimensional space, the modelling of displacement is constrained by the direction of the radii. We don’t consider this constraint to be fatal, but Multivariate Generalised Additive Models allow for a more accurate depiction of knot movement by removing such constraints.
In this tutorial, we introduce two alternative methods to analyse DLC-tracked tongue contour data: Multivariate Generalised Additive Models (Section 2) and Multivariate Functional Principal Component Analysis (Section 3). We will present the pros and cons of each method in Section 4, but to summarise we are inclined to recommend Multivariate Functional Principal Component Analysis over Multivariate Generalised Additive Models due to the substantial computational overhead and reduced practical utility of the latter over the former. This tutorial illustrates how to conduct static analyses of tongue contours (i.e. modelling contours at specific time-points), but the techniques discussed here can be extended to dynamic (time-varying) analyses as shown in Gubian (2025).
2 Multivariate Generalised Additive Models
Generalised Additive Models (GAMs) are an extension of generalised models that allow flexible modelling of non-linear effects (Hastie and Tibshirani 1986; Wood 2006). GAMs are built upon smoothing splines functions, the components of which are multiplied by estimated coefficients to reconstruct an arbitrary time-changing curve. For a thorough introduction to GAMs we refer the reader to Sóskuthy (2021b); Sóskuthy (2021a); Pedersen et al. (2019); Wieling (2018). Multivariate Generalised Additive Models (MGAMs) are GAMs with more than one outcome variable.
As mentioned in Section 1, the data tracked by DeepLabCut consists of the position on the horizontal (x) and vertical (y) axes of fourteen knots. In this tutorial, we will focus on modelling the tongue contour based on the 11 knots from the vallecula to the tongue tip. Figure 4 illustrates the reconstructed tongue contour on the basis of the 11 knots: the tongue in the figure is from the offset of a vowel [o] followed by [t], uttered by a Polish speaker (see Section 2.1).
The same data is shown in Figure 5 in a different format. Instead of a Cartesian coordinate system of X and Y values, the plot has knot number on the x-axis and X/Y coordinates on the y-axis. The X/Y coordinates thus form “trajectories” along the knots. These X/Y trajectories can be modelled using MGAMs and Multiple Functional Principal Component Analysis (MFPCA): in both cases, the X/Y trajectories are modelled as two variables changing along knot number. Note that a viable, but theoretically less robust, alternative is to model the data with the coordinates values as the outcome and the type of coordinate (X vs Y) as a predictor, as done for example in Wieling et al. (2016) and Gubian (2025) (this approach models the X and Y coordinates as completely independent, while an MVGAM also estimates residual correlations between the two, which is a desired, although not deal-breaking, property). In this section, we will illustrate GAMs applied to the X/Y trajectories along the knots and how we can reconstruct the tongue contour from the modelled trajectories. We will use data from two case studies of coarticulation: vowel-consonant (VC) coarticulation based on consonant place in Italian and Polish, and consonantal articulation of plain vs emphatic consonants in Lebanese Arabic.
2.1 VC coarticulation
The data of the first case study, Coretta (2018a), comes from Coretta (2020b) and has been discussed in Coretta (2020a) (the analysis in the latter concerned the position of the tongue root during the duration of vowels followed by voiceless or voiced stops; in this paper we focus on tongue contours at the vowel offset). The materials are /pVCV/ words embedded in a frame sentence (Dico X lentamente ‘I say X slowly’ in Italian and Mówię X teraz ‘I say X now’ in Polish). In the /pVCV/ words, C was /t, d, k, ɡ/ and V was /a, o, u/ (in each word, the two vowels were identical, so for example pata, poto, putu). The data analysed here is from 9 speakers of Italian and 6 speakers of Polish (other speakers were not included due to the difficulty in processing their data with DeepLabCut).
Video recordings of tongue ultrasound were obtained in Articulate Assistant Advanced™ (AAA, Articulate Instruments Ldt. 2011). Spline data was extracted using a custom DeepLabCut (DLC) model developed by Wrench and Balch-Tomes (2022). When exporting from AAA™, the data was rotated based on the bite plane, obtained with the imaging of a bite plate (Scobbie et al. 2011), so that the bite plane is horizontal: this allows for a common coordinate system where vertical and horizontal movement are comparable across speakers. Once the DLC data was imported in R, we manually removed tracking errors and we calculated z-scores within each speaker (the difference between the value and the mean, divided by the standard deviation). These steps are documented in the “Prepare data” notebook online, available at https://stefanocoretta.github.io/mv_uti/notebooks/01_prepare_data-preview.htm.
The following code chunk reads the filtered data. A sample of the data is shown in Table 1. Figure 6 shows the tongue contours for each individual speaker. It is possible to notice clusters of different contours, related to each of the vowels /a, o, u/. Figure 7 zooms in on PL04 (Polish): the contours of each vowel are coloured separately, and two panels separate tongue contours taken at the offset of vowels followed by coronal (/t, d/) and velar stops (/k, ɡ/). Crucially, the variation in tongue shape at vowel offset (or closure onset) across vowels contexts is higher in the coronal than in the velar contexts. This is not surprising, giving the greater involvement of the tongue body and dorsum (the relevant articulators of vowel production) in velar than in coronal stops.
<- readRDS("data/coretta2018/dlc_voff_f.rds") dlc_voff_f
speaker | word | X | Y | knot | knot_label |
---|---|---|---|---|---|
it01 | pugu | -55.2105 | -44.1224 | 0 | Vallecula |
it01 | pugu | -60.6994 | -31.3486 | 1 | Root_1 |
it01 | pugu | -65.1434 | -17.7311 | 2 | Root_2 |
it01 | pugu | -63.6757 | -4.2022 | 3 | Body_1 |
it01 | pugu | -57.2505 | 7.8483 | 4 | Body_2 |
it01 | pugu | -44.9086 | 13.3162 | 5 | Dorsum_1 |
We can now run a multivariate GAM to model the tongue contours. A multivariate GAM can be fitted by providing model formulae for each outcome variable (in our case, X_z
and Y_z
) in a list. For example list(y ~ s(x), w ~ s(x))
would instruct mgcv::gam()
to fit a bivariate GAM with the two outcome variables y
and w
. The required family is mvn
for “multivariate normal”: mvn(d = 2)
indicates a bivariate family (a multivariate family with two dimensions, i.e. two outcome variables). In the model below, we are fitting a multivariate GAM to the z-scored X and Y coordinates. For both outcome variables, we include a smooth over knot (s(knot, ...)
) with a by
variable vow_place_lang
: this variable is built from an interaction of vowel, place and language.1 We set k
to 5: this will usually be sufficient for X/Y coordinates of tongue contours, since they are by nature not very “wiggly” (which would require a higher k
). We also include a factor smooth over knot for speaker (the equivalent of a non-linear random effect) with s(knot, speaker, ...)
: since language is a between-speaker variable, we use the interaction of vowel and place, vow_place
, as the by
variable.
library(mgcv)
<- gam(
voff_gam list(
~ vow_place_lang +
X_z s(knot, by = vow_place_lang, k = 5) +
s(knot, speaker, by = vow_place, bs = "fs", m = 1),
~ vow_place_lang +
Y_z s(knot, by = vow_place_lang, k = 5) +
s(knot, speaker, by = vow_place, bs = "fs", m = 1)
),data = dlc_voff_f,
family = mvn(d = 2)
)
The model summary is not particular insightful. What we are normally interested in is the reconstructed tongue contours and in which locations they are similar or different across conditions. To the best of our knowledge, there isn’t a straightforward way to compute sensible measures of comparison, given the multidimensional nature of the model (i.e., only one or the other outcome can be inspected at a time; moreover, difference smooths, like in Sóskuthy (2021b) and Wieling (2018), represent the difference of the sum of the outcome variables, rather than each outcome separately, Michele Gubian pers. comm.) We thus recommend to plot the predicted tongue contours and base any further inference on impressionistic observations on such predicted contours. Alas, there is also no straightforward way to plot predicted tongue contours, but to extract the predictions following a step-by-step procedure, like the one illustrated in the following paragraphs.
First off, one has to create a grid of predictor values to obtain predictions for. We do this with expand_grid()
in the following code chunk. We start with unique values of speaker
, vow_place
and knot
(rather than just using integers for the knots, we predict along increments of 0.1 from 0 to 10 for a more refined tongue contour). We then create the required column vow_place_lang
by appending the language name based on the speaker ID. Note that all variables included as predictors in the model must be included in the prediction grid.
# Create a grid of values to predict for
<- expand_grid(
frame_voff # All the speakers
speaker = unique(dlc_voff_f$speaker),
# All vowel/place combinations
vow_place = unique(dlc_voff_f$vow_place),
# Knots from 0 to 10 by increments of 0.1
# This gives us greater resolution along the tongue contour than just using 10 knots
knot = seq(0, 10, by = 0.1)
|>
) mutate(
vow_place_lang = case_when(
str_detect(speaker, "it") ~ paste0(vow_place, ".Italian"),
str_detect(speaker, "pl") ~ paste0(vow_place, ".Polish")
) )
With the prediction grid frame_voff
we can now extract predictions from the model voff_gam
with predict()
. This function requires the GAM model object (voff_gam
) and the prediction grid (frame_off
). We also obtain the standard error of the prediction which we will use to calculate Confidence Intervals in the next step. Since we have used factor smooths for speaker, we now have to manually exclude these smooths from the prediction to obtain a “population” level prediction. We do this by listing the smooths to be removed in excl
: note that the smooths must be named as they are in the summary of the model, so always check the summary to ensure you list all of the factor smooths. Finally, we rename the columns with the name of the outcome variables.
# List of factor smooths, to be excluded from prediction
<- c(
excl "s(knot,speaker):vow_placea.coronal",
"s(knot,speaker):vow_placeo.coronal",
"s(knot,speaker):vow_placeu.coronal",
"s(knot,speaker):vow_placea.velar",
"s(knot,speaker):vow_placeo.velar",
"s(knot,speaker):vow_placeu.velar",
"s.1(knot,speaker):vow_placea.coronal",
"s.1(knot,speaker):vow_placeo.coronal",
"s.1(knot,speaker):vow_placeu.coronal",
"s.1(knot,speaker):vow_placea.velar",
"s.1(knot,speaker):vow_placeo.velar",
"s.1(knot,speaker):vow_placeu.velar"
)
# Get prediction from model voff_gam
<- predict(voff_gam, frame_voff, se.fit = TRUE, exclude = excl) |>
voff_gam_p as.data.frame() |>
as_tibble()
# Rename columns
colnames(voff_gam_p) <- c("X", "Y", "X_se", "Y_se")
Now we have to join the prediction in voff_gam_p
with the prediction frame, so that we have all the predictor values in the same data frame. We do so here with bind_cols()
from the dplyr package. Note that voff_gam_p
contains predictions for each level of the factor smooths, despite these being excluded from prediction. If you inspect the predictions for different speakers, you will find that they are the same for the same levels of vow_place_lang
: this is because the effects of the factor smooths were removed, so speaker
has no effect on the predicted values. This means that you can pick any Italian and Polish speaker in the predicted data frame. We do so by filtering with filter(speaker %in% c("it01", "pl02"))
, but any other speaker would lead to the same output. We also calculate the lower and upper limits of 95% Confidence intervals (CI) for each coordinate. Note that you should interpret these CI with a grain of salt, because they are not truly multivariate, but rather represent the CI on each coordinate axis independently.
<- bind_cols(frame_voff, voff_gam_p) |>
voff_gam_p # pick any Italian and Polish speaker, random effects have been removed
filter(speaker %in% c("it01", "pl02")) |>
# Calculate 95% CIs of X and Y
mutate(
X_lo = X - (1.96 * X_se),
X_hi = X + (1.96 * X_se),
Y_lo = Y - (1.96 * Y_se),
Y_hi = Y + (1.96 * Y_se)
|>
) # Separate column into individual variables, for plotting later
separate(vow_place_lang, c("vowel", "place", "language"))
Figure 8 and Figure 9 show the predicted tongue contours based on the voff_gam
model, without and with 95% CIs respectively. As mentioned earlier, there isn’t a straightforward way to obtain any statistical measure of the difference between the contours on the multivariate plane, so we must be content with the figure.
Figure 9 strongly suggests greater VC coarticulation on the C closure in coronal that in velar contexts. Moreover, the vowels /a, o/ have a similar coarticulatory effect on the tongue contour. With velar consonants, we observe less coarticulation from the preceding vowel (a possible explanation of this is given at the end of Section 3.1).
2.2 Emphaticness
The second case study is about consonant “emphaticness” in Lebanese Arabic. The data is from Sakr (2025). Lebanese Arabic is a variety of Arabic primarily spoken in Lebanon, where it is in constant contact with a number of Indo-European languages (primarily English and French, as vectors of education and business, see e.g.. Shaaban and Ghaith 1999), as well as the written standard form of Arabic known as Modern Standard Arabic (MSA). The relationship between Lebanese Arabic (LA) and MSA in Lebanon is one of diglossia (see e.g.. Lian 2022), where LA is spoken in most contexts, but not written, and MSA is the written variety, and therefore also primarily used for legal and official purposes.
Emphasis is a phonologically contrastive feature of Semitic languages. In most varieties of Arabic, it is usually reported to be realised as pharyngealisation (Sakr 2023; J. Al-Tamimi 2017; Zeroual, Esling, and Hoole 2011; Watson 2002) with some variation depending on phonological context (Sakr 2025; F. Al-Tamimi and Heselwood 2011) or on sociolinguistic factors (Khattab, Al-Tamimi, and Heselwood 2006). Older sources instead report the secondary place of articulation as being the velum (see e.g.. Obrecht 1968; Nasr 1959) or the uvula (see e.g.. Bin-Muqbil 2006; Zawaydeh 1999; Ghazeli 1977). Whatever the specifics of this secondary place of articulation, the literature additionally suggests the occurrence of a loss of emphasis in Lebanese, or more generally Levantine or Western dialects of Arabic, likely as a result of the contact with the Indo-European languages mentioned above (see among others Sullivan 2017; El-Khaissi 2015; Elhij’a 2012; Alorifi 2008).
It is against this background, and as part of efforts to document the precise place of secondary articulation of emphasis in Lebanese Arabic, as well as to document whether or not emphasis has, indeed, been lost in the variety, that the data used here (from Sakr 2025) was collected. It consists of UTI recordings, by 5 participants, of CVb stimuli. The onset was either an emphatic or an unemphatic (‘plain’), voiced or voiceless, alveolar, plosive or fricative /t, ṭ, d, ḍ, s, ṣ, z, ẓ/; when talking about a plain/emphatic pair, we denote them /T, D, S, Z/. The nucleus was one of five vowel qualities (see Sakr 2019) present in Lebanese, which we will denote with /A, E, I, O, U/ to signal that these are neither to be taken as phonemes nor exact phonetic realisations. The coda was the voiced bilabial plosive /b/. Each recording consisted of four stimuli in randomized order, covering forty syllables, in five repetitions; for a total of 1000 recordings. The subset of the data used here is from 35ms before consonant offset, defined as the burst for the plosives and as the end of the frication noise for the fricatives.
Since the procedure to fit and plot MGAMs is the same as the one presented in Section 2.1, we won’t be showing the code in this section, but readers can find the code in the Article Notebook, available at https://stefanocoretta.github.io/mv_uti/index-preview.html.
Figure 10 shows the predicted tongue contours of emphatic and plain consonants, split by following vowel. First, the following vowel exercises an appreciable amount of coarticulation on the preceding consonant. The vowel-induced coarticulation seems to be modulating how the emphatic vs plain distinction is implemented (or not): in the context of the vowels /A, O, U/, emphatic consonants are produced with a retracted body and root, indicating pharyngealisation. On the other hand, in the context of the front vowels /E, I/, there is visibly less distinction between emphatic and plain consonants, which is virtually absent in /E/. However, when plotting the predictions for the different vocalic contexts and different speakers, the picture becomes more complex.
In Figure 11, predictions have been calculated for individual speakers (see Article Notebook online, linked above, for the code). First, there is a good deal of individual variation: some speakers show a clear differentiation of the tongue shape in emphatic and plain consonants, while in other speakers the difference is less obvious. FAK produced emphatic and plain consonants with virtually the same tongue shape. Just to pick another example, BAR uvularised/velarised rather than pharyngealised the emphatic consonants followed by /I/, while BAY pharyngealised them. Plotting predictions of individual speakers can reveal idiosyncratic patterns which are not visible when plotting overall predictions.
3 Multivariate Functional Principal Component Analysis
Principal Component Analysis (PCA) is a dimensionality reduction technique. For an introduction to PCA we recommend Kassambara (2017). Functional PCA (FPCA) is an extension of PCA: while classical PCA works by finding common variance in a set of variables (and by reducing the variables to Principal Components that explain that common variance), FPCA is a PCA applied to a functional representation of varying numerical variables (Ramsay and Silverman 2005; Gubian, Torreira, and Boves 2015; Gubian, Pastätter, and Pouplier 2019; Gubian 2024): a typical example is time-series data, with a variable changing over time. The trajectory of the time-varying variable is encoded into a function with a set of coefficients and the values of those coefficients are submitted to PCA. When more than one time-varying variable is needed, this is where Multivariate FPCA (MFPCA) comes in (Gubian 2024).
MFPCA is an FPCA applied to two or more response variables. Note that the variable does not have to be time-varying. The variation can be on any linear variable: in the case of DLC-tracked UTI data, the variation happens along the knot number. Look back at Figure 5: the two varying variables are the X and Y coordinates, which are varying along the DLC knots. As with MGAMs, it is these two varying trajectories that are submitted to MFPCA.
3.1 VC coarticulation
We will apply Multivariate Functional Principal Component Analysis (MFPCA) to the data introduced in Section 2.1. The following code has been adapted from Gubian (2024). The packages below are needed to run MFPCA (except landmarkregUtils, they are available on CRAN).
library(fda)
library(funData)
library(MFPCA)
# install.packages("remotes")
# remotes::install_github("uasolo/landmarkregUtils")
library(landmarkregUtils)
The format required to work through MFPCA is a “long” format with one column containing the coordinate labels (x or y coordinate) and another with the coordinate values. We can easily pivot the data with pivot_longer()
. Note that we are using the z-scored coordinate values (X_z
and Y_z
). If you are unsure about the code in this section, it is always useful to inspect intermediate and final output in the pipe chains to familiarise yourself with the format of the input and resulting data.
<- dlc_voff_f |>
dlc_voff_long # Select relevant columns
::select(X_z, Y_z, frame_id, knot, vowel, c2_place, language, speaker) |>
dplyr# Pivot data to longer format. Saves coordinate labels to column `dim`
pivot_longer(c(X_z, Y_z), names_to = "dim")
In the second step, we create a multiFunData
object: this is a special type of list object, with the observations of the two coordinates (X_z
and Y_z
) as two matrices of dimension \(N \times 11\), where \(N\) is the number of tongue contours and \(11\) is for the 11 knots returned by DLC. Three columns in the data are used to create the multiFunData
object: one column with the id of each contour (in our data, frame_id
), a time or series column (knot
) and the column with the coordinate values (value
).
<- lapply(
curves_fun_2d c("X_z", "Y_z"),
function(y) {
long2irregFunData(
|> filter(dim == {{y}}),
dlc_voff_long # Tongue contour ID
id = "frame_id",
# Knot column
time = "knot",
# X/Y coordinate values
value = "value"
|>
) as.funData()
}|>
) multiFunData()
Once we have our multiFunData
object, we can use the MFPCA()
function to compute an MFPCA. In this tutorial we will compute the first two PCs, but you can compute up to the number of DLC knots in the data, i.e. 11.
# Number of PC to compute
<- 2
n_pc # You can compute up to 11 PCs.
# n_pc <- 11
# Compute MFPCA
<- MFPCA(
mfpca
curves_fun_2d,M = n_pc,
uniExpansions = list(list(type = "uFPCA", npc = n_pc), list(type = "uFPCA", npc = n_pc))
)
We can quickly calculate the relative proportion of explained variance of each PC with the following code.
# Proportion of explained variance
$values / sum(mfpca$values) mfpca
[1] 0.7242321 0.2757679
The best way to assess the effect of the PC scores on the shape of the tongue contours is to plot the predicted tongue contours based on a set of representative PC scores. In order to be able to plot the predicted contours, we need to calculate them from the MFPCA object. It is suggested to plot predicted curves at score intervals based on fractions of the scores’ standard deviation. This is what the following code does.
# Get the PC score SD
<- sqrt(mfpca$values)
sd_fun
# PC curves to be plotted
<- expand_grid(
pc_curves PC = 1:n_pc,
dim = 1:2,
# Set the SD fraction, from -1 SD to +1 SD, with increments of 0.25
sd_frac = seq(-1, 1, by = 0.25)
|>
) group_by(PC, dim, sd_frac) |>
# We can now calculate the predicted contour with funData2long1().
# reframe() is needed because the funData2long1() function returns a data frame
# that has more rows than the original.
reframe(
funData2long1(
$meanFunction[[dim]] +
mfpca* sd_fun[PC] * mfpca$functions[[dim]][PC],
sd_frac time = "knot", value = "value"
)|>
) # We relabel the dimensions
mutate(
dim = factor(dim, levels = c(2, 1), labels = c('Y_z', 'X_z'))
)
The created data frame pc_curves
has the predicted values of the X and Y coordinates along the knots. This is the same structure as Figure 5, with the knot number on the x-axis and the coordinates on the y-axis. Of course, what we are after is the X/Y plot of the tongue contours, rather than the knot/coordinate plot as needed to fit an MFPCA. For the sake of clarity, we first plot the predicted curves for X and Y separately. Figure 12 shows these. The plot is composed of four panels: the top two are the predicted curves along knot number for the Y coordinates (based on PC1 in the left panel and PC2 in the right panel). Interpreting the effect of the PCs on the X and Y coordinates separately allows one to observe vertical (Y coordinate) and horizontal (X coordinate) differences in tongue position independently. However, note that the vectors of muscle contractions in the tongue are not simply along a vertical/horizontal axis, as mentioned in Section 1. Looking at a full tongue contour (in an X/Y coordinates plot) will generally prove to be more straightforward.
In order to plot tongue contours in the X/Y coordinate system, we simply need to pivot the data to a wider format.
<- pc_curves |>
pc_curves_wide pivot_wider(names_from = dim)
Figure 13 plots the predicted contours based on the the PC scores (specifically, fractions of the standard deviation of the PC scores). The x and y-axes correspond to the X and Y coordinates of the tongue contour, with the effect of PC1 in the left panel and the effect of PC2 in the right panel. A higher PC1 score (green lines in the left panel) suggests lowering of the tongue body/dorsum and raising of the tongue tip. Since the data contains velar and coronal consonants, we take this to be capturing the velar/coronal place of articulation. A higher PC2 score (green lines in the right panel) corresponds to an overall higher tongue position. Considering that the back/central vowels /a, o, u/ are included in this data set, we take PC2 to be related to the effect of vowel on the tongue shape at closure onset.
Given the patterns in Figure 13, we can expect to see differences in PC2 scores based on the vowel if there is VC coarticulation. We can obtain the PC scores of each observation in the data with the following code.
<- mfpca$scores |>
pc_scores `colnames<-`(paste0("PC", 1:n_pc)) |>
as_tibble() |>
bind_cols(dlc_voff_long |> distinct(frame_id, vowel, c2_place, language))
Figure 14 plots PC scores by language (rows), consonant place (columns) and vowel (colour). Both in Italian and Polish, we can observe a clear coarticulatory effect of /u/ on the production of coronal stops (and perhaps minor differences in /a/ vs /o/). On the other hand, the effect of vowel in velar stops seems to be minimal, again in both languages. This is not entirely surprising, since while coronal stops allow for adjustments of (and coarticulatory effect on) the tongue body, velar stops do not since it is precisely the tongue body/dorsum that is raised to produce the velar closure.
Once one has established which patterns each PC is capturing, PC scores can be submitted to further statistical modelling, like for example regression models in which the PC scores are outcome variables and predictors are included to assess possible differences in PC scores, or in which the PC scores are predictors themselves. We illustrate an example regression model at the end of the following section for the Lebanese Arabic emphaticness data.
3.2 Emphaticness
In this section we will run an MFPCA analysis on the Lebanese Arabic data. Since the procedure is the same as in the previous section, the code will not be shown here, but can be viewed in the Article Notebook, available at https://stefanocoretta.github.io/mv_uti/index-preview.html.
Figure 15 illustrates the reconstructed tongue contours (taken from 35 ms before the CV boundary) in Lebanese Arabic, based on the MFPCA. PC1 captures the low-back/high-front diagonal movement. PC2, on the other hand, seems to be restricted to high/low movement at the back of the oral cavity. Emphatic consonants, if produced with a constricted pharynx (i.e. pharyngealised), should have a lower PC1. If on the other hand they are produced with a raised tongue dorsum (i.e. uvularised/velarised), they should have a lower PC2 (lower PC scores are in purple in Figure 15).
Figure 16 plots the PC scores for each vowel, emphaticness and speaker combination. Points are coloured based on emphaticness: emphatic in green and plain in orange. This figure illustrates well how the PC scores can capture individual variation: some speakers show clear separation of emphatic and plain tokens, while others do not. In most cases, PC1 is doing the heavy lifting of distinguishing emphatic and plain: recall that PC1 captures the front-high/back-low diagonal; a low PC1 indicates tongue dorsum and root backing, in other words pharyngealisation. Indeed, PC1 tends to be lower in emphatic tokens in several speakers, like BAR, BAY, MRO and SAK, especially with the vowels /A, O, U/. On the other hand, BAR’s emphatic and plain tokens for vowels /E, I/ do not show a PC1 difference, but rather a PC2 difference: PC2 captures tongue dorsum/body raising, hence indicating velarisation. It is possible that in BAR’s productions of emphatic consonants followed by /E, I/ the distinction with plain is produced by uvularisation/velarisation, compared to the pharyngealisation of emphatic consonants followed by /A, O, U/. Uvularisation/velarisation, rather than pharyngealisation, in the /E, I/ contexts makes sense given that the tongue root has to be forward for the production of those vowels (see Sakr (2025), Section 6.4 for more details).
Figure 17 and Figure 18 illustrate one way to plot the PC scores individually for PC1 and PC2. We won’t include here a full description of the plots, since they should be self-explanatory, but we flag to the reader that these type of plots can be helpful in illustrating specific patterns in PC1 or PC2.
Furthermore, it can be helpful to reconstruct the predicted tongue contours of specific contexts. For example, we might be interested in showing the average tongue contours for emphatic and plain consonants followed by each of the five vowels in the data. This is shown in Figure 19. In order to obtain the reconstructed contours, we first need to calculate mean PC scores for each vowel. This can be done through the following code. We recommend to inspect the pc_scores_mean
object in R.
<- pc_scores |>
pc_scores_mean # Group by variables based on which we want to obtain mean values.
group_by(vowel, emph) |>
# Sumarise data to obtain means
summarise(
PC1 = mean(PC1),
PC2 = mean(PC2),
.groups = "drop"
|>
) # Add "dimensions", i.e. X and Y coordinates
mutate(
dim = list(c(1, 2))
|>
) # Unnes the dim column
unnest(dim)
The following code calculates the reconstructed tongue contours based on both PC1 and PC2. One could also calculate the reconstructed contours at fractions of the standard deviation of the scores if one wished so, like it was done above.
<- pc_scores_mean |>
pc_curves_2 group_by(dim, vowel, emph) |>
# We can now calculate the predicted contour with funData2long1().
# reframe() is needed because the funData2long1() function returns a data frame
# that has more rows than the original.
reframe(
funData2long1(
$meanFunction[[dim]] +
mfpca# We add PC1
* mfpca$functions[[dim]][1] +
PC1 # and we add PC2 as well
* mfpca$functions[[dim]][2],
PC2 time = "knot", value = "value"
)|>
) # We relabel the dimensions
mutate(
dim = factor(dim, levels = c(2,1), labels = c('Y_z', 'X_z'))
|>
) pivot_wider(names_from = dim, values_from = value)
Finally, Figure 19 plots the reconstructed contours. Based on this figure, we do find pharyngealisation in emphatic consonants followed by /A, O, U/ on average, while pharyngealisation is absent in the context of /E, I/.
To illustrate how the PC scores can be further modelled using regressions, we fitted a Bayesian Gaussian regression model using brms (Bürkner 2017, frequentist methods are of course also possible) with PC1 as the outcome, emphaticness (emphatic vs plain) and vowel (A, E, I, O, U) and their interaction as predictors. We also included by-participant varying intercepts and varying slopes for the emphaticness/vowel interaction. Both emphaticness and vowel were coded using sum contrasts. This is the R formula we used: PC1 ~ emph_fac * vowel_fac + (emph_fac * vowel_fac | participant)
. We used the default priors as set by brms. Since a full introduction to regression modelling and Bayesian inference is beyond the scope of this paper, we refer the readers to the many resources now available. See the online version of the manuscript for the full code of the Bayesian analysis of PC1: https://stefanocoretta.github.io/mv_uti/.
Before fitting the model, we create a factor version of the columns emph
and vowel
so that we can set the contrasts to sum coding. We then fit a Bayesian Gaussian regression with brms.
<- pc_scores |>
pc_scores_bm mutate(
# Make factors
emph_fac = factor(emph, levels = c("emphatic", "plain")),
vowel_fac = factor(vowel, levels = c("A", "E", "I", "O", "U"))
)
# Set contrasts to sum coding
contrasts(pc_scores_bm$emph_fac) <- "contr.sum"
contrasts(pc_scores_bm$vowel_fac) <- "contr.sum"
# Attach brms package
library(brms)
# Fit a Gaussian regression
<- brm(
emph_bm ~ emph_fac * vowel_fac + (emph_fac * vowel_fac | participant),
PC1 family = gaussian,
data = pc_scores_bm,
# Uses 4 cores to run MCMC in parallel
cores = 4,
# Set the seed for reproducibility
seed = 4235,
# Save the model fit to file
file = "data/cache/emph_bm"
)
Here’s the summary of the model (multilevel hyperparameters not shown). Because of the sum coding, the Intercept
indicates the grand mean PC1 across emphaticness and vowel type. emph_fac1
corresponds to the difference of emphatic vs the grand mean. If we double the posterior draws of emph_fac1
, we get the posterior of the estimated difference between emphatic and plain.
fixef(emph_bm)
Estimate Est.Error Q2.5 Q97.5
Intercept -0.10980004 0.25612082 -0.646389236 0.41870576
emph_fac1 -0.24234156 0.11018367 -0.438593109 -0.01984657
vowel_fac1 -1.21889333 0.24380229 -1.725001730 -0.75115650
vowel_fac2 0.80033815 0.39834033 -0.006674493 1.58260788
vowel_fac3 1.75894457 0.30613851 1.137958770 2.36099086
vowel_fac4 -1.07756205 0.18339976 -1.426689868 -0.76570630
emph_fac1:vowel_fac1 -0.04597482 0.05457016 -0.148279319 0.05327650
emph_fac1:vowel_fac2 0.15409292 0.07214620 -0.002771105 0.30761632
emph_fac1:vowel_fac3 0.13471581 0.18375172 -0.241455205 0.49128557
emph_fac1:vowel_fac4 0.05264575 0.04793634 -0.040416527 0.14856153
To proceed, we need to obtain the posterior draws from the model.
<- as_draws_df(emph_bm) |>
emph_bm_draws mutate(
emph_plain = 2 * b_emph_fac1
)
|>
emph_bm_draws ::select(emph_plain) |>
dplyrsummarise_draws(
"mean", "sd",
# 95%
~quantile(.x, probs = c(0.025, 0.975)),
# 80%
~quantile(.x, probs = c(0.1, 0.9)),
# 70%
~quantile(.x, probs = c(0.15, 0.85))
|>
) mutate(across(is.numeric, ~round(.x, 2)))
# A tibble: 1 × 9
variable mean sd `2.5%` `97.5%` `10%` `90%` `15%` `85%`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 emph_plain -0.48 0.22 -0.88 -0.04 -0.7 -0.28 -0.65 -0.32
To plot the expected PC1 based on the model, we can conveniently get the expected predictions for each draw using posterior_epred()
from brms. We also calculate the expected difference between emphatic and plain for each vowel, which we will use to output the summary table below (Table 2).
# Define newdata to get predictions for
<- expand_grid(
newdata emph_fac = c("emphatic", "plain"),
vowel_fac = c("A", "E", "I", "O", "U")
)
# Get posterior draws for the newdata
<- posterior_epred(emph_bm, re_formula = NA, newdata = newdata)
emph_bm_epred
colnames(emph_bm_epred) <- paste(newdata$emph_fac, newdata$vowel_fac, sep = "_")
# Calculate differences between emphatic and plain for each vowel
<- as_tibble(emph_bm_epred) |>
emph_bm_epred mutate(
A_diff = emphatic_A - plain_A,
E_diff = emphatic_E - plain_E,
I_diff = emphatic_I - plain_I,
O_diff = emphatic_O - plain_O,
U_diff = emphatic_U - plain_U
)
Figure 20 shows the posterior distributions of the expected PC1 scores for emphatic and plain consonants followed by each of the vowels, based on the Bayesian regression model. A lower PC1 score suggests pharyngealisation as established above. It can be observed that emphatic consonants have a lower PC1 score compared to plain consonants, across all vowels, albeit with different magnitudes in different vowels. Overall, the model suggests that the PC1 score of emphatic consonants is on average 0.04 to 0.88 units lower than plain consonants, at 95% confidence (mean effect = -0.48, SE = 0.22, 95% CrI [-0.88, -0.04], 80% CrI [-0.7, -0.28], 70% CrI [-0.65, -0.32]).
Table 1 shows the mean, standard deviation and 95-80-65% CrIs of the posterior difference between emphatic and plain consonants for each vowel. The CrIs indicate a robustly lower PC1 for /A, U/. For the other vowels, both a lower or slightly higher PC1 are possible and, if a difference is present at all, it would be of a lower magnitude than that in the /A, U/ context. At 80% confidence, emphatic consonants also show a lower PC1 when followed by /O/. In sum, while we can argue for a lower PC1 in the /A, U/ and possibly /O/ contexts (and hence, pharyngealisation in emphatic consonants followed by these vowels), the results are more uncertain regarding /E, I/.
emph - plain | mean | SD | l-95% | u-95% | l-80% | u-80% | l-70% | u-70% |
---|---|---|---|---|---|---|---|---|
A_diff | -0.58 | 0.25 | -1.03 | -0.08 | -0.83 | -0.33 | -0.77 | -0.39 |
E_diff | -0.18 | 0.25 | -0.68 | 0.34 | -0.44 | 0.09 | -0.38 | 0.03 |
I_diff | -0.22 | 0.42 | -1.06 | 0.59 | -0.65 | 0.24 | -0.55 | 0.14 |
O_diff | -0.38 | 0.23 | -0.81 | 0.09 | -0.62 | -0.14 | -0.57 | -0.20 |
U_diff | -1.08 | 0.49 | -2.06 | -0.07 | -1.60 | -0.52 | -1.49 | -0.64 |
4 Advantages and disadvantages
Both Multivariate GAMs and FPCA are a useful way to model DLC-tracked ultrasound tongue imaging data. However, each possesses advantages and disadvantages.
Multivariate GAMs can model tongue contours in specific contexts and combinations thereof, like different vowels, consonant, prosodic contexts and so on. The rather complex model structure required to fit multivariate GAMs to tongue data comes at a computational cost and an interpretative cost. Computationally, multivariate GAMs can take hours to estimate even the most simple models. Interpretationally, comparing different tongue contours quantitatively based on the output of a multivariate GAM is non-trivial, given that the tongue contour is in fact a curve reconstructed from the smooths of the X and Y coordinates along knot (in other words, the model does not model tongue contours directly). Moreover, there is no straightforward way to use traditional methods to assess (frequentist) statistical significance. From a practical point of view, a multivariate GAM ends up being a mathematically complex way of obtaining a sort of average tongue contour. A further constraint is that mgcv offers only a multivariate Gaussian distribution, which might not always be an appropriate distribution (despite being a safe neutral assumption).
Multivariate FPCA, on the other hand, are computationally efficient. Even with very large data sets, the computation of Principal Components is relatively quick. Moreover, the obtained PCs can be interpreted straightforwardly by plotting the effect of changing the PC score on the reconstructed tongue contour (as we did for example in Figure 13). One possible disadvantage of multivariate FPCA is that it is usually not known what type of variation each obtained PC captures. This is illustrated in the two case studies in Section 3. In the VC coarticulation data, PC1 corresponded to the coronal/velar difference in consonants, while PC2 to the difference in vowel. In the emphaticness data, PC1 captured the low-back/high-front diagonal movement, while PC2 to the high/low movement at the back of the oral cavity. In other words: until one has run the MFPCA, one cannot know which PC will correspond to which axis of differences, and whether the PCs will capture relevant differences at all. It can happen that the variation one is after is so minimal relative to other, more substantial cases of variation, that it will not be captured by the MFPCA at all. It is possible that qualitatively homogeneous data sets might return PCs that have the same or very similar interpretations. This has not been systematically tested, perhaps with the exception of simple PCA run on vowel formant data, which suggests that the first two PCs capture the two diagonals (Faber and Di Paolo 1995; Hoole 1999; Strycharczuk, Ćavar, and Coretta 2021; Strycharczuk et al. 2025), which make sense in light of the tongue position and shape model expounded by Honda (1996). Another advantage of MFPCA is that, provided that the PCs have captured relevant characteristics, the PCs can be submitted to further modelling using regression with the inclusion of relevant predictors (like different categorical variables of interest) as we have illustrated in the previous section.
Based on the advantages and disadvantages of each of multivariate GAMs and FPCA, we suggest to researchers to use MFPCA as the preferred and default approach to analyse DLC-tracked tongue contour data and to resort to multivariate GAMs if MFPCA fails to capture relevant variation.
5 Conclusions
This tutorial demonstrates two methods for analysing tongue contour data derived from DeepLabCut: Multivariate Generalised Additive Models (MGAMs) and Multivariate Functional Principal Component Analysis (MFPCA). We offer detailed, annotated analyses of two example datasets, focusing on vowel-to-consonant coarticulation in Italian and Polish, and emphatic consonants in Lebanese Arabic. All associated materials, including datasets and code, are accessible through the tutorial’s research compendium at https://github.com/stefanocoretta/mv_uti. In closing, we evaluated the strengths and limitations of each method and advise adopting MFPCA as the preferred initial modelling strategy for tongue contour data.
Disclosure statement
The authors report there are no competing interests to declare.
Data availability statement
All the materials (including data and code) are available in the research compendium of the tutorial at https://github.com/stefanocoretta/mv_uti. An online version of the manuscript is available at https://stefanocoretta.github.io/mv_uti/.
6 References
Footnotes
Note that interactions between categorical variables in the classical sense are not possible in GAMs. Instead, one can approximate interactions by creating an “interaction variable”, which is simply a variable where the values of the interacting variables are pasted together.↩︎
Citation
@online{coretta,
author = {Coretta, Stefano and Sakr, Georges},
title = {Multivariate Analyses of Tongue Contours from Ultrasound
Tongue Imaging},
date = {},
url = {https://stefanocoretta.github.io/mv_uti/},
langid = {en},
abstract = {This tutorial paper introduces two approaches to modelling
tongue contour data obtained with DeepLabCut using Multivariate
Generalised Additive Models (MGAMs) and Multivariate Functional
Principal Component Analysis (MFPCA). For each method, we present a
fully commented analysis of two illustrative data sets: VC
coarticulation in Italian and Polish, and consonant emphaticness in
Lebanese Arabic. All the materials (inlcuding data and code) are
available in the research compendium of the tutorial at \textless
https://github.com/stefanocoretta/mv\_uti\textgreater. We conclude
by discussing advantages and disadvantages of the two methods (MGAM
and MFPCA) and we recommend researchers to prefer MFPCA over MGAM as
an initial step for modelling tongue contours.}
}