2 Rationale for the current research

This research project has three broad aims. First, I sought to obtain acoustic and articulatory data using modern methods which can shed light on the production accounts of the voicing effect put forward in the past century. The second main objective is to enlighten the debate on reported cross-linguistic differences by conducting an analysis which encompasses three related but contrasting languages. The papers in Part II offer evidence in relation to these goals. The third aim is to carefully (re)consider previous and current results in light of the methodological debates related to the Open Science movement (3.3). This three aims correspond to the following questions:

  1. What is the diachronic articulatory source of the voicing effect, and what can synchronic acoustic and articulatory data tell us about the possible pathway to the emergence of the voicing effect?
  2. How can the comparison of three contrasting languages enlighten the debate of the source of the voicing effect?
  3. How can we effectively apply Open Science practices in phonetic research, and what level of confidence can we assign to the results?

This chapter is an overview of how these three questions have been addressed.

Another fundamental aspect of this research project is that it was developed in two stages: an exploratory (hypothesis-generating) stage, and a confirmatory (hypothesis-testing) stage (on the exploratory/confirmatory dichotomy, see Tukey (1980) and 3.3.3). These stages correspond to Study I and Study II respectively, an overview of which is given in 3.1 and 3.2. The research questions at the exploratory stage (Study I) were formulated while being agnostic in regards to specific hypotheses. Rather, the literature reviewed in 1 formed the basis for a set of general questions about articulatory properties of VC sequences. These questions justified the experimental design of Study I (3.1). More specific questions emerged while performing exploratory data analyses at this stage. New hypotheses were generated by the exploratory phase, which justified a confirmatory study (Study II, 3.2). Finally, questions pertaining to a comparison across languages and replication of previous results spanned across the two stages, and their discussion is brought up at different points across the dissertation.

2.1 discusses in detail the research questions and a justification of methods. 2.2 is an overview of the chosen languages and of relevant aspects of their phonological systems. Finally, 2.3 is a preview of the results.

2.1 Research questions

The first research question concerns the source of the voicing effect, or, in other words, the diachronic pathway that led or can lead to the emergence of the voicing effect in any particular language. More specifically, the question asks how a language can develop the voicing effect and which speech aspects play a role in such development. The long-standing debate about the source of the voicing effect in light of the different proposals discussed in 1.5, whether articulatory or perceptual, is evidence for the difficulty of selecting a single property of speech that is behind the differential duration of vowels followed by voiceless vs voiced stops. Moreover, the existence of durational phenomena related to phonation types other than voicing, like aspiration and ejection (1.6), call for an approach to the understanding of the voicing effect that is independent from voicing per se, while still limiting the investigation to the voicing contrast. Such an approach allows us to formulate an account that future research can generalise and apply to other durational phenomena (related to phonation or not). Furthermore, note that the focus of the current research is on how the voicing effect emerges in the first place, and not how individual languages exploit or not the effect to enhance cues of phonological contrast.

A window into possible diachronic developments is offered by the investigation of cross-linguistic synchronic data, an approach taken here. This approach is justified by the idea that diachronic change draws upon synchronic variation and that synchronic variation is the outcome of diachronic change (Blevins 2004; Blevins 2006; Cristofaro 2012; Cristofaro 2014; Bermúdez-Otero 2015). The view of synchrony/diachrony entanglement enables the use of synchronic information to infer possible diachronic changes that might have led to the current synchronic state.

In light of the complexity of the durational effects reviewed in 1, I further decided to limit the scope of the investigation to aspects of production, while keeping an open mind about perceptual factors, as discussed in 8. This choice was based in part on the relative paucity of recent articulatory data of the voicing effect in relation to, for example, acoustics and perception, and in part on the greater number of production accounts of the voicing effect relative to that of perception accounts. Furthermore, the production accounts reviewed in 1.5 deal either with oral (tongue) or laryngeal articulations. In order to identify potential properties of these two types of gestures it seemed a natural choice to use ultrasound tongue imaging and electroglottography in combination with acoustics as three sources of data. In particular, the research sought to obtain data on segment durations, timing of the consonantal gestures, and properties of vocal fold vibration, given the focus on these features in the literature reviewed in 1. Since the voicing effect (and related durational phenomena) has been prevalently if not exclusively defined and dealt with in terms of acoustic segmental durations, the same approach is used here, and acoustic durations will be at the core of the analyses presented in Part II. Since previous work on stop consonants has generated more coherent results than work on other manner of articulations, and since most hypotheses rest on aerodynamic properties of full stop closures, the focus of this dissertation will be limited to stop consonants.

A convenient way to investigate mechanic properties underlying the effect of voicing on vowel duration is to consider languages in which the effect has not been claimed to be phonologised (1.4). Moreover, comparing two languages that differ in the presence or degree of vowel durational differences can uncover variation motivating cross-linguistic differences. Italian and Polish are two good candidates in that they satisfy both of these requirements. Moreover, their phonological systems allow for a somewhat direct comparison. For these reasons, an exploratory study of Italian and Polish (Study I) was carried out to examine the influence of voiceless and voiced stops on vowel duration. 3.1 contains a description of the methods employed in Study I, while 4, 6, and 7 report the results. Note that, with the terms “voiceless” and “voiced,” I refer to the linguistic reading of “voicing,” rather than to the physical implementation of such contrast, as detailed in 1.2. This approach is generally helpful in light of the distinction between aspirating vs true-voicing languages (Beckman, Jessen & Ringen 2013), and in the case of English in particular (Docherty 1992), which is the subject of Study II. Furthermore, I focus here on voicing as a categorical lexical contrast, given this is the approach followed by most of the relevant literature. Future work is warranted to ascertain the role of a gradient/continuous operationalisation of voicing.

As a follow up of Study I, Study II set out to investigate in English the patterns observed in Study I in Italian and Polish. English was chosen as a further test language given the abundance of previous work dealing with different aspects of the English voicing effect. Moreover, virtually all the accounts reviewed in 1.5 were originally posited based on English data. A second reason behind this choice is that English allows us to look into differences between word-medial and word-final contexts.4 This is warranted based on the reported difference in magnitude of the voicing effect in word-medial and word-final position, as mentioned in 1.3. An overview of the methods of Study II is given in 3.2, while 5 presents and discusses the study and its results.

The second question this dissertation set out to answer is concerned with a cross-linguistic comparison of the voicing effect. Building on the results discussed in the chapters of Part II, 8.1 offers a synthesis of the main topics touched upon in Part II. In turn, this forms the basis of the cross-linguistic comparison of Italian, Polish, and English in 8.2.

Lastly, the third objective is related to research practices and the Open Science movement. In light of the concepts and issues which will be reviewed in 3.3, the research described in this dissertation has been carried out according to principles of openness of data, transparency of analysis, and reproducibility and replicability of results. 3.3.4 in particular discusses how these principles were applied.

The following section gives a description of the main phonological features of Italian, Polish, and English, paving the way to the preview of the results in 2.3 and the discussion of the methods in 3.

2.2 Language sample

This section gives an overview of the phonological systems of Italian, Polish, and English, which will set the stage for the preview of the results in the following section and the discussion of these in the second part of the dissertation. Note that when referring to languages, the languoid model is implicitly assumed (Cysouw & Good 2013). A languoid is the pairing of a glossonym (a name that refers to a languoid or doculect) with a collection of doculects. In turn, a doculect is the pairing of a glossonym with a specific publication (in any form, for example a book with the grammatical description of the doculect, or an article focussing on a specific linguistic aspect). Languoids can be hierarchical, so that a languoid can be composed of other languoids, and so on. The doculects of this dissertation are referred to by the glossonyms Italian, Polish, and Manchester English. The Italian doculect is included in the languoid Italian [glottocode: ital1282], the Polish doculect in the languoid Polish [glottocode: poli1260], and the Manchester English doculect (English for short from now on) in Western Central English [glottocode: west2900].5

Vowel and consonant categories as used here should be interpreted as descriptive categories when language-specific phonemes are discussed, and as comparative concepts when cross-linguistic comparisons are carried out, as discussed for the category of voicing in 1.2. This approach follows from the view that phonemes make sense only within the linguistic system they are from (Trubetzkoy 1969; Haspelmath 2010). In this sense, they are descriptive categories. So the phoneme /a/ of Italian is different from the phoneme /a/ of Polish, even in the case they are phonetically similar, for the fact that they belong to two different linguistic systems. When effects like that of voicing are compared across languages, a category like /a/ is no longer to be intended as a descriptive category, but rather as a comparative concept.

The following sections introduce, for each language, the vowel and consonantal phonemic systems, with special attention to phonation contrasts in consonants, syllabic structure and stress patterns, and rhythmic class (Pike 1945).

2.2.1 Italian

Table 2.1: Italian consonant phonemes(adapted from Krämer 2009)
labial dental alveolar palatal velar
stop p, b t, d ts, dz tʃ, dʒ k, g
fricative f, v s, z ʃ, (ʒ)
nasal m n ɲ
lateral l ʎ
rhotic r
approximant w j
Table 2.2: Italian vocalic phonemes (adapted from Krämer 2009)
front central back
high i u
mid-high e o
mid-low ɛ ɔ
low a

Although the exact phonemic inventory of Italian is still debated, especially for consonants (Krämer 2009: 44), a generally agreed upon phonemic set is given in Table 2.1 for consonants and Table 2.2 for vowels.

Italian contrasts consonants along five (phonological) places of articulation: labial (phonetically either bilabial or labiodental), dental, alveolar, palatal (palatal and post-alveolar), and velar. Stops (true stops and affricates) and fricatives contrast for voicing, although note that /z/ has limited functional load (Bertinetto & Loporcaro 2005) and /ʒ/ is relegated to loan words. The Italian voicing contrast is usually described in terms of an opposition between voiceless unaspirated consonants and fully voiced consonants (Vagges et al. 1978; Bortolini et al. 1995; Pape & Jesus 2014; Kirby 2016). Pape & Jesus (2014) shows that Italian speakers tend to perceive stops without a burst following the release as voiced consonants, independent of the duration of voicing during closure. While it is not clear which acoustic cue is employed by Italian speakers to discriminate voiceless and voiced consonants, Pape & Jesus (2014) find in their production study that Italian consistently articulate (velar) stops with full voicing during closure.

The vocalic system in Table 2.2 is found in stressed syllables, although the status of the mid-high and mid-low contrast is not straightforward (especially for the back vowels), and the mid vowels show a high degree of geographical and idiosyncratic variation (Renwick & Ladd 2016). In unstressed syllables, there is no contrast between mid-high and mid-low vowels, and these vowels are articulated as either mid-high or mid-low depending on the variety of Italian (Rogers 2004; Renwick & Ladd 2016). Although vowel duration is not contrastive (Rogers 2004; Krämer 2009; Renwick & Ladd 2016), vowels are longer when they appear in a stressed open syllable (/fa.to/ [faːto] ‘fate’) and shorter when the syllable is closed (/fat.to/ [fatto] ‘fact’).

Stress in Italian is contrastive (non-predicable), and main lexical stress is generally placed on one of the last three syllables (d’Imperio & Rosenthall 1999; Krämer 2009). The basic foot is a maximally bimoraic trochee (Krämer 2009). Italian is traditionally ascribed to the syllable-timed class of rhythmic typology (Pike 1945). However, properties of stress-timed languages (like vowel reduction) can also be observed in Italian, depending on the regional variety (White, Payne & Mattys 2009; Giordano & D’Anna 2010; Pamies Bertrán 1999).

2.2.2 Polish

Table 2.3: Polish consonant phonemes (adapted from Jassem 2003)
labial dental alveolar alveopalatal palatal velar
plosive p, b t, d c, ɟ k, g
fricative f, v s, z ʃ, ʒ ɕ, ʑ x
affricate ts, dz tʃ, dʒ tɕ, dʑ
nasal m n ŋ
lateral l
rhotic r
approximant w
Table 2.4: Polish vocalic phonemes (adapted from Jassem 2003)
front central back
high i u
mid-high ɨ
mid-low ɛ ɛ̃ ɔ ɔ̃
low a

Polish consonants contrast six places of articulation (Jassem 2003): labial (bilabial and labiodental), dental, (post-)alveolar, alveopalatal, palatal, and velar. Similarly to Italian, Polish stops, fricatives, and affricates can either be voiceless or voiced. Keating (1984a) argues that the Polish voicing contrast is between fully voiced consonants and voiceless (short-lag VOT) consonants. Waniek-Klimczak (2011), on the other hand, suggest a possible change in progress by which the duration of VOT in Polish voiceless stops before stressed vowels is increasing. Moslin & Keating (1977) also suggest that the VOT values tend to be longer under certain prosodic conditions. In relation to this finding, Schwartz & Arndt (2018) report that the perception of the voicing contrast by Polish speakers is not hindered by the absence of pre-voicing in voiced stops. Finally, the voicing contrast is neutralised in absolute word-final position (Gussmann 2007), but it is maintained syllable-finally word-medially (Strycharczuk 2012). The Polish vocalic system is made of eight vowel phonemes, six oral and two nasalised: /i, ɛ, ɨ, a, ɔ, u/, /ɛ̃, ɔ̃/ (Jassem 2003; Gussmann 2007).

Polish lexical stress is fixed on the penultimate syllable, with exceptions having ante-penultimate stress being loan words (Gussmann 2007). The phonological nature of Polish lexical stress is still debated (see review in Łukaszewicz 2018). As for the class of rhythmic typology, Polish exhibits features from both stressed-timed and syllable-timed languages (Dauer 1987; Nespor 1990; Grabe & Low 2002; Arvaniti 2009).

2.2.3 English

Table 2.5: English consonant phonemes
labial dental alveolar post-alveolar palatal velar glottal
plosive p, b t, d k, g
fricative f, v θ, ð s, z ʃ, ʒ h
affricate tʃ, dʒ
nasal m n ŋ
lateral l
rhotic r
approximant w j

In order to avoid influences of regional differences in English, especially in the vowel system, Study II (3.2) was restricted to Manchester English.6

The consonant system of Manchester English minimally diverges from the general Southern British English system (Table 2.5), which is non-rhotic, with the notable exceptions of the so-called “T-glottaling” (realisation of /t/ in non-foot-initial position as [ʔ]), “TH-fronting” (realisation of /θ, ð/ and [f, v]), “H-dropping,” and “velar nasal plus” (realisation of /ŋ/ as [ŋg/]) (Baranowski & Turton 2015; Baranowski et al. 2016; Bermúdez-Otero et al. 2016; Coretta & Canzi 2018; Bailey 2019a; Bailey 2019b). The consonantal phonemes of Manchester English belong to one of seven places of articulation (labial, dental, alveolar, post-alveolar, palatal, velar, glottal) and seven manner of articulation (plosive, fricative, affricate, nasal, lateral, rhotic, approximant).

While voicing in Manchester English has not been systematically investigated, the literature on voicing in English in general is vast (for a review, see Davidson 2016). English obstruents (plosives, fricatives, affricates) contrast for what has been traditionally described as voicing, which is also reflected in the standard use of IPA voiceless and voiced symbols. However, the actual articulatory implementation of the contrast is constituted by a complex set of features and it is affected by other phonological factors, like syllabic structure and stress (Lisker 1986; Docherty 1992). Generally speaking, while the voicing contrast in word-medial position especially after stressed vowels is between a category with voicing during closure (voiced category) and one without it (voiceless category), in pre-stressed position and especially in word-initial position the contrast is between two voiceless categories that differ in voice onset time (short VOT vs long VOT, with no vibration of the vocal folds during closure in the former). Furthermore, Docherty (1992) shows how the binary distinction between voiceless and voiced stops is an oversimplification of the fine-grained timing patterns of vocal fold vibration into, within, and out of the stop closure, and argues that temporal information plays an important role in defining phonological categories and should be regarded as part of the representational make-up of these.

Another relevant dimension is the type of phonation used by speakers to encode the voicing contrast in English. For example, Gordeeva & Scobbie (2007), Gordeeva & Scobbie (2010), Gordeeva & Scobbie (2011) show that preaspiration, glottalisation, and ejection can be used by speakers as cues to the voicing contrast in fricatives and stops in Scottish English. Moreover, no evidence was found for a correlation between the type of phonation employed by each speaker and their general voice quality (Gordeeva & Scobbie 2011). The authors interpret this finding to mean that the speaker’s voice quality settings and the use of one phonation type over another are decoupled, and that preaspiration, glottalisation, and ejection play an important role in the speaker-specific phonologisation of the contrast. The sociolinguistic aspects of voicing investigated in these studies stress the multidimensional nature of the English voicing contrast.

Table 2.6: Northern British English vowel monophthong phonemes (Orton 1962, Wells 1892)
front central back
high ɪ iː ʊ uː
mid ɛ ə ɜː ɔː
low æ ɒ ɑː

Manchester English distinguishes short and long vowels (Table 2.6), which differ in duration and quality. The split between /ʊ/ and /ʌ/ (respectively and in Well’s lexical set, Wells 1982) present in many varieties of English is not in Manchester English (as in Northern English more generally), so that there is a single vowel category realised as [ʊ] (Baranowski & Turton 2015). Other features of the vocalic system in Manchester English are the fronting of /uː/, and the laxing of the happ vowel (the final vowel in words like happy, city, duty) to [ɛ] in word-final position.

Lexical stress is contrastive in English (Giegerich 1992). English is more or less uncontroversially regarded as a stress-timing language (Classe 1939; Pike 1945; Abercrombie 1967; Grabe & Low 2002).

2.3 Preview of results

This section presents an overview of the results derived from the investigation of acoustic durations and articulatory properties of vowel-consonant sequences of three related but contrasting languages (Italian, Polish, English) in Study I and II. The results suggest a composite production account of the voicing effect which synthesises previous independent and seemingly contrasting proposals. In particular, the proposed account revisits and combines elements from the compensatory temporal adjustment account, the laryngeal adjustment account, and rate of closure account, which were presented in 1.5. The following paragraphs summarise the contribution of each original publication in Part II, while a full-fledged discussion of the holistic proposal I put forward will be given in 8.

4 and 5 provide evidence for a revised compensatory adjustment account of the voicing effect. 4 deals with Italian and Polish acoustic data, and it shows that vowel-consonant sequences are embedded within a speech interval that is temporally stable across voicing contexts. This paper discusses mechanisms of compensation between vowel and consonant closure duration within such interval. 5 extends these findings to English, by comparing durational properties of monosyllabic and disyllabic words. More specifically, I discuss how differences in the gestural organisation of mono- vs disyllabic words illuminates the debate on diachronic pathways and perceptual biases behind the voicing effect in these two phonological contexts. In B, I relate the current results with those from previous work, by means of a meta-analytical study of the English voicing effect.

Based on data from disyllabic words of Italian, Polish (4), and English (5), it is demonstrated that the duration of the speech interval between the releases of two stops flanking a stressed vowel is not affected by the voicing status of the post-vocalic consonant. By capitalising on known articulatory properties of vocalic and consonantal sequences (Öhman 1967a; Fowler 1983; O’Dell & Nieminen 2008; Saltzman et al. 2008), the temporal stability of the release-to-release interval is proposed to be a consequence of the isochrony of the vocalic gestures of the word and of the phasing of the consonantal gestures relative to vowels. While experimental testing of vowel-to-vowel isochrony and vowel-consonant phasing is warranted, D provides initial partial evidence. As a side effect of the release-to-release temporal stability, the timing of the VC boundary within such interval determines the respective durations of the vowel and the following consonant closure, the latter of which is known to be longer for voiceless than for voiced stops (Lisker 1957; Summers 1987; Davis & Summers 1989; Jong 1991). As a consequence, shorter vowels are followed by the longer closures of voiceless stops, while longer vowels are followed by the shorter closure of voiced stops.

The results of English monosyllabic words, on the other hand, show that in this context the release-to-release interval is longer when the post-vocalic consonant is voiced (5). The absence of release-to-release temporal stability in monosyllabic words is argued in 5 to be related to the absence of vowel-to-vowel isochrony, which in turn is a consequence of the lack of a second vowel functioning as a temporal anchor. The respective durations of vowel and closure can thus be modified independently, fact that speakers can exploit to enhance the voicing contrast. Contrast enhancement can be obtained by manipulating the ratio between the duration of the vowel and that of the closure without the constraint of keeping the release-to-release duration stable, as in disyllabic words. The presence of the voicing effect in monosyllabic words is conjectured to have emerged as a consequence of mechanisms affecting the timing of the consonant closure onset, in accordance with the rate of closure and laryngeal adjustment accounts of the voicing effect. 6 and 7 offer insights about these accounts in relation to tongue root advancement and glottal spreading respectively.

In 6, the time of the boundary between a vowel and the following consonant (i.e. the stop closure onset) is shown to be modulated, among other known factors, by the position of the tongue root, as evidenced by tongue imaging data. In particular, I explore the link between vowel duration, closure duration and tongue root advancement, and discuss how the timing of consonant closure affects all three aspects. Tongue root advancement was observed during the closure of voiced stops in some but not all speakers of both Italian and Polish. Moreover, it was found that tongue root advancement is initiated during the production of the vowel preceding the target consonant and that the degree of advancement at stop closure onset is positively correlated with preceding vowel duration, such that longer vowels correspond to greater tongue root advancement. Together with the shorter duration of the closure of voiced stops, this pattern fits with the known role of tongue root advancement in the maintenance of voicing during stop closure (Kent & Moll 1969; Perkell 1969; Westbury 1983).

In 7, the analysis of vocal fold activity during the production of vowels shows that the latter portion of vowels followed by voiceless stops is produced with greater glottal spread in Italian than in Polish. This difference is taken as evidence for a language-specific implementation of the timing of glottal spreading. Increased glottal spread before voiceless stops is understood as the precursor of pre-aspiration, the presence of which has been reported in Italian (Nı́ Chasaide & Gobl 1993; Stevens & Hajek 2004a; Stevens & Hajek 2004b; Stevens & Hajek 2010; Stevens 2010; Stevens & Reubold 2014). By combining previous work on pre-aspiration (Lisker 1974; Nı́ Chasaide 1985; Stevens, Keyser & Kawasaki 2014), two alternative pathways of sound change development are proposed: either pre-aspiration is enhanced by shortening the closure of the stop, or it is reduced or prevented altogether by producing an earlier stop closure. The latter solution would mask the acoustic effects of glottal spreading and result in a longer closure duration and shorter vowel duration, other things being equal.

8.1 offers an answer to the first question set out in 2.1 by combining (1) a word-holistic articulatory account of gestural phasing, of which the release-to-release temporal stability is a consequence, and the modulating properties of (2) tongue root advancement and (3) glottal spreading on the timing of the vowel offset/closure onset in vowel-consonant sequences. It is proposed that these three interacting aspects play a role in driving the development of the voicing effect. As for the question of what cross-linguistic differences can be observed in relation to the voicing effect, the data from Italian, Polish, and English suggest that, when different phonological aspects are controlled for, the magnitude of the effect is similar across languages (8.2).

The conceptual contribution of this dissertation, as summarised in the previous paragraphs, is accompanied by an advancement of methodologies in phonetic data analysis and research more generally. 7 and A introduce two methods for the analysis of electroglottographic data and tongue contours using generalised additive modelling. The application of generalised additive models on electroglottographic data allows us to obtain a dynamic and multidimensional view of vibratory properties of the vocal folds (7). This constitutes an improvement from methods that reduce the multidimensionality of fold vibration to a single measure, like the closed quotient. A shows how generalised additive modelling can be used with tongue contour data in polar coordinates to control for a complex combination of effects. Modelling is exemplified by means of a comparison of tongue contours obtained from the time of maximum constriction of voiceless and voiced stops in Italian and Polish, which corroborates the between-speaker differences observed in 6.

This dissertation is also an example of how state-of-the-art research methods can be applied to linguistic research, as part of the third research aim outlined in 2.1. The methods adopted in this dissertation were influenced by the Open Science movement. All research materials (data, code, documentation) are made available on the Open Science Framework (Coretta 2020). In the interpretation of the results, more emphasis was given to the estimation of parameters in statistical models, and to the degree of uncertainty surrounding them. To facilitate this endeavour, Bayesian statistics was applied to address a subset of the research questions. Finally, custom research-management and analysis tools were developed in the form of R packages.

The following chapter includes an overview of the research methods of Study I (3.1) and Study II (3.2). 3.3 introduces the principles of Open Science and discusses how they shaped the current project.


  1. Note that Polish would not be a good candidate because of word-final neutralisation of voicing (Gussmann 2007).↩︎

  2. Languoid classification is controversial, as much as traditional language classification, so that classification decisions are taken here without fully committing to them. The classification adopted here does not directly bear on the research results. Future work is warranted for a more thorough classification.↩︎

  3. Due to the difficulty of recruiting speakers of Italian and Polish in Manchester and in the field in Italy, such approach was not possible for these languages.↩︎