1 The voicing effect and beyond

There is in all things a pattern that is part of our universe. It has symmetry, elegance, and grace—these qualities you find always in that the true artist captures. You can find it in the turning of the seasons, the way sand trails along a ridge, in the branch clusters of the creosote bush or the pattern of its leaves. We try to copy these patterns in our lives and in our society, seeking the rhythms, the dances, the forms that comfort. Yet, it is possible to see peril in the finding of ultimate perfection. It is clear that the ultimate pattern contains its own fixity. In such perfection, all things move towards death.—from The Collected Sayings of Muad’Dib by the Princess Irulan

— Frank Herbert, Dune (1965)

A careful analysis of the process of observation in atomic physics has shown that the subatomic particles have no meaning as isolated entities, but can only be understood as interconnections between the preparation of an experiment and the subsequent measurement.

—Fritjof Capra, The Tao of Physics (1975)


The sounds of a language form an incredibly complex system of relations and dependencies, both at a physical and a more abstract level. A topic that masterfully exemplifies the complexities of such a system and that has generated great interest over the decades is the somewhat elusive connection between vowel duration and consonant voicing. According to a robust cross-linguistic tendency, vowels are shorter when followed by a voiceless consonant and longer when the following consonant is voiced (Meyer 1903; Heffner 1937; House & Fairbanks 1953; Lisker 1957; Peterson & Lehiste 1960). This so-called “voicing effect” interacts with a variety of linguistic factors and scholars have sought its origins in properties of speech production, from aerodynamic mechanisms to gestural timing, and properties of speech perception (Belasco 1953; Zimmerman & Sapon 1958; Sharf 1962; Lindblom 1967; Halle, Stevens & Oppenheim 1967; Javkin 1976; Kluender, Diehl & Wright 1988). While much progress has been made in understanding this link, after more than a century there is still disagreement as to what contributes to this phenomenon, as evidenced by the numerous accounts put forward.

Given the plurality of views concerning less understood aspects of the voicing effect, this thesis set out to investigate this phenomenon by employing a diverse set of techniques and sources of data. To keep this type of enquiry manageable, I decided to undertake this endeavour from a speech production outlook, an area which has fuelled a great part of the debate within the voicing effect literature. In particular, this thesis poses the question of what aspects of the articulation of vowel-consonant sequences can inform us about the influence of consonant voicing on the duration of vowels. In answering this question, I collected data from a combination of acoustic, ultrasound tongue imaging, and electroglottographic techniques as part of two studies on Italian, Polish, and English. While the voicing effect in English is generally regarded as particularly large in magnitude, the effect is smaller in Italian, and the literature on the effect in Polish is divided between studies that find an effect and studies that do not. These languages thus make up an appropriate set in that they constitute a window into the complex variation of the voicing effect as seen both across and within languages.

The dissertation is organised in three parts: an introduction (Part I), a collection of original manuscripts (Part II), and a conclusion (Part III).

The three chapters of Part I present a review of the literature on the voicing effect and related issues (1), a rationale for the current research including a discussion of the questions to be addressed (2), and a description of the methodologies employed in the studies that make up this research (3). The following sections introduce the phenomenon of the effect of consonant voicing on preceding vowel durations (the “voicing effect”). First, I will discuss how the voicing effect is cross-linguistically common (with the typological caveat that most investigated languages are from the Indo-European family), although alleged exceptions to its universality exist, both in terms of presence of the effect and of its magnitude (1.1). This is followed by a discussion of the use of “voicing” as a comparative concept rather than as a phonetically-motivated descriptive category (1.2), and by a presentation of other phonological and phonetic factors that are known to interact with the voicing effect, such as manner, prosody (1.3), and processes of phonologisation (1.4). The chapter proceeds with a critical review of the explanatory accounts proposed for the voicing effect, both from a production and a perception point of view (1.5). The chapter concludes with a discussion of the effects of aspiration and ejection on vowel duration, and how these can shed light on the voicing effect (1.6).

2 provides a rationale for the current research. The research questions addressed in the dissertation are introduced and contextualised in relation to the topics touched upon in 1. A justification of the choice of data sources and languages used to answer the research questions is given here. This chapter also offers an overview of the phonologies of the chosen languages, namely Italian, Polish, and English. For each language, a brief description of the consonantal and vocalic phonemic systems is given, with a focus on aspects of phonation contrasts, followed by a discussion of stress and rhythm. 2.3 offers a prospective overview of the main results, which are presented in full in Part II.

3, the last of Part I, collates the methods employed in the studies that make up this research, namely an exploratory study of the voicing effect in Italian and Polish (Study I, 3.1) and a confirmatory study of the compensatory aspects of the voicing effect in English (Study II, 3.2). Note that each paper in Part II (4 to 7) contains targeted methods sections that describe the subset of methods specific to the paper, so that a general overview of the methods is provided in 3. This chapter also introduces ultrasound tongue imaging and electroglottography, two articulatory techniques that allow us to learn in a non-invasive way about properties of tongue movement and vocal fold vibration. Finally, 3.3 discusses issues related to statistical methods, introduces principles and practices of Open Science as a remedy to some of these issues, and shows how Open Science has shaped the current research project.

Part II is a collection of original manuscripts in the form of standalone papers (4 to 7), which report and discuss the conceptual and methodological contribution of the present work. The papers are connected in that they investigate related but self-contained aspects of the voicing effect. A “journal format” was chosen over a “book format” given the strong experimental and methodologically independent nature of the research behind each paper, and thanks to the fact that each can be read more or less independently of the others. Nonetheless, the four papers are laid out according to an order partly based on the chronological sequence of the research but also considering the logical dependency of the hypotheses investigated in them. While the papers in 4 to 6 have a more conceptual focus, 7 centres around a novel methodological approach that enables a holistic analysis of vocal fold vibration data as obtained from electroglottography.

4 (Paper I) describes an exploratory study of acoustic properties of the voicing effect in Italian and Polish disyllabic words, as investigated in Study I. Durational aspects of the voicing effect as evinced from acoustic data are surveyed in light of compensatory mechanisms between the duration of vowels and that of consonant closures. This paper also provides a modern description of the voicing effect in Italian and Polish and discusses how the current results match or diverge from previous work.

The findings of 4 motivated a confirmatory study, Study II, which is described and discussed in 5 (Paper II). An articulatory account inferred from the acoustic data presented in 4 is proposed, which generates hypotheses regarding the durational behaviour of disyllabic vs monosyllabic words. These hypotheses, formulated in terms of acoustic durational patterns, are tested against acoustic data from English disyllabic and monosyllabic words. The paper also links differences in magnitude of the voicing effect in di- vs monosyllabic words to the interplay between the articulatory organisation of gestures and perceptual factors.

Two more papers present articulatory aspects of the voicing effect in Italian and Polish from Study I. This part of the study was carried out to explore voicing-driven differences in articulation during the production of vowel/consonant sequences that could favour the emergence of the voicing effect. 6 (Paper III) discusses ultrasound tongue imaging data and focusses on tongue root advancement, a mechanism known to facilitate voicing during closure. Both the static configuration of tongue root advancement at vowel onset and its dynamic development during the production of vowels followed by voiceless and voiced stops are discussed. Furthermore, the relation between the static and dynamic properties of tongue root advancement, vowel duration, and consonant voicing is studied. 7 (Paper IV) assesses a new technique for the dynamic analysis of electroglottographic data which combines established statistical methods. The application of this method is illustrated with an electroglottographic analysis of Italian and Polish, which investigates how vocal fold vibration during the production of vowels differs depending on the voicing status of the following consonant. Finally, the findings of this analysis are discussed in light of the voicing effect and how glottal spread, characteristic of voiceless consonants, might play a role in the emergence of the effect.

Part III summarises the results of this investigation by providing an overarching synthesis (8) in response to the questions outlined in 2, and concludes with a discussion of limitations and future avenues of research (9).

1.1 The voicing effect

Across a wide variety of languages, vowels tend to be shorter when followed by voiceless consonants, and longer when followed by voiced ones. This phenomenon has been called the “voicing effect” (Mitleb 1982) or “pre-fortis clipping” (Wells 1990). Among the earliest traceable mentions to this phenomenon there are Meyer (1903) for English (cited in Lindblom 1967), Meyer (1904) for German, Meyer & Gombocz (1909) for Hungarian, and Gregoire (1911) for French (all cited in Maddieson & Gandour 1976). After these, a great number of studies further confirmed the existence of the effect in these languages and reported it in an ever increasing list of others. Remarkably, no known language has been claimed to have the opposite effect, namely longer vowel durations before voiceless than before voiced consonants.1

English is the language that by far received the most attention in relation to the voicing effect (Heffner 1937; House & Fairbanks 1953; Lisker 1957; Zimmerman & Sapon 1958; Peterson & Lehiste 1960; House 1961; Sharf 1962; Sharf 1964; Lindblom 1967; Halle & Stevens 1967; Halle, Stevens & Oppenheim 1967; Slis & Cohen 1969a; Slis & Cohen 1969b; Chen 1970; Klatt 1973; Lisker 1974; Raphael 1975; Umeda 1975; Javkin 1976; Port & Dalby 1982; Mack 1982; Luce & Charles-Luce 1985; Summers 1987; Kluender, Diehl & Wright 1988; Jong 1991; Laeufer 1992; Fowler 1992; Jong 2004; Warren & Jacks 2005; Ko 2018; Glewwe 2018; Sanker 2019, among others). The presence of a voicing effect has been further corroborated in French by Belasco (1953), Chen (1970), and Laeufer (1992), in Hungarian by Sóskuthy (2013), and German (in the context of word-final voicing neutralisation, (see Nicenboim, Roettger & Vasishth 2018 and references therein). Other known voicing-effect languages are Arabic (Hussein 1994; but cf. Mitleb 1982), Assamese and Bengali (Maddieson 1976), Dutch (Slis & Cohen 1969a), Georgian (Beguš 2017), Hindi Sanker (2018), Italian (Magno Caldognetto et al. 1979: farnetani1986, esposito2002), Icelandic (Einarsson 1927), Japanese (Port, Dalby & O’Dell 1987), Korean (Chen 1970), Lithuanian (Campos-Astorkiza 2007), Norwegian (Fintoft 1961), Swedish (Elert 1970), Spanish (Navarro Tomás 1916), Telugu (Sanker 2018), and Russian (Chen 1970).2

While the voicing effect is cross-linguistically common, it is not universal, and some languages lack voicing-induced durational differences. Czech and Polish are generally reputed to be languages in which the duration of vowels does not significantly differ before voiceless and voiced stops. In fact, the results concerning the effect in these languages are mixed, and support can be found both for and against an effect of voicing on vowel duration. Keating (1984a) examines the duration of vowels in 3 Czech speakers. Vowels are 193.7 ms long when followed by /t/ and 204.2 ms when followed by /d/. This corresponds to a raw difference of 10.5 ms, which the author reports not to be significant (t(30) = -0.37, p > 0.2). Given the low number of speakers and the relative high standard error of the effect (about 28 ms, calculated as the mean difference over the t-value, Nicenboim, Roettger & Vasishth 2018), it is possible that the null result is due to low statistical power. Machač & Skarnitzl (2007) analyse 638 VCV sequences recorded from 53 speakers of Czech and find partial evidence for an effect of voicing in the language.

As for Polish, Slowiaczek & Dinnsen (1985) measure the duration of vowels in word-final syllables from 5 speakers, and vowels followed by an underlyingly voiced stop are 10–15 ms longer. Nowak (2006) investigates several properties of vowel duration in 4 speakers (from different parts of Poland), and finds that vowels followed by voiced stops are 4.5 ms longer (a significant difference). Malisz & Klessa (2008) analyse data from 40 speakers of Standard Polish, and while they don’t report estimates from the whole dataset, the means from 4 speakers suggest a difference in vowel duration before voiceless vs voiced stops of about 3.5 ms. On the other hand, an equal number of studies argue that voicing does not significantly affect vowel duration in Polish. Jassem & Richter (1989) do not replicate the results in Slowiaczek & Dinnsen (1985). Keating (1984a) reports a non-significant difference of 2 ms in the word pair /rata/ (167.4) and /rada/ (169.5 ms), based on data from 24 speakers living in Wrocław. Finally, Strycharczuk (2012) reports a non-significant effect in 6 Warsaw speakers in pre-sonorant word-final position. To summarise, the evidence concerning the presence or absence of the voicing effect in Czech and Polish is mixed and it is not possible to draw firm conclusions.

A second common stance about the voicing effect is that its magnitude differs across languages, and that the greatest effect is observed in English. The reported effect of voicing in word-final syllables in English varies between 35 and 150 ms (Heffner 1937; House & Fairbanks 1953; Zimmerman & Sapon 1958; Peterson & Lehiste 1960; Sharf 1962; Chen 1970; Klatt 1973; Mack 1982; Luce & Charles-Luce 1985; Laeufer 1992; Ko 2018). However, the effect is smaller in non-final syllables, with values between 18 and 35 ms (Sharf 1962; Klatt 1973; Davis & Summers 1989). Taking Italian for comparison, the mean difference in vowel duration before voiceless vs voiced stops in the first syllable of Italian disyllabic words is 22 in Farnetani & Kori (1986) and 24 ms in Esposito (2002). These values are within the range of the reported effect in English non-final syllables. It is thus possible that, once controlling for contextual factors, the apparent cross-linguistic differences in magnitude are, if not removed, at least reduced. A similar position is taken by Laeufer (1992), who directly compares French and English using carefully designed experimental materials. When the duration of a vowel is similar across languages, consonant voicing also has an effect which is comparable in degree.

1.2 Voicing as a physical property and as a linguistic category

The term “voicing,” as used in the literature on the voicing effect and related phenomena, can mean different things. A first major distinction can be drawn between voicing as a physical property and voicing as a linguistic (abstract) category of lexical contrast. Within each of these two classes, further distinctions are possible. This section will review the physical and the linguistic sense in turn. After providing a physical definition of voicing as periodicity and vocal fold vibration, I will discuss some of the views on how voicing can be defined linguistically. Finally, I will introduce the notions of “descriptive category” and “comparative concept” (Haspelmath 2010) to show that the linguistic reading of voicing in this dissertation will be in the latter sense.

From a physical point of view, voicing can be defined either acoustically or physio-anatomically. Acoustically, voicing is the presence of periodicity in a speech signal. The physio-anatomical source of acoustic periodicity is the cyclic vibration of the vocal folds. A speech interval that is characterised by vocal fold vibration/periodicity is said to be voiced, while an interval that does not have vocal fold vibration is said to be voiceless. The terms “voiced” and “voiceless” in this sense refer to the physical properties of the speech interval.

The initiation of vocal fold vibration requires that the air pressure of the cavity below the vocal folds (broadly speaking, the lungs) is higher than that of the cavity above them (the oral tract). The positive trans-glottal air pressure differential is also necessary for the vibration to continue after it is initiated (Berg 1958; Rothenberg 1967). This property is formally known as the Aerodynamic Voicing Constraint (Ohala 2011). Vocal fold vibration in which no active articulatory adjustment is used to ensure the pressure differential is called passive voicing. The typical class of sounds characterised by passive voicing are sonorants (vowels, nasals, liquids). Similarly, the absence of voicing with no concurrent adjustments to prevent initiation and maintenance of vocal fold vibration is known as passive devoicing. Passive devoicing can be observed in the voiceless closure of stops, while a sign of passive voicing is the continuing presence of vocal fold vibration from the preceding voiced sound for some time into the closure (also know as voicing bleed, Davidson 2016).

Certain articulatory conditions, like a reduced oral tract volume or the full closure of stops consonants, hinder passive voicing by reducing the trans-glottal pressure differential. When the pressure differential is 0 (i.e. when pressure equalisation is reached), vocal fold vibration can no longer be maintained and it ceases. In such articulatory conditions, several articulatory adjustments can be implemented to counteract pressure equalisation. Active adjustments require muscular activity. Among the solutions to help sustaining vocal fold vibration there are (1) some that decrease supra-glottal pressure like tongue root advancement (Kent & Moll 1969; Perkell 1969; Westbury 1983), larynx lowering (Riordan 1980), opening of the velopharyngeal port (Yanagihara & Hyde 1966), or producing a retroflex occlusion (Sprouse, Solé & Ohala 2008), and (2) others that lower the pressure differential threshold required for fold vibration, like slackening of the vocal folds (Halle & Stevens 1967) or producing a shorter stop closure (Lisker 1957). Active voicing is vocal fold vibration with articulatory adjustments that ensure continuing vibration.

Sounds that are intended not to have vocal fold vibration (i.e. are intended as voiceless) but that are characterised by favourable conditions for it tend to show passive voicing. In these cases, articulatory adjustments can be put in place to counteract the presence of passive voicing. For example, glottal abduction, larynx raising, vocal fold tensing and oral wall tensing can all prevent voicing by either decreasing the supra-glottal volume or raising the pressure differential threshold. The resulting phenomenon is called active devoicing (Jansen 2004).

Rothenberg (1967) makes an important further distinction between purposive and non-purposive active articulatory adjustments. For example, tongue root advancement can be executed by muscular activity with the intent to maintaining voicing, in which case we would call it a purposive (active) gesture. If tongue root advancement is executed by muscular activity with an intent different from maintaining voicing (for example, aiding the creation of a tongue constriction movement), then this would be classified as a non-purposive (active) gesture in regards to voicing. While ascertaining whether an active gesture is purposive or non-purposive can be difficult, an active articulatory adjustment (executed by muscular activity) does not automatically imply the speaker’s intention to achieve all of the benefits deriving from that adjustment. While discussing articulatory adjustments in this dissertation, a classification between passive, active, purposive and non-purposive adjustments will not be attempted, but the potential difference will be discussed when relevant. Finally, the terms movement, adjustment, and gesture will be used interchangeably throughout without any theoretical commitment to differences among these.

Turning now to the linguistic sense of voicing, different approaches have been proposed on how to classify phonological systems of phonation contrasts. This discussion will focus on systems that contrast two categories, and only on those approaches to voicing that are relevant to topics at matter. In particular, I will review three main ideas: the distinction between voicing and aspirating languages (Beckman, Jessen & Ringen 2013), the tri-partite system of phonological categories (Keating 1984b), and the Articulatory Phonology definition of voicing (Goldstein & Browman 1986). Secondly, I will show that a typological approach that distinguishes between language-specific and comparative entities spares us the need to find a definition of voicing that can simultaneously account for the different phonological systems. This approach forms the basis of the conceptual background of this dissertation.

The voicing-effect languages listed in 1.1 have quite different phonation systems. A major distinction can be drawn between so-called “true voicing” languages and “aspirating” languages (Beckman, Jessen & Ringen 2013). In the framework discussed in Beckman, Jessen & Ringen (2013)], true voicing languages make use of the distinctive (privative) feature [voice], and segments specified with [voiced] are characterised by active voicing (vocal fold vibration). On the other hand, aspirating languages employ the feature [spread glottis]. Segments specified with [spread glottis] generally have long Voice Onset Time (VOT) values, while unspecified segments can show passive voicing (vocal fold vibration). Typical true-voicing languages are Italian, Spanish, and Russian, while Germanic languages like German, English, and Icelandic are aspirating languages. All of these languages are, even if at an allegedly different extent, voicing-effect languages. In true-voicing languages, vowels are longer when followed by [voice] segments, while in aspirating languages vowels are longer when followed by underspecified segments (segments without [spread glottis]).

Keating (1984b) is an attempt to define “voicing” in such a way that even systems that are very dissimilar at the physical level can be grouped together. This definition is restricted to languages that contrast only two categories of phonation. Keating (1984b) proposes that three levels of representation are necessary. One level is purely phonological, and abstracts away from real physical properties of the contrast. This phonological level of representation corresponds to the traditional [(±)voice] feature (either binary or privative). The second level of representation pertains to what Keating (1984b) calls “modified systematic phonetics.” She proposes three phonetic categories based on VOT: {voiced}, {voiceless unaspirated}, and {voiceless aspirated}. The last level, “pseudo-physical,” assigns a range of VOT values to such categories depending on the language and the phonological context. The [(±)voice] feature can then be interpreted as “more” or “less voiced,” rather than as presence vs absence of vocal fold vibration.

Goldstein & Browman (1986), within their framework of Articulatory Phonology, take a different stance and ascribe voicing to simply the presence or absence of a glottal opening-and-closing gesture. Voiceless segments are then characterised by the presence of such a gesture, while voiced segments by its absence. This definition abstracts away from the presence/absence of vocal fold vibration, and allows us to group together for example an aspirating language like English and a true-voicing one like Italian.

While the three approaches just reviewed try to posit categories that can both describe and categorise phonation systems, the typological approach adopted here keeps these two aims separate. Haspelmath (2010) introduces a helpful distinction between comparative concepts and descriptive categories. Following from what Haspelmath calls “categorial particularism,” it is advocated that individual languages should be described in terms of language-specific categories. In this light, “voicing” in Italian is different from “voicing” in Spanish precisely because Italian and Spanish are two different linguistic systems. Typological comparison should not be based on (language-specific) “descriptive categories,” but rather on “comparative concepts.” Comparative concepts are created by the linguist who performs cross-linguistic analyses, and are not components of particular languages. I assume here that traditional phonological categories like “voicing,” “vowel height,” “place of articulation,” can be thought of either as (language-particular) descriptive categories or comparative concepts, depending on the scientific enterprise.3 In relation to the voicing effect, the use of “voicing” as adopted in this work will be intended as a comparative concept and not as a descriptive category. In other words, “voicing” is used here as a convenience cross-linguistic term for phonological oppositions that are similar in behaviour and that are treated similarly in the voicing effect literature, but no claim of identity of the categories across languages is made (nor is it necessary for the account proposed here).

Note that the adoption of the distinction between descriptive categories and comparative concepts is a working assumption and a fully fledged argument in support of the distinction will not be pursued here. Rather, reference to comparative concepts allows us to compare the voicing effect across languages where “voicing” behaves very differently, and allows us to make cross-linguistic generalisations that transcend language-specific descriptive categories. The holistic account expounded in 8.1 rests on this working assumption, and should be interpreted as applicable independently from language-specific voicing categories. I will not specify the sense of the term “voicing” when used (physical, descriptive, comparative), unless in cases where its interpretation is ambiguous.

1.3 The voicing effect and other phonological and phonetic factors

In 1.1, we saw that the voicing effect can differ depending on the language. In addition to language, this phenomenon is also modulated by other phonological and phonetic factors. For example, Umeda (1975) reports that the difference in vowel duration before voiceless vs voiced consonants is greater when the test word is pre-pausal. The voicing effect also seems to be more robust in stressed than in unstressed vowels (Davis & Summers 1989). There is also indication that the effect is modulated by the position of the syllable in the word in English, so that word-final syllables show a greater effect than word-medial syllables Abdelli-Beruh (2004). Port (1981) further argues that the effect in word-initial stressed vowels is smaller along the hierarchy monosyllabic > disyllabic > trisyllabic words, which also reflects that of decreasing average vowel durations. Laeufer (1992) discusses the voicing effect as a function of vowel height, and shows that the effect is greater in low (intrinsically longer) vowels than in high (intrinsically shorter) vowels. Moreover, Sharf (1964) shows that the effect persists even in whispered (unvoiced) speech.

Manner of articulation of the consonant is a further relevant parameter. While most work seems to focus on stops, voicing of other types of consonants affects preceding vowel duration. For example, House & Fairbanks (1953) report that vowels are longer when followed by a voiced fricative than a voiceless one in English. They also argue that the durational difference is greater before fricatives than before stops. On average, vowels in House & Fairbanks (1953) are 84 ms longer when followed by a voiced stop (vs a voiceless stop) and 93 ms longer when followed by a voiced fricative (vs a voiceless fricative). Laeufer (1992) finds similar patterns in both English and French: vowels followed by voiced fricatives are longer than when followed by voiceless fricatives (the average difference is 93 ms in English, 47 ms in French) and the effect of voicing with fricatives is greater than with stops (the average difference is 60 ms in English stops, 35 ms in French stops). Zimmerman & Sapon (1958) report vowel durations before voiceless and voiced stops and fricatives in English, and the difference is greater in the latter (95 vs 122 ms). On the other hand, in a survey of spontaneous speech from different varieties of English, Tanner et al. (2019) find that the effect of voicing is greater with stops than with fricatives by a mean factor of 1.3. To sum up, it is possible that the degree of the voicing effect is greater in fricatives than in stops, but it is difficult to make generalisations based on such a small pool of studies.

The relation between the voicing effect in obstruents and durational effects of sonorant consonants further indicates mixed results. While only a few systematic investigations on the effect of sonorant voicing on vowel duration have been carried out, it was found that (1) nasals exercise an effect intermediate between that of voiceless and voiced stops but closer to that of the latter (House & Fairbanks 1953; Zimmerman & Sapon 1958); (2) nasals are preceded by vowels that are longer than those followed by voiced stops (Peterson & Lehiste 1960); or (3) the duration of vowels followed by nasals is indistinguishable from that of vowels followed by voiced stops (Lisker 1974). In House & Fairbanks (1953), vowels are on average 245 ms long when followed by a voiced stop, 232 ms long when followed by a nasal, and 161 ms when followed by a voiceless stop. Zimmerman & Sapon (1958) report English vowel durations of 218 ms before voiced stops and 200 ms before nasals, while vowels are 123 ms long when followed by voiceless stops. On the other hand, the duration of vowels in Peterson & Lehiste (1960) are 273 ms when followed by nasals, 265 ms when followed by voiced stops, and 171 ms when followed by voiceless stops. Lisker (1974) argues that the duration of vowels followed by voiced stops and nasals are virtually the same, but measurements are not provided. In sum, nasals seem to behave more like voiced stops than voiceless stops, but it is less clear whether vowels preceding them are longer or shorter than those followed by voiced stops.

1.4 On the phonologisation of the voicing effect

The voicing effect can take on a linguistic function resulting in the phonologisation of the durational differences, as argued for English (Jong 1991; Jong 2004; Solé, Beddor & Ohala 2007; Sanker 2019). Some clarification is due here as to what is meant by phonologisation. The classical or structuralist definition of phonologisation states that this occurs when a contextual allophone becomes contrastive, or in other words it becomes a phoneme (Kiparsky 2015), generally after the disappearance or replacement of the conditioning context. Sanskrit velar palatalisation is a classical example of phonologisation (Hock 1991: 149). At some point in the history of Sanskrit, the velar stops /k/ and /g/ where palatalised when followed by /i/ and /e/, creating an allophonic distinction between velars proper and palatal consonants of some sort. The subsequent change of /e/ to /a/ removed the context conditioning palatalisation, thus creating minimal pairs opposing /ka, ga/ and /tʃa, dʒa/. At this stage, the palatal allophones were phonologised. This conceptualisation of phonologisation amounts to saying that phonetic features that were previously computed procedurally (during phonological/phonetic derivation) from an underlying lexical representation are now instead already part of the lexical representation (which is, in structural terms, a string of phonemes).

Phonologisation assumes a different meaning within the framework of Lexical Phonology (Kiparsky 1988). Lexical Phonology argues that there exist two types of phonological processes: processes that apply at the lexical (stem and prosodic word) level, and processes that are post-lexical and apply across the board. According to the view of Lexical Phonology, a process is phonologised when it goes from being post-lexical to being lexical. To carry on with the Sanskrit example, phonologisation was initially post-lexical, in other words it was applied across the board during derivation after all lexical processes have been applied to the stem and word. During the course of sound change, the same process of velar palatalisation started being applied also at the lexical level (with the original copy of the process possibly still being applied post-lexically). Velar palatalisation has been phonologised, creating so called “quasi-phonemes” (categorical, distinctive units, not yet able to create lexical contrast, Janda 1999).

Kiparsky (2000) carries over the definition of phonologisation from Lexical Phonology onto Stratal Optimality Theory (Kiparsky 2000; Bermúdez-Otero 2017). Stratal OT assumes that the phonological module of grammar is stratified into three levels (called strata, or domains) as in Lexical Phonology: the stem, the word, and the phrasal level. OT constraints are independently ordered in each level, so that within each level different orders allow for different outputs to be selected. Stratal OT also stipulates that phonological constraints apply iteratively (cyclically) from the narrower domain, namely the stem, through the word domain, to the phrasal domain. Under cyclicity, the input of one domain is passed over to the next, and so on. For Kiparsky (2000), phonologisation occurs when the constraint ordering of the phrasal domain (the post-lexical level of Lexical Phonology) is carried over to the word and stem domains (the lexical level of Lexical Phonology).

An extension of Stratal OT, the life cycle of phonological processes (Bermúdez-Otero 2007; Bermúdez-Otero 2015), offers yet another definition of phonologisation and a more fine-grained terminological set. Bermúdez-Otero (2015)] reserves the term “phonologisation” for when a physico-physiological (mechanic) phenomenon comes under the control of the speaker/hearer and in fact becomes part of her grammar (more specifically, part of the phonetic module of the grammar). The process, once it has entered the grammar, can further its “ascent” through increasingly deeper grammatical modules. A (gradient) phonologised process is said to be “stabilised” (and thus categorical) once it is generated by a categorical phonological rule, which applies at the phrase level. At this stage, a stabilised process has entered the phonological module of the speaker/hearer. A stabilised process further undergoes “domain narrowing” when it starts being applied at the word level and then at the stem level. In the final step in the ascent of a sound pattern through the grammar, a phonological process comes under morphological and lexical control, until “it may die altogether, leaving behind no more than inert traces in underlying representations” (Bermúdez-Otero 2015: 12).

A further definition of phonologisation stems from exemplar theories of speech perception and production (Johnson 1997; Pierrehumbert 2001; Sóskuthy et al. 2018; Ambridge 2018; Todd, Pierrehumbert & Hay 2019). A core tenet of these models is that speech tokens are stored in memory as so-called exemplars after having being experienced. Depending on the specifics of the particular model, exemplars are stored at varying degrees of granularity and richness of detail. Each exemplar consists of a (more or less) faithful representation of the actual token of experience that generated it, and it thus contains information from multiple levels and factors (phonetic, lexical, syntactic, sociolinguistic, contextual, and so on). Lexical and other linguistic units are represented as sets of exemplars, or exemplar clouds. The representational space of exemplar clouds is multi-dimensional and can be operationalised as a multivariate distribution. In modular approaches to grammar as briefly expounded above, sound alternations can be encoded (in terms of derivational rules and/or constraints) either at the phonological level or at the phonetic level of representation. On the other hand, as Sóskuthy (2013), pp. 183 illustrates, a consequence of the exemplar mode of representation is that all sound alternations are directly encoded by exemplars within the exemplar cloud, at one single level of representation. As soon as an exemplar with new phonetic characteristics is experienced and stored, the lexical representation of that lexical item already contains information on the sound alternation. In this sense, every type of variation is “phonologised” (represented) from the outset as soon as it is experienced by the speaker/hearer and stored in memory.

When the term phonologisation is employed in the phonetic literature of the voicing effect, it is generally not attributed to any specific phonological framework. This makes it less straightforward to interpret the term as the original author might have intended, but, as far as I can tell, most authors would interpret it at least as to mean that the effect is not just mechanical and/or low-level, but that it has assumed higher-level functions of some sort, whatever the specific function might be. Since the main focus of this work is on the source of the voicing effect rather than on what functions the voicing effect can assume in different languages, the topic of the phonologisation of the voicing effect will only briefly be touched upon in the rest of the dissertation. Note, however, that the account proposed in 8 is envisioned to be informed by some form of exemplar model of speech perception and production, where everything can be considered “phonologised” as soon as it is part of the lexical representation. In this sense, the effect is part of the mental word representations in all languages that have it, independent of its magnitude or function. A discussion of arguments for or against such position are, however, beyond the scope of this dissertation.

Going back to the phonologisation (in the general sense) of the voicing effect in English, Jong (2004) shows that the effect is greater in stressed syllables and under focus in English but not in Arabic (Jong & Zawaydeh 2002), and argues that vowel duration is used contrastively as a cue to voicing in the former language. A further argument for the phonologisation of the durational difference in English is the stability of the effect across speaking tempos. Port & Dalby (1982) suggest that the ratio of the consonant and vowel durations is stable at faster and slower speaking rates, and that the CV ratio proves to be the primary acoustic correlate of voicing in word-final position. Luce & Charles-Luce (1985), however, claim that vowel duration is a more robust cue across tempos than the CV ratio and the duration of the stop closure. Finally, Ko (2018) compares CV ratio values in three speaking styles (normal, faster, and slower) and finds that the ratio changes as a function of speaking style and that the effect of style interacts with consonant voicing. In sum, there is contrasting evidence as to whether the relative magnitude of the effect is stable across speaking tempos or not, and as to whether this can be taken as evidence for or against the phonologisation of the effect in English.

The mechanisms behind the emergence of the voicing effect are in principle independent from those driving the subsequent phonologisation of the effect. In light of this, the next section reviews different proposals of what the source of the voicing effect might be, while leaving aside the further question of how the effect can be exploited phonologically once in place.

1.5 One phenomenon, many explanations

Over a century of research on the voicing effect has without doubt brought progress in our understanding of this complex phenomenon. While several proposals were put forward in the period between the 50s and the 70s, subsequent years focussed on testing or extending previous hypotheses and no final consensus has been reached. A broad distinction can be drawn between accounts that ascribe the voicing effect to articulatory or aerodynamic properties of speech production, and accounts that instead draw on biases of the perceptual system. No answer has been obtained as to which of the two sides best accounts for all of the aspects of the voicing effect, and rather both views contribute in some respect to the overall picture. The following paragraphs review the most notable perception and production accounts, paving the way for a discussion of phonation effects related to that of voicing in the following section.

A perceptual-based explanation advocated by Javkin (1976) argues that the voicing effect emerges as a consequence of the difficulty in the perceptual identification of the vowel-consonant boundary in the context of voiced stops, and of the misinterpretation of voicing during closure. According to this account, speakers misperceive the periodic vibration of the vocal folds (voicing) during the closure of a voiced stop as being part of the preceding vowel. In the absence of contextual correction, this misperception can lead to the creation of a new production norm where the vowel is lengthened (Ohala 1989). Subsequent productions of vowels followed by voiced stops would thus be longer than vowels followed by voiceless stops. Although Javkin (1976) does not directly test the hypothesis that closure voicing is reinterpreted as being part of the preceding vowel, his study indicates that listeners perceive vowels to be longer when followed by voiced than when followed by voiceless stops, other things being equal. On the other hand, Sanker (2019) finds that vowels followed by voiced stops rather elicit fewer “long” responses, while more “long” responses are elicited in stimuli where the following consonant was spliced out. However, listeners were perceiving vowels with falling F0 to be longer than vowels with flat or raising F0, in partial accord with previous work (Lehiste 1976; Yu 2010; Cumming 2011).

To provide for a rationale of the language-specificity of the voicing effect, Kluender, Diehl & Wright (1988) propose that different languages can exploit the perceptual biases behind the effect at different degrees. As discussed in 1.4, the ratio between the duration of the closure and that of the vowel has been identified as one of the perceptual cues to voicing (Port & Dalby 1982; Lisker 1986). Listeners associate smaller values of the CV ratio to voiced stops, and, vice versa, greater values to voiceless stops. Kluender, Diehl & Wright (1988) argue that speakers can actively manipulate vowel durations to proportionally increase the difference in ratio between the two voicing categories, so that the ratio would be even smaller in the voiced context and even greater in the voiceless one. As a consequence, the perceptual distance between the voicing categories would be enhanced, thus facilitating discrimination (Stevens & Keyser 1989; Kingston & Diehl 1994). According to this view, listeners’ discrimination of vowel duration should show a “contrast effect,” by which longer closure durations elicit more “short vowel” responses and shorter closures more “long vowel” responses. However, Fowler (1992) shows that listeners judge vowels to be longer when the stop closure duration is increased, and that, similarly, stop closure is perceived to be longer when vowel duration is increased. These results indicate a mechanism of perceptual assimilation of the respective durations of vowels and stop closures and do not support a contrast effect.

While perceptual biases could be driving some aspects of the voicing effect and be responsible for its enhancement in some languages, production mechanisms are likely to provide the necessary variation that would be exploited by the perceptual system (Beguš 2017; Sanker 2019). Although individual production accounts differ in the details, two broad categories can be identified. Some accounts ascribe the source of the voicing effect to mechanisms of compensation within a certain property of speech (either duration or articulatory force), while others relate the emergence of the effect to timing aspects of articulatory gestures (either laryngeal or oral).

The compensatory temporal adjustment account Lehiste (1970b) states that the relative durations of vowel and consonant in a VC sequence are correlated. A well-known fact about stop closure is that it is longer in voiceless stops and shorter in voiced stops (Lisker 1957; Summers 1987; Davis & Summers 1989; Jong 1991). As a consequence, vowels are shorter when followed by the longer closure of voiceless stops, and they are longer when followed by the shorter closure of voiced stops. This compensatory pattern would be the consequence of keeping the duration of a particular speech interval fixed, while the duration of the closure changes depending on the voicing status of the stop. Proponents of this account have argued that compensation is implemented either at the level of the syllable (or of the VC sequence, Lindblom 1967; Farnetani & Kori 1986), or at the level of the word (Slis & Cohen 1969a; Slis & Cohen 1969b; Lehiste 1970a; Lehiste 1970b). This formulation of the account, however, faces empirical and logical challenges. The duration of both the syllable and the word is affected by stop voicing (Chen 1970; Jacewicz, Fox & Lyle 2009), and it is not clear why compensation within a word should necessarily target the pre-consonantal vowel and not other segments (these issues are discussed in more details in 4).

Another production proposal attributes the voicing-driven duration differences of vowels to articulatory energy expenditure, rather than temporal aspects. Meyer (1903) and similarly Belasco (1953) propose that the articulatory force required to produce a syllable is constant, and thus it is distributed across segments according to their energy requirements. According to this hypothesis, voiceless stops are produced with more force than voiced stops, and hence some force is subtracted from the production of the preceding vowel to maintain the overall force constant. However, the concept of “articulatory force” lacks an empirically solid definition, and experimental results mentioned in Zimmerman & Sapon (1958) rather point to the absence of a relation between energy expenditure and vowel duration.

While the compensatory temporal adjustment and the energy expenditure accounts rely on compensatory mechanisms of duration or articulatory force, two other proposals concern aspects of gestural timing of the larynx and the consonant closing gesture. The laryngeal adjustment account (Halle & Stevens 1967; Halle, Stevens & Oppenheim 1967; Chomsky & Halle 1968) is based on the idea that voicing during stop closure requires precise adjustments of the glottis in order to comply with aerodynamic constraints (Ohala 2011). Such an articulatory precision necessitates greater time to be implemented than the production of closure voicelessness. Because of these properties of laryngeal articulation, full closure can be achieved relatively faster in the context of voiceless stops (which require less precise control), while a delay of closure onset in voiced stops ensures enough time to produce the suitable glottal configuration. The preliminary electromyographic study of glottal muscular activity discussed in Chen (1970), however, does not suggest the presence of early laryngeal activity during the production of vowels followed by voiced stops compared to vowels followed by voiceless stops. Further articulatory evidence shows that there is rather a general increase of activity of certain laryngeal muscles (namely, the posterior crycoarytenoid and the crycothyroid) during the production of voiceless sounds (Hirose & Gay 1972; Kagaya & Hirose 1975; Hirose 1977; Löfqvist et al. 1989). No conclusive evidence can thus be adduced in support of the laryngeal adjustment hypothesis, although other laryngeal mechanisms cannot be ultimately excluded (Beguš 2017).

Another production account is based on the rate of stop closure transition (Öhman 1967a; Chen 1970). Voiceless stops are articulated with greater glottal opening relative to voiced stops (and vowels), so that a greater volume of air is admitted into the oral cavity. Öhman (1967a) argues that the production of the closure of voiceless stops would then require more muscular effort to counteract the increased intra-oral pressure generated by the greater airflow. As a consequence, the rate of the closing gesture of voiceless stops is higher than that of voiced stops. In other words, full closure will be achieved earlier relative to the onset of the closing gesture when the stop is voiceless than when it is voiced. Hence, vowels will be shorter when followed by voiceless stops than when followed by voiced stops. Chen (1970) observes that the difference in labial closure rate accounted for 20% of the difference in vowel duration. Subsequent work by Warren & Jacks (2005) further shows that the percentage of the difference which is accounted for rises to 80% when considering the movements of both the lips and the jaw.

In sum, four main production accounts (or variations thereof) can be found in the literature on the voicing effect. More specifically, two of these accounts posit a mechanism of compensation either between segmental durations (the compensatory temporal adjustment account) or articulatory force (the articulatory energy expenditure account), while two relate durational differences to the timing of laryngeal gestures (the laryngeal adjustment account) or oral gestures (the rate of stop closure transition account). In the following section I review the effects of two other phonation types (aspiration and ejection) on vowel duration, and how these shed light on the aforementioned accounts of the voicing effect.

1.6 Beyond voicing

Two phonation modes other than voicing are known to affect preceding vowel duration: aspiration and ejection. While this project focusses on the voicing effect, the closely related aspiration and ejection effects have consequences of theoretical importance. The results concerning the aspiration effect are mixed. Maddieson & Gandour (1976), Durvasula & Luo (2012), and Lampp & Reklis (2004) report longer vowels before aspirated than before unaspirated stops in Hindi, and Maddieson (1976) finds a similar trend in Assamese, Bengali, and Marathi. Ohala & Ohala (1992), on the other hand, show that vowels have the same duration before unaspirated and aspirated stops in their sample of Hindi speakers. Sanker (2018) observes an effect of aspiration in Hindi long vowels but not in short vowels, while the effect is reversed in Telugu long vowels (vowels are shorter before aspirated than unaspirated stops), with no appreciable difference in short vowels. Note that these studies don’t easily lend themselves to comparison, since the material and contexts used differ (for instance, vowel type, vowel phonological length, number of syllables, and context following the test word).

The trend of vowels being longer when followed by aspirated stops challenges some of the accounts presented in 1.5, as noted in Maddieson & Gandour (1976). The articulatory force expenditure hypothesis predicts vowels to be shorter before aspirated than before unaspirated stops since it is likely that aspirated stops require greater force than unaspirated ones. According to the laryngeal adjustment account, the duration of the vowels should not differ in voiceless unaspirated and aspirated stops since they both require glottal opening, rather than the precise adjustments characteristic of voiced stops. Since closure rate is determined by airflow, the rate of closure account expects vowels followed by aspirated stops to be shorter than or equal to vowels followed by unaspirated stops, since the former should be characterised by greater airflow and higher closure rates due to glottal spreading. While Maddieson & Gandour (1976) argue against a compensatory effect between vowel and consonant duration, the data in Durvasula & Luo (2012) are instead compatible with it (see 4). The results in Sanker (2018) on Hindi, but not Telugu, are also compatible with a compensatory mechanism. Closure duration is longer in unaspirated stops and shorter in aspirated stops in both languages, but the effect of aspiration on vowel duration has opposite directions depending on language, as discussed above.

An investigation of the effect of ejection in Georgian (Beguš 2017) shows that vowels are shortest when followed by aspirated stops, longer when followed by ejectives, and longest when followed by voiced stops. Georgian contrasts aspirated voiceless, ejective, and voiced unaspirated stops. The negative correlation between closure and vowel duration has greater magnitude in the context of voiced compared to that of aspirated and ejective stops. Moreover, the author shows that the closure effect on vowel duration coexists with a “Laryngeal Features” effect (both closure duration and phonation, when entered in a single regression model, lead to significant p-values). In other words, the variance in vowel duration is accounted for in part by the duration of the stop closure and in part by the voicing category of the post-vocalic stop. As discussed in Beguš (2017), these patterns are compatible with accounts of compensatory temporal adjustments, laryngeal adjustments, and rate of closure.

In conclusion, our partial understanding of the relation between vowel duration and consonant phonation is based on contrasting or complementary empirical evidence. This state of affairs can be taken as indication that, while most research (save a few recent exceptions) focussed on finding a unique and unified mechanism behind the voicing effect, we might rather seek multiple mechanisms that cooperate to produce the observed patterns. In light of this, this dissertation sets out to study the interrelations between different sources of evidence, and their interpretation. 2 presents a more detailed discussion of the rational behind the research of this dissertation, and a summary of the main results.


  1. This does not exclude that there might be or have been a language that shows this pattern.↩︎

  2. From a typological perspective, this sample is strongly biased towards the Indo-European language family. Moreover, the five non-Indo-European languages in the list are all thriving and well-studied. This notwithstanding, the voicing effect is generally regarded as a very common and widespread phenomenon.↩︎

  3. Haspelmath (2010) proposes to use capitalised names for descriptive categories (for example, “Italian Voicing”), but since this use is not common among phonologists/phoneticians, I will not adopt it here.↩︎