Tokenise IPA strings
phonetise.Rd
phonetise()
tokenises strings of IPA symbols (like phonetic transcriptions
of words) into individual "phones". The output is a list.
Usage
phonetise(
strings,
multi = NULL,
regex = NULL,
split = TRUE,
sep = " ",
sanitise = TRUE,
ignore_stress = TRUE,
ignore_tone = TRUE,
diacritics = FALSE,
affricates = FALSE,
v_sequences = FALSE,
prenasalised = FALSE,
all_multi = FALSE,
sanitize = sanitise
)
phonetize(
strings,
multi = NULL,
regex = NULL,
split = TRUE,
sep = " ",
sanitise = TRUE,
ignore_stress = TRUE,
ignore_tone = TRUE,
diacritics = FALSE,
affricates = FALSE,
v_sequences = FALSE,
prenasalised = FALSE,
all_multi = FALSE,
sanitize = sanitise
)
Arguments
- strings
A character vector with a list of words in IPA.
- multi
A character vector of one or more multi-character phones as strings.
- regex
A string with a regular expression to match several multi-character phones.
- split
If set to
TRUE
(the default), the tokenised strings are split into phones (i.e. the output is a vector with one element per phone). If set toFALSE
, the string is not split and the phones are separated with the character defined insep
.- sep
A character to be used as the separator of the phones if
split = FALSE
(default is- sanitise
Whether to remove all non-IPA characters (
TRUE
by default).- ignore_stress
If
TRUE
(the default), stress marks are not parsed.- ignore_tone
If
TRUE
(the default), tone marks and letters are not parsed.- diacritics
If set to
TRUE
, parses all valid diacritics as part of the previous character (FALSE
by default).- affricates
If set to
TRUE
, parses homorganic stop + fricative as affricates.- v_sequences
If set to
TRUE
, collapses vowel sequences (FALSE
by default).- prenasalised
If set to
TRUE
, parses prenasalised consonants as such (FALSE
by default).- all_multi
If set to
TRUE
,diacritics
,affricates
,v_sequences
andprenasalised
are all set toTRUE
.- sanitize
Alias of
sanitise
.
Examples
ipa <- c("pʰãkʰ", "tʰum̥", "ɛkʰɯ")
ph <- c("pʰ", "tʰ", "kʰ", "ã", "m̥")
phonetise(ipa, multi = ph)
#> [[1]]
#> [1] "pʰ" "ã" "kʰ"
#>
#> [[2]]
#> [1] "tʰ" "u" "m̥"
#>
#> [[3]]
#> [1] "ɛ" "kʰ" "ɯ"
#>
ph_2 <- ph[4:5]
# Match any character followed by <ʰ> with ".ʰ".
phonetise(ipa, multi = ph_2, regex = ".ʰ")
#> [[1]]
#> [1] "pʰ" "ã" "kʰ"
#>
#> [[2]]
#> [1] "tʰ" "u" "m̥"
#>
#> [[3]]
#> [1] "ɛ" "kʰ" "ɯ"
#>
# Same result.
phonetise(ipa, regex = ".(\u0303|\u0325|\u02B0)")
#> [[1]]
#> [1] "pʰ" "ã" "kʰ"
#>
#> [[2]]
#> [1] "tʰ" "u" "m̥"
#>
#> [[3]]
#> [1] "ɛ" "kʰ" "ɯ"
#>
# Don't split strings and use "." as separator
phonetise(ipa, multi = ph, split = FALSE, sep = ".")
#> [1] "pʰ.ã.kʰ" "tʰ.u.m̥" "ɛ.kʰ.ɯ"