Tokenise IPA strings — phonetise • phonetisr

phonetise() tokenises strings of IPA symbols (like phonetic transcriptions of words) into individual "phones". The output is a list.

Usage

phonetise(
  strings,
  multi = NULL,
  regex = NULL,
  split = TRUE,
  sep = " ",
  sanitise = TRUE,
  ignore_stress = TRUE,
  ignore_tone = TRUE,
  diacritics = FALSE,
  affricates = FALSE,
  v_sequences = FALSE,
  prenasalised = FALSE,
  all_multi = FALSE,
  sanitize = sanitise
)

phonetize(
  strings,
  multi = NULL,
  regex = NULL,
  split = TRUE,
  sep = " ",
  sanitise = TRUE,
  ignore_stress = TRUE,
  ignore_tone = TRUE,
  diacritics = FALSE,
  affricates = FALSE,
  v_sequences = FALSE,
  prenasalised = FALSE,
  all_multi = FALSE,
  sanitize = sanitise
)

Arguments

strings: A character vector with a list of words in IPA.
multi: A character vector of one or more multi-character phones as strings.
regex: A string with a regular expression to match several multi-character phones.
split: If set to TRUE (the default), the tokenised strings are split into phones (i.e. the output is a vector with one element per phone). If set to FALSE, the string is not split and the phones are separated with the character defined in sep.
sep: A character to be used as the separator of the phones if split = FALSE (default is , space).
sanitise: Whether to remove all non-IPA characters (TRUE by default).
ignore_stress: If TRUE (the default), stress marks are not parsed.
ignore_tone: If TRUE (the default), tone marks and letters are not parsed.
diacritics: If set to TRUE, parses all valid diacritics as part of the previous character (FALSE by default).
affricates: If set to TRUE, parses homorganic stop + fricative as affricates.
v_sequences: If set to TRUE, collapses vowel sequences (FALSE by default).
prenasalised: If set to TRUE, parses prenasalised consonants as such (FALSE by default).
all_multi: If set to TRUE, diacritics, affricates, v_sequences and prenasalised are all set to TRUE.
sanitize: Alias of sanitise.

Value

A list of phonetised strings.

Examples

# using unicode escapes for CRAN policy
ipa <- c("p\u02B0a\u0303k\u02B0", "t\u02B0um\u0325", "\u025Bk\u02B0\u026F")
ph <- c("p\u02B0", "t\u02B0", "k\u02B0", "a\u0303", "m\u0325")

phonetise(ipa, multi = ph)
#> [[1]]
#> [1] "pʰ" "ã"  "kʰ"
#> 
#> [[2]]
#> [1] "tʰ" "u"  "m̥" 
#> 
#> [[3]]
#> [1] "ɛ"  "kʰ" "ɯ" 
#> 

ph_2 <- ph[4:5]

# Match any character followed by <\u02B0> with ".\u02B0".
phonetise(ipa, multi = ph_2, regex = ".\u02B0")
#> [[1]]
#> [1] "pʰ" "ã"  "kʰ"
#> 
#> [[2]]
#> [1] "tʰ" "u"  "m̥" 
#> 
#> [[3]]
#> [1] "ɛ"  "kʰ" "ɯ" 
#> 

# Same result.
phonetise(ipa, regex = ".(\u0303|\u0325|\u02B0)")
#> [[1]]
#> [1] "pʰ" "ã"  "kʰ"
#> 
#> [[2]]
#> [1] "tʰ" "u"  "m̥" 
#> 
#> [[3]]
#> [1] "ɛ"  "kʰ" "ɯ" 
#> 

# Don't split strings and use "." as separator
phonetise(ipa, multi = ph, split = FALSE, sep = ".")
#> [1] "pʰ.ã.kʰ" "tʰ.u.m̥"  "ɛ.kʰ.ɯ"