Count syllables in a text — nsyllable • quanteda

Returns a count of the number of syllables in texts. For English words, the syllable count is exact and looked up from the CMU pronunciation dictionary, from the default syllable dictionary data_int_syllables. For any word not in the dictionary, the syllable count is estimated by counting vowel clusters.

data_int_syllables is a quanteda-supplied data object consisting of a named numeric vector of syllable counts for the words used as names. This is the default object used to count English syllables. This object that can be accessed directly, but we strongly encourage you to access it only through the nsyllable() wrapper function.

nsyllable(
  x,
  syllable_dictionary = quanteda::data_int_syllables,
  use.names = FALSE
)

Arguments

x	character vector or `tokens` object whose syllables will be counted. This will count all syllables in a character vector without regard to separating tokens, so it is recommended that x be individual terms.
syllable_dictionary	optional named integer vector of syllable counts where the names are lower case tokens. When set to `NULL` (default), then the function will use the quanteda data object `data_int_syllables`, an English pronunciation dictionary from CMU.
use.names	logical; if `TRUE`, assign the tokens as the names of the syllable count vector

Value

If x is a character vector, a named numeric vector of the counts of the syllables in each element. If x is a tokens object, return a list of syllable counts where each list element corresponds to the tokens in a document.

Note

All tokens are automatically converted to lowercase to perform the matching with the syllable dictionary, so there is no need to perform this step prior to calling nsyllable().

nsyllable() only works reliably for English, as the only syllable count dictionary we could find is the freely available CMU pronunciation dictionary at http://www.speech.cs.cmu.edu/cgi-bin/cmudict. If you have a dictionary for another language, please email the package maintainer as we would love to include it.

Examples

# character
nsyllable(c("cat", "syllable", "supercalifragilisticexpialidocious", 
            "Brexit", "Administration"), use.names = TRUE)
#>                                cat                           syllable 
#>                                  1                                  3 
#> supercalifragilisticexpialidocious                             Brexit 
#>                                 13                                  2 
#>                     Administration 
#>                                  5 

# tokens
txt <- c(doc1 = "This is an example sentence.",
         doc2 = "Another of two sample sentences.")
nsyllable(tokens(txt, remove_punct = TRUE))
#> $doc1
#> [1] 1 1 1 3 2
#> 
#> $doc2
#> [1] 3 1 1 2 3
#> 
# punctuation is not counted
nsyllable(tokens(txt), use.names = TRUE)
#> $doc1
#>     This       is       an  example sentence        . 
#>        1        1        1        3        2       NA 
#> 
#> $doc2
#>   Another        of       two    sample sentences         . 
#>         3         1         1         2         3        NA 
#>