Removes sentences from a corpus or a character vector shorter than a specified length.

corpus_trimsentences(x, min_length = 1, max_length = 10000,
  exclude_pattern = NULL, return_tokens = FALSE)

char_trimsentences(x, min_length = 1, max_length = 10000,
  exclude_pattern = NULL)

Arguments

x

corpus or character object whose sentences will be selected.

min_length, max_length

minimum and maximum lengths in word tokens (excluding punctuation)

exclude_pattern

a stringi regular expression whose match (at the sentence level) will be used to exclude sentences

return_tokens

if TRUE, return tokens object of sentences after trimming, otherwise return the input object type with the trimmed sentences removed.

Value

a corpus or character vector equal in length to the input, or a tokenized set of sentences if . If the input was a corpus, then the all docvars and metadata are preserved. For documents whose sentences have been removed entirely, a null string ("") will be returned.

Note

This function has been superceded by corpus_trim; use that function instead.

Examples

txt <- c("PAGE 1. This is a single sentence. Short sentence. Three word sentence.", "PAGE 2. Very short! Shorter.", "Very long sentence, with multiple parts, separated by commas. PAGE 3.") mycorp <- corpus(txt, docvars = data.frame(serial = 1:3)) texts(mycorp)
#> text1 #> "PAGE 1. This is a single sentence. Short sentence. Three word sentence." #> text2 #> "PAGE 2. Very short! Shorter." #> text3 #> "Very long sentence, with multiple parts, separated by commas. PAGE 3."
# exclude sentences shorter than 3 tokens texts(corpus_trimsentences(mycorp, min_length = 3))
#> text1 #> "This is a single sentence. Three word sentence." #> text3 #> "Very long sentence, with multiple parts, separated by commas."
# exclude sentences that start with "PAGE <digit(s)>" texts(corpus_trimsentences(mycorp, exclude_pattern = "^PAGE \\d+"))
#> text1 #> "This is a single sentence. Short sentence. Three word sentence." #> text2 #> "Very short! Shorter." #> text3 #> "Very long sentence, with multiple parts, separated by commas."
# on a character char_trimsentences(txt, min_length = 3)
#> text1 #> "This is a single sentence. Three word sentence." #> text3 #> "Very long sentence, with multiple parts, separated by commas."
char_trimsentences(txt, min_length = 3)
#> text1 #> "This is a single sentence. Three word sentence." #> text3 #> "Very long sentence, with multiple parts, separated by commas."
char_trimsentences(txt, exclude_pattern = "sentence\\.")
#> text1 #> "PAGE 1." #> text2 #> "PAGE 2. Very short! Shorter." #> text3 #> "Very long sentence, with multiple parts, separated by commas. PAGE 3."