R/corpus_trim.R
corpus_trimsentences.Rd
Removes sentences from a corpus or a character vector shorter than a specified length.
corpus_trimsentences( x, min_length = 1, max_length = 10000, exclude_pattern = NULL, return_tokens = FALSE ) char_trimsentences( x, min_length = 1, max_length = 10000, exclude_pattern = NULL )
x | corpus or character object whose sentences will be selected. |
---|---|
min_length, max_length | minimum and maximum lengths in word tokens (excluding punctuation) |
exclude_pattern | a stringi regular expression whose match (at the sentence level) will be used to exclude sentences |
return_tokens | if |
a corpus or character vector equal in length to the input, or
a tokenized set of sentences if . If the input was a corpus, then the all
docvars and metadata are preserved. For documents whose sentences have
been removed entirely, a null string (""
) will be returned.
This function has been superseded by corpus_trim()
; use
that function instead.
txt <- c("PAGE 1. A single sentence. Short sentence. Three word sentence.", "PAGE 2. Very short! Shorter.", "Very long sentence, with three parts, separated by commas. PAGE 3.") corp <- corpus(txt, docvars = data.frame(serial = 1:3)) texts(corp)#> text1 #> "PAGE 1. A single sentence. Short sentence. Three word sentence." #> text2 #> "PAGE 2. Very short! Shorter." #> text3 #> "Very long sentence, with three parts, separated by commas. PAGE 3."#> text1 #> "A single sentence. Three word sentence." #> text3 #> "Very long sentence, with three parts, separated by commas."# exclude sentences that start with "PAGE <digit(s)>" texts(corpus_trimsentences(corp, exclude_pattern = "^PAGE \\d+"))#> text1 #> "A single sentence. Short sentence. Three word sentence." #> text2 #> "Very short! Shorter." #> text3 #> "Very long sentence, with three parts, separated by commas."# on a character char_trimsentences(txt, min_length = 3)#> text1 #> "A single sentence. Three word sentence." #> text3 #> "Very long sentence, with three parts, separated by commas."char_trimsentences(txt, min_length = 3)#> text1 #> "A single sentence. Three word sentence." #> text3 #> "Very long sentence, with three parts, separated by commas."char_trimsentences(txt, exclude_pattern = "sentence\\.")#> text1 #> "PAGE 1." #> text2 #> "PAGE 2. Very short! Shorter." #> text3 #> "Very long sentence, with three parts, separated by commas. PAGE 3."