Remove sentences based on their token lengths or a pattern match

Removes sentences from a corpus or a character vector shorter than a specified length.

corpus_trimsentences(
  x,
  min_length = 1,
  max_length = 10000,
  exclude_pattern = NULL,
  return_tokens = FALSE
)

char_trimsentences(
  x,
  min_length = 1,
  max_length = 10000,
  exclude_pattern = NULL
)

Arguments

x	corpus or character object whose sentences will be selected.
min_length, max_length	minimum and maximum lengths in word tokens (excluding punctuation)
exclude_pattern	a stringi regular expression whose match (at the sentence level) will be used to exclude sentences
return_tokens	if `TRUE`, return tokens object of sentences after trimming, otherwise return the input object type with the trimmed sentences removed.

Value

a corpus or character vector equal in length to the input, or a tokenized set of sentences if . If the input was a corpus, then the all docvars and metadata are preserved. For documents whose sentences have been removed entirely, a null string ("") will be returned.

Note

This function has been superseded by corpus_trim(); use that function instead.

Examples

txt <- c("PAGE 1. A single sentence.  Short sentence. Three word sentence.",
         "PAGE 2. Very short! Shorter.",
         "Very long sentence, with three parts, separated by commas.  PAGE 3.")
corp <- corpus(txt, docvars = data.frame(serial = 1:3))
texts(corp)
#>                                                                 text1 
#>    "PAGE 1. A single sentence.  Short sentence. Three word sentence." 
#>                                                                 text2 
#>                                        "PAGE 2. Very short! Shorter." 
#>                                                                 text3 
#> "Very long sentence, with three parts, separated by commas.  PAGE 3." 

# exclude sentences shorter than 3 tokens
texts(corpus_trimsentences(corp, min_length = 3))
#>                                                        text1 
#>                   "A single sentence.  Three word sentence." 
#>                                                        text3 
#> "Very long sentence, with three parts, separated by commas." 
# exclude sentences that start with "PAGE <digit(s)>"
texts(corpus_trimsentences(corp, exclude_pattern = "^PAGE \\d+"))
#>                                                        text1 
#>  "A single sentence.  Short sentence.  Three word sentence." 
#>                                                        text2 
#>                                      "Very short!  Shorter." 
#>                                                        text3 
#> "Very long sentence, with three parts, separated by commas." 

# on a character
char_trimsentences(txt, min_length = 3)
#>                                                        text1 
#>                   "A single sentence.  Three word sentence." 
#>                                                        text3 
#> "Very long sentence, with three parts, separated by commas." 
char_trimsentences(txt, min_length = 3)
#>                                                        text1 
#>                   "A single sentence.  Three word sentence." 
#>                                                        text3 
#> "Very long sentence, with three parts, separated by commas." 
char_trimsentences(txt, exclude_pattern = "sentence\\.")
#>                                                                 text1 
#>                                                             "PAGE 1." 
#>                                                                 text2 
#>                                      "PAGE 2.  Very short!  Shorter." 
#>                                                                 text3 
#> "Very long sentence, with three parts, separated by commas.  PAGE 3."