Convert regex and glob patterns to type IDs or fixed patterns

regex2fixed converts regex and glob patterns to fixed patterns. regex2id converts regex or glob to type IDs to allow C++ function to perform fast searches in tokens object. C++ functions use a list of type IDs to construct a hash table, against which sub-vectors of tokens object are matched. This function constructs index of glob patterns for faster matching.

This is an internal function for regex2id() that select types using an index of types by regular expressions.

An internal function for regex2id that constructs an index of regex patterns (e.g. ^xxxx, xxxx$ and ^xxxx$) to avoid expensive sequential search by stri_detect_regex.

Internal function for select_types() to search the index using fastmatch.

Simpler and faster version of expand.grid() in base package

Internal function for select_types() to check if a glob pattern is indexed by index_types().

regex2fixed(pattern, types = NULL, valuetype = NULL,
  case_insensitive = NULL, index = NULL)

regex2id(pattern, types = NULL, valuetype = NULL, case_insensitive = NULL,
  index = NULL)

search_glob(patterns, types_search, index)

search_regex(patterns, types_search, case_insensitive)

search_fixed(patterns, types_search, index)

index_types(types, valuetype, case_insensitive, max_len = NULL)

search_index(pattern, index)

expand(elem)

is_indexed(pattern)

Arguments

pattern	a character vector, list of character vectors, dictionary, collocations, or dfm. See pattern for details.
types	unique types of tokens obtained by `quanteda:::types()`
valuetype	the type of pattern matching: `"glob"` for "glob"-style wildcard expressions; `"regex"` for regular expressions; or `"fixed"` for exact matching. See valuetype for details.
case_insensitive	ignore case when matching, if `TRUE`
index	If TRUE, index is constructed automatically. It also accept index constructed by index_types().
patterns	a list of regular expressions
types_search	lowercased types when `case_insensitive=TRUE`, but not used in glob and fixed matching as types are in the index.
max_len	maximum length of types to be indexed
elem	list of elements to be combined
case_insensitive	ignore case when matching, if `TRUE`, but not used in glob and fixed matching as types are lowercased in the index.
index	index object created by `index_types()`
types	types of tokens to index
case_insensitive	ignore case when matching, if `TRUE`
regex	a glob expression to search
index	an index object created by `index_types()`
x	a glob pattern to be tested

Value

regex2fixed returns a list of character vectors containing types regex2id returns a list of integer vectors containing type IDs

a list of integer vectors containing type IDs with index keys as an attribute

Examples

pattern <- list(c('^a$', '^b'), c('c'), c('d'))
types <- c('A', 'AA', 'B', 'BB', 'BBB', 'C', 'CC')
quanteda:::regex2fixed(pattern, types, 'regex', case_insensitive = TRUE)
#> [[1]]
#> [1] "A" "B"
#> 
#> [[2]]
#> [1] "A"  "BB"
#> 
#> [[3]]
#> [1] "A"   "BBB"
#> 
#> [[4]]
#> [1] "C"
#> 
#> [[5]]
#> [1] "CC"
#> 
index <- quanteda:::index_types(types, 'regex', case_insensitive = TRUE)
quanteda:::regex2fixed(pattern, index = index)
#> [[1]]
#> [1] "A" "B"
#> 
#> [[2]]
#> [1] "A"  "BB"
#> 
#> [[3]]
#> [1] "A"   "BBB"
#> 
#> [[4]]
#> [1] "C"
#> 
#> [[5]]
#> [1] "CC"
#> 
types <- c('A', 'AA', 'B', 'BB', 'BBB', 'C', 'CC')

pats_regex <- list(c('^a$', '^b'), c('c'), c('d'))
quanteda:::regex2id(pats_regex, types, 'regex', case_insensitive = TRUE)
#> [[1]]
#> [1] 1 3
#> 
#> [[2]]
#> [1] 1 4
#> 
#> [[3]]
#> [1] 1 5
#> 
#> [[4]]
#> [1] 6
#> 
#> [[5]]
#> [1] 7
#> 

pats_glob <- list(c('a*', 'b*'), c('c'), c('d'))
quanteda:::regex2id(pats_glob, types, 'glob', case_insensitive = TRUE)
#> [[1]]
#> [1] 1 3
#> 
#> [[2]]
#> [1] 2 3
#> 
#> [[3]]
#> [1] 1 4
#> 
#> [[4]]
#> [1] 2 4
#> 
#> [[5]]
#> [1] 1 5
#> 
#> [[6]]
#> [1] 2 5
#> 
#> [[7]]
#> [1] 6
#> 

index <- quanteda:::index_types(c('xxx', 'yyyy', 'ZZZ'), 'glob', FALSE, 3)
quanteda:::search_glob('yy*', attr(index, 'type_search'), index)
#> [[1]]
#> [1] 2
#> 
quanteda:::expand(list(c('a', 'b', 'c'), c('x', 'y')))
#> [[1]]
#> [1] "a" "x"
#> 
#> [[2]]
#> [1] "b" "x"
#> 
#> [[3]]
#> [1] "c" "x"
#> 
#> [[4]]
#> [1] "a" "y"
#> 
#> [[5]]
#> [1] "b" "y"
#> 
#> [[6]]
#> [1] "c" "y"
#>

Convert regex and glob patterns to type IDs or fixed patterns

Arguments

Value

See also

Examples

Contents