Match the feature set of a dfm to given feature names — dfm_match • quanteda

Match the feature set of a dfm to a specified vector of feature names. For existing features in x for which there is an exact match for an element of features, these will be included. Any features in x not features will be discarded, and any feature names specified in features but not found in x will be added with all zero counts.

dfm_match(x, features, verbose = quanteda_options("verbose"))

Arguments

x: a dfm
features: character; the feature names to be matched in the output dfm
verbose: if TRUE print the number of tokens and documents before and after the function is applied. The number of tokens does not include paddings.

Value

A dfm whose features are identical to those specified in features.

Details

Selecting on another dfm's featnames() is useful when you have trained a model on one dfm, and need to project this onto a test set whose features must be identical. It is also used in bootstrap_dfm().

Note

Unlike dfm_select(), this function will add feature names not already present in x. It also provides only fixed, case-sensitive matches. For more flexible feature selection, see dfm_select().

See also

Examples

# matching a dfm to a feature vector
dfm_match(dfm(tokens("")), letters[1:5])
#> Document-feature matrix of: 1 document, 5 features (100.00% sparse) and 0 docvars.
#>        features
#> docs    a b c d e
#>   text1 0 0 0 0 0
dfm_match(data_dfm_lbgexample, c("A", "B", "Z"))
#> Document-feature matrix of: 6 documents, 3 features (72.22% sparse) and 0 docvars.
#>     features
#> docs A B   Z
#>   R1 2 3   0
#>   R2 0 0   0
#>   R3 0 0   3
#>   R4 0 0 115
#>   R5 0 0  78
#>   V1 0 0   0
dfm_match(data_dfm_lbgexample, c("B", "newfeat1", "A", "newfeat2"))
#> Document-feature matrix of: 6 documents, 4 features (91.67% sparse) and 0 docvars.
#>     features
#> docs B newfeat1 A newfeat2
#>   R1 3        0 2        0
#>   R2 0        0 0        0
#>   R3 0        0 0        0
#>   R4 0        0 0        0
#>   R5 0        0 0        0
#>   V1 0        0 0        0

# matching one dfm to another
txt <- c("This is text one", "The second text", "This is text three")
(dfmat1 <- dfm(tokens(txt[1:2])))
#> Document-feature matrix of: 2 documents, 6 features (41.67% sparse) and 0 docvars.
#>        features
#> docs    this is text one the second
#>   text1    1  1    1   1   0      0
#>   text2    0  0    1   0   1      1
(dfmat2 <- dfm(tokens(txt[2:3])))
#> Document-feature matrix of: 2 documents, 6 features (41.67% sparse) and 0 docvars.
#>        features
#> docs    the second text this is three
#>   text1   1      1    1    0  0     0
#>   text2   0      0    1    1  1     1
(dfmat3 <- dfm_match(dfmat1, featnames(dfmat2)))
#> Document-feature matrix of: 2 documents, 6 features (50.00% sparse) and 0 docvars.
#>        features
#> docs    the second text this is three
#>   text1   0      0    1    1  1     0
#>   text2   1      1    1    0  0     0
setequal(featnames(dfmat2), featnames(dfmat3))
#> [1] TRUE