Match the feature set of a dfm to a specified vector of feature names.
For existing features in x
for which there is an exact match for an
element of features
, these will be included. Any features in x
not features
will be discarded, and any feature names specified in
features
but not found in x
will be added with all zero counts.
dfm_match(x, features)
a dfm
character; the feature names to be matched in the output dfm
A dfm whose features are identical to those specified in
features
.
Selecting on another dfm's featnames()
is useful when you
have trained a model on one dfm, and need to project this onto a test set
whose features must be identical. It is also used in
bootstrap_dfm()
.
Unlike dfm_select()
, this function will add feature names
not already present in x
. It also provides only fixed,
case-sensitive matches. For more flexible feature selection, see
dfm_select()
.
# matching a dfm to a feature vector
dfm_match(dfm(tokens("")), letters[1:5])
#> Document-feature matrix of: 1 document, 5 features (100.00% sparse) and 0 docvars.
#> features
#> docs a b c d e
#> text1 0 0 0 0 0
dfm_match(data_dfm_lbgexample, c("A", "B", "Z"))
#> Document-feature matrix of: 6 documents, 3 features (72.22% sparse) and 0 docvars.
#> features
#> docs A B Z
#> R1 2 3 0
#> R2 0 0 0
#> R3 0 0 3
#> R4 0 0 115
#> R5 0 0 78
#> V1 0 0 0
dfm_match(data_dfm_lbgexample, c("B", "newfeat1", "A", "newfeat2"))
#> Document-feature matrix of: 6 documents, 4 features (91.67% sparse) and 0 docvars.
#> features
#> docs B newfeat1 A newfeat2
#> R1 3 0 2 0
#> R2 0 0 0 0
#> R3 0 0 0 0
#> R4 0 0 0 0
#> R5 0 0 0 0
#> V1 0 0 0 0
# matching one dfm to another
txt <- c("This is text one", "The second text", "This is text three")
(dfmat1 <- dfm(tokens(txt[1:2])))
#> Document-feature matrix of: 2 documents, 6 features (41.67% sparse) and 0 docvars.
#> features
#> docs this is text one the second
#> text1 1 1 1 1 0 0
#> text2 0 0 1 0 1 1
(dfmat2 <- dfm(tokens(txt[2:3])))
#> Document-feature matrix of: 2 documents, 6 features (41.67% sparse) and 0 docvars.
#> features
#> docs the second text this is three
#> text1 1 1 1 0 0 0
#> text2 0 0 1 1 1 1
(dfmat3 <- dfm_match(dfmat1, featnames(dfmat2)))
#> Document-feature matrix of: 2 documents, 6 features (50.00% sparse) and 0 docvars.
#> features
#> docs the second text this is three
#> text1 0 0 1 1 1 0
#> text2 1 1 1 0 0 0
setequal(featnames(dfmat2), featnames(dfmat3))
#> [1] TRUE