Match the feature set of a dfm to a specified vector of feature names.
For existing features in x
for which there is an exact match for an
element of features
, these will be included. Any features in x
not features
will be discarded, and any feature names specified in
features
but not found in x
will be added with all zero counts.
dfm_match(x, features, verbose = quanteda_options("verbose"))
a dfm
character; the feature names to be matched in the output dfm
if TRUE
print the number of tokens and documents before and
after the function is applied. The number of tokens does not include paddings.
A dfm whose features are identical to those specified in
features
.
Selecting on another dfm's featnames()
is useful when you
have trained a model on one dfm, and need to project this onto a test set
whose features must be identical. It is also used in
bootstrap_dfm()
.
Unlike dfm_select()
, this function will add feature names
not already present in x
. It also provides only fixed,
case-sensitive matches. For more flexible feature selection, see
dfm_select()
.
# matching a dfm to a feature vector
dfm_match(dfm(tokens("")), letters[1:5])
#> Document-feature matrix of: 1 document, 5 features (100.00% sparse) and 0 docvars.
#> features
#> docs a b c d e
#> text1 0 0 0 0 0
dfm_match(data_dfm_lbgexample, c("A", "B", "Z"))
#> Document-feature matrix of: 6 documents, 3 features (72.22% sparse) and 0 docvars.
#> features
#> docs A B Z
#> R1 2 3 0
#> R2 0 0 0
#> R3 0 0 3
#> R4 0 0 115
#> R5 0 0 78
#> V1 0 0 0
dfm_match(data_dfm_lbgexample, c("B", "newfeat1", "A", "newfeat2"))
#> Document-feature matrix of: 6 documents, 4 features (91.67% sparse) and 0 docvars.
#> features
#> docs B newfeat1 A newfeat2
#> R1 3 0 2 0
#> R2 0 0 0 0
#> R3 0 0 0 0
#> R4 0 0 0 0
#> R5 0 0 0 0
#> V1 0 0 0 0
# matching one dfm to another
txt <- c("This is text one", "The second text", "This is text three")
(dfmat1 <- dfm(tokens(txt[1:2])))
#> Document-feature matrix of: 2 documents, 6 features (41.67% sparse) and 0 docvars.
#> features
#> docs this is text one the second
#> text1 1 1 1 1 0 0
#> text2 0 0 1 0 1 1
(dfmat2 <- dfm(tokens(txt[2:3])))
#> Document-feature matrix of: 2 documents, 6 features (41.67% sparse) and 0 docvars.
#> features
#> docs the second text this is three
#> text1 1 1 1 0 0 0
#> text2 0 0 1 1 1 1
(dfmat3 <- dfm_match(dfmat1, featnames(dfmat2)))
#> Document-feature matrix of: 2 documents, 6 features (50.00% sparse) and 0 docvars.
#> features
#> docs the second text this is three
#> text1 0 0 1 1 1 0
#> text2 1 1 1 0 0 0
setequal(featnames(dfmat2), featnames(dfmat3))
#> [1] TRUE