In this vignette, we show how to perform Latent Semantic Analysis using the quanteda package based on Grossman and Frieder’s Information Retrieval, Algorithms and Heuristics.

LSA decomposes document-feature matrix into a reduced vector space that is assumed to reflect semantic structure.

New documents or queries can be ‘folded-in’ to this constructed latent semantic space for downstream tasks.

Create a document-feature matrix

txt <- c(d1 = "Shipment of gold damaged in a fire",
         d2 = "Delivery of silver arrived in a silver truck",
         d3 = "Shipment of gold arrived in a truck" )

dfmat <- txt |> 
    tokens() |> 
    dfm()
dfmat
## Document-feature matrix of: 3 documents, 11 features (36.36% sparse) and 0 docvars.
##     features
## docs shipment of gold damaged in a fire delivery silver arrived
##   d1        1  1    1       1  1 1    1        0      0       0
##   d2        0  1    0       0  1 1    0        1      2       1
##   d3        1  1    1       0  1 1    0        0      0       1
## [ reached max_nfeat ... 1 more feature ]

Construct the LSA model

## Warning: package 'quanteda.textmodels' was built under R version 4.4.1
tmod_lsa <- textmodel_lsa(dfmat)
## Warning in fun(A, k, nu, nv, opts, mattype = "dgCMatrix"): all singular values
## are requested, svd() is used instead

The new document vector coordinates in the reduced 2-dimensional space is:

tmod_lsa$docs[, 1:2]
##          [,1]       [,2]
## d1 -0.4944666  0.6491758
## d2 -0.6458224 -0.7194469
## d3 -0.5817355  0.2469149

Apply the constructed LSA model to new data

Now the new unseen document can be represented in the reduced 2-dimensional space. The unseen query document:

dfmat_test <- tokens("gold silver truck") |> 
    dfm() |> 
    dfm_match(features = featnames(dfmat))
dfmat_test
## Document-feature matrix of: 1 document, 11 features (72.73% sparse) and 0 docvars.
##        features
## docs    shipment of gold damaged in a fire delivery silver arrived
##   text1        0  0    1       0  0 0    0        0      1       0
## [ reached max_nfeat ... 1 more feature ]
pred_lsa <- predict(tmod_lsa, newdata = dfmat_test)
pred_lsa$docs_newspace[, 1:2]
## [1] -0.2140026 -0.1820571