textmodel_lsa.Rd
Fit the Latent Semantic Analysis scaling model to a dfm, which may be
weighted (for instance using dfm_tfidf
).
textmodel_lsa(x, nd = 10, margin = c("both", "documents", "features"))
x | the dfm on which the model will be fit |
---|---|
nd | the number of dimensions to be included in output |
margin | margin to be smoothed by the SVD |
svds in the RSpectra package is applied to enable the fast computation of the SVD.
The number of dimensions nd
retained in LSA is an empirical
issue. While a reduction in \(k\) can remove much of the noise, keeping
too few dimensions or factors may lose important information.
Rosario, B. (2000). Latent Semantic Indexing: An Overview. Technical report INFOSYS 240 Spring Paper, University of California, Berkeley.
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., & Harshman, R. (1990). Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, 41(6): 391.
dfmat <- dfm(data_corpus_irishbudget2010) # create an LSA space and return its truncated representation in the low-rank space tmod <- textmodel_lsa(dfmat[1:10, ])#> Warning: all singular values are requested, svd() is used instead#> [,1] [,2] [,3] [,4] [,5] #> Lenihan, Brian (FF) -0.5132082 0.6611990 0.5010158 0.03718041 -0.18932417 #> Bruton, Richard (FG) -0.2774006 -0.3444475 0.1538104 0.84969109 0.13605925 #> Burton, Joan (LAB) -0.3840362 -0.3455358 -0.1080534 -0.22254097 -0.62996056 #> Morgan, Arthur (SF) -0.4381501 -0.2675310 0.1958565 -0.42928912 0.65830177 #> Cowen, Brian (FF) -0.3932116 0.3587097 -0.7698150 0.14403049 0.19068539 #> Kenny, Enda (FG) -0.2611641 -0.1547760 -0.1003581 -0.12282063 0.05878167 #> [,6] [,7] [,8] [,9] #> Lenihan, Brian (FF) 0.024642794 -0.04354314 0.03511621 -0.02558590 #> Bruton, Richard (FG) -0.009346201 0.11169768 0.12502463 -0.10974219 #> Burton, Joan (LAB) 0.022839615 0.51620557 0.04871506 -0.02433495 #> Morgan, Arthur (SF) -0.206942503 0.15992742 0.10149400 0.01181985 #> Cowen, Brian (FF) -0.097840896 0.08922500 -0.19256676 0.01576936 #> Kenny, Enda (FG) 0.813209501 -0.37318871 0.08277396 -0.23320209 #> [,10] #> Lenihan, Brian (FF) 0.082457683 #> Bruton, Richard (FG) 0.004679789 #> Burton, Joan (LAB) -0.071523773 #> Morgan, Arthur (SF) 0.039985771 #> Cowen, Brian (FF) -0.110120661 #> Kenny, Enda (FG) -0.131952742# matrix in low_rank LSA space tmod$matrix_low_rank[,1:5]#> when i presented the supplementary #> Lenihan, Brian (FF) 5 73 1.000000e+00 539 7.000000e+00 #> Bruton, Richard (FG) 2 6 1.725749e-14 305 1.214406e-13 #> Burton, Joan (LAB) 11 40 1.110657e-14 428 1.812092e-13 #> Morgan, Arthur (SF) 21 26 -2.171103e-12 501 1.000000e+00 #> Cowen, Brian (FF) 4 17 -1.101752e-12 394 7.704948e-14 #> Kenny, Enda (FG) 12 25 1.000000e+00 304 1.000000e+00 #> ODonnell, Kieran (FG) 5 11 -2.284291e-12 193 5.337258e-13 #> Gilmore, Eamon (LAB) 6 10 2.470182e-12 270 -8.729094e-13 #> Higgins, Michael (LAB) 3 7 -1.068381e-13 78 -4.569123e-13 #> Quinn, Ruairi (LAB) 5 19 1.203898e-13 80 3.574918e-14# fold queries into the space generated by dfmat[1:10,] # and return its truncated versions of its representation in the new low-rank space pred <- predict(tmod, newdata = dfmat[11:14, ]) pred$docs_newspace#> 4 x 10 Matrix of class "dgeMatrix" #> [,1] [,2] [,3] [,4] #> Gormley, John (Green) -0.06232233 0.02556855 0.01586808 0.002090294 #> Ryan, Eamon (Green) -0.09764584 -0.05532927 -0.03798847 0.290792321 #> Cuffe, Ciaran (Green) -0.07289841 -0.01397222 -0.08691196 0.108245813 #> OCaolain, Caoimhghin (SF) -0.24271908 -0.05221856 0.14035456 -0.140740721 #> [,5] [,6] [,7] [,8] #> Gormley, John (Green) 0.008423089 -0.062365633 -0.01828161 -0.06628157 #> Ryan, Eamon (Green) -0.059380796 -0.222737473 -0.05317940 -0.01139819 #> Cuffe, Ciaran (Green) 0.031632546 -0.002166229 -0.01630824 0.04101057 #> OCaolain, Caoimhghin (SF) 0.095472404 0.004089615 -0.01793895 0.06060947 #> [,9] [,10] #> Gormley, John (Green) 0.01334491 -0.04928801 #> Ryan, Eamon (Green) 0.28550581 -0.19176318 #> Cuffe, Ciaran (Green) 0.07250855 -0.18028126 #> OCaolain, Caoimhghin (SF) -0.07710551 0.23586845