tokens_xptrin quanteda 4
In quanteda version 4.0, we introduced a new class
of tokens object,
tokens_xptr. This object enhances the
efficiency of adjusting tokens objects based on large text corpora. A
sub-class of the
built using on the external-pointer object
External pointers allow quanteda’s tokens operations to
pass references to the object to be modified by the highly optimized C++
routines that perform these operations, without the overhead of passing
the objects by value back and forth from C++ to R code.
Prior to version 4, quanteda’s only version of a
tokens object has been an R list of integer vectors with
each integer mapped to a unique type. With each call of a
tokens_*() function, this object is converted to a C++
object, modified in C++, and then passed by value back to R and then
recreated into an R
tokens object. Such conversion between
R and C++ consumes a large part of the execution time of each
tokens_*() function call, and this time consumption grows
as the size of the tokens object increases.
With the introduction of the
tokens_xptr object, this
conversion upon passing the object for modification is fundamentally
changed, to passing just the reference to the tokens object so that it
can be modified by passing a reference (the external pointer) to the
object, eliminating the transaction costs of passing the classic
tokens object by value. The benefits of this are especially
visible when large tokens objects are processed, say those with more
than one million tokens, while still benefitting from the sequential
steps used in the typical quanteda tokens formatting
pipeline, such as creating the tokens, removing some based on a pattern,
lower-casing them, forming n-grams, etc. Formerly, with each step this
large object was passed by value. With
successive token processing steps only pass the external pointer,
resulting in significant efficiency gains without changing the clear
syntax and workflow used in a typical quanteda tokens
These efficiency gains are only evident for large tokens objects, for
instance those with more than one million tokens. The default
tokens object format remains the non-external pointer type,
but below, we show how to create and work with the
tokens_xptr variant. All of the
functions work the same for both object classes, so that the user-facing
differences are almost unnoticeable between the two object types.
A detailed comparison of performance gains and execution times for
tokens_xptr objects is available
in a separate article.
There are two easy methods of creating a
First, you can create a
tokens using the
toks <- tokens(data_corpus_inaugural) print(toks, 0, 0) ## Tokens consisting of 59 documents and 4 docvars. xtoks <- as.tokens_xptr(toks) print(xtoks, 0, 0) ## Tokens consisting of 59 documents and 4 docvars (pointer to 0x12ad4e208).
Second, this step can be performed by setting
xptr = TRUE argument in
xtoks2 <- tokens(data_corpus_inaugural, xptr = TRUE) print(xtoks2, 0, 0) ## Tokens consisting of 59 documents and 4 docvars (pointer to 0x12b4d25c0).
xtoks object is printed, the address of its
underlying data is shown as
(pointer to xxxx). Because two
different objects were created above, the addresses displayed for
xtoks2 are different. We expand on
this below, in the discussion of shallow versus deep copies, when
binding new object names to existing
When we create a new object by assigning that to an existing
tokens_xptr, this creates what is known as a “shallow
copy”. A shallow copy is just a copy of the pointer to the
memory address where the
tokens_xptr version of the tokens
objects exists. This contrasts with a “deep copy”, in which the new
object takes its data through a physical copy of the data from the old
memory location to a new one for the object. Assignment via a deep
copy is the how assignment of a a regular
object — and indeed most R objects — works.
We can see this in operation here:
toks <- data_corpus_inaugural |> corpus_reshape(to = "sentences") |> tokens() xtoks <- as.tokens_xptr(toks) print(xtoks, 0, 0) ## Tokens consisting of 5,234 documents and 4 docvars (pointer to 0x109733170). # shallow copy xtoks_copy <- xtoks print(xtoks_copy, 0, 0) ## Tokens consisting of 5,234 documents and 4 docvars (pointer to 0x109733170).
As an external pointer object, the underlying data of the copy of the
xtoks_copy is the same. This means that operations
tokens_* functions also affect
xtoks_copy even after the copying (shallow copy).
print(xtoks, max_ndoc = 1) ## Tokens consisting of 5,234 documents and 4 docvars (pointer to 0x109733170). ## 1789-Washington.1 : ##  "Fellow-Citizens" "of" "the" "Senate" ##  "and" "of" "the" "House" ##  "of" "Representatives" ":" "Among" ## [ ... and 37 more ] ## ## [ reached max_ndoc ... 5,233 more documents ] xtoks_copy <- xtoks_copy |> tokens_remove(stopwords("en")) |> tokens_tolower() print(xtoks, max_ndoc = 1) ## Tokens consisting of 5,234 documents and 4 docvars (pointer to 0x109733170). ## 1789-Washington.1 : ##  "fellow-citizens" "senate" "house" "representatives" ##  ":" "among" "vicissitudes" "incident" ##  "life" "event" "filled" "greater" ## [ ... and 11 more ] ## ## [ reached max_ndoc ... 5,233 more documents ]
This is not the normal R behaviour for modifying objects, but does exist in other packages that use external pointers, for instance, data.table, which has an entire vignette on reference semantics.
If you want to copy the underlying data of a
object (a deep copy), this can be done explicitly using
copy() or using
# deep copies # print(copy(xtoks), 0, 0) # print(as.tokens_xptr(xtoks), 0, 0)
Each of these has a different memory address from that of the
xtoks object above.
The functions that modify
tokens_xptr objects work in
exactly the same way as they do for
tokens objects, with
the exception that they modify those objects by reference rather than
passing copies by value back and forth from R to the underlying C++ code
that performs the work.
Not all of the
tokens_() functions are implemented this
way, because the efficiency gains for some would be negligible.
tokens_replace() will return a shallow copy of the input
object for efficiency. However,
tokens_chunk() return a
deep copy because they can change the number of the order of the
documents. If in any doubt, inspect the attention to the address of the
(pointer to xxxx) when using
print(xtoks, 0, 0) ## Tokens consisting of 5,234 documents and 4 docvars (pointer to 0x109733170). # shallow copy xtoks2 <- tokens_remove(xtoks, stopwords("en")) print(xtoks2, 0, 0) ## Tokens consisting of 5,234 documents and 4 docvars (pointer to 0x109733170). # deep copy xtoks3 <- tokens_subset(xtoks, lengths(toks) > 150) print(xtoks3, 0, 0) ## Tokens consisting of 11 documents and 4 docvars (pointer to 0x12a11b540).
Because they point to a memory location rather than containing the
data itself, a
tokens_xptr object needs first needs to be
converted to a
tokens object before its data can be saved
to disk. This is simple and performed by applying the
as.tokens() method to the
After conversion, it can be saved normally using
tokens_xptr object directly to
dfm() to compile a document-feature matrix avoids a deep
copy of that object and for larger objects can significantly reduce
execution times for dfm creation.
We demonstrate this below, using several
tokens_xptr objects to
dfm in this mock pipeline. The execution time
for the latter is significantly shorter than for the former, thanks to
the cost saved by passing the data without the expensive conversion.
library("ggplot2") library("microbenchmark") data_corpus_guardian <- quanteda.corpora::download("data_corpus_guardian") toks <- data_corpus_guardian |> corpus_reshape(to = "sentences") |> tokens() microbenchmark::microbenchmark( tokens = toks |> tokens_remove(stopwords("en"), padding = TRUE) |> tokens_remove(c("http:*", "https:*", "*.com", "*.org", "*.net"), padding = TRUE) |> tokens_compound(newsmap::data_dictionary_newsmap_en) |> dfm(remove_padding = TRUE), tokens_xptr = as.tokens_xptr(toks) |> tokens_remove(stopwords("en"), padding = TRUE) |> tokens_remove(c("http:*", "https:*", "*.com", "*.org", "*.net"), padding = TRUE) |> tokens_compound(newsmap::data_dictionary_newsmap_en) |> dfm(remove_padding = TRUE), times = 10 ) |> autoplot(log = FALSE) ## Warning in microbenchmark::microbenchmark(tokens = ## dfm(tokens_compound(tokens_remove(tokens_remove(toks, : less accurate ## nanosecond times to avoid potential integer overflows
We explore further benchmarks and comparisons for
tokens_xptr usage in a separate article.