Tokenize the texts from a character vector or from a corpus.

tokens(x, what = c("word", "sentence", "character", "fastestword",
  "fasterword"), remove_numbers = FALSE, remove_punct = FALSE,
  remove_symbols = FALSE, remove_separators = TRUE,
  remove_twitter = FALSE, remove_hyphens = FALSE, remove_url = FALSE,
  ngrams = 1L, skip = 0L, concatenator = "_", hash = TRUE,
  verbose = quanteda_options("verbose"), include_docvars = TRUE, ...)

Arguments

x

a character, corpus, or tokens object to be tokenized

what

the unit for splitting the text, available alternatives are:

"word"

(recommended default) smartest, but slowest, word tokenization method; see stringi-search-boundaries for details.

"fasterword"

dumber, but faster, word tokenization method, uses {stri_split_charclass(x, "\\p{WHITE_SPACE}")}

"fastestword"

dumbest, but fastest, word tokenization method, calls stri_split_fixed(x, " ")

"character"

tokenization into individual characters

"sentence"

sentence segmenter, smart enough to handle some exceptions in English such as "Prof. Plum killed Mrs. Peacock." (but far from perfect).

remove_numbers

remove tokens that consist only of numbers, but not words that start with digits, e.g. 2day

remove_punct

if TRUE, remove all characters in the Unicode "Punctuation" [P] class

remove_symbols

if TRUE, remove all characters in the Unicode "Symbol" [S] class

remove_separators

remove Separators and separator characters (spaces and variations of spaces, plus tab, newlines, and anything else in the Unicode "separator" category) when remove_punct=FALSE. Only applicable for what = "character" (when you probably want it to be FALSE) and for what = "word" (when you probably want it to be TRUE). Note that if what = "word" and you set remove_punct = TRUE, then remove_separators has no effect. Use carefully.

remove_twitter

remove Twitter characters @ and #; set to TRUE if you wish to eliminate these. Note that this will always be set to FALSE if remove_punct = FALSE.

remove_hyphens

if TRUE, split words that are connected by hyphenation and hyphenation-like characters in between words, e.g. "self-storage" becomes c("self", "storage"). Default is FALSE to preserve such words as is, with the hyphens. Only applies if what = "word".

remove_url

if TRUE, find and eliminate URLs beginning with http(s) -- see section "Dealing with URLs".

ngrams

integer vector of the n for n-grams, defaulting to 1 (unigrams). For bigrams, for instance, use 2; for bigrams and unigrams, use 1:2. You can even include irregular sequences such as 2:3 for bigrams and trigrams only. See tokens_ngrams.

skip

integer vector specifying the skips for skip-grams, default is 0 for only immediately neighbouring words. Only applies if ngrams is different from the default of 1. See tokens_skipgrams.

concatenator

character to use in concatenating n-grams, default is "_", which is recommended since this is included in the regular expression and Unicode definitions of "word" characters

hash

if TRUE (default), return a hashed tokens object, otherwise, return a classic tokenizedTexts object. (This will be phased out soon in coming versions.)

verbose

if TRUE, print timing messages to the console; off by default

include_docvars

if TRUE, pass docvars and metadoc fields through to the tokens object. Only applies when tokenizing corpus objects.

...

additional arguments not used

Value

quanteda tokens class object, by default a hashed list of integers corresponding to a vector of types.

Details

The tokenizer is designed to be fast and flexible as well as to handle Unicode correctly. Most of the time, users will construct dfm objects from texts or a corpus, without calling tokens() as an intermediate step. Since tokens() is most likely to be used by more technical users, we have set its options to default to minimal intervention. This means that punctuation is tokenized as well, and that nothing is removed by default from the text being tokenized except inter-word spacing and equivalent characters. Note that a tokens constructor also works on tokens objects, which allows setting additional options that will modify the original object. It is not possible, however, to change a setting to "un-remove" something that was removed from the input tokens object, however. For instance, tokens(tokens("Ha!", remove_punct = TRUE), remove_punct = FALSE) will not restore the "!" token. No warning is currently issued about this, so the user should use tokens.tokens() with caution.

Dealing with URLs

URLs are tricky to tokenize, because they contain a number of symbols and punctuation characters. If you wish to remove these, as most people do, and your text contains URLs, then you should set what = "fasterword" and remove_url = TRUE. If you wish to keep the URLs, but do not want them mangled, then your options are more limited, since removing punctuation and symbols will also remove them from URLs. We are working on improving this behaviour. See the examples below.

See also

tokens_ngrams, tokens_skipgrams, as.list.tokens

Examples

txt <- c(doc1 = "This is a sample: of tokens.", doc2 = "Another sentence, to demonstrate how tokens works.") tokens(txt)
#> tokens from 2 documents. #> doc1 : #> [1] "This" "is" "a" "sample" ":" "of" "tokens" "." #> #> doc2 : #> [1] "Another" "sentence" "," "to" "demonstrate" #> [6] "how" "tokens" "works" "." #>
# removing punctuation marks and lowecasing texts tokens(char_tolower(txt), remove_punct = TRUE)
#> tokens from 2 documents. #> doc1 : #> [1] "this" "is" "a" "sample" "of" "tokens" #> #> doc2 : #> [1] "another" "sentence" "to" "demonstrate" "how" #> [6] "tokens" "works" #>
# keeping versus removing hyphens tokens("quanteda data objects are auto-loading.", remove_punct = TRUE)
#> tokens from 1 document. #> text1 : #> [1] "quanteda" "data" "objects" "are" "auto-loading" #>
tokens("quanteda data objects are auto-loading.", remove_punct = TRUE, remove_hyphens = TRUE)
#> tokens from 1 document. #> text1 : #> [1] "quanteda" "data" "objects" "are" "auto" "loading" #>
# keeping versus removing symbols tokens("<tags> and other + symbols.", remove_symbols = FALSE)
#> tokens from 1 document. #> text1 : #> [1] "<" "tags" ">" "and" "other" "+" "symbols" #> [8] "." #>
tokens("<tags> and other + symbols.", remove_symbols = TRUE)
#> tokens from 1 document. #> text1 : #> [1] "tags" "and" "other" "symbols" "." #>
tokens("<tags> and other + symbols.", remove_symbols = FALSE, what = "fasterword")
#> tokens from 1 document. #> text1 : #> [1] "<tags>" "and" "other" "+" "symbols." #>
tokens("<tags> and other + symbols.", remove_symbols = TRUE, what = "fasterword")
#> tokens from 1 document. #> text1 : #> [1] "<tags>" "and" "other" "symbols." #>
## examples with URLs - hardly perfect! txt <- "Repo https://githib.com/kbenoit/quanteda, and www.stackoverflow.com." tokens(txt, remove_url = TRUE, remove_punct = TRUE)
#> tokens from 1 document. #> text1 : #> [1] "Repo" "and" "www.stackoverflow.com" #>
tokens(txt, remove_url = FALSE, remove_punct = TRUE)
#> tokens from 1 document. #> text1 : #> [1] "Repo" "https" "githib.com" #> [4] "kbenoit" "quanteda" "and" #> [7] "www.stackoverflow.com" #>
tokens(txt, remove_url = FALSE, remove_punct = TRUE, what = "fasterword")
#> tokens from 1 document. #> text1 : #> [1] "Repo" #> [2] "https://githib.com/kbenoit/quanteda," #> [3] "and" #> [4] "www.stackoverflow.com." #>
tokens(txt, remove_url = FALSE, remove_punct = FALSE, what = "fasterword")
#> tokens from 1 document. #> text1 : #> [1] "Repo" #> [2] "https://githib.com/kbenoit/quanteda," #> [3] "and" #> [4] "www.stackoverflow.com." #>
## MORE COMPARISONS txt <- "#textanalysis is MY <3 4U @myhandle gr8 #stuff :-)" tokens(txt, remove_punct = TRUE)
#> tokens from 1 document. #> text1 : #> [1] "#textanalysis" "is" "MY" "3" #> [5] "4U" "@myhandle" "gr8" "#stuff" #>
tokens(txt, remove_punct = TRUE, remove_twitter = TRUE)
#> tokens from 1 document. #> text1 : #> [1] "textanalysis" "is" "MY" "3" "4U" #> [6] "myhandle" "gr8" "stuff" #>
#tokens("great website http://textasdata.com", remove_url = FALSE) #tokens("great website http://textasdata.com", remove_url = TRUE) txt <- c(text1="This is $10 in 999 different ways,\n up and down; left and right!", text2="@kenbenoit working: on #quanteda 2day\t4ever, http://textasdata.com?page=123.") tokens(txt, verbose = TRUE)
#> Starting tokenization...
#> ...tokenizing 1 of 1 blocks
#> ...preserving hyphens
#> ...preserving Twitter characters (#, @)
#> ...serializing tokens
#> 34 unique types
#> ...total elapsed: 0.0169999999999391 seconds.
#> Finished tokenizing and cleaning 2 texts.
#> tokens from 2 documents. #> text1 : #> [1] "This" "is" "$" "10" "in" "999" #> [7] "different" "ways" "," "up" "and" "down" #> [13] ";" "left" "and" "right" "!" #> #> text2 : #> [1] "@kenbenoit" "working" ":" "on" #> [5] "#quanteda" "2day" "4ever" "," #> [9] "http" ":" "/" "/" #> [13] "textasdata.com" "?" "page" "=" #> [17] "123" "." #>
tokens(txt, remove_numbers = TRUE, remove_punct = TRUE)
#> tokens from 2 documents. #> text1 : #> [1] "This" "is" "in" "different" "ways" "up" #> [7] "and" "down" "left" "and" "right" #> #> text2 : #> [1] "@kenbenoit" "working" "on" "#quanteda" #> [5] "2day" "4ever" "http" "textasdata.com" #> [9] "page" #>
tokens(txt, remove_numbers = FALSE, remove_punct = TRUE)
#> tokens from 2 documents. #> text1 : #> [1] "This" "is" "10" "in" "999" "different" #> [7] "ways" "up" "and" "down" "left" "and" #> [13] "right" #> #> text2 : #> [1] "@kenbenoit" "working" "on" "#quanteda" #> [5] "2day" "4ever" "http" "textasdata.com" #> [9] "page" "123" #>
tokens(txt, remove_numbers = TRUE, remove_punct = FALSE)
#> tokens from 2 documents. #> text1 : #> [1] "This" "is" "$" "in" "different" "ways" #> [7] "," "up" "and" "down" ";" "left" #> [13] "and" "right" "!" #> #> text2 : #> [1] "@kenbenoit" "working" ":" "on" #> [5] "#quanteda" "2day" "4ever" "," #> [9] "http" ":" "/" "/" #> [13] "textasdata.com" "?" "page" "=" #> [17] "." #>
tokens(txt, remove_numbers = FALSE, remove_punct = FALSE)
#> tokens from 2 documents. #> text1 : #> [1] "This" "is" "$" "10" "in" "999" #> [7] "different" "ways" "," "up" "and" "down" #> [13] ";" "left" "and" "right" "!" #> #> text2 : #> [1] "@kenbenoit" "working" ":" "on" #> [5] "#quanteda" "2day" "4ever" "," #> [9] "http" ":" "/" "/" #> [13] "textasdata.com" "?" "page" "=" #> [17] "123" "." #>
tokens(txt, remove_numbers = FALSE, remove_punct = FALSE, remove_separators = FALSE)
#> tokens from 2 documents. #> text1 : #> [1] "This" " " "is" " " "$" "10" #> [7] " " "in" " " "999" " " "different" #> [13] " " "ways" "," "\n" " " "up" #> [19] " " "and" " " "down" ";" " " #> [25] "left" " " "and" " " "right" "!" #> #> text2 : #> [1] "@kenbenoit" " " "working" ":" #> [5] " " "on" " " "#quanteda" #> [9] " " "2day" "\t" "4ever" #> [13] "," " " "http" ":" #> [17] "/" "/" "textasdata.com" "?" #> [21] "page" "=" "123" "." #>
tokens(txt, remove_numbers = TRUE, remove_punct = TRUE, remove_url = TRUE)
#> tokens from 2 documents. #> text1 : #> [1] "This" "is" "in" "different" "ways" "up" #> [7] "and" "down" "left" "and" "right" #> #> text2 : #> [1] "@kenbenoit" "working" "on" "#quanteda" "2day" #> [6] "4ever" #>
# character level tokens("Great website: http://textasdata.com?page=123.", what = "character")
#> tokens from 1 document. #> text1 : #> [1] "G" "r" "e" "a" "t" "w" "e" "b" "s" "i" "t" "e" ":" "h" "t" "t" "p" ":" "/" #> [20] "/" "t" "e" "x" "t" "a" "s" "d" "a" "t" "a" "." "c" "o" "m" "?" "p" "a" "g" #> [39] "e" "=" "1" "2" "3" "." #>
tokens("Great website: http://textasdata.com?page=123.", what = "character", remove_separators = FALSE)
#> tokens from 1 document. #> text1 : #> [1] "G" "r" "e" "a" "t" " " "w" "e" "b" "s" "i" "t" "e" ":" " " "h" "t" "t" "p" #> [20] ":" "/" "/" "t" "e" "x" "t" "a" "s" "d" "a" "t" "a" "." "c" "o" "m" "?" "p" #> [39] "a" "g" "e" "=" "1" "2" "3" "." #>
# sentence level tokens(c("Kurt Vongeut said; only assholes use semi-colons.", "Today is Thursday in Canberra: It is yesterday in London.", "Today is Thursday in Canberra: \nIt is yesterday in London.", "To be? Or\nnot to be?"), what = "sentence")
#> tokens from 4 documents. #> text1 : #> [1] "Kurt Vongeut said; only assholes use semi-colons." #> #> text2 : #> [1] "Today is Thursday in Canberra: It is yesterday in London." #> #> text3 : #> [1] "Today is Thursday in Canberra: It is yesterday in London." #> #> text4 : #> [1] "To be?" "Or not to be?" #>
tokens(data_corpus_inaugural[c(2,40)], what = "sentence")
#> tokens from 2 documents. #> 1793-Washington : #> [1] "Fellow citizens, I am again called upon by the voice of my country to execute the functions of its Chief Magistrate." #> [2] "When the occasion proper for it shall arrive, I shall endeavor to express the high sense I entertain of this distinguished honor, and of the confidence which has been reposed in me by the people of united America." #> [3] "Previous to the execution of any official act of the President the Constitution requires an oath of office." #> [4] "This oath I am now about to take, and in your presence: That if it shall be found during my administration of the Government I have in any instance violated willingly or knowingly the injunctions thereof, I may (besides incurring constitutional punishment) be subject to the upbraidings of all who are now witnesses of the present solemn ceremony." #> #> 1945-Roosevelt : #> [1] "Chief Justice, Mr. Vice President, my friends, you will understand and, I believe, agree with my wish that the form of this inauguration be simple and its words brief." #> [2] "We Americans of today, together with our allies, are passing through a period of supreme test." #> [3] "It is a test of our courage -- of our resolve -- of our wisdom -- our essential democracy." #> [4] "If we meet that test -- successfully and honorably -- we shall perform a service of historic importance which men and women and children will honor throughout all time." #> [5] "As I stand here today, having taken the solemn oath of office in the presence of my fellow countrymen -- in the presence of our God -- I know that it is America's purpose that we shall not fail." #> [6] "In the days and in the years that are to come we shall work for a just and honorable peace, a durable peace, as today we work and fight for total victory in war." #> [7] "We can and we will achieve such a peace." #> [8] "We shall strive for perfection." #> [9] "We shall not achieve it immediately -- but we still shall strive." #> [10] "We may make mistakes -- but they must never be mistakes which result from faintness of heart or abandonment of moral principle." #> [11] "I remember that my old schoolmaster, Dr. Peabody, said, in days that seemed to us then to be secure and untroubled: \"Things in life will not always run smoothly." #> [12] "Sometimes we will be rising toward the heights -- then all will seem to reverse itself and start downward." #> [13] "The great fact to remember is that the trend of civilization itself is forever upward; that a line drawn through the middle of the peaks and the valleys of the centuries always has an upward trend.\"" #> [14] "Our Constitution of 1787 was not a perfect instrument; it is not perfect yet." #> [15] "But it provided a firm base upon which all manner of men, of all races and colors and creeds, could build our solid structure of democracy." #> [16] "And so today, in this year of war, 1945, we have learned lessons -- at a fearful cost -- and we shall profit by them." #> [17] "We have learned that we cannot live alone, at peace; that our own well-being is dependent on the well-being of other nations far away." #> [18] "We have learned that we must live as men, not as ostriches, nor as dogs in the manger." #> [19] "We have learned to be citizens of the world, members of the human community." #> [20] "We have learned the simple truth, as Emerson said, that \"The only way to have a friend is to be one.\"" #> [21] "We can gain no lasting peace if we approach it with suspicion and mistrust or with fear." #> [22] "We can gain it only if we proceed with the understanding, the confidence, and the courage which flow from conviction." #> [23] "The Almighty God has blessed our land in many ways." #> [24] "He has given our people stout hearts and strong arms with which to strike mighty blows for freedom and truth." #> [25] "He has given to our country a faith which has become the hope of all peoples in an anguished world." #> [26] "So we pray to Him now for the vision to see our way clearly -- to see the way that leads to a better life for ourselves and for all our fellow men -- to the achievement of His will to peace on earth." #>
# removing features (stopwords) from tokenized texts txt <- char_tolower(c(mytext1 = "This is a short test sentence.", mytext2 = "Short.", mytext3 = "Short, shorter, and shortest.")) tokens(txt, remove_punct = TRUE)
#> tokens from 3 documents. #> mytext1 : #> [1] "this" "is" "a" "short" "test" "sentence" #> #> mytext2 : #> [1] "short" #> #> mytext3 : #> [1] "short" "shorter" "and" "shortest" #>
### removeFeatures(tokens(txt, remove_punct = TRUE), stopwords("english")) # ngram tokenization ### tokens(txt, remove_punct = TRUE, ngrams = 2) ### tokens(txt, remove_punct = TRUE, ngrams = 2, skip = 1, concatenator = " ") ### tokens(txt, remove_punct = TRUE, ngrams = 1:2) # removing features from ngram tokens ### removeFeatures(tokens(txt, remove_punct = TRUE, ngrams = 1:2), stopwords("english"))