Package 'tidytext' reference manual

Title:	Text Mining using 'dplyr', 'ggplot2', and Other Tidy Tools
Description:	Using tidy data principles can make many text mining tasks easier, more effective, and consistent with tools already in wide use. Much of the infrastructure needed for text mining with tidy data frames already exists in packages like 'dplyr', 'broom', 'tidyr', and 'ggplot2'. In this package, we provide functions and supporting data sets to allow conversion of text to and from tidy formats, and to switch seamlessly between tidy tools and existing text mining packages.
Authors:	Gabriela De Queiroz [ctb], Colin Fay [ctb] , Emil Hvitfeldt [ctb], Os Keyes [ctb] , Kanishka Misra [ctb], Tim Mastny [ctb], Jeff Erickson [ctb], David Robinson [aut], Julia Silge [aut, cre]
Maintainer:	Julia Silge <[email protected]>
License:	MIT + file LICENSE
Version:	0.4.2.9000
Built:	2025-03-06 05:05:43 UTC
Source:	https://github.com/juliasilge/tidytext

Bind the term frequency and inverse document frequency of a tidy text dataset to the dataset

Description

Calculate and bind the term frequency and inverse document frequency of a tidy text dataset, along with the product, tf-idf, to the dataset. Each of these values are added as columns. This function supports non-standard evaluation through the tidyeval framework.

Usage

bind_tf_idf(tbl, term, document, n)
bind_tf_idf(tbl, term, document, n)

Arguments

`tbl`	A tidy text dataset with one-row-per-term-per-document
`term`	Column containing terms as string or symbol
`document`	Column containing document IDs as string or symbol
`n`	Column containing document-term counts as string or symbol

Details

The arguments term, document, and n are passed by expression and support quasiquotation; you can unquote strings and symbols.

If the dataset is grouped, the groups are ignored but are retained.

The dataset must have exactly one row per document-term combination for this to work.

Examples


library(dplyr)
library(janeaustenr)

book_words <- austen_books() %>%
  unnest_tokens(word, text) %>%
  count(book, word, sort = TRUE)

book_words

# find the words most distinctive to each document
book_words %>%
  bind_tf_idf(word, book, n) %>%
  arrange(desc(tf_idf))

library(dplyr)
library(janeaustenr)

book_words <- austen_books() %>%
  unnest_tokens(word, text) %>%
  count(book, word, sort = TRUE)

book_words

# find the words most distinctive to each document
book_words %>%
  bind_tf_idf(word, book, n) %>%
  arrange(desc(tf_idf))

Create a sparse matrix from row names, column names, and values in a table.

Description

This function supports non-standard evaluation through the tidyeval framework.

Usage

cast_sparse(data, row, column, value, ...)
cast_sparse(data, row, column, value, ...)

Arguments

`data`	A tbl
`row`	Column name to use as row names in sparse matrix, as string or symbol
`column`	Column name to use as column names in sparse matrix, as string or symbol
`value`	Column name to use as sparse matrix values (default 1) as string or symbol
`...`	Extra arguments to pass on to `sparseMatrix()`

Details

Note that cast_sparse ignores groups in a grouped tbl_df. The arguments row, column, and value are passed by expression and support quasiquotation; you can unquote strings and symbols.

Value

A sparse Matrix object, with one row for each unique value in the row column, one column for each unique value in the column column, and with as many non-zero values as there are rows in data.

Examples


dat <- data.frame(a = c("row1", "row1", "row2", "row2", "row2"),
                  b = c("col1", "col2", "col1", "col3", "col4"),
                  val = 1:5)

cast_sparse(dat, a, b)

cast_sparse(dat, a, b, val)

dat <- data.frame(a = c("row1", "row1", "row2", "row2", "row2"),
                  b = c("col1", "col2", "col1", "col3", "col4"),
                  val = 1:5)

cast_sparse(dat, a, b)

cast_sparse(dat, a, b, val)

Casting a data frame to a DocumentTermMatrix, TermDocumentMatrix, or dfm

Description

This turns a "tidy" one-term-per-document-per-row data frame into a DocumentTermMatrix or TermDocumentMatrix from the tm package, or a dfm from the quanteda package. These functions support non-standard evaluation through the tidyeval framework. Groups are ignored.

Usage

cast_tdm(data, term, document, value, weighting = tm::weightTf, ...)

cast_dtm(data, document, term, value, weighting = tm::weightTf, ...)

cast_dfm(data, document, term, value, ...)
cast_tdm(data, term, document, value, weighting = tm::weightTf, ...)

cast_dtm(data, document, term, value, weighting = tm::weightTf, ...)

cast_dfm(data, document, term, value, ...)

Arguments

`data`	Table with one-term-per-document-per-row
`term`	Column containing terms as string or symbol
`document`	Column containing document IDs as string or symbol
`value`	Column containing values as string or symbol
`weighting`	The weighting function for the DTM/TDM (default is term-frequency, effectively unweighted)
`...`	Extra arguments passed on to `sparseMatrix()`

Details

The arguments term, document, and value are passed by expression and support quasiquotation; you can unquote strings and symbols.

Tidiers for a corpus object from the quanteda package

Description

Tidy a corpus object from the quanteda package. tidy returns a tbl_df with one-row-per-document, with a text column containing the document's text, and one column for each document-level metadata. glance returns a one-row tbl_df with corpus-level metadata, such as source and created. For Corpus objects from the tm package, see tidy.Corpus().

Usage

## S3 method for class 'corpus'
tidy(x, ...)

## S3 method for class 'corpus'
glance(x, ...)
## S3 method for class 'corpus'
tidy(x, ...)

## S3 method for class 'corpus'
glance(x, ...)

Arguments

`x`	A Corpus object, such as a VCorpus or PCorpus
`...`	Extra arguments, not used

Details

For the most part, the tidy output is equivalent to the "documents" data frame in the corpus object, except that it is converted to a tbl_df, and texts column is renamed to text to be consistent with other uses in tidytext.

Similarly, the glance output is simply the "metadata" object, with NULL fields removed and turned into a one-row tbl_df.

Examples


if (requireNamespace("quanteda", quietly = TRUE)) {
 data("data_corpus_inaugural", package = "quanteda")

 data_corpus_inaugural

 tidy(data_corpus_inaugural)
}

if (requireNamespace("quanteda", quietly = TRUE)) {
 data("data_corpus_inaugural", package = "quanteda")

 data_corpus_inaugural

 tidy(data_corpus_inaugural)
}

Tidy dictionary objects from the quanteda package

Description

Tidy dictionary objects from the quanteda package

Usage

## S3 method for class 'dictionary2'
tidy(x, regex = FALSE, ...)
## S3 method for class 'dictionary2'
tidy(x, regex = FALSE, ...)

Arguments

`x`	A dictionary object
`regex`	Whether to turn dictionary items from a glob to a regex
`...`	Extra arguments, not used

Value

A data frame with two columns: category and word.

Get a tidy data frame of a single sentiment lexicon

Description

Get specific sentiment lexicons in a tidy format, with one row per word, in a form that can be joined with a one-word-per-row dataset. The "bing" option comes from the included sentiments() data frame, and others call the relevant function in the textdata package.

Usage

get_sentiments(lexicon = c("bing", "afinn", "loughran", "nrc"))
get_sentiments(lexicon = c("bing", "afinn", "loughran", "nrc"))

Arguments

lexicon

The sentiment lexicon to retrieve; either "afinn", "bing", "nrc", or "loughran"

Value

A tbl_df with a word column, and either a sentiment column (if lexicon is not "afinn") or a numeric value column (if lexicon is "afinn").

Examples


library(dplyr)

get_sentiments("bing")

## Not run: 
get_sentiments("afinn")
get_sentiments("nrc")

## End(Not run)

library(dplyr)

get_sentiments("bing")

## Not run: 
get_sentiments("afinn")
get_sentiments("nrc")

## End(Not run)

Get a tidy data frame of a single stopword lexicon

Description

Get a specific stop word lexicon via the stopwords package's stopwords function, in a tidy format with one word per row.

Usage

get_stopwords(language = "en", source = "snowball")
get_stopwords(language = "en", source = "snowball")

Arguments

`language`	The language of the stopword lexicon specified as a two-letter ISO code, such as `"es"`, `"de"`, or `"fr"`. Default is `"en"` for English. Use stopwords_getlanguages from stopwords to see available languages.
`source`	The source of the stopword lexicon specified. Default is `"snowball"`. Use stopwords_getsources from stopwords to see available sources.

Value

A tibble with two columns, word and lexicon. The parameter lexicon is "quanteda" in this case.

Examples



library(dplyr)
get_stopwords()
get_stopwords(source = "smart")
get_stopwords("es", "snowball")
get_stopwords("ru", "snowball")

library(dplyr)
get_stopwords()
get_stopwords(source = "smart")
get_stopwords("es", "snowball")
get_stopwords("ru", "snowball")

Tidiers for LDA and CTM objects from the topicmodels package

Description

Tidy the results of a Latent Dirichlet Allocation or Correlated Topic Model.

Usage

## S3 method for class 'LDA'
tidy(x, matrix = c("beta", "gamma"), log = FALSE, ...)

## S3 method for class 'CTM'
tidy(x, matrix = c("beta", "gamma"), log = FALSE, ...)

## S3 method for class 'LDA'
augment(x, data, ...)

## S3 method for class 'CTM'
augment(x, data, ...)

## S3 method for class 'LDA'
glance(x, ...)

## S3 method for class 'CTM'
glance(x, ...)
## S3 method for class 'LDA'
tidy(x, matrix = c("beta", "gamma"), log = FALSE, ...)

## S3 method for class 'CTM'
tidy(x, matrix = c("beta", "gamma"), log = FALSE, ...)

## S3 method for class 'LDA'
augment(x, data, ...)

## S3 method for class 'CTM'
augment(x, data, ...)

## S3 method for class 'LDA'
glance(x, ...)

## S3 method for class 'CTM'
glance(x, ...)

Arguments

`x`	An LDA or CTM (or LDA_VEM/CTA_VEM) object from the topicmodels package
`matrix`	Whether to tidy the beta (per-term-per-topic, default) or gamma (per-document-per-topic) matrix
`log`	Whether beta/gamma should be on a log scale, default FALSE
`...`	Extra arguments, not used
`data`	For `augment`, the data given to the LDA or CTM function, either as a DocumentTermMatrix or as a tidied table with "document" and "term" columns

Value

tidy returns a tidied version of either the beta or gamma matrix.

If matrix == "beta" (default), returns a table with one row per topic and term, with columns

topic: Topic, as an integer
term: Term
beta: Probability of a term generated from a topic according to the multinomial model

If matrix == "gamma", returns a table with one row per topic and document, with columns

topic: Topic, as an integer
document: Document name or ID
gamma: Probability of topic given document

augment returns a table with one row per original document-term pair, such as is returned by tdm_tidiers:

document: Name of document (if present), or index
term: Term
.topic: Topic assignment

If the data argument is provided, any columns in the original data are included, combined based on the document and term columns.

glance always returns a one-row table, with columns

iter: Number of iterations used
terms: Number of terms in the model
alpha: If an LDA_VEM, the parameter of the Dirichlet distribution for topics over documents

Examples


if (requireNamespace("topicmodels", quietly = TRUE)) {
  set.seed(2016)
  library(dplyr)
  library(topicmodels)

  data("AssociatedPress", package = "topicmodels")
  ap <- AssociatedPress[1:100, ]
  lda <- LDA(ap, control = list(alpha = 0.1), k = 4)

  # get term distribution within each topic
  td_lda <- tidy(lda)
  td_lda

  library(ggplot2)

  # visualize the top terms within each topic
  td_lda_filtered <- td_lda %>%
    filter(beta > .004) %>%
    mutate(term = reorder(term, beta))

  ggplot(td_lda_filtered, aes(term, beta)) +
    geom_bar(stat = "identity") +
    facet_wrap(~ topic, scales = "free") +
    theme(axis.text.x = element_text(angle = 90, size = 15))

  # get classification of each document
  td_lda_docs <- tidy(lda, matrix = "gamma")
  td_lda_docs

  doc_classes <- td_lda_docs %>%
    group_by(document) %>%
    top_n(1) %>%
    ungroup()

  doc_classes

  # which were we most uncertain about?
  doc_classes %>%
    arrange(gamma)
}

if (requireNamespace("topicmodels", quietly = TRUE)) {
  set.seed(2016)
  library(dplyr)
  library(topicmodels)

  data("AssociatedPress", package = "topicmodels")
  ap <- AssociatedPress[1:100, ]
  lda <- LDA(ap, control = list(alpha = 0.1), k = 4)

  # get term distribution within each topic
  td_lda <- tidy(lda)
  td_lda

  library(ggplot2)

  # visualize the top terms within each topic
  td_lda_filtered <- td_lda %>%
    filter(beta > .004) %>%
    mutate(term = reorder(term, beta))

  ggplot(td_lda_filtered, aes(term, beta)) +
    geom_bar(stat = "identity") +
    facet_wrap(~ topic, scales = "free") +
    theme(axis.text.x = element_text(angle = 90, size = 15))

  # get classification of each document
  td_lda_docs <- tidy(lda, matrix = "gamma")
  td_lda_docs

  doc_classes <- td_lda_docs %>%
    group_by(document) %>%
    top_n(1) %>%
    ungroup()

  doc_classes

  # which were we most uncertain about?
  doc_classes %>%
    arrange(gamma)
}

Tidiers for Latent Dirichlet Allocation models from the mallet package

Description

Tidy LDA models fit by the mallet package, which wraps the Mallet topic modeling package in Java. The arguments and return values are similar to lda_tidiers().

Usage

## S3 method for class 'jobjRef'
tidy(
  x,
  matrix = c("beta", "gamma"),
  log = FALSE,
  normalized = TRUE,
  smoothed = TRUE,
  ...
)

## S3 method for class 'jobjRef'
augment(x, data, ...)
## S3 method for class 'jobjRef'
tidy(
  x,
  matrix = c("beta", "gamma"),
  log = FALSE,
  normalized = TRUE,
  smoothed = TRUE,
  ...
)

## S3 method for class 'jobjRef'
augment(x, data, ...)

Arguments

`x`	A jobjRef object, of type RTopicModel, such as created by `mallet::MalletLDA()`.
`matrix`	Whether to tidy the beta (per-term-per-topic, default) or gamma (per-document-per-topic) matrix.
`log`	Whether beta/gamma should be on a log scale, default FALSE
`normalized`	If true (default), normalize so that each document or word sums to one across the topics. If false, values will be integers representing the actual number of word-topic or document-topic assignments.
`smoothed`	If true (default), add the smoothing parameter to each to avoid any values being zero. This smoothing parameter is initialized as `alpha.sum` in `mallet::MalletLDA()`.
`...`	Extra arguments, not used
`data`	For `augment`, the data given to the LDA function, either as a DocumentTermMatrix or as a tidied table with "document" and "term" columns.

Details

Note that the LDA models from mallet::MalletLDA() are technically a special case of S4 objects with class jobjRef. These are thus implemented as jobjRef tidiers, with a check for whether the toString output is as expected.

Value

augment must be provided a data argument containing one row per original document-term pair, such as is returned by tdm_tidiers, containing columns document and term. It returns that same data with an additional column .topic with the topic assignment for that document-term combination.

Examples


## Not run: 
library(mallet)
library(dplyr)

data("AssociatedPress", package = "topicmodels")
td <- tidy(AssociatedPress)

# mallet needs a file with stop words
tmp <- tempfile()
writeLines(stop_words$word, tmp)

# two vectors: one with document IDs, one with text
docs <- td %>%
  group_by(document = as.character(document)) %>%
  summarize(text = paste(rep(term, count), collapse = " "))

docs <- mallet.import(docs$document, docs$text, tmp)

# create and run a topic model
topic_model <- MalletLDA(num.topics = 4)
topic_model$loadDocuments(docs)
topic_model$train(20)

# tidy the word-topic combinations
td_beta <- tidy(topic_model)
td_beta

# Examine the four topics
td_beta %>%
  group_by(topic) %>%
  top_n(8, beta) %>%
  ungroup() %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta)) +
  geom_col() +
  facet_wrap(~ topic, scales = "free") +
  coord_flip()

# find the assignments of each word in each document
assignments <- augment(topic_model, td)
assignments

## End(Not run)

## Not run: 
library(mallet)
library(dplyr)

data("AssociatedPress", package = "topicmodels")
td <- tidy(AssociatedPress)

# mallet needs a file with stop words
tmp <- tempfile()
writeLines(stop_words$word, tmp)

# two vectors: one with document IDs, one with text
docs <- td %>%
  group_by(document = as.character(document)) %>%
  summarize(text = paste(rep(term, count), collapse = " "))

docs <- mallet.import(docs$document, docs$text, tmp)

# create and run a topic model
topic_model <- MalletLDA(num.topics = 4)
topic_model$loadDocuments(docs)
topic_model$train(20)

# tidy the word-topic combinations
td_beta <- tidy(topic_model)
td_beta

# Examine the four topics
td_beta %>%
  group_by(topic) %>%
  top_n(8, beta) %>%
  ungroup() %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta)) +
  geom_col() +
  facet_wrap(~ topic, scales = "free") +
  coord_flip()

# find the assignments of each word in each document
assignments <- augment(topic_model, td)
assignments

## End(Not run)

English negators, modals, and adverbs

Description

English negators, modals, and adverbs, as a data frame. A few of these entries are two-word phrases instead of single words.

Usage

nma_words
nma_words

Format

A data frame with 44 rows and 2 variables:

word: An English word or bigram
modifier: The modifier type for word, either "negator", "modal", or "adverb"

Source

http://saifmohammad.com/WebPages/SCL.html#NMA

Parts of speech for English words from the Moby Project

Description

Parts of speech for English words from the Moby Project by Grady Ward. Words with non-ASCII characters and items with a space have been removed.

Usage

parts_of_speech
parts_of_speech

Format

A data frame with 205,985 rows and 2 variables:

word: An English word
pos: The part of speech of the word. One of 13 options, such as "Noun", "Adverb", "Adjective"

Details

Another dataset of English parts of speech, available only for non-commercial use, is available as part of SUBTLEXus at https://www.ugent.be/pp/experimentele-psychologie/en/research/documents/subtlexus/.

Source

https://archive.org/details/mobypartofspeech03203gut

Examples


library(dplyr)

parts_of_speech

parts_of_speech %>%
  count(pos, sort = TRUE)

library(dplyr)

parts_of_speech

parts_of_speech %>%
  count(pos, sort = TRUE)

Reorder an x or y axis within facets

Description

Reorder a column before plotting with faceting, such that the values are ordered within each facet. This requires two functions: reorder_within applied to the column, then either scale_x_reordered or scale_y_reordered added to the plot. This is implemented as a bit of a hack: it appends ___ and then the facet at the end of each string.

Usage

reorder_within(x, by, within, fun = mean, sep = "___", ...)

scale_x_reordered(..., labels = reorder_func, sep = deprecated())

scale_y_reordered(..., labels = reorder_func, sep = deprecated())

reorder_func(x, sep = "___")
reorder_within(x, by, within, fun = mean, sep = "___", ...)

scale_x_reordered(..., labels = reorder_func, sep = deprecated())

scale_y_reordered(..., labels = reorder_func, sep = deprecated())

reorder_func(x, sep = "___")

Arguments

`x`	Vector to reorder.
`by`	Vector of the same length, to use for reordering.
`within`	Vector or list of vectors of the same length that will later be used for faceting. A list of vectors will be used to facet within multiple variables.
`fun`	Function to perform within each subset to determine the resulting ordering. By default, mean.
`sep`	Separator to distinguish `by` and `within`. You may want to set this manually if ___ can exist within one of your labels.
`...`	In `reorder_within` arguments passed on to `reorder()`. In the scale functions, extra arguments passed on to `ggplot2::scale_x_discrete()` or `ggplot2::scale_y_discrete()`.
`labels`	Function to transform the labels of `ggplot2::scale_x_discrete()`, by default `reorder_func`.

Source

"Ordering categories within ggplot2 Facets" by Tyler Rinker: https://trinkerrstuff.wordpress.com/2016/12/23/ordering-categories-within-ggplot2-facets/

Examples



library(tidyr)
library(ggplot2)

iris_gathered <- gather(iris, metric, value, -Species)

# reordering doesn't work within each facet (see Sepal.Width):
ggplot(iris_gathered, aes(reorder(Species, value), value)) +
  geom_boxplot() +
  facet_wrap(~ metric)

# reorder_within and scale_x_reordered work.
# (Note that you need to set scales = "free_x" in the facet)
ggplot(iris_gathered, aes(reorder_within(Species, value, metric), value)) +
  geom_boxplot() +
  scale_x_reordered() +
  facet_wrap(~ metric, scales = "free_x")

# to reorder within multiple variables, set within to the list of
# facet variables.
ggplot(mtcars, aes(reorder_within(carb, mpg, list(vs, am)), mpg)) +
  geom_boxplot() +
  scale_x_reordered() +
  facet_wrap(vs ~ am, scales = "free_x")

library(tidyr)
library(ggplot2)

iris_gathered <- gather(iris, metric, value, -Species)

# reordering doesn't work within each facet (see Sepal.Width):
ggplot(iris_gathered, aes(reorder(Species, value), value)) +
  geom_boxplot() +
  facet_wrap(~ metric)

# reorder_within and scale_x_reordered work.
# (Note that you need to set scales = "free_x" in the facet)
ggplot(iris_gathered, aes(reorder_within(Species, value, metric), value)) +
  geom_boxplot() +
  scale_x_reordered() +
  facet_wrap(~ metric, scales = "free_x")

# to reorder within multiple variables, set within to the list of
# facet variables.
ggplot(mtcars, aes(reorder_within(carb, mpg, list(vs, am)), mpg)) +
  geom_boxplot() +
  scale_x_reordered() +
  facet_wrap(vs ~ am, scales = "free_x")

Sentiment lexicon from Bing Liu and collaborators

Description

Lexicon for opinion and sentiment analysis in a tidy data frame. This dataset is included in this package with permission of the creators, and may be used in research, commercial, etc. contexts with attribution, using either the paper or URL below.

Usage

sentiments
sentiments

Format

A data frame with 6,786 rows and 2 variables:

word: An English word
sentiment: A sentiment for that word, either positive or negative.

Details

This lexicon was first published in:

Minqing Hu and Bing Liu, “Mining and summarizing customer reviews.”, Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD-2004), Seattle, Washington, USA, Aug 22-25, 2004.

Words with non-ASCII characters were removed.

Source

https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html

Tidiers for Structural Topic Models from the stm package

Description

Tidy topic models fit by the stm package. The arguments and return values are similar to lda_tidiers().

Usage

## S3 method for class 'STM'
tidy(
  x,
  matrix = c("beta", "gamma", "theta", "frex", "lift"),
  log = FALSE,
  document_names = NULL,
  ...
)

## S3 method for class 'estimateEffect'
tidy(x, ...)

## S3 method for class 'estimateEffect'
glance(x, ...)

## S3 method for class 'STM'
augment(x, data, ...)

## S3 method for class 'STM'
glance(x, ...)
## S3 method for class 'STM'
tidy(
  x,
  matrix = c("beta", "gamma", "theta", "frex", "lift"),
  log = FALSE,
  document_names = NULL,
  ...
)

## S3 method for class 'estimateEffect'
tidy(x, ...)

## S3 method for class 'estimateEffect'
glance(x, ...)

## S3 method for class 'STM'
augment(x, data, ...)

## S3 method for class 'STM'
glance(x, ...)

Arguments

`x`	An STM fitted model object from either `stm::stm()` or `stm::estimateEffect()`
`matrix`	Which matrix to tidy: the beta matrix (per-term-per-topic, default) the gamma/theta matrix (per-document-per-topic); the stm package calls this the theta matrix, but other topic modeling packages call this gamma the FREX matrix, for words with high frequency and exclusivity the lift matrix, for words with high lift
`log`	Whether beta/gamma/theta should be on a log scale, default FALSE
`document_names`	Optional vector of document names for use with per-document-per-topic tidying
`...`	Extra arguments for tidying, such as `w` as used in `stm::calcfrex()`
`data`	For `augment`, the data given to the stm function, either as a `dfm` from quanteda or as a tidied table with "document" and "term" columns

Value

tidy returns a tidied version of either the beta, gamma, FREX, or lift matrix if called on an object from stm::stm(), or a tidied version of the estimated regressions if called on an object from stm::estimateEffect().

glance returns a tibble with exactly one row of model summaries.

augment must be provided a data argument, either a dfm from quanteda or a table containing one row per original document-term pair, such as is returned by tdm_tidiers, containing columns document and term. It returns that same data with an additional column .topic with the topic assignment for that document-term combination.

Examples


library(dplyr)
library(ggplot2)
library(stm)
library(janeaustenr)

austen_sparse <- austen_books() %>%
    unnest_tokens(word, text) %>%
    anti_join(stop_words) %>%
    count(book, word) %>%
    cast_sparse(book, word, n)
topic_model <- stm(austen_sparse, K = 12, verbose = FALSE)

# tidy the word-topic combinations
td_beta <- tidy(topic_model)
td_beta

# Examine the topics
td_beta %>%
    group_by(topic) %>%
    slice_max(beta, n = 10) %>%
    ungroup() %>%
    ggplot(aes(beta, term)) +
    geom_col() +
    facet_wrap(~ topic, scales = "free")

# high FREX words per topic
tidy(topic_model, matrix = "frex")

# high lift words per topic
tidy(topic_model, matrix = "lift")

# tidy the document-topic combinations, with optional document names
td_gamma <- tidy(topic_model, matrix = "gamma",
                 document_names = rownames(austen_sparse))
td_gamma

# using stm's gardarianFit, we can tidy the result of a model
# estimated with covariates
effects <- estimateEffect(1:3 ~ treatment, gadarianFit, gadarian)
glance(effects)
td_estimate <- tidy(effects)
td_estimate

library(dplyr)
library(ggplot2)
library(stm)
library(janeaustenr)

austen_sparse <- austen_books() %>%
    unnest_tokens(word, text) %>%
    anti_join(stop_words) %>%
    count(book, word) %>%
    cast_sparse(book, word, n)
topic_model <- stm(austen_sparse, K = 12, verbose = FALSE)

# tidy the word-topic combinations
td_beta <- tidy(topic_model)
td_beta

# Examine the topics
td_beta %>%
    group_by(topic) %>%
    slice_max(beta, n = 10) %>%
    ungroup() %>%
    ggplot(aes(beta, term)) +
    geom_col() +
    facet_wrap(~ topic, scales = "free")

# high FREX words per topic
tidy(topic_model, matrix = "frex")

# high lift words per topic
tidy(topic_model, matrix = "lift")

# tidy the document-topic combinations, with optional document names
td_gamma <- tidy(topic_model, matrix = "gamma",
                 document_names = rownames(austen_sparse))
td_gamma

# using stm's gardarianFit, we can tidy the result of a model
# estimated with covariates
effects <- estimateEffect(1:3 ~ treatment, gadarianFit, gadarian)
glance(effects)
td_estimate <- tidy(effects)
td_estimate

Various lexicons for English stop words

Description

English stop words from three lexicons, as a data frame. The snowball and SMART sets are pulled from the tm package. Note that words with non-ASCII characters have been removed.

Usage

stop_words
stop_words

Format

A data frame with 1149 rows and 2 variables:

word: An English word
lexicon: The source of the stop word. Either "onix", "SMART", or "snowball"

Source

Tidy DocumentTermMatrix, TermDocumentMatrix, and related objects from the tm package

Description

Tidy a DocumentTermMatrix or TermDocumentMatrix into a three-column data frame: term{}, and value (with zeros missing), with one-row-per-term-per-document.

Usage

## S3 method for class 'DocumentTermMatrix'
tidy(x, ...)

## S3 method for class 'TermDocumentMatrix'
tidy(x, ...)

## S3 method for class 'dfm'
tidy(x, ...)

## S3 method for class 'dfmSparse'
tidy(x, ...)

## S3 method for class 'simple_triplet_matrix'
tidy(x, row_names = NULL, col_names = NULL, ...)
## S3 method for class 'DocumentTermMatrix'
tidy(x, ...)

## S3 method for class 'TermDocumentMatrix'
tidy(x, ...)

## S3 method for class 'dfm'
tidy(x, ...)

## S3 method for class 'dfmSparse'
tidy(x, ...)

## S3 method for class 'simple_triplet_matrix'
tidy(x, row_names = NULL, col_names = NULL, ...)

Arguments

`x`	A DocumentTermMatrix or TermDocumentMatrix object
`...`	Extra arguments, not used
`row_names`	Specify row names
`col_names`	Specify column names

Examples


if (requireNamespace("topicmodels", quietly = TRUE)) {
  data("AssociatedPress", package = "topicmodels")
  AssociatedPress

  tidy(AssociatedPress)
}

if (requireNamespace("topicmodels", quietly = TRUE)) {
  data("AssociatedPress", package = "topicmodels")
  AssociatedPress

  tidy(AssociatedPress)
}

Utility function to tidy a simple triplet matrix

Description

Utility function to tidy a simple triplet matrix

Usage

tidy_triplet(x, triplets, row_names = NULL, col_names = NULL)
tidy_triplet(x, triplets, row_names = NULL, col_names = NULL)

Arguments

`x`	Object with rownames and colnames
`triplets`	A data frame or list of i, j, x
`row_names`	rownames, if not gotten from rownames(x)
`col_names`	colnames, if not gotten from colnames(x)

Tidy a Corpus object from the tm package

Description

Tidy a Corpus object from the tm package. Returns a data frame with one-row-per-document, with a text column containing the document's text, and one column for each local (per-document) metadata tag. For corpus objects from the quanteda package, see tidy.corpus().

Usage

## S3 method for class 'Corpus'
tidy(x, collapse = "\n", ...)
## S3 method for class 'Corpus'
tidy(x, collapse = "\n", ...)

Arguments

`x`	A Corpus object, such as a VCorpus or PCorpus
`collapse`	A string that should be used to collapse text within each corpus (if a document has multiple lines). Give NULL to not collapse strings, in which case a corpus will end up as a list column if there are multi-line documents.
`...`	Extra arguments, not used

Examples


library(dplyr)   # displaying tbl_dfs

if (requireNamespace("tm", quietly = TRUE)) {
  library(tm)
  #' # tm package examples
  txt <- system.file("texts", "txt", package = "tm")
  ovid <- VCorpus(DirSource(txt, encoding = "UTF-8"),
                  readerControl = list(language = "lat"))

  ovid
  tidy(ovid)

  # choose different options for collapsing text within each
  # document
  tidy(ovid, collapse = "")$text
  tidy(ovid, collapse = NULL)$text

  # another example from Reuters articles
  reut21578 <- system.file("texts", "crude", package = "tm")
  reuters <- VCorpus(DirSource(reut21578),
                     readerControl = list(reader = readReut21578XMLasPlain))
  reuters

  tidy(reuters)
}

library(dplyr)   # displaying tbl_dfs

if (requireNamespace("tm", quietly = TRUE)) {
  library(tm)
  #' # tm package examples
  txt <- system.file("texts", "txt", package = "tm")
  ovid <- VCorpus(DirSource(txt, encoding = "UTF-8"),
                  readerControl = list(language = "lat"))

  ovid
  tidy(ovid)

  # choose different options for collapsing text within each
  # document
  tidy(ovid, collapse = "")$text
  tidy(ovid, collapse = NULL)$text

  # another example from Reuters articles
  reut21578 <- system.file("texts", "crude", package = "tm")
  reuters <- VCorpus(DirSource(reut21578),
                     readerControl = list(reader = readReut21578XMLasPlain))
  reuters

  tidy(reuters)
}

Wrapper around unnest_tokens for characters and character shingles

Description

These functions are a wrapper around unnest_tokens( token = "characters" ) and unnest_tokens( token = "character_shingles" ).

Usage

unnest_characters(
  tbl,
  output,
  input,
  strip_non_alphanum = TRUE,
  format = c("text", "man", "latex", "html", "xml"),
  to_lower = TRUE,
  drop = TRUE,
  collapse = NULL,
  ...
)

unnest_character_shingles(
  tbl,
  output,
  input,
  n = 3L,
  n_min = n,
  strip_non_alphanum = TRUE,
  format = c("text", "man", "latex", "html", "xml"),
  to_lower = TRUE,
  drop = TRUE,
  collapse = NULL,
  ...
)
unnest_characters(
  tbl,
  output,
  input,
  strip_non_alphanum = TRUE,
  format = c("text", "man", "latex", "html", "xml"),
  to_lower = TRUE,
  drop = TRUE,
  collapse = NULL,
  ...
)

unnest_character_shingles(
  tbl,
  output,
  input,
  n = 3L,
  n_min = n,
  strip_non_alphanum = TRUE,
  format = c("text", "man", "latex", "html", "xml"),
  to_lower = TRUE,
  drop = TRUE,
  collapse = NULL,
  ...
)

Arguments

`tbl`	A data frame
`output`	Output column to be created as string or symbol.
`input`	Input column that gets split as string or symbol. The output/input arguments are passed by expression and support quasiquotation; you can unquote strings and symbols.
`strip_non_alphanum`	Should punctuation and white space be stripped?
`format`	Either "text", "man", "latex", "html", or "xml". When the format is "text", this function uses the tokenizers package. If not "text", this uses the hunspell tokenizer, and can tokenize only by "word".
`to_lower`	Whether to convert tokens to lowercase.
`drop`	Whether original input column should get dropped. Ignored if the original input and new output column have the same name.
`collapse`	A character vector of variables to collapse text across, or `NULL`. For tokens like n-grams or sentences, text can be collapsed across rows within variables specified by `collapse` before tokenization. At tidytext 0.2.7, the default behavior for `collapse = NULL` changed to be more consistent. The new behavior is that text is not collapsed for `NULL`. Grouping data specifies variables to collapse across in the same way as `collapse` but you cannot use both the `collapse` argument and grouped data. Collapsing applies mostly to `token` options of "ngrams", "skip_ngrams", "sentences", "lines", "paragraphs", or "regex".
`...`	Extra arguments passed on to tokenizers
`n`	The number of characters in each shingle. This must be an integer greater than or equal to 1.
`n_min`	This must be an integer greater than or equal to 1, and less than or equal to `n`.

Examples

library(dplyr)
library(janeaustenr)

d <- tibble(txt = prideprejudice)

d %>%
  unnest_characters(word, txt)

d %>%
  unnest_character_shingles(word, txt, n = 3)

library(dplyr)
library(janeaustenr)

d <- tibble(txt = prideprejudice)

d %>%
  unnest_characters(word, txt)

d %>%
  unnest_character_shingles(word, txt, n = 3)

Wrapper around unnest_tokens for n-grams

Description

These functions are wrappers around unnest_tokens( token = "ngrams" ) and unnest_tokens( token = "skip_ngrams" ) .

Usage

unnest_ngrams(
  tbl,
  output,
  input,
  n = 3L,
  n_min = n,
  ngram_delim = " ",
  format = c("text", "man", "latex", "html", "xml"),
  to_lower = TRUE,
  drop = TRUE,
  collapse = NULL,
  ...
)

unnest_skip_ngrams(
  tbl,
  output,
  input,
  n_min = 1,
  n = 3,
  k = 1,
  format = c("text", "man", "latex", "html", "xml"),
  to_lower = TRUE,
  drop = TRUE,
  collapse = NULL,
  ...
)
unnest_ngrams(
  tbl,
  output,
  input,
  n = 3L,
  n_min = n,
  ngram_delim = " ",
  format = c("text", "man", "latex", "html", "xml"),
  to_lower = TRUE,
  drop = TRUE,
  collapse = NULL,
  ...
)

unnest_skip_ngrams(
  tbl,
  output,
  input,
  n_min = 1,
  n = 3,
  k = 1,
  format = c("text", "man", "latex", "html", "xml"),
  to_lower = TRUE,
  drop = TRUE,
  collapse = NULL,
  ...
)

Arguments

`tbl`	A data frame
`output`	Output column to be created as string or symbol.
`input`	Input column that gets split as string or symbol. The output/input arguments are passed by expression and support quasiquotation; you can unquote strings and symbols.
`n`	The number of words in the n-gram. This must be an integer greater than or equal to 1.
`n_min`	The minimum number of words in the n-gram. This must be an integer greater than or equal to 1, and less than or equal to `n`.
`ngram_delim`	The separator between words in an n-gram.
`format`	Either "text", "man", "latex", "html", or "xml". When the format is "text", this function uses the tokenizers package. If not "text", this uses the hunspell tokenizer, and can tokenize only by "word".
`to_lower`	Whether to convert tokens to lowercase.
`drop`	Whether original input column should get dropped. Ignored if the original input and new output column have the same name.
`collapse`	A character vector of variables to collapse text across, or `NULL`. For tokens like n-grams or sentences, text can be collapsed across rows within variables specified by `collapse` before tokenization. At tidytext 0.2.7, the default behavior for `collapse = NULL` changed to be more consistent. The new behavior is that text is not collapsed for `NULL`. Grouping data specifies variables to collapse across in the same way as `collapse` but you cannot use both the `collapse` argument and grouped data. Collapsing applies mostly to `token` options of "ngrams", "skip_ngrams", "sentences", "lines", "paragraphs", or "regex".
`...`	Extra arguments passed on to tokenizers
`k`	For the skip n-gram tokenizer, the maximum skip distance between words. The function will compute all skip n-grams between `0` and `k`.

Examples

library(dplyr)
library(janeaustenr)

d <- tibble(txt = prideprejudice)

d %>%
  unnest_ngrams(word, txt, n = 2)

d %>%
  unnest_skip_ngrams(word, txt, n = 3, k = 1)

library(dplyr)
library(janeaustenr)

d <- tibble(txt = prideprejudice)

d %>%
  unnest_ngrams(word, txt, n = 2)

d %>%
  unnest_skip_ngrams(word, txt, n = 3, k = 1)

Wrapper around unnest_tokens for Penn Treebank Tokenizer

Description

This function is a wrapper around unnest_tokens( token = "ptb" ).

Usage

unnest_ptb(
  tbl,
  output,
  input,
  format = c("text", "man", "latex", "html", "xml"),
  to_lower = TRUE,
  drop = TRUE,
  collapse = NULL,
  ...
)
unnest_ptb(
  tbl,
  output,
  input,
  format = c("text", "man", "latex", "html", "xml"),
  to_lower = TRUE,
  drop = TRUE,
  collapse = NULL,
  ...
)

Arguments

`tbl`	A data frame
`output`	Output column to be created as string or symbol.
`input`	Input column that gets split as string or symbol. The output/input arguments are passed by expression and support quasiquotation; you can unquote strings and symbols.
`format`	Either "text", "man", "latex", "html", or "xml". When the format is "text", this function uses the tokenizers package. If not "text", this uses the hunspell tokenizer, and can tokenize only by "word".
`to_lower`	Whether to convert tokens to lowercase.
`drop`	Whether original input column should get dropped. Ignored if the original input and new output column have the same name.
`collapse`	A character vector of variables to collapse text across, or `NULL`. For tokens like n-grams or sentences, text can be collapsed across rows within variables specified by `collapse` before tokenization. At tidytext 0.2.7, the default behavior for `collapse = NULL` changed to be more consistent. The new behavior is that text is not collapsed for `NULL`. Grouping data specifies variables to collapse across in the same way as `collapse` but you cannot use both the `collapse` argument and grouped data. Collapsing applies mostly to `token` options of "ngrams", "skip_ngrams", "sentences", "lines", "paragraphs", or "regex".
`...`	Extra arguments passed on to tokenizers

Examples

library(dplyr)
library(janeaustenr)

d <- tibble(txt = prideprejudice)

d %>%
  unnest_ptb(word, txt)

library(dplyr)
library(janeaustenr)

d <- tibble(txt = prideprejudice)

d %>%
  unnest_ptb(word, txt)

Wrapper around unnest_tokens for regular expressions

Description

This function is a wrapper around unnest_tokens( token = "regex" ).

Usage

unnest_regex(
  tbl,
  output,
  input,
  pattern = "\\s+",
  format = c("text", "man", "latex", "html", "xml"),
  to_lower = TRUE,
  drop = TRUE,
  collapse = NULL,
  ...
)
unnest_regex(
  tbl,
  output,
  input,
  pattern = "\\s+",
  format = c("text", "man", "latex", "html", "xml"),
  to_lower = TRUE,
  drop = TRUE,
  collapse = NULL,
  ...
)

Arguments

`tbl`	A data frame
`output`	Output column to be created as string or symbol.
`input`	Input column that gets split as string or symbol. The output/input arguments are passed by expression and support quasiquotation; you can unquote strings and symbols.
`pattern`	A regular expression that defines the split.
`format`	Either "text", "man", "latex", "html", or "xml". When the format is "text", this function uses the tokenizers package. If not "text", this uses the hunspell tokenizer, and can tokenize only by "word".
`to_lower`	Whether to convert tokens to lowercase.
`drop`	Whether original input column should get dropped. Ignored if the original input and new output column have the same name.
`collapse`	A character vector of variables to collapse text across, or `NULL`. For tokens like n-grams or sentences, text can be collapsed across rows within variables specified by `collapse` before tokenization. At tidytext 0.2.7, the default behavior for `collapse = NULL` changed to be more consistent. The new behavior is that text is not collapsed for `NULL`. Grouping data specifies variables to collapse across in the same way as `collapse` but you cannot use both the `collapse` argument and grouped data. Collapsing applies mostly to `token` options of "ngrams", "skip_ngrams", "sentences", "lines", "paragraphs", or "regex".
`...`	Extra arguments passed on to tokenizers

Examples

library(dplyr)
library(janeaustenr)

d <- tibble(txt = prideprejudice)

d %>%
  unnest_regex(word, txt, pattern = "Chapter [\\\\d]")

library(dplyr)
library(janeaustenr)

d <- tibble(txt = prideprejudice)

d %>%
  unnest_regex(word, txt, pattern = "Chapter [\\\\d]")

Wrapper around unnest_tokens for sentences, lines, and paragraphs

Description

These functions are wrappers around unnest_tokens( token = "sentences" ) unnest_tokens( token = "lines" ) and unnest_tokens( token = "paragraphs" ).

Usage

unnest_sentences(
  tbl,
  output,
  input,
  strip_punct = FALSE,
  format = c("text", "man", "latex", "html", "xml"),
  to_lower = TRUE,
  drop = TRUE,
  collapse = NULL,
  ...
)

unnest_lines(
  tbl,
  output,
  input,
  format = c("text", "man", "latex", "html", "xml"),
  to_lower = TRUE,
  drop = TRUE,
  collapse = NULL,
  ...
)

unnest_paragraphs(
  tbl,
  output,
  input,
  paragraph_break = "\n\n",
  format = c("text", "man", "latex", "html", "xml"),
  to_lower = TRUE,
  drop = TRUE,
  collapse = NULL,
  ...
)
unnest_sentences(
  tbl,
  output,
  input,
  strip_punct = FALSE,
  format = c("text", "man", "latex", "html", "xml"),
  to_lower = TRUE,
  drop = TRUE,
  collapse = NULL,
  ...
)

unnest_lines(
  tbl,
  output,
  input,
  format = c("text", "man", "latex", "html", "xml"),
  to_lower = TRUE,
  drop = TRUE,
  collapse = NULL,
  ...
)

unnest_paragraphs(
  tbl,
  output,
  input,
  paragraph_break = "\n\n",
  format = c("text", "man", "latex", "html", "xml"),
  to_lower = TRUE,
  drop = TRUE,
  collapse = NULL,
  ...
)

Arguments

`tbl`	A data frame
`output`	Output column to be created as string or symbol.
`input`	Input column that gets split as string or symbol. The output/input arguments are passed by expression and support quasiquotation; you can unquote strings and symbols.
`strip_punct`	Should punctuation be stripped?
`format`	Either "text", "man", "latex", "html", or "xml". When the format is "text", this function uses the tokenizers package. If not "text", this uses the hunspell tokenizer, and can tokenize only by "word".
`to_lower`	Whether to convert tokens to lowercase.
`drop`	Whether original input column should get dropped. Ignored if the original input and new output column have the same name.
`collapse`	A character vector of variables to collapse text across, or `NULL`. For tokens like n-grams or sentences, text can be collapsed across rows within variables specified by `collapse` before tokenization. At tidytext 0.2.7, the default behavior for `collapse = NULL` changed to be more consistent. The new behavior is that text is not collapsed for `NULL`. Grouping data specifies variables to collapse across in the same way as `collapse` but you cannot use both the `collapse` argument and grouped data. Collapsing applies mostly to `token` options of "ngrams", "skip_ngrams", "sentences", "lines", "paragraphs", or "regex".
`...`	Extra arguments passed on to tokenizers
`paragraph_break`	A string identifying the boundary between two paragraphs.

Examples

library(dplyr)
library(janeaustenr)

d <- tibble(txt = prideprejudice)

d %>%
  unnest_sentences(word, txt)

library(dplyr)
library(janeaustenr)

d <- tibble(txt = prideprejudice)

d %>%
  unnest_sentences(word, txt)

Split a column into tokens

Description

Split a column into tokens, flattening the table into one-token-per-row. This function supports non-standard evaluation through the tidyeval framework.

Usage

unnest_tokens(
  tbl,
  output,
  input,
  token = "words",
  format = c("text", "man", "latex", "html", "xml"),
  to_lower = TRUE,
  drop = TRUE,
  collapse = NULL,
  ...
)
unnest_tokens(
  tbl,
  output,
  input,
  token = "words",
  format = c("text", "man", "latex", "html", "xml"),
  to_lower = TRUE,
  drop = TRUE,
  collapse = NULL,
  ...
)

Arguments

`tbl`	A data frame
`output`	Output column to be created as string or symbol.
`input`	Input column that gets split as string or symbol. The output/input arguments are passed by expression and support quasiquotation; you can unquote strings and symbols.
`token`	Unit for tokenizing, or a custom tokenizing function. Built-in options are "words" (default), "characters", "character_shingles", "ngrams", "skip_ngrams", "sentences", "lines", "paragraphs", "regex", and "ptb" (Penn Treebank). If a function, should take a character vector and return a list of character vectors of the same length.
`format`	Either "text", "man", "latex", "html", or "xml". When the format is "text", this function uses the tokenizers package. If not "text", this uses the hunspell tokenizer, and can tokenize only by "word".
`to_lower`	Whether to convert tokens to lowercase.
`drop`	Whether original input column should get dropped. Ignored if the original input and new output column have the same name.
`collapse`	A character vector of variables to collapse text across, or `NULL`. For tokens like n-grams or sentences, text can be collapsed across rows within variables specified by `collapse` before tokenization. At tidytext 0.2.7, the default behavior for `collapse = NULL` changed to be more consistent. The new behavior is that text is not collapsed for `NULL`. Grouping data specifies variables to collapse across in the same way as `collapse` but you cannot use both the `collapse` argument and grouped data. Collapsing applies mostly to `token` options of "ngrams", "skip_ngrams", "sentences", "lines", "paragraphs", or "regex".
`...`	Extra arguments passed on to tokenizers, such as `strip_punct` for "words", `n` and `k` for "ngrams" and "skip_ngrams", and `pattern` for "regex".

Details

If format is anything other than "text", this uses the hunspell::hunspell_parse() tokenizer instead of the tokenizers package. This does not yet have support for tokenizing by any unit other than words.

Support for token = "tweets" was removed in tidytext 0.4.0 because of changes in upstream dependencies.

Examples



library(dplyr)
library(janeaustenr)

d <- tibble(txt = prideprejudice)
d

d %>%
  unnest_tokens(output = word, input = txt)

d %>%
  unnest_tokens(output = sentence, input = txt, token = "sentences")

d %>%
  unnest_tokens(output = ngram, input = txt, token = "ngrams", n = 2)

d %>%
  unnest_tokens(chapter, txt, token = "regex", pattern = "Chapter [\\\\d]")

d %>%
  unnest_tokens(shingle, txt, token = "character_shingles", n = 4)

# custom function
d %>%
  unnest_tokens(word, txt, token = stringr::str_split, pattern = " ")

# tokenize HTML
h <- tibble(row = 1:2,
                text = c("<h1>Text <b>is</b>", "<a href='example.com'>here</a>"))

h %>%
  unnest_tokens(word, text, format = "html")

library(dplyr)
library(janeaustenr)

d <- tibble(txt = prideprejudice)
d

d %>%
  unnest_tokens(output = word, input = txt)

d %>%
  unnest_tokens(output = sentence, input = txt, token = "sentences")

d %>%
  unnest_tokens(output = ngram, input = txt, token = "ngrams", n = 2)

d %>%
  unnest_tokens(chapter, txt, token = "regex", pattern = "Chapter [\\\\d]")

d %>%
  unnest_tokens(shingle, txt, token = "character_shingles", n = 4)

# custom function
d %>%
  unnest_tokens(word, txt, token = stringr::str_split, pattern = " ")

# tokenize HTML
h <- tibble(row = 1:2,
                text = c("<h1>Text <b>is</b>", "<a href='example.com'>here</a>"))

h %>%
  unnest_tokens(word, text, format = "html")

Package 'tidytext'

Help Index

Bind the term frequency and inverse document frequency of a tidy text dataset to the dataset

Description

Usage

Arguments

Details

Examples

Create a sparse matrix from row names, column names, and values in a table.

Description

Usage

Arguments

Details

Value

Examples

Casting a data frame to a DocumentTermMatrix, TermDocumentMatrix, or dfm

Description

Usage

Arguments

Details

Tidiers for a corpus object from the quanteda package

Description

Usage

Arguments

Details

Examples

Tidy dictionary objects from the quanteda package

Description

Usage

Arguments

Value

Get a tidy data frame of a single sentiment lexicon

Description

Usage

Arguments

Value

Examples

Get a tidy data frame of a single stopword lexicon

Description

Usage

Arguments

Value

Examples

Tidiers for LDA and CTM objects from the topicmodels package

Description

Usage

Arguments

Value

Examples

Tidiers for Latent Dirichlet Allocation models from the mallet package

Description

Usage

Arguments

Details

Value

See Also

Examples

English negators, modals, and adverbs

Description

Usage

Format

Source

Parts of speech for English words from the Moby Project

Description

Usage

Format

Details

Source

Examples

Reorder an x or y axis within facets

Description

Usage

Arguments

Source

Examples

Sentiment lexicon from Bing Liu and collaborators

Description

Usage

Format

Details