--- title: "Converting to and from Document-Term Matrix and Corpus objects" author: "Julia Silge and David Robinson" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Converting to and from Document-Term Matrix and Corpus objects} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r} #| echo = FALSE knitr::opts_chunk$set( message = FALSE, warning = FALSE, eval = requireNamespace("tm", quietly = TRUE) && requireNamespace("quanteda", quietly = TRUE) && requireNamespace("topicmodels", quietly = TRUE) && requireNamespace("ggplot2", quietly = TRUE) ) ``` ```{r} #| echo = FALSE library(ggplot2) theme_set(theme_bw()) ``` ### Tidying document-term matrices Many existing text mining datasets are in the form of a `DocumentTermMatrix` class (from the tm package). For example, consider the corpus of 2246 Associated Press articles from the topicmodels package: ```{r} library(tm) data("AssociatedPress", package = "topicmodels") AssociatedPress ``` If we want to analyze this with tidy tools, we need to turn it into a one-term-per-document-per-row data frame first. The `tidy` function does this. (For more on the tidy verb, [see the broom package](https://github.com/dgrtwo/broom)). ```{r} library(dplyr) library(tidytext) ap_td <- tidy(AssociatedPress) ``` Just as shown in [this vignette](tidytext.html), having the text in this format is convenient for analysis with the tidytext package. For example, you can perform sentiment analysis on these newspaper articles. ```{r} ap_sentiments <- ap_td %>% inner_join(get_sentiments("bing"), join_by(term == word)) ap_sentiments ``` We can find the most negative documents: ```{r} library(tidyr) ap_sentiments %>% count(document, sentiment, wt = count) %>% pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% mutate(sentiment = positive - negative) %>% arrange(sentiment) ``` Or visualize which words contributed to positive and negative sentiment: ```{r} #| fig.width = 7, #| fig.height = 4, #| fig.alt = 'Bar charts for the contribution of words to sentiment scores. The words "like" and "work" contribute the most to positive sentiment, and the words "killed" and "death" contribute the most to negative sentiment' library(ggplot2) ap_sentiments %>% count(sentiment, term, wt = count) %>% group_by(sentiment) %>% slice_max(n, n = 10) %>% mutate(term = reorder(term, n)) %>% ggplot(aes(n, term, fill = sentiment)) + geom_col(show.legend = FALSE) + facet_wrap(vars(sentiment), scales = "free_y") + labs(x = "Contribution to sentiment", y = NULL) ``` Note that a tidier is also available for the `dfm` class from the quanteda package: ```{r} library(methods) data("data_corpus_inaugural", package = "quanteda") d <- quanteda::tokens(data_corpus_inaugural) %>% quanteda::dfm() d tidy(d) ``` ### Casting tidy text data into a DocumentTermMatrix Some existing text mining tools or algorithms work only on sparse document-term matrices. Therefore, tidytext provides `cast_` verbs for converting from a tidy form to these matrices. ```{r} ap_td # cast into a Document-Term Matrix ap_td %>% cast_dtm(document, term, count) # cast into a Term-Document Matrix ap_td %>% cast_tdm(term, document, count) # cast into quanteda's dfm ap_td %>% cast_dfm(term, document, count) # cast into a Matrix object m <- ap_td %>% cast_sparse(document, term, count) class(m) dim(m) ``` This allows for easy reading, filtering, and processing to be done using dplyr and other tidy tools, after which the data can be converted into a document-term matrix for machine learning applications. ### Tidying corpus data You can also tidy Corpus objects from the tm package. For example, consider a Corpus containing 20 documents, one for each ```{r} reut21578 <- system.file("texts", "crude", package = "tm") reuters <- VCorpus(DirSource(reut21578), readerControl = list(reader = readReut21578XMLasPlain)) reuters ``` The `tidy` verb creates a table with one row per document: ```{r} reuters_td <- tidy(reuters) reuters_td ``` Similarly, you can `tidy` a `corpus` object from the quanteda package: ```{r} library(quanteda) data("data_corpus_inaugural") data_corpus_inaugural inaug_td <- tidy(data_corpus_inaugural) inaug_td ``` This lets us work with tidy tools like `unnest_tokens` to analyze the text alongside the metadata. ```{r} inaug_words <- inaug_td %>% unnest_tokens(word, text) %>% anti_join(stop_words) inaug_words ``` We could then, for example, see how the appearance of a word changes over time: ```{r} inaug_freq <- inaug_words %>% count(Year, word) %>% complete(Year, word, fill = list(n = 0)) %>% group_by(Year) %>% mutate(year_total = sum(n), percent = n / year_total) %>% ungroup() inaug_freq ``` For example, we can use the broom package to perform logistic regression on each word. ```{r} library(broom) models <- inaug_freq %>% group_by(word) %>% filter(sum(n) > 50) %>% group_modify( ~ tidy(glm(cbind(n, year_total - n) ~ Year, ., family = "binomial")) ) %>% ungroup() %>% filter(term == "Year") models models %>% filter(term == "Year") %>% arrange(desc(abs(estimate))) ``` You can show these models as a volcano plot, which compares the effect size with the significance: ```{r} #| fig.width = 7, #| fig.height = 5, #| fig.alt = 'Volcano plot showing that words like "america" and "world" have increased over time with small p-values, while words like "public" and "institution" have decreased' library(ggplot2) models %>% mutate(adjusted.p.value = p.adjust(p.value)) %>% ggplot(aes(estimate, adjusted.p.value)) + geom_point() + scale_y_log10() + geom_text(aes(label = word), vjust = 1, hjust = 1, check_overlap = TRUE) + labs(x = "Estimated change over time", y = "Adjusted p-value") ``` We can also use the ggplot2 package to display the top 6 terms that have changed in frequency over time. ```{r} #| fig.width = 7, #| fig.height = 6, #| fig.alt = 'Scatterplot with LOESS smoothing lines showing that the words "america", "americans", "century", "children", "democracy", and "god" have increased over time' library(scales) models %>% slice_max(abs(estimate), n = 6) %>% inner_join(inaug_freq) %>% ggplot(aes(Year, percent)) + geom_point() + geom_smooth() + facet_wrap(vars(word)) + scale_y_continuous(labels = percent_format()) + labs(y = "Frequency of word in speech") ```