Skip to contents

Pictures paint thousands of words, so figures are necessary for communicating complex ideas about texts. While tmtyro provides polished figures with visualize(), there’s always room for refinement or reimagining.

Customizing visualize()

The standard tmtyro functions like add_frequencies() prepare data that can be plugged directly into visualize(). Figures prepared this way are clean and ready for use in reports and presentations. Tweaking colors, themes, and labels can further tailor a visualization to an specific project, preference, or purpose, either using built-in features or exploring ggplot2 for greater flexibility.

Colors

tmtyro’s change_colors() function combines six ggplot2 functions into a single function with a single common interface.1

library(dplyr)
library(ggplot2)
library(tmtyro)

# Start by loading a corpus
corpus_dubliners <- get_gutenberg_corpus(2814) |> 
  load_texts() |> 
  identify_by(part) |> 
  standardize_titles()

# Choose a smaller subset of stories
select_titles <- unique(corpus_dubliners$doc_id)[c(1, 3, 12, 15)]

my_corpus <- corpus_dubliners |> 
  filter(doc_id %in% select_titles) |> 
  select(doc_id, word)
# standard output on left
word_freqs <- my_corpus |> 
  add_frequency() |> 
  visualize()

word_freqs

# custom colors on right
word_freqs |> 
  change_colors(palette = "Set1")

Themes

Themes built into ggplot2 or offered by an extension package provide a quick option for making figures more your own. Just remember to use the plus operator (+) when chaining functions from ggplot2.

word_freqs +
  theme_classic()

word_freqs |> 
  change_colors(palette = "Pastel1") +
  theme_dark()

Labels

Some figures make more sense with a title or different axis labels. The labs() function from ggplot2 makes it easy to adjust these.

my_corpus |> 
  add_frequency() |> 
  # Setting color_y is necessary when defining colors by Y values
  visualize(color_y = TRUE, reorder_y = TRUE) |> 
  change_colors(c(
    "said" = "goldenrod",
    "navy")) +
  labs(
    title = "Top Ten Words in Four Joyce Stories",
    subtitle = "The use of “said” in “Ivy Day” suggests more dialogue.",
    x = "prevalence",
    y = "word")

Recreating visualize() figures

Vectorized functions beginning get_...() and is...() are not directly compatible with visualize(), which is made for the standard workflow. But the visualizations they support can be recreated with a bit of effort, and the techniques learned in the process are transferable to other custom visualizations. For further guidance and advanced techniques, the ggplot2 documentation offers much more than can be shown here.

Corpus details

By default, a corpus prepared with load_texts() will visualize() into a bar chart showing word counts for each document. Preparing something manually is pretty simple, even if it doesn’t compare well to the default output:

# default output on left
visualize(corpus_dubliners)

# manual output prepared with dplyr's count()
corpus_dubliners |> 
  count(doc_id) |> 
  ggplot(aes(
    x = n, 
    y = doc_id)) +
  geom_col()

Among other things, visualize() preserves the order of documents from top to bottom, adjusts labeling, and adds some settings for theme and color. Alone, each is a simple change. But everything adds up when polishing publication-ready graphs, from gridlines to label spacing and number formatting:

corpus_dubliners |> 
  count(doc_id) |> 
  # reverse doc_id order
  mutate(doc_id = forcats::fct_rev(doc_id)) |> 
  ggplot(aes(
    x = n, 
    y = doc_id, 
    # add color
    fill = doc_id)) +
  geom_col(show.legend = FALSE) +
  # adjust number format and shift y-axis labels
  scale_x_continuous(
    labels = scales::label_comma(),
    expand = c(0, 0)) +
  # change the theme background
  theme_minimal() +
  # adjust labels
  labs(
    x = "length (words)",
    y = NULL) +
  # adjust grid lines
  theme(
    panel.grid.minor.x = element_blank(),
    panel.grid.major.y = element_blank(),
    panel.grid.minor.y = element_blank())

Word frequencies

When used after add_frequency(), visualize() will prepare a faceted graph of some of the top word frequencies for each document. To create something similar manually, using mutate() with a vectorized function like get_frequency() or get_tf_by(), it’s necessary to prepare a table with distinct() and slice_max() before piping it to ggplot():

corpus_dubliners |>
  mutate(
    n = get_tf_by(word, doc_id)) |> 
  # dplyr's `distinct()` drops repeated rows
  distinct() |> 
  # dplyr's `slice_max()` takes the max n from each group
  slice_max(
    order_by = n, 
    by = doc_id, 
    n = 3) |> # show 3 words each
  ggplot(aes(
    x = n, 
    # `reorder_within()` reorders words for each facet
    y = tidytext::reorder_within(
      word, 
      by = n,
      within = doc_id))) +
  geom_col() +
  facet_wrap(vars(doc_id), scales = "free") +
  # `scale_y_reordered()` cleans up after `reorder_within()`
  tidytext::scale_y_reordered()

The resulting graph can be further customized with ggplot2’s standard functions. After a function like reorder_within(), the Y axis label especially needs some love.

Vocabulary richness

Default visualizations from add_vocabulary() are highly customized. It isn’t hard to make a simple version after a vectorized function like get_cumulative_vocabulary(), but this version can lack readability:

corpus_dubliners |> 
  group_by(doc_id) |> 
  mutate(
    vocab = get_cumulative_vocabulary(word), 
    progress = row_number()) |> 
  ungroup() |> 
  ggplot(aes(
    x = progress, 
    y = vocab, 
    color = doc_id)) +
  geom_line()

Adding direct labels is often worth the effort:

dubliners_vocab <- corpus_dubliners |> 
  group_by(doc_id) |> 
  mutate(
    vocab = get_cumulative_vocabulary(word), 
    progress = row_number()) |> 
  ungroup()

# Create a table of labels and locations
document_labels <- dubliners_vocab |> 
  group_by(doc_id) |> 
  summarize(
    vocab = last(vocab),
    progress = last(progress)) |> 
  ungroup()

dubliners_vocab |> 
  ggplot(aes(
    x = progress, 
    y = vocab, 
    color = doc_id)) +
  geom_line() +
  geom_point(
    data = document_labels) +
  # avoid overlapping labels
  ggrepel::geom_text_repel(
    data = document_labels,
    aes(label = doc_id)) +
  theme(legend.position = "none")