Text Analytics in R

Introduction to tokenizing text from books written by Mark Twain.

1 August, 2022

Contributors

Jason

@jasonmchlee

“Classic — a book which people praise and don’t read.”

Mark Twain

Hopefully, you have proved Mark Twain wrong, and have indulged in one of his classics. If not, perhaps this a great introduction to learning more about his books through Text Analytics.

What is text analytics?

Text analytics is the process of examining unstructured data in the form of text to gather some insights on patterns and topics of interest.

Why is it important?

There are a lot of reasons why text analytics is important, with the main one being to understand sentiment and emotions used in applications and services we use every day. Using text analytics, we can extract meaningful information in tweets, emails, text messages, advertisements, maps, and so much more.

Here is an excellent book if you are looking to dive deeper into the subject — Text Mining in R

For this tutorial, I’m going to show you how to get started with necessary text analytic capabilities in R.

R is a language and environment for statistical computing and graphics.

Getting Started

We are going to have to install the gutenbergr library, and this will give us access to their library of available books/publications.

The gutenbergr package helps you download and process public domain works from the Project Gutenberg collection. This includes both tools for downloading books (and stripping header/footer information), and a complete dataset of Project Gutenberg metadata that can be used to find words of interest.

Let’s install and load the library in R Studio

Mark Twain Books

Now, we are going to pull several books written by Mark Twain from the gutenbergr library.

•

Adventures of Huckleberry Finn - gutenbergr ID: 76

•

The Adventures of Tom Sawyer - gutenbergr ID: 74

•

The Innocents Abroad - gutenbergr ID: 3176

•

Life on the Mississippi - gutenbergr ID: 245

In the gutenbergr library, each book is tagged to an ID number, which we will need to identify their location.

We pull the books using the gutenberg_download function and save it to a mark_twain data frame.

Snapshot of the mark_twin data frame

By using R, the books are pulled into specific rows and tagged to a column corresponding to the book ID number. It is clear that the data is messy and doesn’t provide much use for analysis for the moment.

Identifying Stop Words

When you analyze any text, there will always be redundant words that can skew the results depending on what patterns or trends you are trying to identify. These are called stop words.

It is up to you if you want to remove stop_words, but for this example, we will go ahead and remove them.

First, we will need to load the library tidytext.

Next, we will view the stop_words in the entire R database (not in Mark Twain’s books)

View of the first few rows of the stop_words function

Note we have 1,139 rows.

Tokenizing and Removing Stop Words

Now we will use the piping method from the dplyr library to remove stop words and tokenizing our text.

Tokenization is the task of chopping it up into pieces

Think of tokenizing as breaking down a sentence word by word. This gives text analytic software the ability to provide more structure for analysis.

Source: Stanford NLP — Tokenizing example

We will pipe together several steps to remove the stop words while tokenizing the mark_twain data frame.

Steps:

Using the unnest_tokens(). formula and passing the in the inputs to specify what we want to tokenize and where to access the text.

Using the anti_join()., we are essentially excluding all words that are found in the stop_words database.

We saved all our work in a new variable tidy_mark_twain

If you look at the

We can see the words are no individually separated and the book ID is tagged to it

Did you notice that since each object was tokenized, we now have 182,706 rows! Compared to 1,139 rows we got earlier. This is because each word has its own row.

Frequency Distribution of Words

Our goal is to find patterns in the data, so an excellent way to see which words are to sort and find the most frequently used words.

The most popular word is time.

If you have read any of Mark Twain’s books, it is no surprise Tom is the second most popular word, but it interesting to see the word time as the most frequently used word — 1226!

Visualizing the Data

Using the ggplot2 library, we can add some visual context to see which words are most frequently used.

Some key steps are done to provide a clear graph:

filter() — to make sure we do not view the entire word count for all the words since that would be too much. Here we set a boundary of words with more than 400 counts.

mutate() — to provide a better order of how the words should be organized.

coord_flip() — to rotate the graph and make it look more presentable.

If the rest of the code looks unfamiliar, check out my ggplot tutorial to learn more.

Overview

Text is everywhere, which means there are endless opportunities to analyze and make sense of the unstructured data. Now you have a foundation that should help you begin your Text Analytics journey.

Stay tuned — I will be sharing more tutorials about using Twitter’s API to extract and scrape tweets.