cover-img

Introduction to ggplot2 in R

A quick guide to getting up and running with data visualization techniques in the ggplot2 package.

19 July, 2022

9

9

0

Contributors

What is data visualization?

Is the practice of visualizing data in graphs, icons, presentations and more. It is most commonly used to translate complex data into digestible insights for a non-technical audience.
This is a great book to use as a starting point if you are new to data visualization - Storytelling with Data
If you're interested in learning more about data visualization with Python, check my other tutorial - Matplotlib in Python

What is R?

R is a programming language mainly used for statistical analysis. It is a common tool for analysis in the finance and healthcare industry.
R provides a wide variety of statistical (linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clusteting, ...) and graphical techniques, and is highly extensible.
Great resources to getting started with R,

What is ggplot2?

ggplot2 is an R package from the tidyverse. Its popularity is down to the simplicity of customizing graphs and removing or altering components in a plot at a high level of abstraction.
Check out this book if you're interested in learning more - Data Visualization in R With ggplot2
The syntax for plotting in ggplot follows a simple layering approach for building graphs.
1. data
2. aesthetics - variables
3. geometric style - this is where you define the style of graph
4. additional layers for customization - title, labels, axis, etc.
The structure looks similar to this.
For this tutorial, I will assume you have a basic working proficiency with R concepts.
Let's get started!

Getting our environment ready

To begin, we will need to install our tidyverse and ggplot2 packages.
Next, we will need to load the ggplot2 library.

Bar Graphs

For data, we will be working with a dataset called reviews. The file has already been read into our environment. The reviews dataset is a collection of movie reviews from 4 main review sites, Fandango, Rotten Tomatoes, IMDB, and Metacritic.
The inputs we are interested in are,
1. data = reviews
2. aesthetics = (x-axis = Rating Site, y-axis = average rating)
3. geometric style = bar chart
To create the bar chart that shows the average ratings per website, we can do the following.
img

Histograms

Histograms show us how frequent a value occurs. Below is a histogram showing the frequency distribution of the ratings in our reviews dataset. Notice there are some additional layers added.
img
Additional steps:
1. fill - we used this in the aesthetic layer to specify the color we wanted.
2. geom_histogram() - here we define we want a histogram.
3. labs - to add a title, we used a new layer for labels.
Here we can see that we changed and added 3 new layers. ggplot makes it very easy to customize graphs for our personal preferences.

Boxplots

Boxplots are another excellent tool for visualizing descriptive statistics. If you want to learn more about boxplots check out this article from fellow Towards Data Science writer - Michael Galarnyk
Below is a boxplot shows the spread for all the rating sites.
img
Now we look at this boxplot we have changed or added some new layers.
1. color - the color allows us to customize the line border of the element, here we choose to pass in the variable rating_site. This makes each box different colored from one another.
2. geom_boxplot() - state the style of graph.
3. panel.background - this allows us to remove the grey background and fill it with white. My personal preference is always to have a white background, but depending on what you are trying to convey, sometimes different color backgrounds can be more useful.
4. legend.position - here I state to remove the legend labels. Why? If I left the legend to be visible, it would merely state which rating_site the color of each boxplot is referring to. This is repetitive since it is clear the xlabels already show us the rating_site.
Overall, we can see that the box representing Fandango ratings is higher up on the y-axis than those for the other sites. In comparison, the Rotten Tomatoes box is longer, meaning the ratings are more spread from each other.

Overview

ggplot is one of the most powerful tools for visualization in R. Once you dive deeper into this subject, you can see how much customizability you can have creating colorful, detailed, and vibrant graphs.
There are a lot more graphs available in the ggplot library as well as other popular libraries available in R. It is worth exploring all the different options and finding which library suits your style of coding and analysis.
Stay tuned - I will be sharing more tutorials about creating other graphs in ggplot.

r

ggplot2

9

9

0

r

ggplot2

Jason

San Francisco, CA, USA

Unlocking growth at Chime

More Articles

Showwcase is a professional tech network with over 0 users from over 150 countries. We assist tech professionals in showcasing their unique skills through dedicated profiles and connect them with top global companies for career opportunities.

© Copyright 2024. Showcase Creators Inc. All rights reserved.