Introduction to ggplot2 in R
A quick guide to getting up and running with data visualization techniques in the ggplot2 package.
19 July, 2022
9
9
0
Contributors
What is data visualization?
Is the practice of visualizing data in graphs, icons, presentations and more. It is most commonly used to translate complex data into digestible insights for a non-technical audience.
This is a great book to use as a starting point if you are new to data visualization - Storytelling with Data
If you're interested in learning more about data visualization with Python, check my other tutorial - Matplotlib in Python
What is R?
R is a programming language mainly used for statistical analysis. It is a common tool for analysis in the finance and healthcare industry.
R provides a wide variety of statistical (linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clusteting, ...) and graphical techniques, and is highly extensible.
Great resources to getting started with R,
1. codeacademy
2. guru99
What is ggplot2?
Check out this book if you're interested in learning more - Data Visualization in R With ggplot2
The syntax for plotting in ggplot follows a simple layering approach for building graphs.
1. data
2. aesthetics - variables
3. geometric style - this is where you define the style of graph
4. additional layers for customization - title, labels, axis, etc.
The structure looks similar to this.
For this tutorial, I will assume you have a basic working proficiency with R concepts.
Let's get started!
Getting our environment ready
To begin, we will need to install our tidyverse and ggplot2 packages.
Next, we will need to load the ggplot2 library.
Bar Graphs
For data, we will be working with a dataset called reviews. The file has already been read into our environment. The reviews dataset is a collection of movie reviews from 4 main review sites, Fandango, Rotten Tomatoes, IMDB, and Metacritic.
The inputs we are interested in are,
1. data = reviews
2. aesthetics = (x-axis = Rating Site, y-axis = average rating)
3. geometric style = bar chart
To create the bar chart that shows the average ratings per website, we can do the following.
Histograms
Histograms show us how frequent a value occurs. Below is a histogram showing the frequency distribution of the ratings in our reviews dataset. Notice there are some additional layers added.
Additional steps:
1. fill - we used this in the aesthetic layer to specify the color we wanted.
2. geom_histogram() - here we define we want a histogram.
3. labs - to add a title, we used a new layer for labels.
Here we can see that we changed and added 3 new layers. ggplot makes it very easy to customize graphs for our personal preferences.
Boxplots
Boxplots are another excellent tool for visualizing descriptive statistics. If you want to learn more about boxplots check out this article from fellow Towards Data Science writer - Michael Galarnyk
Below is a boxplot shows the spread for all the rating sites.
Now we look at this boxplot we have changed or added some new layers.
1. color - the color allows us to customize the line border of the element, here we choose to pass in the variable rating_site. This makes each box different colored from one another.
2. geom_boxplot() - state the style of graph.
3. panel.background - this allows us to remove the grey background and fill it with white. My personal preference is always to have a white background, but depending on what you are trying to convey, sometimes different color backgrounds can be more useful.
4. legend.position - here I state to remove the legend labels. Why? If I left the legend to be visible, it would merely state which rating_site the color of each boxplot is referring to. This is repetitive since it is clear the xlabels already show us the rating_site.
Overall, we can see that the box representing Fandango ratings is higher up on the y-axis than those for the other sites. In comparison, the Rotten Tomatoes box is longer, meaning the ratings are more spread from each other.
Overview
ggplot is one of the most powerful tools for visualization in R. Once you dive deeper into this subject, you can see how much customizability you can have creating colorful, detailed, and vibrant graphs.
There are a lot more graphs available in the ggplot library as well as other popular libraries available in R. It is worth exploring all the different options and finding which library suits your style of coding and analysis.
Stay tuned - I will be sharing more tutorials about creating other graphs in ggplot.
r
ggplot2