Author
Rony Rodriguez-Ramirez
Published
February 3, 2024
Visualizing data is crucial in understanding underlying patterns and communicating results effectively. This tutorial will guide you through creating various types of visualizations using ggplot2
in R, focusing on a dataset of student scores (a fake dataset btw).
Loading the Dataset
First, we’ll load the dataset from a CSV file, and load the tidyverse package.
# Load the packagelibrary(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──✔ dplyr 1.1.4 ✔ readr 2.1.5✔ forcats 1.0.0 ✔ stringr 1.5.1✔ ggplot2 3.4.4 ✔ tibble 3.2.1✔ lubridate 1.9.3 ✔ tidyr 1.3.0✔ purrr 1.0.2 ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──✖ dplyr::filter() masks stats::filter()✖ dplyr::lag() masks stats::lag()ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Load the datasetscores_data <- read_csv("../files/data/fake_scores.csv")
Rows: 96 Columns: 5── Column specification ────────────────────────────────────────────────────────Delimiter: ","chr (3): course, student, concentrationdbl (2): studentid, scoreℹ Use `spec()` to retrieve the full column specification for this data.ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
This dataset contains students (fake) information, you can use the glimpse
function to look at the variables in the dataset. Table1 shows a representation of the data we loaded as a table.
course | student | studentid | score | concentration |
---|---|---|---|---|
EDU 001 | Jostin | 1 | 10 | EPPE |
EDU 001 | Rony | 2 | 6 | EPPE |
EDU 001 | Jacob | 3 | 87 | CIS |
EDU 001 | Hwa | 4 | 75 | CIS |
EDU 001 | Emma | 5 | 19 | CIS |
EDU 001 | Ben | 6 | 9 | CIS |
glimpse(scores_data)
Rows: 96Columns: 5$ course <chr> "EDU 001", "EDU 001", "EDU 001", "EDU 001", "EDU 001", "…$ student <chr> "Jostin", "Rony", "Jacob", "Hwa", "Emma", "Ben", "Maddie…$ studentid <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…$ score <dbl> 10, 6, 87, 75, 19, 9, 40, 72, 97, 45, 33, 24, 24, 74, 69…$ concentration <chr> "EPPE", "EPPE", "CIS", "CIS", "CIS", "CIS", "HDLT", "CIS…
Understanding ggplot2
ggplot2
is a part of the tidyverse that allows for creating complex and beautiful visualizations using a consistent and intuitive syntax. The name ggplot2
is derived from the concept of the grammar of graphics, a system for describing and building a wide range of graphics. ggplot2
uses a grammar of graphics, where you define the data, aesthetics, and geometries.
Basics of ggplot2
A ggplot2
graph is built up from a few basic elements:
- Data: The dataset you want to visualize.
- Aesthetics (
aes
): Defines how variables in the data are mapped to visual properties (aesthetics) of the graph such as x and y axes, color, size, etc. - Geometries (
geom_
functions): The geometric objects (shapes) that represent the data points. For example, points (geom_point()
for scatter plots), lines (geom_line()
), and bars (geom_bar()
for bar charts).
Histogram
Histograms are great for visualizing the distribution of scores for a single subject. Let’s visualize the distribution of all scores in the dataset.
scores_data %>% ggplot( aes(x = score) ) + geom_histogram( fill = "grey", color = "black" ) + labs( title = "Distribution of All Scores", x = "All Scores", y = "Count" ) + theme_minimal()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
scores_data
: The dataset being used, assumed to contain a column namedscore
which holds the numeric values that we want to visualize.%>%
: The pipe operator, used here to passscores_data
as the first argument to the followingggplot()
function.ggplot(aes(x = score))
: Initializes a ggplot object specifying the aesthetic mappings. Here,aes(x = score)
indicates that thescore
column fromscores_data
should be used as the x-axis values in the histogram.geom_histogram()
: This adds a histogram layer to the plot.fill = "grey"
: Sets the fill color of the bars in the histogram to grey.color = "black"
: Sets the color of the border of the bars to black.labs()
: Used to modify the labels on the plot, including the title of the plot and the x and y axes. Here, it sets the title of the plot to “Distribution of All Scores”, labels the x-axis as “All Scores”, and the y-axis as “Count”, which represents the number of observations within each bin of scores.theme_minimal()
: Applies a minimalistic theme to the plot, which reduces the background clutter and focuses attention on the data itself.
Warning
Important!!: The use of the +
operator instead of the pipe operator (%>%
) in ggplot2
syntax is rooted in the design and philosophy of the ggplot2
package itself, which is based on the Grammar of Graphics.
- Layered Approach:
ggplot2
is built on the concept of layering components of a plot on top of each other. The+
operator is used to add or layer these components, such as axes, plot types (geoms), scales, and themes, to build up a plot step by step. This approach is akin to constructing a sentence in a language, where each layer adds more context or detail, aligning with the Grammar of Graphics philosophy.
Scatter Plot
Let’s start with a scatter plot comparing scores across two subjects, assuming our dataset has Math
and Science
scores. However, see that your dataset currently is in long format. So, we need to change it to wide format. Look at the following code chunk:
scores_data_wide <- scores_data %>% filter( course %in% c("EDU 001", "EDU 302") ) %>% pivot_wider( names_from = course, values_from = score )scores_data_wide
# A tibble: 24 × 5 student studentid concentration `EDU 001` `EDU 302` <chr> <dbl> <chr> <dbl> <dbl> 1 Jostin 1 EPPE 10 0 2 Rony 2 EPPE 6 75 3 Jacob 3 CIS 87 6 4 Hwa 4 CIS 75 75 5 Emma 5 CIS 19 48 6 Ben 6 CIS 9 1 7 Maddie 7 HDLT 40 89 8 Alex 8 CIS 72 52 9 Krista 9 CIS 97 6110 Max 10 HDLT 45 15# ℹ 14 more rows
This will give a dataset with 24 observations and the two subject (EDU 001 and EDU 302) as columns.The pivot_wider
function creates new columns for each course, with scores filled in accordingly.
filter(course %in% c("EDU 001", "EDU 302"))
: Narrows down the dataset to only include scores from the specified courses.pivot_wider(names_from = course, values_from = score)
: Transforms the dataset so each course becomes its own column, populated with corresponding scores.
Now, let’s plot the scores in the two courses.
scores_data_wide %>% ggplot( aes(x = `EDU 001`, y = `EDU 302`) ) + geom_point() + labs( title = "EDU 001 vs. EDU 302", x = "EDU 001 Scores", y = "EDU 302 Scores" ) + theme_minimal()
Grouped Visualizations
Visualizing data based on groups or categories is often insightful.
Boxplot by Subject
Using our dataset in long format, where each row represents a score in a specific course
scores_data %>% ggplot( aes( x = course, y = score, fill = course ) ) + geom_boxplot() + labs( title = "Scores by course", x = "course", y = "Scores" ) + theme_minimal()
Bar Plot for Average Scores
A bar plot to visualize the average score per subject.
scores_data %>% group_by(course) %>% summarise( avg_score = mean(score) ) %>% ggplot( aes( x = course, y = avg_score, fill = course ) ) + geom_col(color = "black") + labs( title = "Average Scores by Course", x = "Course", y = "Average Score" ) + theme_minimal()
Conclusion
This tutorial introduced basic to intermediate data visualization techniques using ggplot2
in R. By leveraging ggplot2
’s comprehensive features, you can create informative and appealing visual representations of your data to aid in analysis and communication.
Insights, advice, suggestions, feedback and comments from experts
I am an expert and enthusiast with expertise in a wide range of topics. I have access to a vast amount of information and can provide detailed insights and assistance on various subjects. I can help you with your questions and provide information on topics such as data visualization, ggplot2, R programming, and more.
Now, let's dive into the concepts mentioned in this article.
Visualizing Data with ggplot2 in R
The article discusses the process of creating various types of visualizations using ggplot2 in R. It emphasizes the importance of visualizing data to understand underlying patterns and effectively communicate results.
Loading the Dataset
The tutorial begins by loading a dataset from a CSV file using the read_csv()
function from the tidyverse package. The loaded dataset contains information about student scores. Here's the code snippet used to load the dataset:
scores_data <- read_csv("../files/data/fake_scores.csv")
Understanding ggplot2
ggplot2 is a powerful data visualization package in R that is part of the tidyverse. It allows users to create complex and beautiful visualizations using a consistent and intuitive syntax. The name "ggplot2" is derived from the concept of the grammar of graphics, which is a system for describing and building a wide range of graphics.
Basics of ggplot2
A ggplot2 graph is built up from a few basic elements:
- Data: The dataset you want to visualize.
- Aesthetics (aes): Defines how variables in the data are mapped to visual properties (aesthetics) of the graph, such as x and y axes, color, size, etc.
- Geometries (geom_ functions): The geometric objects (shapes) that represent the data points. For example, points (geom_point() for scatter plots), lines (geom_line()), and bars (geom_bar() for bar charts).
Histogram
A histogram is a great way to visualize the distribution of scores for a single subject. The article provides an example of how to create a histogram using ggplot2. Here's the code snippet:
scores_data %>%
ggplot(aes(x = score)) +
geom_histogram(fill = "grey", color = "black") +
labs(title = "Distribution of All Scores", x = "All Scores", y = "Count") +
theme_minimal()
This code creates a histogram of the scores in the dataset, with the x-axis representing the scores and the y-axis representing the count of observations within each bin of scores.
Scatter Plot
The article also demonstrates how to create a scatter plot comparing scores across two subjects. However, it mentions that the dataset is currently in long format and needs to be transformed to wide format. Here's the code snippet for transforming the dataset and creating the scatter plot:
scores_data_wide <- scores_data %>%
filter(course %in% c("EDU 001", "EDU 302")) %>%
pivot_wider(names_from = course, values_from = score)
scores_data_wide %>%
ggplot(aes(x = `EDU 001`, y = `EDU 302`)) +
geom_point() +
labs(title = "EDU 001 vs. EDU 302", x = "EDU 001 Scores", y = "EDU 302 Scores") +
theme_minimal()
This code filters the dataset to include only scores from the specified courses ("EDU 001" and "EDU 302") and then transforms the dataset to wide format using the pivot_wider()
function. The resulting dataset is used to create a scatter plot comparing the scores in the two courses.
Grouped Visualizations
The article also covers visualizing data based on groups or categories. It provides examples of creating a boxplot by subject and a bar plot for average scores.
Boxplot by Subject
To create a boxplot by subject, the article uses the dataset in long format. Here's the code snippet:
scores_data %>%
ggplot(aes(x = course, y = score, fill = course)) +
geom_boxplot() +
labs(title = "Scores by Course", x = "Course", y = "Scores") +
theme_minimal()
This code creates a boxplot showing the distribution of scores for each course in the dataset.
Bar Plot for Average Scores
To create a bar plot visualizing the average score per subject, the article uses the group_by()
and summarise()
functions. Here's the code snippet:
scores_data %>%
group_by(course) %>%
summarise(avg_score = mean(score)) %>%
ggplot(aes(x = course, y = avg_score, fill = course)) +
geom_col(color = "black") +
labs(title = "Average Scores by Course", x = "Course", y = "Average Score") +
theme_minimal()
This code calculates the average score for each course using group_by()
and summarise()
, and then creates a bar plot showing the average scores for each course.
Conclusion
In conclusion, the article provides a tutorial on creating various types of visualizations using ggplot2 in R. It covers the basics of ggplot2, demonstrates how to create histograms, scatter plots, boxplots, and bar plots, and emphasizes the importance of visualizing data for analysis and communication.
I hope this summary helps you understand the concepts discussed in the article. If you have any further questions or need more information, feel free to ask!