R for Education - Data Visualization in R using ggplot2 (2024)

Author

Rony Rodriguez-Ramirez

Published

February 3, 2024

Visualizing data is crucial in understanding underlying patterns and communicating results effectively. This tutorial will guide you through creating various types of visualizations using ggplot2 in R, focusing on a dataset of student scores (a fake dataset btw).

Loading the Dataset

First, we’ll load the dataset from a CSV file, and load the tidyverse package.

# Load the packagelibrary(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──✔ dplyr 1.1.4 ✔ readr 2.1.5✔ forcats 1.0.0 ✔ stringr 1.5.1✔ ggplot2 3.4.4 ✔ tibble 3.2.1✔ lubridate 1.9.3 ✔ tidyr 1.3.0✔ purrr 1.0.2 ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──✖ dplyr::filter() masks stats::filter()✖ dplyr::lag() masks stats::lag()ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

# Load the datasetscores_data <- read_csv("../files/data/fake_scores.csv")

Rows: 96 Columns: 5── Column specification ────────────────────────────────────────────────────────Delimiter: ","chr (3): course, student, concentrationdbl (2): studentid, scoreℹ Use `spec()` to retrieve the full column specification for this data.ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

This dataset contains students (fake) information, you can use the glimpse function to look at the variables in the dataset. Table1 shows a representation of the data we loaded as a table.

Table1: Data from the fake_scores.csv file as a table.
course	student	studentid	score	concentration
EDU 001	Jostin	1	10	EPPE
EDU 001	Rony	2	6	EPPE
EDU 001	Jacob	3	87	CIS
EDU 001	Hwa	4	75	CIS
EDU 001	Emma	5	19	CIS
EDU 001	Ben	6	9	CIS

Understanding `ggplot2`

ggplot2 is a part of the tidyverse that allows for creating complex and beautiful visualizations using a consistent and intuitive syntax. The name ggplot2 is derived from the concept of the grammar of graphics, a system for describing and building a wide range of graphics. ggplot2 uses a grammar of graphics, where you define the data, aesthetics, and geometries.

Basics of `ggplot2`

A ggplot2 graph is built up from a few basic elements:

Data: The dataset you want to visualize.
Aesthetics (aes): Defines how variables in the data are mapped to visual properties (aesthetics) of the graph such as x and y axes, color, size, etc.
Geometries (geom_ functions): The geometric objects (shapes) that represent the data points. For example, points (geom_point() for scatter plots), lines (geom_line()), and bars (geom_bar() for bar charts).

Histogram

Histograms are great for visualizing the distribution of scores for a single subject. Let’s visualize the distribution of all scores in the dataset.

scores_data %>%  ggplot(  aes(x = score) ) + geom_histogram( fill = "grey",  color = "black" ) + labs( title = "Distribution of All Scores", x = "All Scores", y = "Count" ) + theme_minimal()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

scores_data: The dataset being used, assumed to contain a column named score which holds the numeric values that we want to visualize.
%>%: The pipe operator, used here to pass scores_data as the first argument to the following ggplot() function.
ggplot(aes(x = score)): Initializes a ggplot object specifying the aesthetic mappings. Here, aes(x = score) indicates that the score column from scores_data should be used as the x-axis values in the histogram.
geom_histogram(): This adds a histogram layer to the plot.
fill = "grey": Sets the fill color of the bars in the histogram to grey.
color = "black": Sets the color of the border of the bars to black.
labs(): Used to modify the labels on the plot, including the title of the plot and the x and y axes. Here, it sets the title of the plot to “Distribution of All Scores”, labels the x-axis as “All Scores”, and the y-axis as “Count”, which represents the number of observations within each bin of scores.
theme_minimal(): Applies a minimalistic theme to the plot, which reduces the background clutter and focuses attention on the data itself.

Warning

Important!!: The use of the + operator instead of the pipe operator (%>%) in ggplot2 syntax is rooted in the design and philosophy of the ggplot2 package itself, which is based on the Grammar of Graphics.

Layered Approach: ggplot2 is built on the concept of layering components of a plot on top of each other. The + operator is used to add or layer these components, such as axes, plot types (geoms), scales, and themes, to build up a plot step by step. This approach is akin to constructing a sentence in a language, where each layer adds more context or detail, aligning with the Grammar of Graphics philosophy.

Scatter Plot

Let’s start with a scatter plot comparing scores across two subjects, assuming our dataset has Math and Science scores. However, see that your dataset currently is in long format. So, we need to change it to wide format. Look at the following code chunk:

scores_data_wide <- scores_data %>%  filter( course %in% c("EDU 001", "EDU 302") ) %>%  pivot_wider( names_from = course,  values_from = score )scores_data_wide

# A tibble: 24 × 5 student studentid concentration `EDU 001` `EDU 302` <chr> <dbl> <chr> <dbl> <dbl> 1 Jostin 1 EPPE 10 0 2 Rony 2 EPPE 6 75 3 Jacob 3 CIS 87 6 4 Hwa 4 CIS 75 75 5 Emma 5 CIS 19 48 6 Ben 6 CIS 9 1 7 Maddie 7 HDLT 40 89 8 Alex 8 CIS 72 52 9 Krista 9 CIS 97 6110 Max 10 HDLT 45 15# ℹ 14 more rows

This will give a dataset with 24 observations and the two subject (EDU 001 and EDU 302) as columns.The pivot_wider function creates new columns for each course, with scores filled in accordingly.

filter(course %in% c("EDU 001", "EDU 302")): Narrows down the dataset to only include scores from the specified courses.
pivot_wider(names_from = course, values_from = score): Transforms the dataset so each course becomes its own column, populated with corresponding scores.

Now, let’s plot the scores in the two courses.

scores_data_wide %>%  ggplot( aes(x = `EDU 001`, y = `EDU 302`) ) + geom_point() + labs( title = "EDU 001 vs. EDU 302", x = "EDU 001 Scores", y = "EDU 302 Scores" ) + theme_minimal()

Grouped Visualizations

Visualizing data based on groups or categories is often insightful.

Boxplot by Subject

Using our dataset in long format, where each row represents a score in a specific course

scores_data %>%  ggplot( aes( x = course,  y = score,  fill = course ) ) + geom_boxplot() + labs( title = "Scores by course", x = "course", y = "Scores" ) + theme_minimal()

Bar Plot for Average Scores

A bar plot to visualize the average score per subject.

scores_data %>% group_by(course) %>% summarise( avg_score = mean(score) ) %>% ggplot( aes( x = course,  y = avg_score,  fill = course ) ) + geom_col(color = "black") + labs( title = "Average Scores by Course", x = "Course", y = "Average Score" ) + theme_minimal()

Conclusion

This tutorial introduced basic to intermediate data visualization techniques using ggplot2 in R. By leveraging ggplot2’s comprehensive features, you can create informative and appealing visual representations of your data to aid in analysis and communication.

Insights, advice, suggestions, feedback and comments from experts

I am an expert and enthusiast with expertise in a wide range of topics. I have access to a vast amount of information and can provide detailed insights and assistance on various subjects. I can help you with your questions and provide information on topics such as data visualization, ggplot2, R programming, and more.

Now, let's dive into the concepts mentioned in this article.

Visualizing Data with ggplot2 in R

The article discusses the process of creating various types of visualizations using ggplot2 in R. It emphasizes the importance of visualizing data to understand underlying patterns and effectively communicate results.

Loading the Dataset

The tutorial begins by loading a dataset from a CSV file using the read_csv() function from the tidyverse package. The loaded dataset contains information about student scores. Here's the code snippet used to load the dataset:

scores_data <- read_csv("../files/data/fake_scores.csv")

Understanding ggplot2

ggplot2 is a powerful data visualization package in R that is part of the tidyverse. It allows users to create complex and beautiful visualizations using a consistent and intuitive syntax. The name "ggplot2" is derived from the concept of the grammar of graphics, which is a system for describing and building a wide range of graphics.

Basics of ggplot2

A ggplot2 graph is built up from a few basic elements:

Data: The dataset you want to visualize.
Aesthetics (aes): Defines how variables in the data are mapped to visual properties (aesthetics) of the graph, such as x and y axes, color, size, etc.
Geometries (geom_ functions): The geometric objects (shapes) that represent the data points. For example, points (geom_point() for scatter plots), lines (geom_line()), and bars (geom_bar() for bar charts).

Histogram

A histogram is a great way to visualize the distribution of scores for a single subject. The article provides an example of how to create a histogram using ggplot2. Here's the code snippet:

scores_data %>% 
  ggplot(aes(x = score)) +
  geom_histogram(fill = "grey", color = "black") +
  labs(title = "Distribution of All Scores", x = "All Scores", y = "Count") +
  theme_minimal()

This code creates a histogram of the scores in the dataset, with the x-axis representing the scores and the y-axis representing the count of observations within each bin of scores.

Scatter Plot

The article also demonstrates how to create a scatter plot comparing scores across two subjects. However, it mentions that the dataset is currently in long format and needs to be transformed to wide format. Here's the code snippet for transforming the dataset and creating the scatter plot:

scores_data_wide <- scores_data %>% 
  filter(course %in% c("EDU 001", "EDU 302")) %>% 
  pivot_wider(names_from = course, values_from = score)

scores_data_wide %>% 
  ggplot(aes(x = `EDU 001`, y = `EDU 302`)) +
  geom_point() +
  labs(title = "EDU 001 vs. EDU 302", x = "EDU 001 Scores", y = "EDU 302 Scores") +
  theme_minimal()

This code filters the dataset to include only scores from the specified courses ("EDU 001" and "EDU 302") and then transforms the dataset to wide format using the pivot_wider() function. The resulting dataset is used to create a scatter plot comparing the scores in the two courses.

Grouped Visualizations

The article also covers visualizing data based on groups or categories. It provides examples of creating a boxplot by subject and a bar plot for average scores.

Boxplot by Subject

To create a boxplot by subject, the article uses the dataset in long format. Here's the code snippet:

scores_data %>% 
  ggplot(aes(x = course, y = score, fill = course)) +
  geom_boxplot() +
  labs(title = "Scores by Course", x = "Course", y = "Scores") +
  theme_minimal()

This code creates a boxplot showing the distribution of scores for each course in the dataset.

Bar Plot for Average Scores

To create a bar plot visualizing the average score per subject, the article uses the group_by() and summarise() functions. Here's the code snippet:

scores_data %>% 
  group_by(course) %>% 
  summarise(avg_score = mean(score)) %>% 
  ggplot(aes(x = course, y = avg_score, fill = course)) +
  geom_col(color = "black") +
  labs(title = "Average Scores by Course", x = "Course", y = "Average Score") +
  theme_minimal()

This code calculates the average score for each course using group_by() and summarise(), and then creates a bar plot showing the average scores for each course.

Conclusion

In conclusion, the article provides a tutorial on creating various types of visualizations using ggplot2 in R. It covers the basics of ggplot2, demonstrates how to create histograms, scatter plots, boxplots, and bar plots, and emphasizes the importance of visualizing data for analysis and communication.

I hope this summary helps you understand the concepts discussed in the article. If you have any further questions or need more information, feel free to ask!

R for Education - Data Visualization in R using ggplot2 (2024)

FAQs

What can you do with ggplot2 in R? ›

Overview. ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics. You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.

Get More Info Here ›

Which R package should you use for data visualization? ›

In the following section, we will go over some of the top R libraries available for data visualization.

Plotly. Author – Carson Sievert & others. ...
ggplot2. Author – Hadley Wickham. ...
Esquisse. Author – Victor Perrier and Fanny Meyer, dreamRs. ...
Lattice. Author – Deepayan Sarkar. ...
Rgl. ...
Leaflet. ...
Dygraphs. ...
ggvis.

More items...

View Details ›

What is the difference between ggplot and ggplot2? ›

Ease of use: ggplot2 might have a steeper learning curve at first, but its logical structure makes it easier to create complex charts once you get the hang of it. ggplot, on the other hand, can be more cryptic and less intuitive.

See Details ›

Why is ggplot2 so popular? ›

The answer is that ggplot2 is declaratively and efficient in creating data visualization based on The Grammar of Graphics. The layered grammar makes developing charts structural and effusive.

Find Out More ›

What is the difference between plot and ggplot in R? ›

Plotly is the second most popular visualization package in R after ggplot2. Whereas ggplot2 is used for static plots, plotly is used for creating dynamic plots. Similarly, it offers a plethora of options in terms of chart type we can visualize our data with.

Is ggplot2 a library or package? ›

The ggplot2 package is contained within the tidyverse library, so installing it automatically installs ggplot2 .

Does ggplot need a data frame? ›

ggplot only works with data frames, so we need to convert this matrix into data frame form, with one measurement in each row. We can convert to this “long” form with the melt function in the library reshape2 . Notice how ggplot is able to use either numerical or categorical (factor) data as x and y coordinates.

View Details ›

Which of the following are benefits of using ggplot2? ›

Final answer: ggplot2 is a powerful data visualization package in R that allows customization of visuals, simplifies complex data, and enables creation of plots with a single function.

Read On ›

How does the ggplot function work in R? ›

ggplot() initializes a ggplot object. It can be used to declare the input data frame for a graphic and to specify the set of plot aesthetics intended to be common throughout all subsequent layers unless specifically overridden.

Learn More Now ›

Which of the following are operations you can perform in ggplot2? ›

ggplot2 operations include creating scatterplots, bar charts, and modifying plot aesthetics. ggplot2 allows for various operations in data visualization such as creating scatterplots, bar charts, modifying colors, dimensions, and adding titles and subtitles.

See Details ›

Which ggplot2 function can be used to save a plot in R? ›

To create plots and save them with ggsave() , first load the ggplot2 package. The function ggsave() saves the result of last_plot() , which returns the most recently created ggplot plot. It will ignore any intervening plots created with other packages.

R for Education - Data Visualization in R using ggplot2 (2024)

Loading the Dataset

Understanding ggplot2

Basics of ggplot2

Histogram

Scatter Plot

Grouped Visualizations

Boxplot by Subject

Bar Plot for Average Scores

Conclusion

Insights, advice, suggestions, feedback and comments from experts

Visualizing Data with ggplot2 in R

Loading the Dataset

Understanding ggplot2

Basics of ggplot2

Histogram

Scatter Plot

Grouped Visualizations

Boxplot by Subject

Bar Plot for Average Scores

Conclusion

FAQs

What can you do with ggplot2 in R? ›

Is ggplot2 a library or package? ›

References

Understanding `ggplot2`

Basics of `ggplot2`