R for Education - Data Visualization in R using ggplot2 (2024)

Author

Rony Rodriguez-Ramirez

Published

February 3, 2024

Visualizing data is crucial in understanding underlying patterns and communicating results effectively. This tutorial will guide you through creating various types of visualizations using ggplot2 in R, focusing on a dataset of student scores (a fake dataset btw).

Loading the Dataset

First, we’ll load the dataset from a CSV file, and load the tidyverse package.

# Load the packagelibrary(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──✔ dplyr 1.1.4 ✔ readr 2.1.5✔ forcats 1.0.0 ✔ stringr 1.5.1✔ ggplot2 3.4.4 ✔ tibble 3.2.1✔ lubridate 1.9.3 ✔ tidyr 1.3.0✔ purrr 1.0.2 ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──✖ dplyr::filter() masks stats::filter()✖ dplyr::lag() masks stats::lag()ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Load the datasetscores_data <- read_csv("../files/data/fake_scores.csv")
Rows: 96 Columns: 5── Column specification ────────────────────────────────────────────────────────Delimiter: ","chr (3): course, student, concentrationdbl (2): studentid, scoreℹ Use `spec()` to retrieve the full column specification for this data.ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

This dataset contains students (fake) information, you can use the glimpse function to look at the variables in the dataset. Table1 shows a representation of the data we loaded as a table.

Table1: Data from the fake_scores.csv file as a table.
coursestudentstudentidscoreconcentration
EDU 001Jostin110EPPE
EDU 001Rony26EPPE
EDU 001Jacob387CIS
EDU 001Hwa475CIS
EDU 001Emma519CIS
EDU 001Ben69CIS
glimpse(scores_data)
Rows: 96Columns: 5$ course <chr> "EDU 001", "EDU 001", "EDU 001", "EDU 001", "EDU 001", "…$ student <chr> "Jostin", "Rony", "Jacob", "Hwa", "Emma", "Ben", "Maddie…$ studentid <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…$ score <dbl> 10, 6, 87, 75, 19, 9, 40, 72, 97, 45, 33, 24, 24, 74, 69…$ concentration <chr> "EPPE", "EPPE", "CIS", "CIS", "CIS", "CIS", "HDLT", "CIS…

Understanding ggplot2

ggplot2 is a part of the tidyverse that allows for creating complex and beautiful visualizations using a consistent and intuitive syntax. The name ggplot2 is derived from the concept of the grammar of graphics, a system for describing and building a wide range of graphics. ggplot2 uses a grammar of graphics, where you define the data, aesthetics, and geometries.

Basics of ggplot2

A ggplot2 graph is built up from a few basic elements:

  • Data: The dataset you want to visualize.
  • Aesthetics (aes): Defines how variables in the data are mapped to visual properties (aesthetics) of the graph such as x and y axes, color, size, etc.
  • Geometries (geom_ functions): The geometric objects (shapes) that represent the data points. For example, points (geom_point() for scatter plots), lines (geom_line()), and bars (geom_bar() for bar charts).

Histogram

Histograms are great for visualizing the distribution of scores for a single subject. Let’s visualize the distribution of all scores in the dataset.

scores_data %>%  ggplot(  aes(x = score) ) + geom_histogram( fill = "grey",  color = "black" ) + labs( title = "Distribution of All Scores", x = "All Scores", y = "Count" ) + theme_minimal()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

R for Education - Data Visualization in R using ggplot2 (1)

  • scores_data: The dataset being used, assumed to contain a column named score which holds the numeric values that we want to visualize.
  • %>%: The pipe operator, used here to pass scores_data as the first argument to the following ggplot() function.
  • ggplot(aes(x = score)): Initializes a ggplot object specifying the aesthetic mappings. Here, aes(x = score) indicates that the score column from scores_data should be used as the x-axis values in the histogram.
  • geom_histogram(): This adds a histogram layer to the plot.
  • fill = "grey": Sets the fill color of the bars in the histogram to grey.
  • color = "black": Sets the color of the border of the bars to black.
  • labs(): Used to modify the labels on the plot, including the title of the plot and the x and y axes. Here, it sets the title of the plot to “Distribution of All Scores”, labels the x-axis as “All Scores”, and the y-axis as “Count”, which represents the number of observations within each bin of scores.
  • theme_minimal(): Applies a minimalistic theme to the plot, which reduces the background clutter and focuses attention on the data itself.

Warning

Important!!: The use of the + operator instead of the pipe operator (%>%) in ggplot2 syntax is rooted in the design and philosophy of the ggplot2 package itself, which is based on the Grammar of Graphics.

  • Layered Approach: ggplot2 is built on the concept of layering components of a plot on top of each other. The + operator is used to add or layer these components, such as axes, plot types (geoms), scales, and themes, to build up a plot step by step. This approach is akin to constructing a sentence in a language, where each layer adds more context or detail, aligning with the Grammar of Graphics philosophy.

Scatter Plot

Let’s start with a scatter plot comparing scores across two subjects, assuming our dataset has Math and Science scores. However, see that your dataset currently is in long format. So, we need to change it to wide format. Look at the following code chunk:

scores_data_wide <- scores_data %>%  filter( course %in% c("EDU 001", "EDU 302") ) %>%  pivot_wider( names_from = course,  values_from = score )scores_data_wide
# A tibble: 24 × 5 student studentid concentration `EDU 001` `EDU 302` <chr> <dbl> <chr> <dbl> <dbl> 1 Jostin 1 EPPE 10 0 2 Rony 2 EPPE 6 75 3 Jacob 3 CIS 87 6 4 Hwa 4 CIS 75 75 5 Emma 5 CIS 19 48 6 Ben 6 CIS 9 1 7 Maddie 7 HDLT 40 89 8 Alex 8 CIS 72 52 9 Krista 9 CIS 97 6110 Max 10 HDLT 45 15# ℹ 14 more rows

This will give a dataset with 24 observations and the two subject (EDU 001 and EDU 302) as columns.The pivot_wider function creates new columns for each course, with scores filled in accordingly.

  • filter(course %in% c("EDU 001", "EDU 302")): Narrows down the dataset to only include scores from the specified courses.
  • pivot_wider(names_from = course, values_from = score): Transforms the dataset so each course becomes its own column, populated with corresponding scores.

Now, let’s plot the scores in the two courses.

scores_data_wide %>%  ggplot( aes(x = `EDU 001`, y = `EDU 302`) ) + geom_point() + labs( title = "EDU 001 vs. EDU 302", x = "EDU 001 Scores", y = "EDU 302 Scores" ) + theme_minimal()

R for Education - Data Visualization in R using ggplot2 (2)

Grouped Visualizations

Visualizing data based on groups or categories is often insightful.

Boxplot by Subject

Using our dataset in long format, where each row represents a score in a specific course

scores_data %>%  ggplot( aes( x = course,  y = score,  fill = course ) ) + geom_boxplot() + labs( title = "Scores by course", x = "course", y = "Scores" ) + theme_minimal()

R for Education - Data Visualization in R using ggplot2 (3)

Bar Plot for Average Scores

A bar plot to visualize the average score per subject.

scores_data %>% group_by(course) %>% summarise( avg_score = mean(score) ) %>% ggplot( aes( x = course,  y = avg_score,  fill = course ) ) + geom_col(color = "black") + labs( title = "Average Scores by Course", x = "Course", y = "Average Score" ) + theme_minimal()

R for Education - Data Visualization in R using ggplot2 (4)

Conclusion

This tutorial introduced basic to intermediate data visualization techniques using ggplot2 in R. By leveraging ggplot2’s comprehensive features, you can create informative and appealing visual representations of your data to aid in analysis and communication.

Insights, advice, suggestions, feedback and comments from experts

I am an expert and enthusiast with expertise in a wide range of topics. I have access to a vast amount of information and can provide detailed insights and assistance on various subjects. I can help you with your questions and provide information on topics such as data visualization, ggplot2, R programming, and more.

Now, let's dive into the concepts mentioned in this article.

Visualizing Data with ggplot2 in R

The article discusses the process of creating various types of visualizations using ggplot2 in R. It emphasizes the importance of visualizing data to understand underlying patterns and effectively communicate results.

Loading the Dataset

The tutorial begins by loading a dataset from a CSV file using the read_csv() function from the tidyverse package. The loaded dataset contains information about student scores. Here's the code snippet used to load the dataset:

scores_data <- read_csv("../files/data/fake_scores.csv")

Understanding ggplot2

ggplot2 is a powerful data visualization package in R that is part of the tidyverse. It allows users to create complex and beautiful visualizations using a consistent and intuitive syntax. The name "ggplot2" is derived from the concept of the grammar of graphics, which is a system for describing and building a wide range of graphics.

Basics of ggplot2

A ggplot2 graph is built up from a few basic elements:

  1. Data: The dataset you want to visualize.
  2. Aesthetics (aes): Defines how variables in the data are mapped to visual properties (aesthetics) of the graph, such as x and y axes, color, size, etc.
  3. Geometries (geom_ functions): The geometric objects (shapes) that represent the data points. For example, points (geom_point() for scatter plots), lines (geom_line()), and bars (geom_bar() for bar charts).

Histogram

A histogram is a great way to visualize the distribution of scores for a single subject. The article provides an example of how to create a histogram using ggplot2. Here's the code snippet:

scores_data %>% 
  ggplot(aes(x = score)) +
  geom_histogram(fill = "grey", color = "black") +
  labs(title = "Distribution of All Scores", x = "All Scores", y = "Count") +
  theme_minimal()

This code creates a histogram of the scores in the dataset, with the x-axis representing the scores and the y-axis representing the count of observations within each bin of scores.

Scatter Plot

The article also demonstrates how to create a scatter plot comparing scores across two subjects. However, it mentions that the dataset is currently in long format and needs to be transformed to wide format. Here's the code snippet for transforming the dataset and creating the scatter plot:

scores_data_wide <- scores_data %>% 
  filter(course %in% c("EDU 001", "EDU 302")) %>% 
  pivot_wider(names_from = course, values_from = score)

scores_data_wide %>% 
  ggplot(aes(x = `EDU 001`, y = `EDU 302`)) +
  geom_point() +
  labs(title = "EDU 001 vs. EDU 302", x = "EDU 001 Scores", y = "EDU 302 Scores") +
  theme_minimal()

This code filters the dataset to include only scores from the specified courses ("EDU 001" and "EDU 302") and then transforms the dataset to wide format using the pivot_wider() function. The resulting dataset is used to create a scatter plot comparing the scores in the two courses.

Grouped Visualizations

The article also covers visualizing data based on groups or categories. It provides examples of creating a boxplot by subject and a bar plot for average scores.

Boxplot by Subject

To create a boxplot by subject, the article uses the dataset in long format. Here's the code snippet:

scores_data %>% 
  ggplot(aes(x = course, y = score, fill = course)) +
  geom_boxplot() +
  labs(title = "Scores by Course", x = "Course", y = "Scores") +
  theme_minimal()

This code creates a boxplot showing the distribution of scores for each course in the dataset.

Bar Plot for Average Scores

To create a bar plot visualizing the average score per subject, the article uses the group_by() and summarise() functions. Here's the code snippet:

scores_data %>% 
  group_by(course) %>% 
  summarise(avg_score = mean(score)) %>% 
  ggplot(aes(x = course, y = avg_score, fill = course)) +
  geom_col(color = "black") +
  labs(title = "Average Scores by Course", x = "Course", y = "Average Score") +
  theme_minimal()

This code calculates the average score for each course using group_by() and summarise(), and then creates a bar plot showing the average scores for each course.

Conclusion

In conclusion, the article provides a tutorial on creating various types of visualizations using ggplot2 in R. It covers the basics of ggplot2, demonstrates how to create histograms, scatter plots, boxplots, and bar plots, and emphasizes the importance of visualizing data for analysis and communication.

I hope this summary helps you understand the concepts discussed in the article. If you have any further questions or need more information, feel free to ask!

R for Education - Data Visualization in R using ggplot2 (2024)

FAQs

What can you do with ggplot2 in R? ›

Overview. ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics. You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.

Which R package should you use for data visualization? ›

In the following section, we will go over some of the top R libraries available for data visualization.
  • Plotly. Author – Carson Sievert & others. ...
  • ggplot2. Author – Hadley Wickham. ...
  • Esquisse. Author – Victor Perrier and Fanny Meyer, dreamRs. ...
  • Lattice. Author – Deepayan Sarkar. ...
  • Rgl. ...
  • Leaflet. ...
  • Dygraphs. ...
  • ggvis.

What is the difference between ggplot and ggplot2? ›

Ease of use: ggplot2 might have a steeper learning curve at first, but its logical structure makes it easier to create complex charts once you get the hang of it. ggplot, on the other hand, can be more cryptic and less intuitive.

Why is ggplot2 so popular? ›

The answer is that ggplot2 is declaratively and efficient in creating data visualization based on The Grammar of Graphics. The layered grammar makes developing charts structural and effusive.

What is the difference between plot and ggplot in R? ›

Plotly is the second most popular visualization package in R after ggplot2. Whereas ggplot2 is used for static plots, plotly is used for creating dynamic plots. Similarly, it offers a plethora of options in terms of chart type we can visualize our data with.

Can you use R for data visualization? ›

R offers a myriad of options and ways to visualize and summarize data which makes R an incredibly flexible tool. This introduction will focus on the three main frameworks for data visualization in R (base, lattice, and ggplot).

Is R better than Python for visualization? ›

R: R is much better than Python in terms of data visualizations. R was designed to display statistical analysis results, with the fundamental graphics module making it simple to build basic charts and plots.

Is R or Python better for data visualization? ›

If you're passionate about the statistical calculation and data visualization portions of data analysis, R could be a good fit for you. If, on the other hand, you're interested in becoming a data scientist and working with big data, artificial intelligence, and deep learning algorithms, Python would be the better fit.

What is ggplot in data visualization? ›

ggplot2 implements the grammar of graphics, a coherent system for describing and building graphs. With ggplot2, you can do more and faster by learning one system and applying it in many places. This chapter will teach you how to visualize your data using ggplot2.

What are the disadvantages of ggplot? ›

3 The disadvantages of ggplot2

It can be slow and memory-intensive when dealing with large or complex data sets, and it can be difficult to create custom or unconventional plots that do not fit the grammar of graphics framework.

Is ggplot2 a library or package? ›

The ggplot2 package is contained within the tidyverse library, so installing it automatically installs ggplot2 .

Does ggplot need a data frame? ›

ggplot only works with data frames, so we need to convert this matrix into data frame form, with one measurement in each row. We can convert to this “long” form with the melt function in the library reshape2 . Notice how ggplot is able to use either numerical or categorical (factor) data as x and y coordinates.

Which of the following are benefits of using ggplot2? ›

Final answer: ggplot2 is a powerful data visualization package in R that allows customization of visuals, simplifies complex data, and enables creation of plots with a single function.

How does the ggplot function work in R? ›

ggplot() initializes a ggplot object. It can be used to declare the input data frame for a graphic and to specify the set of plot aesthetics intended to be common throughout all subsequent layers unless specifically overridden.

Which of the following are operations you can perform in ggplot2? ›

ggplot2 operations include creating scatterplots, bar charts, and modifying plot aesthetics. ggplot2 allows for various operations in data visualization such as creating scatterplots, bar charts, modifying colors, dimensions, and adding titles and subtitles.

Which ggplot2 function can be used to save a plot in R? ›

To create plots and save them with ggsave() , first load the ggplot2 package. The function ggsave() saves the result of last_plot() , which returns the most recently created ggplot plot. It will ignore any intervening plots created with other packages.

References

Top Articles
Latest Posts
Article information

Author: Reed Wilderman

Last Updated:

Views: 5689

Rating: 4.1 / 5 (72 voted)

Reviews: 87% of readers found this page helpful

Author information

Name: Reed Wilderman

Birthday: 1992-06-14

Address: 998 Estell Village, Lake Oscarberg, SD 48713-6877

Phone: +21813267449721

Job: Technology Engineer

Hobby: Swimming, Do it yourself, Beekeeping, Lapidary, Cosplaying, Hiking, Graffiti

Introduction: My name is Reed Wilderman, I am a faithful, bright, lucky, adventurous, lively, rich, vast person who loves writing and wants to share my knowledge and understanding with you.