R and its ggplot2 package are wonderful tools for visualising data. In this post, we will explore some of the basics of plotting with ggplot2 by creating bar charts using the famous Palmer Archipelago dataset.

Artwork by @allison_horst.
The dataset being used can be downloaded directly (in csv format) from Kaggle or imported directly into R with the palmerpenguins
package.
Importing the Data
Move the dataset to the location of the R script you will be plotting in, or use a relative path. Remember that for R to find your file you may need to set your current working directory, you can do this in RStudio by clicking Session <- Set Working Directory <- To Source File Location
in the banner, or by running the setwd("/your_path_here")
command.
You can import the penguins dataset using the read.csv()
function, built into R.
# Import the penguins dataset using the read.csv() function
penguins <- read.csv("penguins_size.csv")
# View the first few entries of the dataframe
print(penguins[1:5,])
Output:
1 Adelie Torgersen 39.1 18.7 181
2 Adelie Torgersen 39.5 17.4 186
3 Adelie Torgersen 40.3 18.0 195
4 Adelie Torgersen NA NA NA
5 Adelie Torgersen 36.7 19.3 193
body_mass_g sex
1 3750 MALE
2 3800 FEMALE
3 3250 FEMALE
4 NA <NA>
5 3450 FEMALE
Cleaning Up
Viewing the dataset by printing the first few rows has revealed our first issue, this data has several NA values. A great way to visualise the amount of data missing in a given dataframe in R is using the vis_miss
function from the naniar
library. You may need to install this by running install.packages('naniar')
.
# install.packages('naniar')
library(naniar)
vis_miss(penguins)

Reassuringly the dataset has very few missing values. The easiest way to deal with these will be to exclude them using na.omit()
which simply removes each row in a dataframe that has any NA values in it.
penguins <- na.omit(penguins)
vis_miss(penguins)

Another good idea when working with a new dataset is to make sure that any categorical variables are treated as factors in R. This can be done with as.factor(col)
and makes sure that plots of categorical variables work correctly.
penguins$sex <- as.factor(penguins$sex)
penguins$island <- as.factor(penguins$island)
penguins$species <- as.factor(penguins$species)
We can also change the names of any columns. Below I have changed the names of the variables we will be plotting to be capitalised so that they will look a little nicer in legends.
names(penguins)[names(penguins) == 'island'] <- 'Island'
names(penguins)[names(penguins) == 'species'] <- 'Species'
names(penguins)[names(penguins) == 'sex'] <- 'Sex'
There is one more issue with the dataset in its current form. The sex for one observation is missing, instead containing just a full stop. A helpful side effect of converting our categorical variables to factors is that we can see this easily by printing the levels of each factor variable.
print(levels(penguins$Sex))
Output:
[1] "." "FEMALE" "MALE"
print(levels(penguins$Island))
Output:
[1] "Biscoe" "Dream" "Torgersen"
print(levels(penguins$Species))
Output:
[1] "Adelie" "Chinstrap" "Gentoo"
To handle this we can use the filter()
function from the dplyr
library. The !
before (Sex == ".")
means that rather than returning the dataset with only rows where the sex of the penguin is “.” the function will do the opposite and select all rows where the sex does not equal “.”.
library(dplyr)
penguins <- filter(penguins, !(Sex == "."))
We are now ready to start plotting. For this first look at ggplot2 we will focus on bar plots.
Creating Plots
To create any plot with ggplot2 we first need to create the plot area with the ggplot()
function. For all plots we will need to specify the data being used and any aesthetics we wish to pass through to the graphs we will be plotting. For this first tutorial we will focus exclusively on the number of penguins for specific categories in the dataset rather than any other dependent variable.
library(ggplot2) # Load the ggplot2 library at the start of the script
ggplot(data = penguins, aes(x=Species)) +
geom_bar()

Intuitively, we add new elements to a plot with +. For this tutorial we use geom_bar()
for a bar plot but other plots available include geom_point()
for a scatter plot, geom_col()
for a column plot or geom_line()
for a line plot. We could even add multiple plots to the same axes.
In aesthetics (aes
), x = Species
means that the x-axis of our bar plot, the category, is the species of penguin. In the below plot x = Island
is used to instead have the island the penguin was found on in the x-axis. For this plot fill = Species
indicates that we want our bars to be coloured by the species of the penguin.
# We can use a pipe rather than specifying the dataset directly (think of it as 'then')
penguins |>
ggplot(aes(x=Island, fill=Species)) +
geom_bar()

We can enhance our plots by adding some labels using labs()
to add a title, x-axis and y-axis. To change the title of a legend we can use the argument fill = title
for bar plots.
penguins |>
ggplot(aes(x=Species, fill=Species)) +
geom_bar()+
labs(title="Penguins in the Palmer Archipelago",
x = "Species",
y="Penguin Count")

penguins |>
ggplot(aes(x=Island, fill=Species)) +
geom_bar(position = "dodge2")+
labs(title="Penguins in the Palmer Archipelago",
x = "Island",
y="Penguin Count")

Themes allow us to customise our plots further. There are many built into ggplot2 however my favourite, easy to implement, themes are those in the ggthemes
package. The below graphs use the themes theme_hc
, theme_economist
and theme_calc()
but there are far more available. Each theme also comes with a colour palette that can be used. A custom colour palette could also have been used with scale_color_manual(c(color1, color2, color3))
.
library(ggthemes) # Load the ggthemes library
penguins |>
ggplot(aes(x=Species, fill=Species)) +
geom_bar()+
labs(title="Penguins in the Palmer Archipelago",
x="Species",
y="Penguin Count") +
geom_rangeframe() + # Highlights the range of the variables
theme_hc() + # Use the hc theme
scale_fill_hc()+ # Use the hc palette
theme(legend.position = "none")

penguins |>
ggplot(aes(x=Island, fill=Species)) +
geom_bar(position = "dodge2")+ # position = dodge2 puts the bars side by side
labs(title="Penguins in the Palmer Archipelago",
x = "Island",
y="Penguin Count",
fill="Species") +
geom_rangeframe() +
scale_fill_economist()+ # Use the economist palette
theme_economist() # Use the economist theme

penguins |>
ggplot(aes(x=Sex, fill=Species)) +
geom_bar(position = "dodge2")+
labs(title="Penguins in the Palmer Archipelago",
x = "Sex",
y="Penguin Count",
fill="Species") +
geom_rangeframe() +
scale_fill_few()+
theme_calc() # We can mix and match themes and palettes

Combining Plots
We can use a facet grid to combine all of the information from our plots so far into a single, easy to read plot. To do this we will need to reshape the penguins
dataframe using the melt
function and the the MASS
, reshape
and reshape2
packages.
library(MASS)
library(reshape2)
library(reshape)
penguin_2 <- melt(penguins, id = c('culmen_length_mm', 'culmen_depth_mm',
'flipper_length_mm', 'body_mass_g',
'Species','Sex'))
print(head(penguin_2)) # See the first few entries of our reshaped dataframe
# Create a vector so that we can later show the sex of a penguin as "Male"
# or "Female" rather than the all caps version
sex.labs <- c("Male", "Female")
names(sex.labs) <- c("MALE", "FEMALE")
ggplot(penguin_2, aes(x=value, fill = Species))+
geom_bar(position = "dodge2")+
facet_grid(Sex~variable, # facet_grid showing sex and each variable (Island)
scales="free",
space="free_x",
labeller = labeller(Sex=sex.labs))+ # Renames the sexes
labs(x="",
y="Penguin Count",
title="Penguins in the Palmer Archipelago")+
theme_hc()+
scale_fill_manual(values=c("#FF8100", "#C25ECA", "#067476")) # Set custom colours

Conclusion
This final plot shows us the distribution of penguins across each island, for each species and for both sexes.
In the next post we will begin looking at the other variables in the dataset such as body mass and flipper length and look at if these vary based on sex, island or species.