Creating Waffle Plots in R with Waffle

Introduction

In this post we will be creating waffle plots with R, using the waffle() function from the package of the same name. In terms of necessary packages I would recommend dplyr(for the handy filter() and select() functions), tidyr (for the pivot_longer() function) and of course waffle (to allow us to create waffle plots).

library(dplyr)
library(tidyr)
library(waffle)   

Importing The Data

Waffle plots are best used to visualise count data, often as an alternative to something like a pie chart or bar chart. The data we will be using is a count of medals won at the Paris 2024 Olympic Games by country. This dataset is from Kaggle (which is a generally useful place to find datasets to practice data visualisation or machine learning) and can be found here.

Processing The Data

Out of the box the data looks as shown below.

# Import the data from a csv file
olympic_data <- read.csv('olympics2024.csv')

print(head(olympic_data))

Output:

Rank Country Country.Code Gold Silver Bronze Total
<int> <chr> <chr> <int> <int> <int> <int>
1	1	United States	US	40	44	42	126
2	2	China	CHN	40	27	24	91
3	3	Japan	JPN	20	12	13	45
4	4	Australia	AUS	18	19	16	53
5	5	France	FRA	16	26	22	64
6	6	Netherlands	NED	15	7	12	34

This is great as there is not particularly any cleaning that we will need to do. It is worth getting rid of the Country.Code and Total columns however as we will not be needing these. We can do this with select(). We can see that Gold, Silver and Bronze and Rank are already integers but Country contains character variables. It won’t make a difference here but it’s generally better to convert this column to factors instead, as this is how we will be using it.

# Remove unwanted columns
olympic_data <- olympic_data |>
  select(-c(Total, Country.Code))

# Convert country to factor
olympic_data$Country <- as.factor(olympic_data$Country)

# See the result
print(head(olympic_data))
Rank Country Gold Silver Bronze
<int> <fctr> <int> <int> <int>
1	1	United States	40	44	42
2	2	China	40	27	24
3	3	Japan	20	12	13
4	4	Australia	18	19	16
5	5	France	16	26	22
6	6	Netherlands	15	7	12

This is great however to plot a waffle plot we will actually need our data to be in a slightly different form. We do this with pivot_longer().

When using pivot_longer() Remember to specify the columns in the order you would like them to appear in the plot.

long_data <- olympic_data |>
  pivot_longer(
    cols = c(Gold, Silver, Bronze),
    names_to = "Medal",
    values_to = "Count"
  )

Here is our new dataframe, each country now has three rows, one for each of the number of gold, silver and bronze medals won.

print(head(long_data))
Rank Country Medal Count
<int> <fctr> <chr> <int>
1	United States	Gold	40	
1	United States	Silver	44	
1	United States	Bronze	42	
2	China	Gold	40	
2	China	Silver	27	
2	China	Bronze	24	

Creating Waffle Plots

As mentioned earlier, the way we will be creating a waffle plot today is with the waffle() function. This function could not be more simple, we input a dataframe with a column of descriptions and a column of values and out comes a waffle plot!

To start let’s just consider Great Britain (using filter).

# Use filter to get just the data about GB
GB_data <- long_data |> filter(Country == 'Great Britain')

# Create the waffle plot,
waffle(data.frame(GB_data$Medal, GB_data$Count))

We could do with tidying this plot up a little and it would be nice to have the colours match “gold”, “silver” and “bronze”.

In general when making waffle plots it might also be nice to have as few “left over” squares as possible, meaning ideally we would want the total number of squares to be a multiple of the number of rows. We can do this by summing our Count column and looking for any factors which we can achieve concisely with the primeFactors() function from the numbers library.

library(numbers)

total <- sum(GB_data$Count)

print(primeFactors(total))

Output:

[1]  5 13

We can see that 5 is a prime factor of our total, meaning if we choose 5 rows we won’t be left with any “hanging squares”.

# Add a title, set the number of rows and change the legend position
waffle(data.frame(GB_data$Medal, GB_data$Count), rows = 5,
       title = "Team GB Medals: Paris 2024",
       legend_pos = "bottom")

We can change the colours with the colors argument.

waffle(data.frame(GB_data$Medal, GB_data$Count), rows = 5,
       title = "Team GB Medals: Paris 2024",
       legend_pos = "bottom",
       colors = c('#d4af37', '#c0c0c0', '#cd7f32'))

There we have it, our first waffle plot! Each square represents a medal won.

We can also turn this plot into a pictogram, to do this you will need fontawesome installed on your computer. By default the medal glyph is a bit too large for this plot so we set its size manually to 8 with the glyph_size argument.

library(extrafont)
library(fontawesome)
loadfonts(device = 'all')
waffle(data.frame(GB_data$Medal, GB_data$Count), rows = 5,
       title = "Team GB Medals: Paris 2024",
       legend_pos = "bottom",
       colors = c('#d4af37', '#c0c0c0', '#cd7f32'),
       use_glyph = 'medal', glyph_size = 8)

There’s nothing stopping us creating plots for different countries too. In fact we could show several on one plot using the iron function from the waffle package or using a function from another package such as plot_grid() from the cowplot package.

To simplify our code when doing this I’m also going to create a function to make us a waffle plot.

This function has one mandatory input, country, the country you wish to make the plot for. It has optional argument of data (I made this customisable in case we wished to plot for a different Olympic Games for example), legend_pos which we can use to make sure there’s only one legend for the plot and size in case we want to adjust this.

To create a title that gives each country and its rank we use paste() which is a way to concatenate strings in R.

country_waffle <- function(country, data = long_data, legend_pos = 'none', size = 0.25)
  {
  data <- data |> filter(data$Country == country)
  plot <- waffle(data.frame(data$Medal, data$Count), 
                 size = size, legend_pos = legend_pos,
                 keep = FALSE, colors = c('#d4af37', '#c0c0c0', '#cd7f32'),
                 title =  paste(country, ':', data$Rank[1]))
  return (plot)
}

Let’s test that this works for Australia. We do need to adjust the size parameter to make this look great.

country_waffle("Australia", size = 1, legend_pos='bottom')

Perfect! Time to combine our plots. We will plot the top five countries with cowplot first.

library(cowplot)

# Filter long_data to find the top 5 countries,
# use unique to remove duplicates as each country has 3 rows.
top_countries <- filter(long_data,
                        Rank %in% c(1, 2, 3, 4, 5))$Country|> unique()

# Titles with cowplot are a bit awkward,
# to get around this we use ggdraw and create a plot that is just a title.

title <- ggdraw() + 
  draw_label(
    "Paris 2024 Olympic Games: Medals for Top 5 Countries",
    fontface = 'bold',
    hjust = 0
  )

# To get the layout we want we need to add two blank plots
# to the first row using geom_blank()

plot_grid(title, geom_blank(), geom_blank(),
          country_waffle(top_countries[1]),
          country_waffle(top_countries[2]),
          country_waffle(top_countries[3]),
          country_waffle(top_countries[4]),
          country_waffle(top_countries[5],
                         legend_pos = 'right'), ncol=3)

Alternatively we could do this with the built in iron() function in waffle. I wanted to show both ways as something like plot_grid() is more customisable but it can be awkward to work with.

iron(country_waffle(top_countries[1]),
     country_waffle(top_countries[2]),
     country_waffle(top_countries[3],
                    legend_pos = 'bottom'))

The iron() function is better for a smaller number of plots as it can squish things quite a bit.

This isn’t the only way of creating waffle plots however, in a future post I will be looking at creating waffle plots with geom_waffle instead, waffle() is great for simple one off plots but geom_waffle is definitely the way to go when trying to do something a bit more complicated, such as this example from r-graph-gallery.com.

Creating Scatter Plots with ggplot2

This post follows directly on from my last, “How Many Penguins? A First Look at Visualising Data With R and ggplot2”, so if you are new to ggplot2 check that one out first!

Today we are going to be continuing looking at the Palmer Archipelago dataset, this time for creating scatter plots.

As before the dataset being used can be downloaded directly (in csv format) from Kaggle or imported directly into R with the palmerpenguins package.

Artwork by @allison_horst.

Importing the Data

As previously we will be needing the ggthemes and ggplot2 packages. To show off a few themes we’ll use a different one for each plot. This time we will also be using the dplyr package for data manipulation and the ggExtra package which we will be using to plot distributions on plots.

Firstly we import the data with read.csv.

library(dplyr)
library(ggplot2)
library(ggthemes)
library(ggExtra)

# Import the penguins dataset using the read.csv() function, built into R
penguins <- read.csv("penguins_size.csv")

Next we convert the ‘species’, ‘island’ and ‘sex’ variables to factors, this will be a requirement for some of our plots and is generally good practice for factor variables. This is done with lapply, click here for more on this.


# Convert species, island and sec to factor variables
penguins[c('species', 'island', 'sex')] <- 
lapply(penguins[c('species', 'island', 'sex')], as.factor)

Our final statement is quite a large one and shows off some of the power of pipes in R. In R |> is a pipe, which can be read as “then”. So the statement penguins <- penguins |> na.omit() |> rename('Island' = 'island', 'Species' = 'species', 'Sex' = 'sex') |> filter(!(Sex == ".")) is saying to take the penguins dataframe, then remove any NA values, then rename ‘island’, ‘species’ and ‘sex’ columns to their capitalised variants then filter the dataframe to remove any rows where the sex of the penguin is recorded as ‘.’ and finally to assign the resulting dataframe to the penguins variable. The use of pipes allows us to perform all of these operations in a single line of code, rather than several.


# Omit NA values from the dataset, rename 'island', 'species' and 'sex' to be capitalised and remove any penguins with a sex of '.'.

penguins <- penguins |> na.omit() 
    |> rename('Island' = 'island', 'Species' = 'species', 'Sex' = 'sex') 
        |> filter(!(Sex == "."))

Simple Scatter

To create a basic scatter plot we can use geom_point(), as shown below. As previously we pass the penguins dataframe into the ggplot() function with a pipe, set the aes values to be culmen length and culmen depth of the penguins and finally add the plot.

penguins |>
  ggplot(aes(x = culmen_length_mm, y = culmen_depth_mm))+
  geom_point()

Making Improvements

We can improve this plot by adding labels, a little colour and using a theme. My usual go to theme is theme_hc() from ggthemes so we’ll start with this one.

penguins |>
  ggplot(aes(x = culmen_length_mm, y = culmen_depth_mm))+
  geom_point(shape = 16, colour = "#FF4F00")+
  labs(x = "Culmen Length (mm)", y = "Culmen Depth (mm)", 
      title = "Culmen Length vs. Depth in \n Penguins in the Palmer Archipelago") +
  theme_hc() +
  geom_rangeframe()
Improved scatter.

Looking at our plot it is clear that there appear to be some distinct clusters in the plot, at least two. We have three species of penguins in the dataset, three different islands and two sexes which could all be potential reasons for this clustering. Later we will graphically look at what these clusters may be.

Before that, however, it is important to look at how to add a line of best fit to the scatter plot. This is incredibly easy with R and we even get built in confidence intervals for free! The line of best fit is created with geom_smooth(method="lm") where “lm” stands for linear model.

Looking at the plot we can see that a linear model is not a great fit for our data at the moment.

penguins |>
  ggplot(aes(x = culmen_length_mm, y = culmen_depth_mm))+
  geom_point(shape = 16, colour = "#FF4F00")+
  labs(x = "Culmen Length (mm)", y = "Culmen Depth (mm)",
      title = "Culmen Length vs. Depth in \n Penguins in the Palmer Archipelago") +
  geom_smooth(method="lm")+
  theme_economist_white()+
  geom_rangeframe()
  

print(cor(penguins$culmen_depth_mm, penguins$culmen_length_mm))
Trend scatter.

Output:

-0.2286256

Marginal Plots

We can further improve our plot by adding marginal plots, with ggMarginal from the ggExtra package. These plots help us see the distribution of the data. We could make these plots separately but it can be nice to have this information in the same place as our main plot.

We do this as below, unlike before we know need to assign our plot to a variable and then call ggMarginal on our plot. This is to reduce code repetition when showing off each type of marginal plot available.

plot <- penguins |>
        ggplot(aes(x = culmen_length_mm, y = culmen_depth_mm))+
        geom_point(shape = 16, colour = "#FF4F00")+
        labs(x = "Culmen Length (mm)", y = "Culmen Depth (mm)",
            title = "Culmen Length vs. Depth in \n Penguins in the Palmer Archipelago") +
        geom_smooth(method="lm")+
        theme_calc()


ggMarginal(plot, type="histogram", fill = "#FF4F00", size=5, bins = 12)
Marginal scatter.
ggMarginal(plot, type="boxplot", fill = "#FF4F00", size=15)
ggMarginal(plot, type="density", fill = "#FF4F00", size=10)

For this data in particular I think the boxplot and density plots show the most information. One advantage of creating box plots like this is that it removes a usual downside of the plot, a loss of information about the actual distribution of the data.

Investigating Clustering

Our previous plots were fine but they do little to explain the clusters of data in the graph. To investigate this we can add the parameter colour = Species to aes.

penguins |>
  ggplot(aes(x = culmen_length_mm, y = culmen_depth_mm, colour = Species))+
  geom_point(shape = 16)+
  labs(x = "Culmen Length (mm)", y = "Culmen Depth (mm)",
      title = "Culmen Length vs. Depth \n in Penguins in the Palmer Archipelago by Species") +
  geom_smooth(method="lm")+
  scale_fill_few()+
  theme_few()
Multiple lms.

We see three distinct clusters based on species of penguin, with very little overlap. Graphically we can see that three linear models for this data appears to be doing a better job than one. We can put some numbers to this by looking at the correlation between culmen length and depth for each species of penguin.

# Investigate the correlation for each species
correlation <- penguins |>
  group_by(Species) |>
    summarise(correlation = cor(culmen_length_mm, culmen_depth_mm))

print(correlation)

Output:

Species correlation
<fctr> <dbl>
Adelie	0.3858132			
Chinstrap	0.6535362			
Gentoo	0.6540233	

The closer the magnitude of the correlation is to 1.0, the stronger the correlation. For Adelie penguins the correlation is weak and positive, at 0.4. For Chinstrap and Gentoo penguins the correlation is stronger, 0.65 for both. This would suggest that a linear model such as the one plotted would work well for Chinstrap and Gentoo penguins, which matches our intuition when looking at the distribution of data for each species.

Rather than species we might also want to consider Island!

penguins |>
  ggplot(aes(x = culmen_length_mm, y = culmen_depth_mm, colour = Island))+
  geom_point(shape = 16)+
  labs(x = "Culmen Length (mm)", y = "Culmen Depth (mm)", title = "Culmen Length vs. Depth by Island \n in Penguins in the Palmer Archipelago") +
  geom_smooth(method="lm")+
  scale_fill_few()+
  theme_foundation()

correlation <- penguins |>
  group_by(Island) |>
    summarise(correlation = cor(culmen_length_mm, culmen_depth_mm))

print(correlation)

Output:

Island correlation
<fctr> <dbl>
Biscoe	-0.4446658			
Dream	0.3654451			
Torgersen	0.2160882	

This seems to have been less successful than splitting the data by species. There is clearly some separation but this is far less distinct. Correlation is also generally weaker.

There’s nothing stopping us considering multiple factors on the same graph! One way to do this is to use the shape attribute i.e. change the shape of the point for each species. R will even create new lines of best fit for us for each combination! Sex is an obvious choice to do this with as we would reasonably expect differences between penguin sexes but not necessarily between the same species in different locations (although factors such as nutrition could make this the case).

Looking at sex alone there is a clear difference in our lines of best fit. Correlation appears to be similar between the sexes however both culmen length and depth seem to trend larger in male penguins.

penguins |>
  ggplot(aes(x = culmen_length_mm, y = culmen_depth_mm, colour = Sex))+
  geom_point()+
  labs(x = "Culmen Length (mm)", y = "Culmen Depth (mm)",
      title = "Culmen Length vs. Depth by Sex \n in Penguins in the Palmer Archipelago") +
  geom_smooth(method="lm")+
  scale_fill_few()+
  theme_solarized()

correlation <- penguins |>
  group_by(Sex) |>
    summarise(correlation = cor(culmen_length_mm, culmen_depth_mm))

print(correlation)

Output:

Sex correlation
<fctr> <dbl>
FEMALE	-0.4263804			
MALE	-0.3952939	

Now let’s consider sex and species!

penguins |>
  ggplot(aes(x = culmen_length_mm, y = culmen_depth_mm, colour = Sex, shape=Species))+
  geom_point()+
  labs(x = "Culmen Length (mm)", y = "Culmen Depth (mm)",
      title = "Culmen Length vs. Depth by Sex and Species \n in Penguins in the Palmer Archipelago") +
  geom_smooth(method="lm")+
  scale_fill_few()+
  theme_excel_new()

A similar patterns seems to emerge for each species by sex, however there is some slight strangeness. The linear model for male chinstrap penguins makes it appear like their culmen length increases at a greater rate with culmen depth than for female chinstrap penguins. This may be the case, but it may not. I think this highlights a key issue with splitting up data like this, as the size of the dataset you are working with decreases any attempt to make predictions with it become inherently less reliable.

Facet Plots

Whilst graphs like this can be useful they are a little bit visually busy, we will finish by looking at a different way to see multiple plots in one place, with a facet.

For this particular set of data we need only add the facet_grid function as shown.

ggplot(penguins, aes(x=culmen_length_mm, y = culmen_depth_mm, colour = Species))+
  geom_point()+
  geom_smooth(method="lm")+
  facet_grid(Island~Species, 
# facet_grid showing Island on the y axis and Species on the x axis
             scales="free",
             space="free_x") + 
  labs(x="Culmen Length (mm)",
       y="Culmen Depth (mm)",
       title="Culmen Length vs Depth by Species and Island for Penguins in the Palmer 
           Archipelago")+
  theme_base()

We can also do this for sex and species.

ggplot(penguins, aes(x=culmen_length_mm, y = culmen_depth_mm, colour = Sex))+
  geom_point()+
  geom_smooth(method="lm")+
  facet_grid(Sex~Species, 
      # facet_grid showing Sex on the y axis and Species on the x axis
             scales="free",
             space="free_x") + 
  labs(x="",
       y="Penguin Count",
       title="Culmen Length vs Depth by Species
           and Sex for Penguins in the Palmer Archipelago")+
  theme_stata()

Or even just species if you would rather the data not be all on one plot.

ggplot(penguins, aes(x=culmen_length_mm, y = culmen_depth_mm, colour = Sex))+
  geom_point()+
  geom_smooth(method="lm")+
  facet_grid(~Species, 
      # facet_grid showing Sex on the y axis and Species on the x axis
             scales="free",
             space="free_x") + 
  labs(x="",
       y="Penguin Count",
       title="Culmen Length vs Depth by Species for Penguins
          in the Palmer Archipelago")+
  theme_bw()

We need to be very careful here however as if you were not paying attention it might not be obvious that this plot has three distinct x axes.

We could make this same plot the other way around too. Notice that this time we need to write Species~.

ggplot(penguins, aes(x=culmen_length_mm, y = culmen_depth_mm, colour = Sex))+
  geom_point()+
  geom_smooth(method="lm")+
  facet_grid(Species~., 
      # facet_grid showing Sex on the y axis and Species on the x axis
             scales="free",
             space="free_x") + 
  labs(x="",
       y="Penguin Count",
       title="Culmen Length vs Depth by Species
           for Penguins in the Palmer Archipelago")+
  theme_dark()

Conclusion

I hope that this post has been a useful introduction to scatter plots with ggplot2. Why not try investigating the relationships between some of the other numeric variables for this data such as body mass or flipper_length? You could even try investigating three numerical variables at once using bubble plots!

A First Look at Visualising Data With R and ggplot2

R and its ggplot2 package are wonderful tools for visualising data. In this post, we will explore some of the basics of plotting with ggplot2 by creating bar charts using the famous Palmer Archipelago dataset.

Artwork by @allison_horst.

The dataset being used can be downloaded directly (in csv format) from Kaggle or imported directly into R with the palmerpenguins package.

Importing the Data

Move the dataset to the location of the R script you will be plotting in, or use a relative path. Remember that for R to find your file you may need to set your current working directory, you can do this in RStudio by clicking Session <- Set Working Directory <- To Source File Location in the banner, or by running the setwd("/your_path_here") command.

You can import the penguins dataset using the read.csv() function, built into R.


# Import the penguins dataset using the read.csv() function
penguins <- read.csv("penguins_size.csv")

# View the first few entries of the dataframe
print(penguins[1:5,])

Output:

1  Adelie Torgersen             39.1            18.7               181
2  Adelie Torgersen             39.5            17.4               186
3  Adelie Torgersen             40.3            18.0               195
4  Adelie Torgersen               NA              NA                NA
5  Adelie Torgersen             36.7            19.3               193
  body_mass_g    sex
1        3750   MALE
2        3800 FEMALE
3        3250 FEMALE
4          NA   <NA>
5        3450 FEMALE

Cleaning Up

Viewing the dataset by printing the first few rows has revealed our first issue, this data has several NA values. A great way to visualise the amount of data missing in a given dataframe in R is using the vis_miss function from the naniar library. You may need to install this by running install.packages('naniar').

# install.packages('naniar')
library(naniar)
vis_miss(penguins)

Reassuringly the dataset has very few missing values. The easiest way to deal with these will be to exclude them using na.omit() which simply removes each row in a dataframe that has any NA values in it.

penguins <- na.omit(penguins)

vis_miss(penguins)

Another good idea when working with a new dataset is to make sure that any categorical variables are treated as factors in R. This can be done with as.factor(col) and makes sure that plots of categorical variables work correctly.

penguins$sex <- as.factor(penguins$sex)
penguins$island <- as.factor(penguins$island)
penguins$species <- as.factor(penguins$species)

We can also change the names of any columns. Below I have changed the names of the variables we will be plotting to be capitalised so that they will look a little nicer in legends.

names(penguins)[names(penguins) == 'island'] <- 'Island'
names(penguins)[names(penguins) == 'species'] <- 'Species'
names(penguins)[names(penguins) == 'sex'] <- 'Sex'

There is one more issue with the dataset in its current form. The sex for one observation is missing, instead containing just a full stop. A helpful side effect of converting our categorical variables to factors is that we can see this easily by printing the levels of each factor variable.

print(levels(penguins$Sex))

Output:

[1] "."      "FEMALE" "MALE"
print(levels(penguins$Island))

Output:

[1] "Biscoe"    "Dream"     "Torgersen"
print(levels(penguins$Species))

Output:

[1] "Adelie"    "Chinstrap" "Gentoo"   

To handle this we can use the filter() function from the dplyr library. The ! before (Sex == ".") means that rather than returning the dataset with only rows where the sex of the penguin is “.” the function will do the opposite and select all rows where the sex does not equal “.”.

library(dplyr)
penguins <- filter(penguins, !(Sex == "."))

We are now ready to start plotting. For this first look at ggplot2 we will focus on bar plots.

Creating Plots

To create any plot with ggplot2 we first need to create the plot area with the ggplot() function. For all plots we will need to specify the data being used and any aesthetics we wish to pass through to the graphs we will be plotting. For this first tutorial we will focus exclusively on the number of penguins for specific categories in the dataset rather than any other dependent variable.

library(ggplot2) # Load the ggplot2 library at the start of the script

ggplot(data = penguins, aes(x=Species)) +
  geom_bar()

Intuitively, we add new elements to a plot with +. For this tutorial we use geom_bar() for a bar plot but other plots available include geom_point() for a scatter plot, geom_col() for a column plot or geom_line() for a line plot. We could even add multiple plots to the same axes.

In aesthetics (aes), x = Species means that the x-axis of our bar plot, the category, is the species of penguin. In the below plot x = Island is used to instead have the island the penguin was found on in the x-axis. For this plot fill = Species indicates that we want our bars to be coloured by the species of the penguin.

# We can use a pipe rather than specifying the dataset directly (think of it as 'then')
penguins |>
  ggplot(aes(x=Island, fill=Species)) +
  geom_bar()

We can enhance our plots by adding some labels using labs() to add a title, x-axis and y-axis. To change the title of a legend we can use the argument fill = title for bar plots.

penguins |>
  ggplot(aes(x=Species, fill=Species)) +
  geom_bar()+
  labs(title="Penguins in the Palmer Archipelago",
       x = "Species",
       y="Penguin Count")
penguins |>
  ggplot(aes(x=Island, fill=Species)) +
  geom_bar(position = "dodge2")+
  labs(title="Penguins in the Palmer Archipelago",
       x = "Island",
       y="Penguin Count")

Themes allow us to customise our plots further. There are many built into ggplot2 however my favourite, easy to implement, themes are those in the ggthemes package. The below graphs use the themes theme_hc, theme_economist and theme_calc() but there are far more available. Each theme also comes with a colour palette that can be used. A custom colour palette could also have been used with scale_color_manual(c(color1, color2, color3)) .

library(ggthemes) # Load the ggthemes library

penguins |>
  ggplot(aes(x=Species, fill=Species)) +
  geom_bar()+
  labs(title="Penguins in the Palmer Archipelago",
       x="Species",
       y="Penguin Count") +
  geom_rangeframe() + # Highlights the range of the variables
  theme_hc() + # Use the hc theme
  scale_fill_hc()+ # Use the hc palette
  theme(legend.position = "none")
penguins |>
  ggplot(aes(x=Island, fill=Species)) +
  geom_bar(position = "dodge2")+ # position = dodge2 puts the bars side by side
  labs(title="Penguins in the Palmer Archipelago",
       x = "Island",
       y="Penguin Count",
       fill="Species") +
  geom_rangeframe() +
  scale_fill_economist()+ # Use the economist palette
  theme_economist() # Use the economist theme
penguins |>
  ggplot(aes(x=Sex, fill=Species)) +
  geom_bar(position = "dodge2")+
  labs(title="Penguins in the Palmer Archipelago",
       x = "Sex",
       y="Penguin Count",
       fill="Species") +
  geom_rangeframe() +
  scale_fill_few()+ 
  theme_calc() # We can mix and match themes and palettes

Combining Plots

We can use a facet grid to combine all of the information from our plots so far into a single, easy to read plot. To do this we will need to reshape the penguins dataframe using the melt function and the the MASS, reshape and reshape2 packages.

library(MASS) 
library(reshape2) 
library(reshape) 


penguin_2 <- melt(penguins, id = c('culmen_length_mm', 'culmen_depth_mm',
                                   'flipper_length_mm', 'body_mass_g',
                                   'Species','Sex'))

print(head(penguin_2)) # See the first few entries of our reshaped dataframe

# Create a vector so that we can later show the sex of a penguin as "Male"
# or "Female" rather than the all caps version

sex.labs <- c("Male", "Female")
names(sex.labs) <- c("MALE", "FEMALE")


ggplot(penguin_2, aes(x=value, fill = Species))+
  geom_bar(position = "dodge2")+
  facet_grid(Sex~variable, # facet_grid showing sex and each variable (Island) 
             scales="free",
             space="free_x", 
             labeller = labeller(Sex=sex.labs))+ # Renames the sexes
  labs(x="",
       y="Penguin Count",
       title="Penguins in the Palmer Archipelago")+
  theme_hc()+
  scale_fill_manual(values=c("#FF8100", "#C25ECA", "#067476")) # Set custom colours
  

Conclusion

This final plot shows us the distribution of penguins across each island, for each species and for both sexes.

In the next post we will begin looking at the other variables in the dataset such as body mass and flipper length and look at if these vary based on sex, island or species.