Data wrangling & visualization I — ggplot2

Week 2

R/RStudio books

Data Visualization

  • this week: we will learn the basic structure of a ggplot2 plot

Data Transformation

  • next week: we will learn the key verbs (R commands) to:
    • select variables
    • filter out observations
    • create new variables
    • compute summaries

Why make plots?

  • numeric summaries of data are easy to generate
  • mean, sd, correlation, list of t-tests, etc…
  • but numerical summaries are just summaries
  • they can obscure patterns in the underlying data

Why make plots?

  • numeric summaries of data are easy to generate
  • mean, sd, correlation, list of t-tests, etc…
  • but numerical summaries are just summaries
  • they can obscure patterns in the underlying data

ALWAYS PLOT YOUR DATA ALWAYS PLOT YOUR DATA ALWAYS PLOT YOUR DATA ALWAYS PLOT YOUR DATA ALWAYS PLOT YOUR DATA ALWAYS PLOT YOUR DATA ALWAYS PLOT YOUR DATA ALWAYS PLOT YOUR DATA ALWAYS PLOT YOUR DATA

Anscombe’s Quartet

Datasaurus Dozen

Datasaurus Dozen

.

Why make plots using code?

  • repeatable
  • extensible
  • sharable
  • durable

Prerequisites

  • only once:
install.packages("tidyverse")
  • every time you start RStudio:
library(tidyverse)

Sample dataset: mpg

  • the ggplot2 package (which comes with tidyverse) includes:
    • a tibble called mpg
    • type ?mpg in the RStudio console to get a help page on the mpg dataset
    • type mpg to see the first few rows
mpg
# A tibble: 234 × 11
   manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
   <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
 1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…
 2 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…
 3 audi         a4           2    2008     4 manu… f        20    31 p     comp…
 4 audi         a4           2    2008     4 auto… f        21    30 p     comp…
 5 audi         a4           2.8  1999     6 auto… f        16    26 p     comp…
 6 audi         a4           2.8  1999     6 manu… f        18    26 p     comp…
 7 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…
 8 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…
 9 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…
10 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
# ℹ 224 more rows

Sample dataset: mpg

  • the ggplot2 package (which comes with tidyverse) includes:
    • a tibble called mpg
    • type nrow(mpg) to count the number of rows
    • type ncol(mpg) to count the number of cols
nrow(mpg)
[1] 234
ncol(mpg)
[1] 11

Sample dataset: mpg

  • you can also see a view of a data frame using glimpse():
glimpse(mpg)
Rows: 234
Columns: 11
$ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
$ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
$ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
$ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
$ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
$ trans        <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
$ drv          <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4…
$ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
$ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
$ fl           <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
$ class        <chr> "compact", "compact", "compact", "compact", "compact", "c…

Sample dataset: mpg

  • type View(mpg) to bring up a spreadsheet-like view of the data frame

Q: Do big engines use more fuel?

  • displ: engine size, in litres
  • hwy: fuel efficiency (highway), in miles per gallon (mpg)

Creating a ggplot

Creating a ggplot

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))

Creating a ggplot

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))
  • ggplot(data = mpg)
    • creates a coordinate system you can add layers to
  • geom_point()
    • adds a layer of points to your plot
    • mapping = aes(x = displ, y = hwy)
      • tells geom_point to map displ values on to X-axis and hwy values onto Y-axis

Colour-code by another variable

  • can we explain why the cars shown in red don’t follow the trend?

Colour-code by another variable

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class))

Bigger marker size

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class),
             size = 3)

Size-code by another variable

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class, size = class))

Alpha transparency

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class, size = class),
             alpha = 0.5)

Facets—wrap

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_wrap(~ class, nrow = 2)

Facets—grid

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_grid(drv ~ cyl)

Facets: wrap vs grid

geoms—geom_point()

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))

geoms—geom_smooth()

ggplot(data = mpg) + 
  geom_smooth(mapping = aes(x = displ, y = hwy))

geoms–both

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  geom_smooth(mapping = aes(x = displ, y = hwy))

Pair mappings to ggplot()

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
    geom_point() + 
    geom_smooth()

geoms—colors

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color=drv)) + 
    geom_point() + 
    geom_smooth()

geom_smooth() options

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color=drv)) + 
    geom_point() + 
    geom_smooth(method="lm")

geom_smooth() options

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color=drv)) + 
    geom_point() + 
    geom_smooth(method="lm", se=FALSE)

Statistical Transformations

  • some graphs (e.g. scatterplots) plot raw values in your dataset
  • others (bar charts, histograms, boxplots, smoothers) calculate new values
    • called a stat for short
ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut))

Statistical Transformations

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut))

Statistical Transformations

Barplot with raw values not stat

# A tibble: 5 × 3
  cut           n  mean
  <ord>     <int> <dbl>
1 Fair       1610 4359.
2 Good       4906 3929.
3 Very Good 12082 3982.
4 Premium   13791 4584.
5 Ideal     21551 3458.
ggplot(data = meanprice) +
  geom_bar(mapping = aes(x = cut, y = mean),
           stat = "identity")

Barplot with summary (e.g. mean)

ggplot(diamonds) +
  stat_summary_bin(aes(x = cut, y = price),
                   fun = "mean", geom = "bar")

So many options!

  • do the readings
  • play with the code
  • homeworks will teach you a little bit more
  • build on simple examples

The layered grammar of graphics

ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(
     mapping = aes(<MAPPINGS>),
     stat = <STAT>, 
     position = <POSITION>
  ) +
  <COORDINATE_FUNCTION> +
  <FACET_FUNCTION>
  • you can uniquely describe any plot as a combination of a dataset, a geom, a set of mappings, a stat, a position adjustment, a coordinate system, and a faceting scheme

The layered grammar of graphics

ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(
     mapping = aes(<MAPPINGS>),
     stat = <STAT>, 
     position = <POSITION>
  ) +
  <COORDINATE_FUNCTION> +
  <FACET_FUNCTION>
  • with these 7 parameters you can make any plot
  • you rarely need to supply all seven parameters to make a graph
  • ggplot2 will provide useful defaults for everything except the data, the mappings, and the geom function

R for Data Science


ggplot—play around!

  • ggplot is simple to get started with
  • is as complex as you want it to be
  • designed as a “grammar” of graphics—systematic
  • many ways of doing the same thing


ggplot—play around!

ggplot—Homework 2

  • some answers are directly in the lecture slides
  • some answers are directly in the readings
  • some answers require you to apply what you have learned
  • TAs are there to help guide you
    • (but not to literally give you the code)

ALWAYS PLOT YOUR DATA