Who are we?
2021-10-05
Who are we?
Attention to visualization principles while digging into the R universe
Also… possibly “great frustration and much suckiness…” - Hadley Wickham
Cairo (from introduction): A good visualization is
Wilke (from Chapter 1): Data visualization is part art and part science. [Avoid being…]
Data visualization is essential for data exploration, communication, and understanding. Imagine we have a small dataset with the following summary characteristics:
A set of summary statistics is at best a partial picture, until we see what it looks like.
## # A tibble: 1 × 6 ## dataset mean_x mean_y std_dev_x std_dev_y corr_x_y ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1 54.3 47.8 16.8 26.9 -0.0641
Imagine another data set with the following summary characteristics… and scatterplot…
## # A tibble: 1 × 6 ## dataset mean_x mean_y std_dev_x std_dev_y corr_x_y ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 2 54.3 47.8 16.8 26.9 -0.0686
But wait! There’s more!
## # A tibble: 1 × 6 ## dataset mean_x mean_y std_dev_x std_dev_y corr_x_y ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 3 54.3 47.8 16.8 26.9 -0.0645
For any new project in R, create an R project. Projects allow RStudio to leave notes for itself (e.g., history), will always start a new R session when opened, and will always set the working directory to the Project directory. If you never have to set the working directory at the top of the script, that’s a good thing![^2]
And create a system for organizing the objects in this project!
Functions are the “verbs” that allow us to manipulate data. Packages contain functions, and all functions belong to packages.
R comes with about 30 packages (“base R”). There are over 10,000 user-contributed packages; you can discover these packages online in Comprehensive R Archive Network (CRAN), with more in active development on GitHub.
To use a package, install it (you only need to do this once)
tidyverse
(or a different package name) then click on Install.In each new R session, you’ll have to load the package if you want access to it’s functions: e.g., type library(tidyverse)
.
Let’s start working in R!
Part of the the tidyverse
, dplyr
is a package for data manipulation. The package implements a grammar for transforming data, based on verbs/functions that define a set of common tasks.
dplyr
functions are for d
ata frames.
dplyr
functions is always a data frame\(\color{blue}{\text{select()}}\) - extract \(\color{blue}{\text{variables}}\)
\(\color{green}{\text{filter()}}\) - extract \(\color{green}{\text{rows}}\)
\(\color{green}{\text{arrange()}}\) - reorder \(\color{green}{\text{rows}}\)
Extract columns by name.
select(property, yearbuilt)
select() helpers include
Extract rows that meet logical conditions.
filter(property, cardtype == "R" & yearbuilt > 2020)
Logical tests | Boolean operators for multiple conditions |
---|---|
x < y: less than | a & b: and |
y >= y: greater than or equal to | a | b: or |
x == y: equal to | xor(a,b): exactly or |
x != y: not equal to | !a: not |
x %in% y: is a member of | |
is.na(x): is NA | |
!is.na(x): is not NA |
Order rows from smallest to largest values (or vice versa) for designated column/s.
arrange(property, yearbuilt)
Reverse the order (largest to smallest) with desc()
arrange(property, desc(yearbuilt))
\(\color{green}{\text{slice()}}\) - extract \(\color{green}{\text{rows}}\) using index(es)
\(\color{green}{\text{distinct()}}\) - filter for unique \(\color{green}{\text{rows}}\)
\(\color{green}{\text{sample_n()/sample_frac()}}\) - randomly sample \(\color{green}{\text{rows}}\)
The pipe (%>%
) allows you to chain together functions by passing (piping) the result on the left into the first argument of the function on the right.
To get the totalvalue
and finsqft
for property built in 2020 arranged in descending order of totalvalue
without the pipe we could nest the functions
arrange( select( filter(property, yearbuilt == "2020" & cardtype == "R"), totalvalue, finsqft), desc(totalvalue))
Or run each and save the intervening steps
tmp <- filter(property, yearbuilt == "2020" & cardtype == "R") tmp <- select(tmp, totalvalue, finsqft) arrange(tmp, desc(totalvalue))
With the pipe, we call each function in sequence (read the pipe as “and then…”)
property %>% filter(yearbuilt == "2020" & cardtype == "R") %>% select(totalvalue, finsqft) %>% arrange(desc(yearbuilt))
Keyboard Shortcut!
\(\color{blue}{\text{summarize()}}\) - summarize \(\color{blue}{\text{variables}}\)
\(\color{green}{\text{group_by()}}\) - group \(\color{green}{\text{rows}}\)
\(\color{blue}{\text{mutate()}}\) - create new \(\color{blue}{\text{variables}}\)
Compute summaries. Summary functions include
property %>% filter(yearbuilt == "2020" & cardtype == "R") %>% summarize(smallest = min(finsqft), biggest = max(finsqft), total = n())
Summary Functions | |
---|---|
first(): first value | last(): last value |
min(): minimum value | max(): maximum value |
mean(): mean value | median(): median value |
var(): variance | sd(): standard deviation |
nth(.x, n): nth value | quantile(.x, probs = .25): |
n_distinct(): number of distinct values | n(): number of values |
Groups cases by common values of one or more columns.
property %>% filter(yearbuilt == "2020" & cardtype == "R") %>% group_by(esdistrict) %>% summarize(smallest = min(finsqft), biggest = max(finsqft), avg_value = mean(totalvalue, na.rm = TRUE), number = n()) %>% arrange(desc(avg_value))
Create new columns or alter existing columns
property %>% filter(yearbuilt == "2020" & cardtype == "R") %>% mutate(finsqft = as.numeric(finsqft), value_sqft = totalvalue/finsqft) %>% group_by(esdistrict) %>% summarize(smallest = min(finsqft), biggest = max(finsqft), avg_value = mean(totalvalue, na.rm = TRUE), number = n()) %>% arrange(desc(avg_value))
if_else
, case_when
\(\color{blue}{\text{tally()}}\) - short hand for summarize(n())
\(\color{blue}{\text{count()}}\) - short hand for group_by()
+ tally()
\(\color{blue}{\text{summarize(across())}}\) - apply summary function to select \(\color{blue}{\text{variables}}\)
\(\color{blue}{\text{mutate(across())}}\) - apply mutate function to select \(\color{blue}{\text{variables}}\)
\(\color{blue}{\text{summarize(across(where()))}}\) - apply summary function to \(\color{blue}{\text{variables}}\) by conditions
\(\color{blue}{\text{rename()}}\) - rename \(\color{blue}{\text{variables}}\)
\(\color{blue}{\text{recode()}}\) - modify values of \(\color{blue}{\text{variables}}\)
Factors are variables which take on a limited number of values, aka categorical variables. In R, factors are stored as a vector of integer values with the corresponding set of character values you’ll see when displayed (colloquially, labels; in R, levels).
property %>% count(condition) # currently a character property %>% mutate(condition = factor(condition)) %>% # make a factor count(condition) # assert the ordering of the factor levels cond_levels <- c("Excellent", "Good", "Average", "Fair", "Poor", "Very Poor", "Unknown") property %>% mutate(condition = factor(condition, levels = cond_levels)) %>% count(condition)
The forcats
package, part of the tidyverse
, provides helper functions for working with factors. Including
The Grammar of Graphcis: All data visualizations map data to aesthetic attributes (location, shape, color) of geometric objects (lines, points, bars)
Scales control the mapping from data to aesthetics and provide tools to read the plot (axes, legends). Geometric objects are drawn in a specific coordinate system.
A plot can contains statistical transformations of the data (counts, means, medians) and faceting can be used to generate the same plot for different subsets of the data.
ggplot(data, aes(x = var1, y = var2)) + geom_point(aes(color = var3)) + geom_smooth(color = "red") + labs(title = "Helpful Title", x = "x-axis label") # geom_histogram(), geom_boxplot(), geom_bar(), etc.
head(ncdn_long)
## # A tibble: 6 × 10 ## station date name month day avg_tmp max_tmp min_tmp location dates ## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <fct> <date> ## 1 USW00003759 01-01 CHAR… 01 01 36 43.5 28.5 Charlot… 0000-01-01 ## 2 USW00003759 01-02 CHAR… 01 02 35.9 43.4 28.4 Charlot… 0000-01-02 ## 3 USW00003759 01-03 CHAR… 01 03 35.8 43.3 28.3 Charlot… 0000-01-03 ## 4 USW00003759 01-04 CHAR… 01 04 35.7 43.2 28.2 Charlot… 0000-01-04 ## 5 USW00003759 01-05 CHAR… 01 05 35.6 43.1 28.1 Charlot… 0000-01-05 ## 6 USW00003759 01-06 CHAR… 01 06 35.5 43 28 Charlot… 0000-01-06
ggplot(ncdn_long, aes(x = dates, y = avg_tmp, color = location)) + geom_line(size = 1) + scale_x_date(name = "month", date_labels = "%b") + scale_y_continuous(limits = c(15, 95), breaks = seq(15, 95, by = 20), name = "temperature (°F)") + labs(title = "Average Daily Normal Temperatures")
head(mean_ncdn)
## # A tibble: 6 × 3 ## location month mean ## <fct> <fct> <dbl> ## 1 Houston Jan 53.8 ## 2 Houston Feb 57.8 ## 3 Houston Mar 63.8 ## 4 Houston Apr 69.9 ## 5 Houston May 77.3 ## 6 Houston Jun 83.0
ggplot(mean_ncdn, aes(x = month, y = location, fill = mean)) + geom_tile(width = .95, height = 0.95) + scale_fill_viridis_c(option = "B", begin = 0.15, end = 0.98, name = "temp (°F)") + scale_y_discrete(name = NULL) + coord_fixed(expand = FALSE) + theme(axis.line = element_blank(), axis.ticks = element_blank()) + labs(title = "Average Monthly Normal Temperatures")
One more example
All of the code for his book is on github https://github.com/clauswilke/dataviz