Code
getwd() # check your working directory
[1] "/Users/mmarraff/Documents/Classes/161.250/Data_Analysis_Course/workshops"
Code
# setwd() # if you know the directory you want R to look in
R
and RStudioR
basicsR
and RStudioIf you have not installed R and RStudio, refer to the study guide for some instructions.
To start this workshop download the source from the class website. You can now save this to your computer and make notes in this document.
R
basicsWorking directory
In RStudio, set the working directory under the Session menu. It is a good idea to start your analysis as a new project in the File menu so that the entire work and data files can be saved and re-opened easily later on.
[1] "/Users/mmarraff/Documents/Classes/161.250/Data_Analysis_Course/workshops"
R/RStudio as a calculator
In RStudio, use the File >> New File >> R Script menu to type or copy and paste the commands and execute them
Type 1+1
to see 2
on the console (or ->Run
the code in RStudio).
Type a=1;b=2;a/b
to see 0.5
.
Note that semicolon separates various commands. It is optional to use them as long as you type the commands one by one as follows:
There are many built-in functions. Try the following.
27^3
sqrt(10)
round(sqrt(10),2)
abs(-4)
log(10)
exp(10)
rnorm(100)
mean(rnorm(100))
sd(rnorm(100))
You may wonder what was the base used for log(10)
. A help on this can be obtained by placing a question mark (?
) before log as ?log
or by help(log)
There are a few exceptions. The command ?if
wont work but ?"if"
will. In other words, ?"log"
or help("log")
are safer ways of getting help on “built-in” functions.
In RStudio, use the R Editor (menu File > New Script) to type the commands and submit them (shortcut: CNTRL+R).`
Why does this code not work?
Make a variable with a sequence of numbers between 1 and 10. Look up the function seq()
in the help menu for tips.
Default examples
The command example()
will produce the available HELP examples, and will work for most functions. For example, try example(boxplot)
. You will see many boxplot examples such as the following:
There are also demos available, explore using the command demo()
. The basic R system produces somewhat old style graphs.
So we will be largely using the newer plotting system ggplot which is part of the tidyverse
suite of packages; see https://www.tidyverse.org/.
Let’s load that package now:
A huge number of other dedicated packages are available to improve the power of R. Many R packages are hosted at a repository called CRAN (Comprehensive R Archive Network). The package install option within RStudio can download and install these optional packages under the menu Packages >> Install
. You can also do this using the command install.packages
. For example
install.packages(c("tidyverse", "car"), dependencies = TRUE)
This command installs two packages tidyverse
and car
in one go.
Contributed R packages are grouped in various headings at https://cran.r-project.org/web/views/. They can be installed in bulk using the ctv
package command install.views()
.
You might have to install quite a few packages as you work through this course.
If the data file is stored locally, you should put the data into the same directory as your Quarto or R markdown script. That way, you can (usually) load it easily without having to type the full pathway (e.g., mydata.csv
rather than C:/Users/anhsmith/Work/Project1/data/mydata.csv
). Better yet, Projects make this much easier.
You can also load data from the web using a URL. For example,
Rows: 33 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (10): id, loc, time, w.e, cl, wind, temp, river, people, vehicle
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 33 × 10
id loc time w.e cl wind temp river people vehicle
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 2 1 1 2 2 1 37 15
2 2 1 1 1 1 2 1 2 23 6
3 3 1 2 1 1 2 2 3 87 31
4 4 2 2 1 1 2 1 1 86 27
5 5 2 1 1 1 2 2 2 19 2
6 6 2 2 1 2 1 3 3 136 23
7 7 1 2 2 2 2 2 3 14 8
8 8 1 2 1 2 2 2 3 67 26
9 9 1 1 2 1 3 1 2 4 3
10 10 2 2 1 2 2 2 3 127 45
# ℹ 23 more rows
We’d usually want to store the data as an object though, like so:
Rows: 33 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (10): id, loc, time, w.e, cl, wind, temp, river, people, vehicle
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Now the data are available in R as an object.
Rows: 33
Columns: 10
$ id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,…
$ loc <dbl> 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1,…
$ time <dbl> 2, 1, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2,…
$ w.e <dbl> 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 2, 1, 1, 1, 1, 2, 2, 1, 1, 1, 2,…
$ cl <dbl> 1, 1, 1, 1, 1, 2, 2, 2, 1, 2, 1, 2, 1, 1, 1, 2, 2, 2, 1, 2, 2,…
$ wind <dbl> 2, 2, 2, 2, 2, 1, 2, 2, 3, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 3, 2,…
$ temp <dbl> 2, 1, 2, 1, 2, 3, 2, 2, 1, 2, 2, 3, 1, 2, 2, 2, 3, 2, 2, 2, 2,…
$ river <dbl> 1, 2, 3, 1, 2, 3, 3, 3, 2, 3, 1, 3, 1, 3, 1, 1, 2, 1, 2, 2, 3,…
$ people <dbl> 37, 23, 87, 86, 19, 136, 14, 67, 4, 127, 43, 190, 50, 47, 32, …
$ vehicle <dbl> 15, 6, 31, 27, 2, 23, 8, 26, 3, 45, 7, 53, 22, 18, 10, 3, 11, …
Try importing the Telomeres data file (in Excel format) available at
https://rs.figshare.com/ndownloader/files/22850096
Note that Excel files usually contain blanks for missing or unreported data or allocate many rows for variable description, which can cause issues while importing them.
Consider the study guide dataset rangitikei.txt
(Recreational Use of the Rangitikei river) again. The first 10 rows of this dataset are shown below:
id loc time w.e cl wind temp river people vehicle
1 1 1 2 1 1 2 2 1 37 15
2 2 1 1 1 1 2 1 2 23 6
3 3 1 2 1 1 2 2 3 87 31
4 4 2 2 1 1 2 1 1 86 27
5 5 2 1 1 1 2 2 2 19 2
6 6 2 2 1 2 1 3 3 136 23
7 7 1 2 2 2 2 2 3 14 8
8 8 1 2 1 2 2 2 3 67 26
9 9 1 1 2 1 3 1 2 4 3
10 10 2 2 1 2 2 2 3 127 45
The description of the variables is given below:
loc
- two locations were surveyed, coded 1, 2
time
- time of day, 1 for morning, 2 for afternoon
w.e
- coded 1 for weekend, 2 for weekday
cl
- cloud cover, 1 for >50%, 2 for <50%
wind
- coded 1 through 4 for increasing wind speed
temp
- temperature, 1, 2 or 3 increasing temp
river
- murkiness of river in 3 increasing categories
people
- number of people at that location and time
vehicle
- number of vehicles at that location at that time
How many rows are in my.data
? How many columns? How can you find that in R?
Find the mean of vehicle
and the median of people
using built in R functions.
If have not installed ggplot2
or tidyverse
, install it with the following commands.
We can now load the ggplot2
library with the commands:
In order to work with ggplot2
, we must have a data frame
or a tibble
containing our data. We need to specify the aesthetics
or how the columns of our data frame can be translated into positions, colours, sizes, and shapes of graphical elements.
The aes
part defines the “aesthetics”, which is how columns of the dataframe map to graphical attributes such as x and y position, colour, size, etc. An aesthetic can be either numeric or categorical and an appropriate scale will be used. After this, we add layers of graphics. geom_point
layer is employed to map x and y and we need not specify all the options for geom_point
.
Why does the following give an error and how would you fix it?
We can add a title using labs() or ggtitle() functions. Try-
or
Note that labs() allows captions and subtitles.
Remake the above graph, adjusting the x axis to say Vehicles
using the labs()
function.
geom_smooth
is additionally used to show trends.
Run this code in your head and predict what the output will look like. Then, run the code in R and check your predictions.
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Similar to geom_smooth
, a variety of geoms
are available.
Each geom
accepts a particular set of mappings;for example geom_text()
accepts a labels
mapping. Try-
Will these two graphs look different? Why/why not?
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
The faceting option allows a collection of small plots with the same scales. Try-
Faceting is the ggplot2
option to create separate graphs for subsets of data. ggplot2
offers two functions for creating small multiples:
facet_wrap()
: define subsets as the levels of a single grouping variablefacet_grid()
: define subsets as the crossing of two grouping variablesThe following arguments are common to most scales in ggplot2
:
Specific scale functions may have additional arguments. Some of the available Scales are:
Scale | Examples |
---|---|
scale_color_ |
scale_color_discrete |
scale_fill_ |
scale_fill_continuous |
scale_size_ |
scale_size_manual |
scale_size_discrete |
|
scale_shape_ |
scale_shape_discrete |
scale_shape_manual |
|
scale_linetype_ |
scale_linetype_discrete |
scale_x_ |
scale_x_continuous |
scale_x_log |
|
scale_x_date |
|
scale_y_ |
scale_y_reverse |
scale_y_discrete |
|
scale_y_datetime |
In RStudio, we can type scale_
followed by TAB to get the whole list of available scales.
Try-
The other coding option is shown below:
Note that a desired graph can be obtained in more than one way.
The ggplot2
theme system handles plot elements (not data based) such as
Built-in themes include:
theme_gray()
(default)theme_bw()
theme_minimal()
theme_classic()
Note that the graph is assigned an object name p1
and nothing will be printed unless we then print the object p1
.
Try-
Specific theme elements can be overridden using theme()
. For example:
All theme options can be seen with ?theme
.
To specify a theme for a whole document, use
Minimal graphing can be done using the qplot
option that will produce a few standard formatted graphs quickly.
Warning: `qplot()` was deprecated in ggplot2 3.4.0.
Try-
qplot(people, data = my.data)
qplot(people, fill=factor(river), data=my.data)
qplot(people, data = my.data, geom = "dotplot")
qplot(factor(river), people, data = my.data, geom = "boxplot")
A cheat sheet for ggplot2
is available at https://www.rstudio.com/resources/cheatsheets/ (optional to download). There are many other packages which incorporate ggplot2
based graphs or dependent on it.
The library patchwork allows complex composition arbitrary plots, which are not produced using the faceting option. Try
Bin width defaults to 1/30 of the range of the data. Pick better value with
`binwidth`.
Demonstrate how to open a new Quarto document
---
title: "Getting Started With R"
---
# Installing `R` and RStudio
If you have not installed R and RStudio, refer to the study guide for some instructions.
To start this workshop download the source from the class website. You can now save this to your computer and make notes in this document.
# Some `R` basics {.unnumbered}
*Working directory*
In RStudio, set the working directory under the *Session* menu. It is a good idea to start your analysis as a new project in the File menu so that the entire work and data files can be saved and re-opened easily later on.
```{r}
getwd() # check your working directory
# setwd() # if you know the directory you want R to look in
```
*R/RStudio as a calculator*
*In RStudio, use the File \>\> New File \>\> R Script menu to type or copy and paste the commands and execute them*
Type `1+1` to see `2` on the console (or `->Run` the code in RStudio).
```{r}
1+1
```
Type `a=1;b=2;a/b` to see `0.5`.
```{r}
a=1;b=2;a/b
```
Note that semicolon separates various commands. It is optional to use them as long as you type the commands one by one as follows:
```{r}
a=1
b=2
a/b
```
# Exercise 1.1 {-}
There are many built-in functions. Try the following.
`27^3` `sqrt(10)` `round(sqrt(10),2)` `abs(-4)` `log(10)` `exp(10)`
`rnorm(100)` `mean(rnorm(100))` `sd(rnorm(100))`
```{r}
# your code goes here
# make yourself notes about what each of these functions does
```
You may wonder what was the base used for `log(10)`. A help on this can be obtained by placing a question mark (`?`) before log as `?log` or by `help(log)`
There are a few exceptions. The command `?if` wont work but `?"if"` will. In other words, `?"log"` or `help("log")` are safer ways of getting help on "built-in" functions.
In RStudio, use the R Editor (menu *File > New Script*) to type the commands and submit them (shortcut: *CNTRL+R*).`
# Exercise 1.2 {-}
Why does this code not work?
```{r}
#| eval: false
my_variable <- 10
my_varıable
```
# Exercise 1.3 {-}
Make a variable with a sequence of numbers between 1 and 10. Look up the function `seq()` in the help menu for tips.
```{r}
# your code goes here
```
*Default examples*
The command `example()` will produce the available HELP examples, and will work for most functions. For example, try `example(boxplot)`. You will see many boxplot examples such as the following:
```{r}
boxplot(count ~ spray, data = InsectSprays, col = "lightgray")
```
There are also demos available, explore using the command `demo()`. The basic R system produces somewhat old style graphs.
So we will be largely using the newer plotting system **ggplot** which is part of the `tidyverse` suite of packages; see <https://www.tidyverse.org/>.
Let's load that package now:
```{r}
#| message: false
# install.packages('tidyverse') # if you do not have tidyverse you will need to install it first
# you only have to install packages once per machine (computer)
library(tidyverse) # you need to load a library every R session
```
A huge number of other dedicated packages are available to improve the power of R. Many R packages are hosted at a repository called CRAN (*C*omprehensive *R* *A*rchive *N*etwork). The package install option within RStudio can download and install these optional packages under the menu `Packages >> Install`. You can also do this using the command
`install.packages`. For example
`install.packages(c("tidyverse", "car"), dependencies = TRUE)`
This command installs two packages `tidyverse` and `car` in one go.
Contributed R packages are grouped in various headings at
<https://cran.r-project.org/web/views/>. They can be installed in bulk using the `ctv` package command `install.views()`.
You might have to install quite a few packages as you work through this course.
# Loading/Importing Data {.unnumbered}
If the data file is stored locally, you should put the data into the same directory as your Quarto or R markdown script. That way, you can (usually) load it easily without having to type the full pathway (e.g., `mydata.csv` rather than `C:/Users/anhsmith/Work/Project1/data/mydata.csv`). Better yet, [Projects](https://r4ds.hadley.nz/workflow-scripts#projects) make this much easier.
You can also load data from the web using a URL. For example,
```{r}
#| code-fold: show
read_csv("https://www.massey.ac.nz/~anhsmith/data/rangitikei.csv")
```
We'd usually want to store the data as an object though, like so:
```{r}
#| code-fold: show
rangitikei <- read_csv("https://www.massey.ac.nz/~anhsmith/data/rangitikei.csv")
```
Now the data are available in R as an object.
```{r}
#| code-fold: show
glimpse(rangitikei)
```
# Exercise 1.4 {-}
Try importing the Telomeres data file (in Excel format)
available at
<https://rs.figshare.com/ndownloader/files/22850096>
```{r}
# your code goes here
```
Note that Excel files usually contain blanks for missing or unreported data or allocate many rows for variable description, which can cause issues while importing them.
Consider the study guide dataset `rangitikei.txt` (Recreational Use of
the Rangitikei river) again. The first 10 rows of this dataset are shown
below:
```{r, echo=FALSE}
my.data <- read.csv(
"https://www.massey.ac.nz/~anhsmith/data/rangitikei.csv",
header=TRUE
)
head(my.data, 10) # shows the first ten rows
```
The description of the variables is given below:
`loc` - two locations were surveyed, coded 1, 2\
`time` - time of day, 1 for morning, 2 for afternoon\
`w.e` - coded 1 for weekend, 2 for weekday\
`cl`- cloud cover, 1 for \>50%, 2 for \<50%\
`wind`- coded 1 through 4 for increasing wind speed\
`temp` - temperature, 1, 2 or 3 increasing temp\
`river`- murkiness of river in 3 increasing categories\
`people` - number of people at that location and time\
`vehicle`- number of vehicles at that location at that time\
# Exercise 1.5 {-}
How many rows are in `my.data`? How many columns? How can you find that in R?
```{r}
```
# Exercise 1.6 {-}
Find the mean of `vehicle` and the median of `people` using built in R functions.
```{r}
```
# Graphing with ggplot2
If have not installed `ggplot2` or `tidyverse`, install it with the
following commands.
```{r, eval=FALSE}
install.packages("ggplot2")
install.packages("tidyverse")
```
We can now load the `ggplot2` library with the commands:
```{r, warninng=FALSE}
library(ggplot2)
```
In order to work with `ggplot2`, we must have a `data frame` or a `tibble`
containing our data. We need to specify the `aesthetics` or how the
columns of our data frame can be translated into positions, colours,
sizes, and shapes of graphical elements.
```{r}
ggplot(my.data) +
aes(x = vehicle, y = people) +
geom_point()
```
The `aes` part defines the "aesthetics", which is how columns of the dataframe map to graphical attributes such as x and y position, colour, size, etc. An aesthetic can be either numeric or categorical and an appropriate scale will be used. After this, we add layers of graphics. `geom_point` layer is employed to map x and y and we need not specify all the options for `geom_point`.
# Exercise 1.6 {-}
Why does the following give an error and how would you fix it?
```{r}
#| eval: false
ggplot(data = my.data) +
geom_point()
```
We can add a title using *labs()* or *ggtitle()* functions. Try-
```{r, fig.show="hide"}
ggplot(my.data) +
aes(x = vehicle, y = people, color = river) +
geom_point() +
ggtitle("No. of people vs No. of vehicles")
```
or
```{r, fig.show="hide"}
ggplot(my.data)+
aes(x = vehicle, y = people) +
geom_point() +
labs(title = "No. of people vs No. of vehicles")
```
Note that *labs()* allows captions and subtitles.
# Exercise 1.7 {-}
Remake the above graph, adjusting the x axis to say `Vehicles` using the `labs()` function.
```{r}
```
`geom_smooth` is additionally used to show trends.
```{r}
ggplot(my.data) +
aes(x = vehicle, y = people) +
geom_point() +
geom_smooth()
```
# Exercise 1.8 {-}
Run this code in your head and predict what the output will look like. Then, run the code in R and check your predictions.
```{r}
ggplot(my.data) +
aes(x = vehicle, y = people) +
geom_point()+
geom_smooth(se = FALSE)
```
Similar to `geom_smooth`, a variety of `geoms` are available.
```{r}
ggplot(my.data) +
aes(x = factor(wind), y = people) +
geom_boxplot()
```
Each `geom` accepts a particular set of mappings;for example
`geom_text()` accepts a `labels` mapping. Try-
```{r, fig.show="hide"}
ggplot(my.data) +
aes(x = vehicle, y = people) +
geom_point() +
geom_text(aes(label = w.e),
size = 5)
```
# Exercise 1.9 {-}
Will these two graphs look different? Why/why not?
```{r}
ggplot(
data = my.data,
mapping = aes(x = vehicle, y = people)
) +
geom_point() +
geom_smooth()
ggplot() +
geom_point(
data = my.data,
mapping = aes(x = vehicle, y = people)
) +
geom_smooth(
data = my.data,
mapping = aes(x = vehicle, y = people)
)
```
The faceting option allows a collection of small plots with the same
scales. Try-
```{r}
ggplot(my.data) +
aes(x=vehicle, y=people) +
geom_point() +
facet_wrap(~ river)
```
Faceting is the `ggplot2` option to create separate graphs for subsets
of data. `ggplot2` offers two functions for creating small multiples:
1. `facet_wrap()`: define subsets as the levels of a single grouping variable
2. `facet_grid()`: define subsets as the crossing of two grouping variables
The following arguments are common to most scales in `ggplot2`:
- **name:** the first argument gives the axis or legend title
- **limits:** the minimum and maximum of the scale
- **breaks:** the points along the scale where labels should appear
- **labels:** the labels that appear at each break
Specific scale functions may have additional arguments. Some of the
available Scales are:
| **Scale** | **Examples** |
|-------------------|---------------------------|
| `scale_color_` | `scale_color_discrete` |
| `scale_fill_` | `scale_fill_continuous` |
| `scale_size_` | `scale_size_manual` |
| | `scale_size_discrete` |
| | |
| `scale_shape_` | `scale_shape_discrete` |
| | `scale_shape_manual` |
| `scale_linetype_` | `scale_linetype_discrete` |
| | |
| `scale_x_` | `scale_x_continuous` |
| | `scale_x_log` |
| | `scale_x_date` |
| `scale_y_` | `scale_y_reverse` |
| | `scale_y_discrete` |
| | `scale_y_datetime` |
In RStudio, we can type `scale_` followed by TAB to get the whole list
of available scales.
Try-
```{r}
ggplot(my.data) +
aes(x = vehicle, y = people, color = factor(temp)) +
geom_point() +
scale_x_continuous(name = "No. of Vehicles") +
scale_y_continuous(name = "No. of people") +
scale_color_discrete(name = "Temperature")
```
The other coding option is shown below:
```{r, fig.show = 'hide'}
ggplot(my.data) +
aes(x = vehicle, y = people, color = factor(temp)) +
geom_point() +
xlab("No. of Vehicles") +
ylab("No. of people") +
labs(colour="Temperature")
```
Note that a desired graph can be obtained in more than one way.
The `ggplot2` theme system handles plot elements (not data based) such
as
- Axis labels
- Plot background
- Facet label background
- Legend appearance
Built-in themes include:
- `theme_gray()` (default)
- `theme_bw()`
- `theme_minimal()`
- `theme_classic()`
```{r}
p1 <- ggplot(my.data) +
aes(x = vehicle, y = people, color = factor(temp)) +
geom_point()
```
Note that the graph is assigned an object name `p1` and nothing will be
printed unless we then print the object `p1`.
```{r}
p1 <- ggplot(my.data) +
aes(x = vehicle, y = people, color = factor(temp)) +
geom_point()
p1
```
Try-
```{r}
p1 + theme_light()
```
```{r}
p1 + theme_bw()
```
Specific theme elements can be overridden using `theme()`. For example:
```{r}
p1 + theme_minimal() +
theme(text = element_text(color = "red"))
```
All theme options can be seen with `?theme`.
To specify a theme for a whole document, use
```{r}
theme_set(theme_minimal())
```
*Minimal graphing* can be done using the `qplot` option that will
produce a few standard formatted graphs quickly.
```{r}
qplot(people, vehicle, data = my.data, colour = river)
```
Try-
`qplot(people, data = my.data)`
`qplot(people, fill=factor(river), data=my.data)`
`qplot(people, data = my.data, geom = "dotplot")`
`qplot(factor(river), people, data = my.data, geom = "boxplot")`
A cheat sheet for `ggplot2` is available at
<https://www.rstudio.com/resources/cheatsheets/> (optional to download).
There are many other packages which incorporate `ggplot2` based graphs or
dependent on it.
The library *patchwork* allows complex composition arbitrary plots,
which are not produced using the faceting option. Try
```{r}
library(patchwork)
p1 <- qplot(people, data = my.data, geom = "dotplot")
p2 <- qplot(people, data = my.data, geom = "boxplot")
p3 <- ggplot(my.data, aes(x = vehicle, y = people)) + geom_point()
(p1 + p2) / p3 +
plot_annotation("My title", caption = "My caption")
```
## Quarto
Demonstrate how to open a new Quarto document