Won’t you be my neighboR?

# Won’t you be my neighboR?
## An introduction to R for data science in education
### Joshua Rosenberg and Emily Bovee
### 2018-05-16

---

# #whoami1

.pull-left[
* Dr. Joshua Rosenberg
* Assistant Professor, STEM Education, University of Tennessee, Knoxville
* Dad (one-year-old toddler!)
* Primary areas of interest
  * Data science in education
  * Data science education
]

---

# #whoami2

.pull-left[
* Dr. Emily Bovee
* Director of Educational Development and Assessment: Marquette University School of Dentistry
* Cat mom
* Primary areas of interest
  * Understanding student motivation in higher education and professional education
  * Data science education 
]

---

# Agenda

---

# Agenda

1. Why learn R?
1. Getting started with R Studio
1. Computer science foundations
1. Processing data
1. Loading data
1. Visualizing data
1. Modeling data 
1. Learning and doing more with R
1. Questions

---

# Why learn R?

* It is capable of carrying out basic and complex statistical analyses
* It is able to work with data small (*n* = 30!) and large (*n* = 100,000+) efficiently
* It is a programming language and so is quite flexible
* There is a great, inclusive community of users and developers (and teachers)
* It is cross-platform, open-source, and freely-available

---

# Getting started with R Studio

---

# Installations

To download R:
- Visit this page to download R: https://cran.r-project.org/
- Find your operating system (Mac, Windows, or Linux)
- Download the 'latest release' on the page for your operating system and download and install the application

To download R Studio:
- Visit this page to download R studio: https://www.rstudio.com/products/rstudio/download/
- Find your operating system (Mac, Windows, or Linux)
- Download the 'latest release' on the page for your operating system and download and install the application

## Check that it worked

Open R Studio. Find the console window and type in `2 + 2`. If what you can guess is returned (hint: it's what you expect!), then R Studio *and* R both work.

---

# Creating Projects

Before proceeding, we're going to take a few steps to set ourselves to make the analysis easier; namely, through the use of Projects, an R Studio-specific organizational tool.

To create a project, in R Studio, navigate to "File" and then "New Directory". Then, click "New Project".

Names should be short and easy-to-remember but also informative.

---

# Packages

"Packages" are shareable collections of R code that provide functions (i.e., a command to perform a specific task), data, and documentation. Packages increase the functionality of R by improving and expanding on base R (basic R functions).

### Installing and Loading Packages

To download a package, you must call `install.packages()`:

```r
install.packages("dplyr")
```

You can also navigate to the Packages pane, and then click "Install", which will work the same as the line of code above. This is a way to install a package using code or part of the R Studio interface.

Usually, writing code is a bit quicker, but using the interface can be very useful and complimentary to use of code.

---

# Using packages

*After* the package is installed, it must be loaded into your R Studio session using `library()`:

```r
library(dplyr)
```

We only have to install a package once, but to use it, we have to load it each time we start a new R session.

> A package is a like a book, a library is like a library; you use library() to check a package out of the library
> - Hadley Wickham, Chief Scientist, R Studio

---

# Vignettes

Vignettes are long-form documentation (and tutorials) that can provide a helpful introduction to a package.

Run `vignette()` in order to view *all* of the vignettes available for a package:

```r
vignette(package = "dplyr")
```

Then, you can load a specific vignette.

```r
vignette("dplyr", package = "dplyr")
```

These are also available through CRAN (i.e., https://cran.r-project.org/web/packages/dplyr/index.html)
---

# Using a specific function

If you know the specific function that you want to look up, you can run this in the R Studio console:

```r
?dplyr::filter
```

Once you know what you want to do with the function, you can run it in your code:

```r
dat <- # example data frame
  tibble(letter = c("A", "A", "A", "B", "B"),
         number = c(1L, 2L, 3L, 4L, 5L))

filter(dat, letter == "A")
```

```
## # A tibble: 3 x 2
##   letter number
##   <chr>   <int>
## 1 A           1
## 2 A           2
## 3 A           3
```

---

# Configuring R Studio

By default, R Studio saves all of the objects in your environment. In general, this is not ideal, because it means that you may have taken steps interactively that are not documented your code.

---

# The tidyverse

The tidyverse is a set of packages for data manipulation, exploration, and visualization using the design philosophy of 'tidy' data. Tidy data has a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table.

The packages contained in the Tidyverse provide useful functions that augment base R functionality.

You can installing and load the complete tidyverse with:

```r
install.packages("tidyverse")
```

```r
library(tidyverse)
```

---

# Computer science foundations
---

# Data structures

- Data Frame

```r
data.frame(var1 = c("a", "b", "c"),
           var2 = c(4, 2, 5))
```

```
##   var1 var2
## 1    a    4
## 2    b    2
## 3    c    5
```

- Tibble

```r
library(tidyverse)
tibble(var1 = c("a", "b", "c"),
       var2 = c(4, 2, 5))
```

```
## # A tibble: 3 x 2
##   var1   var2
##   <chr> <dbl>
## 1 a         4
## 2 b         2
## 3 c         5
```

---

# Data structures

```r
vec1 = c("a", "b", "c")
vec1 
```

```
## [1] "a" "b" "c"
```

```r
vec2 = c(4, 2, 5)
vec2
```

```
## [1] 4 2 5
```

```r
tibble(var1 = vec1,
       var2 = vec2)
```

```
## # A tibble: 3 x 2
##   var1   var2
##   <chr> <dbl>
## 1 a         4
## 2 b         2
## 3 c         5
```

---

# Conditional logic

- `ifelse()`

```r
vec1 <- runif(10) # generates random number between 0 and 1
d1 <- tibble(var1 = vec1)

d1 %>% 
  mutate(var2 = ifelse(var1 >= .5, "greater", "lesser"))
```

```
## # A tibble: 10 x 2
##     var1 var2   
##    <dbl> <chr>  
##  1 0.885 greater
##  2 0.514 greater
##  3 0.926 greater
##  4 0.639 greater
##  5 0.861 greater
##  6 0.892 greater
##  7 0.165 lesser 
##  8 0.171 lesser 
##  9 0.877 greater
## 10 0.776 greater
```

---

# Conditional logic

- `dplyr::case_when()`

```r
d1 %>% 
  mutate(var2 = case_when(
    var1 >= .75 ~ "a lot",
    var1 >= .5 & var1 < .75 ~ "greater",
    var1 >= .25 & var1 < .5  ~ "lesser",
    var1 < .25 ~ "very little" 
  ))
```

```
## # A tibble: 10 x 2
##     var1 var2       
##    <dbl> <chr>      
##  1 0.885 a lot      
##  2 0.514 greater    
##  3 0.926 a lot      
##  4 0.639 greater    
##  5 0.861 a lot      
##  6 0.892 a lot      
##  7 0.165 very little
##  8 0.171 very little
##  9 0.877 a lot      
## 10 0.776 a lot
```

---
# Objects, functions, directories, and data structures

Here's a short URL: `https://goo.gl/bUeMhV`

This is what it resolves to (it's a CSV file): `https://raw.githubusercontent.com/data-edu/data-science-in-education/master/data/pisaUSA15/stu-quest.csv`

This chunk of code will save that file to a data directory in our working environment.

```r
student_responses_url <-
  "https://goo.gl/bUeMhV"

student_responses_file_name <-
  paste0(getwd(), "/data/student-responses-data.csv")

download.file(
  url = student_responses_url,
  destfile = student_responses_file_name)
```

---

# Step 1

1. The *character string* `"https://goo.gl/wPmujv"` is being saved to an *object* called `student_responses_url`.

```r
student_responses_url <-
  "https://goo.gl/bUeMhV"
```

---

# Step 2

2. We concatenate the working directory file path to the desired file name for the CSV using a *function* called `str_c`. This is stored in another *object* called `student_reponses_file_name`. This creates a file name with a *file path* in your working directory and it saves the file in the folder that you are working in.

```r
student_responses_file_name <-
  str_c(getwd(), "./data/student-responses-data.csv")
```

---

# Step 3

3. The `student_responses_url` *object* is passed to the `url` argument of the *function* called `download.file()` along with `student_responses_file_name`, which is passed to the `destfile` argument.

In short, the `download.file()` function needs to know
- where the file is coming from (which you tell it through the `url`) argument and
- where the file will be saved (which you tell it through the `destfile` argument).

```r
download.file(
  url = student_responses_url,
  destfile = student_responses_file_name)
```

---
class: inverse, center, middle
# Processing Data
---

# The pipe: %>%

First, let's load the data we downloaded in the last step.

```r
student_responses <- read_csv("data/student-responses-data.csv")
```

We're also going to introduce a powerful, unusual *operator* in R, the pipe. The pipe is this symbol: `%>%`. It lets you *compose* functions. It does this by passing the output of one function to the next.

Select reduces the number of *columns* of a dataset.

```r
student_mot_vars <- # save object student_mot_vars by...
  student_responses %>% # using dataframe student_responses
  select(SCIEEFF, JOYSCIE, INTBRSCI, EPIST, INSTSCIE) # and selecting only these five variables

student_mot_vars
```

Also, check out the helper functions: `contains()`, `starts_with()`, and `ends_with()`, i.e.:

```r
student_responses %>% 
  select(contains("sci"))
```

---

# Saving the results

Note that we saved the output from the `select()` function to `student_mot_vars` but we could also save it back to `student_responses`, which would simply overwrite the original data frame (the following code is not run here):

```r
student_responses <- # save object student_responses by...
  student_responses %>% # using dataframe student_responses
  select(student_responses, SCIEEFF, JOYSCIE, INTBRSCI, EPIST, INSTSCIE) # and selecting only these five variables
```

---

# Renaming

We can also rename the variables at the same time we select them. Let's save the results to `student_mot_vars`.

```r
student_mot_vars <- student_responses %>% 
  select(student_efficacy = SCIEEFF,
         student_joy = JOYSCIE, 
         student_broad_interest = INTBRSCI,
         student_epistemic_beliefs = EPIST,
         student_instrumental_motivation = INSTSCIE,
  )

student_mot_vars %>% 
  head(3)
```

```
## # A tibble: 3 x 5
##   student_efficacy student_joy student_broad_i… student_epistem…
##              <dbl>       <dbl>            <dbl>            <dbl>
## 1            3.28       2.16              1.73             2.16 
## 2           -0.668      0.509             0.491           -0.501
## 3           -1.62       0.0992           -0.638           -0.193
## # … with 1 more variable: student_instrumental_motivation <dbl>
```

---

# Creating a new variable

`mutate()` takes instructions to create a new variable.

What goes *before* the equals sign is the name of the new variable.

What goes after are the instructions.

```r
student_responses %>% 
  select(student_efficacy = SCIEEFF, 
         student_joy = JOYSCIE, 
         student_broad_interest = INTBRSCI, 
         student_epistemic_beliefs = EPIST,
         student_instrumental_motivation = INSTSCIE) %>% 
  mutate(student_joy_interest = (student_joy + student_broad_interest) / 2) %>% 
  head(3)
```

```
## # A tibble: 3 x 6
##   student_efficacy student_joy student_broad_i… student_epistem…
##              <dbl>       <dbl>            <dbl>            <dbl>
## 1            3.28       2.16              1.73             2.16 
## 2           -0.668      0.509             0.491           -0.501
## 3           -1.62       0.0992           -0.638           -0.193
## # … with 2 more variables: student_instrumental_motivation <dbl>,
## #   student_joy_interest <dbl>
```

---

# Filtering the data set

Filter takes *logical statements* (statements that can evalute to true or false) to select a number of *rows* from a dataset.

```
## # A tibble: 3 x 6
##   student_efficacy student_joy student_broad_i… student_epistem…
##              <dbl>       <dbl>            <dbl>            <dbl>
## 1           3.28         2.16             1.73             2.16 
## 2          -0.668        0.509            0.491           -0.501
## 3          -0.0378       1.67             0.332            2.16 
## # … with 2 more variables: student_instrumental_motivation <dbl>,
## #   student_joy_interest <dbl>
```

---

# Grouping and summarizing

It is often useful to aggregate a data set.

The `group_by()` and `summarize()` functions are especially helpful for this purpose.

Let's find the mean and standard deviation (and counts) of student broad interest by teacher.

```r
student_responses %>% 
  select(student_broad_interest = INTBRSCI, teacher_id = SCHID) %>% 
  group_by(teacher_id) %>% 
  summarize(student_broad_interest_mean = mean(student_broad_interest, na.rm = TRUE),
            student_broad_interest_sd = sd(student_broad_interest, na.rm = TRUE),
            n = n()) %>% 
  head(3)
```

```
## # A tibble: 3 x 4
##   teacher_id student_broad_interest_mean student_broad_interest_sd     n
##   <chr>                            <dbl>                     <dbl> <int>
## 1 00001                          -0.157                      0.846    36
## 2 00003                          -0.136                      1.08     31
## 3 00004                          -0.0430                     1.26     39
```

---

# Arranging

Finally, let's arrange the table by the number of students in each class.

`arrange()` sorts in order from the smallest to largest values.

`desc()` tells `arrange()` to sort in descending order.

```
## # A tibble: 3 x 4
##   teacher_id student_broad_interest_mean student_broad_interest_sd     n
##   <chr>                            <dbl>                     <dbl> <int>
## 1 00107                           0.204                      0.786    42
## 2 00117                           0.204                      0.893    42
## 3 00131                           0.0349                     0.870    41
```

---

# Counting

Sometimes, we simply want to count the number of observations associated with each group.

```r
student_responses %>% 
  select(teacher_id = SCHID) %>% 
  count(teacher_id)
```

We could then use this as the basis of *other* summary statistics.

```r
student_responses %>% 
  select(teacher_id = SCHID) %>% 
  count(teacher_id) %>% 
  summarize(n_mean = mean(n),
            n_sd = sd(n))
```

```
## # A tibble: 1 x 2
##   n_mean  n_sd
##    <dbl> <dbl>
## 1   32.3  8.61
```

On average, there are around 32 (*SD* = 8.61) students in each teachers' class.

---

*Note. Check out [R Studio Cloud](https://rstudio.cloud/) if you're having any installation trouble!*

*The doc is [here](https://github.com/jrosen48/MSU-workshop-2019/blob/master/demo-doc.Rmd) in the repository*

---

# Loading data

---

# Loading a CSV File

Okay, we're ready to go. The easiest way to read a CSV file is with the function `read_csv()` from the package `readr`, which is contained within the Tidyverse.

Again, we'll load (or have loaded, already) the tidyverse library.

```r
library(tidyverse) # so tidyverse packages can be used for analysis
```

Just as a recap of a line that we ran earlier.

```r
student_responses <- read_csv("data/student-responses-data.csv")
```

Since we loaded the data, we now want to look at it. We can type its name in the function `glimpse()` to print some information on the dataset (this code is not run here).

```r
glimpse(student_responses)
```

You can also print the name of the object.

---

# Loading Excel and SAV files

## Loading Excel files

```r
install.packages("readxl")
```

```r
library(readxl)
```

```r
my_data <-
  read_excel("path/to/file.xlsx")
```

## Loading SAV files

```r
install.packages("haven")
```

```r
library(haven)
my_data <-
  read_sav("path/to/file.xlsx")
```

---

# Saving (Writing) Files

Using our data frame `student_responses`, we can save it as a CSV (for example) with the following function. The first argument, `student_reponses`, is the name of the object that you want to save. The second argument, `student-responses.csv`, what you want to call the saved dataset.

```r
write_csv(student_responses, "student-responses.csv")
```

That will save a CSV file entitled `student-responses.csv` in the working directory. If you want to save it to another directory, simply add the file path to the file, i.e. `path/to/student-responses.csv`. To save a file for SPSS, load the haven package and use `write_sav()`. There is not a function to save an Excel file, but you can save as a CSV and directly load it in Excel.

---

# Visualizing data

---

# The grammar of graphics

*ggplot2* is a tidyverse package for creating visualizations (or figures).

Let's work with a smaller data set, for now.

```r
student_mot_vars <- # save object student_mot_vars by...
  student_responses %>% # using dataframe student_responses
  select(student_efficacy = SCIEEFF, # selecting variable SCIEEFF and renaming to student_efficiency
         student_joy = JOYSCIE, # selecting variable JOYSCIE and renaming to student_joy
         student_broad_interest = INTBRSCI, # selecting variable INTBRSCI and renaming to student_broad_interest
         student_epistemic_beliefs = EPIST, # selecting variable EPIST and renaming to student_epistemic_beliefs
         student_instrumental_motivation = INSTSCIE,  # selecting variable INSTSCIE and renaming to student_instrumental_motivation
         teacher_id = SCHID
  )
```

---

# Scatter plot

Notice the three parts of all `ggplot2` graphs: a) the data (`student_mot_vars`), b) an `aes()`thetic mapping, and c) a `geom`:

---

# Scatter plot with a regression line

```r
ggplot(student_mot_vars,
       aes(x = student_efficacy, y = student_broad_interest)) +
  geom_point() +
  geom_smooth(method = "lm")
```

<img src="index_files/figure-html/unnamed-chunk-45-1.png" width="350px" style="display: block; margin: auto;" />
---
# Cleaning up

```r
ggplot(student_mot_vars,
       aes(x = student_efficacy, y = student_broad_interest)) +
  geom_point() +
  geom_smooth(method = "lm") +
  theme_bw() +
  xlab("Student Efficacy") +
  ylab("Student Broad Interest")
```

<img src="index_files/figure-html/unnamed-chunk-46-1.png" width="350px" style="display: block; margin: auto;" />
---

*The doc is [here](https://github.com/jrosen48/MSU-workshop-2019/blob/master/demo-doc.Rmd) in the repository*

---

# Modeling data

---

# Linear models

The `lm()` function is a very helpful general purpose function for linear models. We'll use the `student_mot_vars` data frame we created earlier.

```r
lm(student_epistemic_beliefs ~ student_broad_interest, data = student_mot_vars)
```

```
## 
## Call:
## lm(formula = student_epistemic_beliefs ~ student_broad_interest, 
##     data = student_mot_vars)
## 
## Coefficients:
##            (Intercept)  student_broad_interest  
##                 0.2407                  0.2401
```

---

# Linear models

Let's save the results back to an object, `m1`.

```r
m1 <- lm(student_epistemic_beliefs ~ student_broad_interest, data = student_mot_vars)
```

---

# Linear models

We can then run a summary on the output.

```r
summary(m1)
```

```
## 
## Call:
## lm(formula = student_epistemic_beliefs ~ student_broad_interest, 
##     data = student_mot_vars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6197 -0.5614 -0.1755  0.5917  2.5144 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             0.24066    0.01390   17.31   <2e-16 ***
## student_broad_interest  0.24015    0.01423   16.88   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.013 on 5323 degrees of freedom
##   (387 observations deleted due to missingness)
## Multiple R-squared:  0.0508,	Adjusted R-squared:  0.05062 
## F-statistic: 284.9 on 1 and 5323 DF,  p-value: < 2.2e-16
```

---

# Linear models

You can then build up a more complex model, i.e.:

```r
m2 <- lm(student_epistemic_beliefs ~ student_broad_interest + student_joy + student_broad_interest:student_joy, data = student_mot_vars)
```

You can then run `summary()` to view the results.

---

# Multi-level models

```r
library(lme4)
```

```
## Loading required package: Matrix
```

```
## 
## Attaching package: 'Matrix'
```

```
## The following object is masked from 'package:tidyr':
## 
##     expand
```

```r
m2_mlm <- lmer(student_epistemic_beliefs ~ student_broad_interest + student_joy + student_broad_interest:student_joy + (1|teacher_id), data = student_mot_vars)
m2_mlm
```

```
## Linear mixed model fit by REML ['lmerMod']
## Formula: 
## student_epistemic_beliefs ~ student_broad_interest + student_joy +  
##     student_broad_interest:student_joy + (1 | teacher_id)
##    Data: student_mot_vars
## REML criterion at convergence: 14853.02
## Random effects:
##  Groups     Name        Std.Dev.
##  teacher_id (Intercept) 0.1839  
##  Residual               0.9648  
## Number of obs: 5315, groups:  teacher_id, 176
## Fixed Effects:
##                        (Intercept)              student_broad_interest  
##                            0.16608                             0.07491  
##                        student_joy  student_broad_interest:student_joy  
##                            0.27409                             0.01991
```
---

*The doc is [here](https://github.com/jrosen48/MSU-workshop-2019/blob/master/demo-doc.Rmd) in the repository*
---

---

# Resources

- [Doing data science with R](https://r4ds.had.co.nz/) by Wickham and
    Grolemund (2017)
  - [Big magic with R: Creating learning beyond
    fear](https://speakerdeck.com/apreshill/big-magic-with-r-creative-learning-beyond-fear)
    by Hill (2017)
  - [Data science in education](https://github.com/data-edu/data-science-in-education)
  - [R Studio Community](https://community.rstudio.com/)
  
---

# Courses

- [\#r4ds](https://medium.com/@kierisi/r4ds-the-next-iteration-d51e0a1b0b82)
    (see a talk at rstudio::conf()
    [here](https://resources.rstudio.com/rstudio-conf-2019/r4ds-online-learning-community-improvements-to-self-taught-data-science-and-the-critical-need-for-diversity-equity-and-inclusion-in-data-science-education)
    by Maegan (2019))
  - [Data science for social scientists](http://datascience.tntlab.org/) by
    Landers (2019)
  - [University of Oregon Data Science Specialization for the College of
    Education](https://github.com/uo-datasci-specialization) by Anderson (2019)

---

# Useful packages

With one exception, can install via `install.packages(<name-of-package)`, i.e., `install.packages("tidylog").

- tidyverse: tools for data manipulation and processing
- tidylog: companion to the tidyverse; gives you feedback
- haven: import and export files for other statistical software
- apaTables: create APA-style tables in MS Word format
- devtools: to download packages only available on GitHub
- psych: general psychological science-related analysis tools
- skimr: quickly calculate summary statistics for a data frame
- lavaan: structural equation modeling
- tipyLPA: latent profile analysis
- lme4: multi-level modeling (HLM)
- sjstats: convenient statistical functions
- poLCA: latent class analysis
- rmarkdown: reports (PDF, Word, HTML, etc.)
- papaja: APA-style paper generation (must install from [GitHub](https://github.com/crsh/papaja))
- blogdown create a website/blog
- xaringan: create presentations
- MplusAutomation: Run Mplus via R

---

# People (and a few hashtags) to follow (on Twitter)

- [Jesse Mostipak](https://twitter.com/kierisi)
- [Alison Hill](https://twitter.com/apreshill)
- [Hadley Wickham](https://twitter.com/hadleywickham)
- [Daniel Anderson](https://twitter.com/datalorax_)
- [Mara Averick](https://twitter.com/dataandme)
- [Thomas Mock](https://twitter.com/thomas_mock)
- [Angela Li](https://twitter.com/CivicAngela)
- [R4DS community](https://twitter.com/R4DScommunity)
- [#tidytuesday](https://twitter.com/hashtag/TidyTuesday?src=hash)
- [#rstats](https://twitter.com/hashtag/rstats?src=hash)
- [David Robinson](https://twitter.com/drob)
- [Julia Silge](https://twitter.com/juliasilge)
- [Gabriela de Quieroz](https://twitter.com/gdequeiroz)
- [Joshua Rosenberg](https://twitter.com/jrosenberg6432) and [Emily Bovee](https://twitter.com/ebovee09) (of course! :)

---

The repository for this workshop is [here](https://github.com/jrosen48/MSU-workshop-2019).

---

# Contact

- [Joshua Rosenberg](mailto:jmrosenberg@utk.edu)
- [Emily Bovee](mailto:emily.a.bovee@gmail.com)