class: center, middle, inverse, title-slide # Won’t you be my neighboR? ## An introduction to R for data science in education ### Joshua Rosenberg and Emily Bovee ### 2018-05-16 --- # #whoami1 .pull-left[ * Dr. Joshua Rosenberg * Assistant Professor, STEM Education, University of Tennessee, Knoxville * Dad (one-year-old toddler!) * Primary areas of interest * Data science in education * Data science education ] <img src="img/kr-joro.jpg" width="350px" style="display: block; margin: auto;" /> --- # #whoami2 .pull-left[ * Dr. Emily Bovee * Director of Educational Development and Assessment: Marquette University School of Dentistry * Cat mom * Primary areas of interest * Understanding student motivation in higher education and professional education * Data science education ] <img src="img/ebv.jpg" width="350px" style="display: block; margin: auto;" /> --- class: inverse, center, middle # Agenda --- # Agenda 1. Why learn R? 1. Getting started with R Studio 1. Computer science foundations 1. Processing data 1. Loading data 1. Visualizing data 1. Modeling data 1. Learning and doing more with R 1. Questions --- # Why learn R? * It is capable of carrying out basic and complex statistical analyses * It is able to work with data small (*n* = 30!) and large (*n* = 100,000+) efficiently * It is a programming language and so is quite flexible * There is a great, inclusive community of users and developers (and teachers) * It is cross-platform, open-source, and freely-available --- class: inverse, center, middle # Getting started with R Studio --- # Installations To download R: - Visit this page to download R: https://cran.r-project.org/ - Find your operating system (Mac, Windows, or Linux) - Download the 'latest release' on the page for your operating system and download and install the application To download R Studio: - Visit this page to download R studio: https://www.rstudio.com/products/rstudio/download/ - Find your operating system (Mac, Windows, or Linux) - Download the 'latest release' on the page for your operating system and download and install the application ## Check that it worked Open R Studio. Find the console window and type in `2 + 2`. If what you can guess is returned (hint: it's what you expect!), then R Studio *and* R both work. --- # Creating Projects Before proceeding, we're going to take a few steps to set ourselves to make the analysis easier; namely, through the use of Projects, an R Studio-specific organizational tool. To create a project, in R Studio, navigate to "File" and then "New Directory". Then, click "New Project". Names should be short and easy-to-remember but also informative. --- # Packages "Packages" are shareable collections of R code that provide functions (i.e., a command to perform a specific task), data, and documentation. Packages increase the functionality of R by improving and expanding on base R (basic R functions). ### Installing and Loading Packages To download a package, you must call `install.packages()`: ```r install.packages("dplyr") ``` You can also navigate to the Packages pane, and then click "Install", which will work the same as the line of code above. This is a way to install a package using code or part of the R Studio interface. Usually, writing code is a bit quicker, but using the interface can be very useful and complimentary to use of code. --- # Using packages *After* the package is installed, it must be loaded into your R Studio session using `library()`: ```r library(dplyr) ``` We only have to install a package once, but to use it, we have to load it each time we start a new R session. > A package is a like a book, a library is like a library; you use library() to check a package out of the library > - Hadley Wickham, Chief Scientist, R Studio --- # Vignettes Vignettes are long-form documentation (and tutorials) that can provide a helpful introduction to a package. Run `vignette()` in order to view *all* of the vignettes available for a package: ```r vignette(package = "dplyr") ``` Then, you can load a specific vignette. ```r vignette("dplyr", package = "dplyr") ``` These are also available through CRAN (i.e., https://cran.r-project.org/web/packages/dplyr/index.html) --- # Using a specific function If you know the specific function that you want to look up, you can run this in the R Studio console: ```r ?dplyr::filter ``` Once you know what you want to do with the function, you can run it in your code: ```r dat <- # example data frame tibble(letter = c("A", "A", "A", "B", "B"), number = c(1L, 2L, 3L, 4L, 5L)) filter(dat, letter == "A") ``` ``` ## # A tibble: 3 x 2 ## letter number ## <chr> <int> ## 1 A 1 ## 2 A 2 ## 3 A 3 ``` --- # Configuring R Studio By default, R Studio saves all of the objects in your environment. In general, this is not ideal, because it means that you may have taken steps interactively that are not documented your code. <img src="img/save-workspace-reminder.jpg" width="425px" style="display: block; margin: auto;" /> --- # The tidyverse The tidyverse is a set of packages for data manipulation, exploration, and visualization using the design philosophy of 'tidy' data. Tidy data has a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table. The packages contained in the Tidyverse provide useful functions that augment base R functionality. You can installing and load the complete tidyverse with: ```r install.packages("tidyverse") ``` ```r library(tidyverse) ``` --- class: inverse, center, middle # Computer science foundations --- # Data structures - Data Frame ```r data.frame(var1 = c("a", "b", "c"), var2 = c(4, 2, 5)) ``` ``` ## var1 var2 ## 1 a 4 ## 2 b 2 ## 3 c 5 ``` - Tibble ```r library(tidyverse) tibble(var1 = c("a", "b", "c"), var2 = c(4, 2, 5)) ``` ``` ## # A tibble: 3 x 2 ## var1 var2 ## <chr> <dbl> ## 1 a 4 ## 2 b 2 ## 3 c 5 ``` --- # Data structures ```r vec1 = c("a", "b", "c") vec1 ``` ``` ## [1] "a" "b" "c" ``` ```r vec2 = c(4, 2, 5) vec2 ``` ``` ## [1] 4 2 5 ``` ```r tibble(var1 = vec1, var2 = vec2) ``` ``` ## # A tibble: 3 x 2 ## var1 var2 ## <chr> <dbl> ## 1 a 4 ## 2 b 2 ## 3 c 5 ``` --- # Conditional logic - `ifelse()` ```r vec1 <- runif(10) # generates random number between 0 and 1 d1 <- tibble(var1 = vec1) d1 %>% mutate(var2 = ifelse(var1 >= .5, "greater", "lesser")) ``` ``` ## # A tibble: 10 x 2 ## var1 var2 ## <dbl> <chr> ## 1 0.885 greater ## 2 0.514 greater ## 3 0.926 greater ## 4 0.639 greater ## 5 0.861 greater ## 6 0.892 greater ## 7 0.165 lesser ## 8 0.171 lesser ## 9 0.877 greater ## 10 0.776 greater ``` --- # Conditional logic - `dplyr::case_when()` ```r d1 %>% mutate(var2 = case_when( var1 >= .75 ~ "a lot", var1 >= .5 & var1 < .75 ~ "greater", var1 >= .25 & var1 < .5 ~ "lesser", var1 < .25 ~ "very little" )) ``` ``` ## # A tibble: 10 x 2 ## var1 var2 ## <dbl> <chr> ## 1 0.885 a lot ## 2 0.514 greater ## 3 0.926 a lot ## 4 0.639 greater ## 5 0.861 a lot ## 6 0.892 a lot ## 7 0.165 very little ## 8 0.171 very little ## 9 0.877 a lot ## 10 0.776 a lot ``` --- # Objects, functions, directories, and data structures Here's a short URL: `https://goo.gl/bUeMhV` This is what it resolves to (it's a CSV file): `https://raw.githubusercontent.com/data-edu/data-science-in-education/master/data/pisaUSA15/stu-quest.csv` This chunk of code will save that file to a data directory in our working environment. ```r student_responses_url <- "https://goo.gl/bUeMhV" student_responses_file_name <- paste0(getwd(), "/data/student-responses-data.csv") download.file( url = student_responses_url, destfile = student_responses_file_name) ``` --- # Step 1 1. The *character string* `"https://goo.gl/wPmujv"` is being saved to an *object* called `student_responses_url`. ```r student_responses_url <- "https://goo.gl/bUeMhV" ``` --- # Step 2 2. We concatenate the working directory file path to the desired file name for the CSV using a *function* called `str_c`. This is stored in another *object* called `student_reponses_file_name`. This creates a file name with a *file path* in your working directory and it saves the file in the folder that you are working in. ```r student_responses_file_name <- str_c(getwd(), "./data/student-responses-data.csv") ``` --- # Step 3 3. The `student_responses_url` *object* is passed to the `url` argument of the *function* called `download.file()` along with `student_responses_file_name`, which is passed to the `destfile` argument. In short, the `download.file()` function needs to know - where the file is coming from (which you tell it through the `url`) argument and - where the file will be saved (which you tell it through the `destfile` argument). ```r download.file( url = student_responses_url, destfile = student_responses_file_name) ``` --- class: inverse, center, middle # Processing Data --- # The pipe: %>% First, let's load the data we downloaded in the last step. ```r student_responses <- read_csv("data/student-responses-data.csv") ``` We're also going to introduce a powerful, unusual *operator* in R, the pipe. The pipe is this symbol: `%>%`. It lets you *compose* functions. It does this by passing the output of one function to the next. Select reduces the number of *columns* of a dataset. ```r student_mot_vars <- # save object student_mot_vars by... student_responses %>% # using dataframe student_responses select(SCIEEFF, JOYSCIE, INTBRSCI, EPIST, INSTSCIE) # and selecting only these five variables student_mot_vars ``` Also, check out the helper functions: `contains()`, `starts_with()`, and `ends_with()`, i.e.: ```r student_responses %>% select(contains("sci")) ``` --- # Saving the results Note that we saved the output from the `select()` function to `student_mot_vars` but we could also save it back to `student_responses`, which would simply overwrite the original data frame (the following code is not run here): ```r student_responses <- # save object student_responses by... student_responses %>% # using dataframe student_responses select(student_responses, SCIEEFF, JOYSCIE, INTBRSCI, EPIST, INSTSCIE) # and selecting only these five variables ``` --- # Renaming We can also rename the variables at the same time we select them. Let's save the results to `student_mot_vars`. ```r student_mot_vars <- student_responses %>% select(student_efficacy = SCIEEFF, student_joy = JOYSCIE, student_broad_interest = INTBRSCI, student_epistemic_beliefs = EPIST, student_instrumental_motivation = INSTSCIE, ) student_mot_vars %>% head(3) ``` ``` ## # A tibble: 3 x 5 ## student_efficacy student_joy student_broad_i… student_epistem… ## <dbl> <dbl> <dbl> <dbl> ## 1 3.28 2.16 1.73 2.16 ## 2 -0.668 0.509 0.491 -0.501 ## 3 -1.62 0.0992 -0.638 -0.193 ## # … with 1 more variable: student_instrumental_motivation <dbl> ``` --- # Creating a new variable `mutate()` takes instructions to create a new variable. What goes *before* the equals sign is the name of the new variable. What goes after are the instructions. ```r student_responses %>% select(student_efficacy = SCIEEFF, student_joy = JOYSCIE, student_broad_interest = INTBRSCI, student_epistemic_beliefs = EPIST, student_instrumental_motivation = INSTSCIE) %>% mutate(student_joy_interest = (student_joy + student_broad_interest) / 2) %>% head(3) ``` ``` ## # A tibble: 3 x 6 ## student_efficacy student_joy student_broad_i… student_epistem… ## <dbl> <dbl> <dbl> <dbl> ## 1 3.28 2.16 1.73 2.16 ## 2 -0.668 0.509 0.491 -0.501 ## 3 -1.62 0.0992 -0.638 -0.193 ## # … with 2 more variables: student_instrumental_motivation <dbl>, ## # student_joy_interest <dbl> ``` --- # Filtering the data set Filter takes *logical statements* (statements that can evalute to true or false) to select a number of *rows* from a dataset. ```r student_responses %>% select(student_efficacy = SCIEEFF, student_joy = JOYSCIE, student_broad_interest = INTBRSCI, student_epistemic_beliefs = EPIST, student_instrumental_motivation = INSTSCIE) %>% mutate(student_joy_interest = (student_joy + student_broad_interest) / 2) %>% filter(student_joy_interest > mean(student_joy_interest, na.rm = TRUE)) %>% head(3) ``` ``` ## # A tibble: 3 x 6 ## student_efficacy student_joy student_broad_i… student_epistem… ## <dbl> <dbl> <dbl> <dbl> ## 1 3.28 2.16 1.73 2.16 ## 2 -0.668 0.509 0.491 -0.501 ## 3 -0.0378 1.67 0.332 2.16 ## # … with 2 more variables: student_instrumental_motivation <dbl>, ## # student_joy_interest <dbl> ``` --- # Grouping and summarizing It is often useful to aggregate a data set. The `group_by()` and `summarize()` functions are especially helpful for this purpose. Let's find the mean and standard deviation (and counts) of student broad interest by teacher. ```r student_responses %>% select(student_broad_interest = INTBRSCI, teacher_id = SCHID) %>% group_by(teacher_id) %>% summarize(student_broad_interest_mean = mean(student_broad_interest, na.rm = TRUE), student_broad_interest_sd = sd(student_broad_interest, na.rm = TRUE), n = n()) %>% head(3) ``` ``` ## # A tibble: 3 x 4 ## teacher_id student_broad_interest_mean student_broad_interest_sd n ## <chr> <dbl> <dbl> <int> ## 1 00001 -0.157 0.846 36 ## 2 00003 -0.136 1.08 31 ## 3 00004 -0.0430 1.26 39 ``` --- # Arranging Finally, let's arrange the table by the number of students in each class. `arrange()` sorts in order from the smallest to largest values. `desc()` tells `arrange()` to sort in descending order. ```r student_responses %>% select(student_broad_interest = INTBRSCI, teacher_id = SCHID) %>% group_by(teacher_id) %>% summarize(student_broad_interest_mean = mean(student_broad_interest, na.rm = TRUE), student_broad_interest_sd = sd(student_broad_interest, na.rm = TRUE), n = n()) %>% arrange(desc(n)) %>% head(3) ``` ``` ## # A tibble: 3 x 4 ## teacher_id student_broad_interest_mean student_broad_interest_sd n ## <chr> <dbl> <dbl> <int> ## 1 00107 0.204 0.786 42 ## 2 00117 0.204 0.893 42 ## 3 00131 0.0349 0.870 41 ``` --- # Counting Sometimes, we simply want to count the number of observations associated with each group. ```r student_responses %>% select(teacher_id = SCHID) %>% count(teacher_id) ``` We could then use this as the basis of *other* summary statistics. ```r student_responses %>% select(teacher_id = SCHID) %>% count(teacher_id) %>% summarize(n_mean = mean(n), n_sd = sd(n)) ``` ``` ## # A tibble: 1 x 2 ## n_mean n_sd ## <dbl> <dbl> ## 1 32.3 8.61 ``` On average, there are around 32 (*SD* = 8.61) students in each teachers' class. --- class: inverse, center, middle # Demo doc (part 1): Try it out! *Note. Check out [R Studio Cloud](https://rstudio.cloud/) if you're having any installation trouble!* *The doc is [here](https://github.com/jrosen48/MSU-workshop-2019/blob/master/demo-doc.Rmd) in the repository* --- class: inverse, center, middle # Loading data --- # Loading a CSV File Okay, we're ready to go. The easiest way to read a CSV file is with the function `read_csv()` from the package `readr`, which is contained within the Tidyverse. Again, we'll load (or have loaded, already) the tidyverse library. ```r library(tidyverse) # so tidyverse packages can be used for analysis ``` Just as a recap of a line that we ran earlier. ```r student_responses <- read_csv("data/student-responses-data.csv") ``` Since we loaded the data, we now want to look at it. We can type its name in the function `glimpse()` to print some information on the dataset (this code is not run here). ```r glimpse(student_responses) ``` You can also print the name of the object. --- # Loading Excel and SAV files ## Loading Excel files ```r install.packages("readxl") ``` ```r library(readxl) ``` ```r my_data <- read_excel("path/to/file.xlsx") ``` ## Loading SAV files ```r install.packages("haven") ``` ```r library(haven) my_data <- read_sav("path/to/file.xlsx") ``` --- # Saving (Writing) Files Using our data frame `student_responses`, we can save it as a CSV (for example) with the following function. The first argument, `student_reponses`, is the name of the object that you want to save. The second argument, `student-responses.csv`, what you want to call the saved dataset. ```r write_csv(student_responses, "student-responses.csv") ``` That will save a CSV file entitled `student-responses.csv` in the working directory. If you want to save it to another directory, simply add the file path to the file, i.e. `path/to/student-responses.csv`. To save a file for SPSS, load the haven package and use `write_sav()`. There is not a function to save an Excel file, but you can save as a CSV and directly load it in Excel. --- class: inverse, center, middle # Visualizing data --- # The grammar of graphics *ggplot2* is a tidyverse package for creating visualizations (or figures). Let's work with a smaller data set, for now. ```r student_mot_vars <- # save object student_mot_vars by... student_responses %>% # using dataframe student_responses select(student_efficacy = SCIEEFF, # selecting variable SCIEEFF and renaming to student_efficiency student_joy = JOYSCIE, # selecting variable JOYSCIE and renaming to student_joy student_broad_interest = INTBRSCI, # selecting variable INTBRSCI and renaming to student_broad_interest student_epistemic_beliefs = EPIST, # selecting variable EPIST and renaming to student_epistemic_beliefs student_instrumental_motivation = INSTSCIE, # selecting variable INSTSCIE and renaming to student_instrumental_motivation teacher_id = SCHID ) ``` --- # Scatter plot Notice the three parts of all `ggplot2` graphs: a) the data (`student_mot_vars`), b) an `aes()`thetic mapping, and c) a `geom`: <img src="index_files/figure-html/unnamed-chunk-44-1.png" width="450px" style="display: block; margin: auto;" /> --- # Scatter plot with a regression line ```r ggplot(student_mot_vars, aes(x = student_efficacy, y = student_broad_interest)) + geom_point() + geom_smooth(method = "lm") ``` <img src="index_files/figure-html/unnamed-chunk-45-1.png" width="350px" style="display: block; margin: auto;" /> --- # Cleaning up ```r ggplot(student_mot_vars, aes(x = student_efficacy, y = student_broad_interest)) + geom_point() + geom_smooth(method = "lm") + theme_bw() + xlab("Student Efficacy") + ylab("Student Broad Interest") ``` <img src="index_files/figure-html/unnamed-chunk-46-1.png" width="350px" style="display: block; margin: auto;" /> --- class: inverse, center, middle # Demo doc (part 2) *The doc is [here](https://github.com/jrosen48/MSU-workshop-2019/blob/master/demo-doc.Rmd) in the repository* --- class: inverse, center, middle # Modeling data --- # Linear models The `lm()` function is a very helpful general purpose function for linear models. We'll use the `student_mot_vars` data frame we created earlier. ```r lm(student_epistemic_beliefs ~ student_broad_interest, data = student_mot_vars) ``` ``` ## ## Call: ## lm(formula = student_epistemic_beliefs ~ student_broad_interest, ## data = student_mot_vars) ## ## Coefficients: ## (Intercept) student_broad_interest ## 0.2407 0.2401 ``` --- # Linear models Let's save the results back to an object, `m1`. ```r m1 <- lm(student_epistemic_beliefs ~ student_broad_interest, data = student_mot_vars) ``` --- # Linear models We can then run a summary on the output. ```r summary(m1) ``` ``` ## ## Call: ## lm(formula = student_epistemic_beliefs ~ student_broad_interest, ## data = student_mot_vars) ## ## Residuals: ## Min 1Q Median 3Q Max ## -3.6197 -0.5614 -0.1755 0.5917 2.5144 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.24066 0.01390 17.31 <2e-16 *** ## student_broad_interest 0.24015 0.01423 16.88 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.013 on 5323 degrees of freedom ## (387 observations deleted due to missingness) ## Multiple R-squared: 0.0508, Adjusted R-squared: 0.05062 ## F-statistic: 284.9 on 1 and 5323 DF, p-value: < 2.2e-16 ``` --- # Linear models You can then build up a more complex model, i.e.: ```r m2 <- lm(student_epistemic_beliefs ~ student_broad_interest + student_joy + student_broad_interest:student_joy, data = student_mot_vars) ``` You can then run `summary()` to view the results. --- # Multi-level models ```r library(lme4) ``` ``` ## Loading required package: Matrix ``` ``` ## ## Attaching package: 'Matrix' ``` ``` ## The following object is masked from 'package:tidyr': ## ## expand ``` ```r m2_mlm <- lmer(student_epistemic_beliefs ~ student_broad_interest + student_joy + student_broad_interest:student_joy + (1|teacher_id), data = student_mot_vars) m2_mlm ``` ``` ## Linear mixed model fit by REML ['lmerMod'] ## Formula: ## student_epistemic_beliefs ~ student_broad_interest + student_joy + ## student_broad_interest:student_joy + (1 | teacher_id) ## Data: student_mot_vars ## REML criterion at convergence: 14853.02 ## Random effects: ## Groups Name Std.Dev. ## teacher_id (Intercept) 0.1839 ## Residual 0.9648 ## Number of obs: 5315, groups: teacher_id, 176 ## Fixed Effects: ## (Intercept) student_broad_interest ## 0.16608 0.07491 ## student_joy student_broad_interest:student_joy ## 0.27409 0.01991 ``` --- class: inverse, center, middle # Demo doc (part 3) *The doc is [here](https://github.com/jrosen48/MSU-workshop-2019/blob/master/demo-doc.Rmd) in the repository* --- class: inverse, center, middle # Learning and doing more with R --- # Resources - [Doing data science with R](https://r4ds.had.co.nz/) by Wickham and Grolemund (2017) - [Big magic with R: Creating learning beyond fear](https://speakerdeck.com/apreshill/big-magic-with-r-creative-learning-beyond-fear) by Hill (2017) - [Data science in education](https://github.com/data-edu/data-science-in-education) - [R Studio Community](https://community.rstudio.com/) --- # Courses - [\#r4ds](https://medium.com/@kierisi/r4ds-the-next-iteration-d51e0a1b0b82) (see a talk at rstudio::conf() [here](https://resources.rstudio.com/rstudio-conf-2019/r4ds-online-learning-community-improvements-to-self-taught-data-science-and-the-critical-need-for-diversity-equity-and-inclusion-in-data-science-education) by Maegan (2019)) - [Data science for social scientists](http://datascience.tntlab.org/) by Landers (2019) - [University of Oregon Data Science Specialization for the College of Education](https://github.com/uo-datasci-specialization) by Anderson (2019) --- # Useful packages With one exception, can install via `install.packages(<name-of-package)`, i.e., `install.packages("tidylog"). - tidyverse: tools for data manipulation and processing - tidylog: companion to the tidyverse; gives you feedback - haven: import and export files for other statistical software - apaTables: create APA-style tables in MS Word format - devtools: to download packages only available on GitHub - psych: general psychological science-related analysis tools - skimr: quickly calculate summary statistics for a data frame - lavaan: structural equation modeling - tipyLPA: latent profile analysis - lme4: multi-level modeling (HLM) - sjstats: convenient statistical functions - poLCA: latent class analysis - rmarkdown: reports (PDF, Word, HTML, etc.) - papaja: APA-style paper generation (must install from [GitHub](https://github.com/crsh/papaja)) - blogdown create a website/blog - xaringan: create presentations - MplusAutomation: Run Mplus via R --- # People (and a few hashtags) to follow (on Twitter) - [Jesse Mostipak](https://twitter.com/kierisi) - [Alison Hill](https://twitter.com/apreshill) - [Hadley Wickham](https://twitter.com/hadleywickham) - [Daniel Anderson](https://twitter.com/datalorax_) - [Mara Averick](https://twitter.com/dataandme) - [Thomas Mock](https://twitter.com/thomas_mock) - [Angela Li](https://twitter.com/CivicAngela) - [R4DS community](https://twitter.com/R4DScommunity) - [#tidytuesday](https://twitter.com/hashtag/TidyTuesday?src=hash) - [#rstats](https://twitter.com/hashtag/rstats?src=hash) - [David Robinson](https://twitter.com/drob) - [Julia Silge](https://twitter.com/juliasilge) - [Gabriela de Quieroz](https://twitter.com/gdequeiroz) - [Joshua Rosenberg](https://twitter.com/jrosenberg6432) and [Emily Bovee](https://twitter.com/ebovee09) (of course! :) --- class: inverse, center, middle # Questions? The repository for this workshop is [here](https://github.com/jrosen48/MSU-workshop-2019). --- # Contact - [Joshua Rosenberg](mailto:jmrosenberg@utk.edu) - [Emily Bovee](mailto:emily.a.bovee@gmail.com)