An Introduction to Natural Language Processing and Machine Learning

# An Introduction to Natural Language Processing and Machine Learning
### Joshua M. Rosenberg, Ph.D.
### University of Tennessee, Knoxville
### 2021/07/19 (updated: 2021-07-18)

---

# 1. Introductions

---

## Goals of this workshop

### Over-arching goal:

Get started with applying natural language processing methods in relatively short order through the use of R.

### Also:

This is an opportunity to get to know and learn from one another and to build capacity in science education research to use NLP and do ML.

---

## Agenda

1. Introductions (20 min.)
2. Brief *R*efresher (20 min.)
3. Overview of core natural language processing (NLP) ideas (10 min.)

10 min. break!

4. Trying out NLP analyses, part A (25 min.)
5. Trying out NLP analyses, part B (25 min.)

10 min. break!

6. Trying out NLP analyses, part C (30 min.)
7. Learning and doing more (10 min.)
8. Wrap-up (10 min.)

---

## About me

.pull-left[
* Joshua (Josh) Rosenberg, Ph.D.
* Assistant Professor, STEM Education, [University of Tennessee, Knoxville](https://utk.edu/)
* Husband to Katie, a school librarian, and Dad to a toddler, cyclist in training
* Research areas:
  * Science education
  * Data science in education
* R user since 2014; developer of many R packages
* Presently PI or Co-PI for three National Science Foundation grants
]

]

---

## Break-out rooms!

In randomly assigned break-out rooms of ~5-6 people, *starting with whomever has a 
birthday farthest from today*:

- **Introduce yourself** and your position and affiliation
- **Describe the best thing** that you have done this summer
- **What are some of your goals** for using natural language processing in science education?

---

# 2. Brief *R*efresher (20 min.)

---

## Let's reason about some code

```r
library(tidyverse)

sci_data <- read_csv(here("data", "data-to-model.csv"))

sci_data %>% 
  select(student_id, final_grade) %>% 
  arrange(desc(final_grade))
```

```r
sci_data %>% 
  select(student_id, final_grade) %>% 
  arrange(desc(final_grade))
```

```
## # A tibble: 546 x 2
##    student_id final_grade
##         <dbl>       <dbl>
##  1      85650       100  
##  2      91067        99.8
##  3      66740        99.3
##  4      86792        99.1
##  5      78153        99.0
##  6      66689        98.6
##  7      88261        98.6
##  8      92740        98.6
##  9      92726        98.2
## 10      92741        98.2
## # … with 536 more rows
```

What is the _data_?
What are the _functions_?
What are the _arguments_ to the functions?

And . . .

What do you think will happen?

---

```r
sci_data %>% 
  select(student_id, final_grade) %>% 
  arrange(desc(final_grade))
```

---

```r
sci_data %>% 
  group_by(subject) %>% 
  summarize(mean_final_grade = mean(final_grade, na.rm = TRUE))
```

What is the _data_?
What are the _functions_?
What are the _arguments_ to the functions?

And . . .

What do you think will happen?

---

```r
sci_data %>% 
  group_by(subject) %>% 
  summarize(mean_final_grade = mean(final_grade, na.rm = TRUE))
```

```
## # A tibble: 5 x 2
##   subject mean_final_grade
##   <chr>              <dbl>
## 1 AnPhA               76.1
## 2 BioA                68.4
## 3 FrScA               80.6
## 4 OcnA                73.3
## 5 PhysA               83.5
```

---

***Code Snippet A***

```r
sci_data %>% 
  mutate(passing_grade = if_else(final_grade >= .70, 1, 0))
```

***Code Snippet B***

```r
sci_data %>% 
  mutate(passing_grade = if_else(final_grade >= .70, 1, 0))
```

What is different about the two code snippets?

---

Getting Started task: https://rstudio.cloud/spaces/156192/project/2711839

---

# 3. Overview of core NLP ideas (10 min.)

---

## Text is everywhere

But it is often much _less-structured_ than the kind of data stored in tables we are used to working with

<img src="img/rosenberg-krist-prompt.png" width="75%" style="display: block; margin: auto;" />
 
---

## Text as data

Text can be treated as data, but this often involves:

- Transcribing the text
- Processing the text

---

## Input

To consider text as data, it is common to structure data in a _corpus_:
  - A _corpus_: text with meta-data, often but not always in a table/spreadsheet format
    
---

## Key structure

**Tokenized text**

> A token is a meaningful unit of text, most often a word, that we are interested in using for further analysis, and tokenization is the process of splitting text into tokens. ([Tidy Text Mining](https://www.tidytextmining.com/tidytext.html))

| student_id|word    |
|----------:|:-------|
|     101404|so      |
|     101404|the     |
|     101404|model   |
|     101404|could   |
|     101404|explain |

---

## The creation of a document-term matrix involves processing

- creating tokens
- removing stopwords, weighting, and stemming

---

## Defining Machine Learning (ML)

- *Artificial Intelligence (AI)*
.footnote[
[1]  I feel super uncomfortable bringing AI into this, but perhaps it's useful just to be clear about terminology
]
: Simulating human intelligence through the use of computers
- *Machine learning (ML)*: A subset of AI focused on how computers acquire new information/knowledge

This definition leaves a lot of space for a range of approaches to ML

---

## A helpful frame: Supervised & unsupervised

### Supervised ML

- Requires coded data or data with a known outcome
- Uses coded/outcome data to train an algorithm
- Uses that algorithm to **predict the codes/outcomes for new data** (data not used during the training)
- Can take the form of a *classification* (predicting a dichotomous or categorical outcome) or a *regression* (predicting a continuous outcome)
- Algorithms include:
  - [Linear regression (really!)](https://web.stanford.edu/~hastie/ElemStatLearn/)
  - Logistic regression
  - Decision tree
  - Support Vector Machine

---

## What kind of coded data?

> Want to detect spam? Get samples of spam messages. Want to forecast stocks? Find the price history. Want to find out user preferences? Parse their activities on Facebook (no, Mark, stop collecting it, enough!) (from [ML for Everyone](https://vas3k.com/blog/machine_learning/))

In science education:

- Assessment data (e.g., [1](https://link.springer.com/article/10.1007/s10956-020-09895-9))
- Data from log files ("trace data") (e.g., [1](https://www.tandfonline.com/doi/full/10.1080/10508406.2013.837391?casa_token=-8Fm2KCFJ30AAAAA%3Altbc8Y8ci_z-uLJx4se9tgvru9mzm3yqCTFi12ndJ5uM6RDl5YJGG6_4KpUgIK5BYa_Ildeh2qogoQ))
- Open-ended written (text) responses (e.g., [1](https://link.springer.com/article/10.1007/s10956-020-09889-7), [2](https://doi.org/10.1007/s11423-020-09761-w))
- Achievement data (i.e., end-of-course grades) (e.g., [1](https://link.springer.com/article/10.1007/s10956-020-09888-8), [2](https://search.proquest.com/docview/2343734516?pq-origsite=gscholar&fromopenview=true))

What else?
- Drawings/diagrammatic models
- Data collected for formative purposes (i.e., exit tickets)
- ???

---

## How is this different from regression?

The _aim_ is different, the algorithms and methods of estimation are not (or, are differences in degree, rather than in kind).

In a linear regression, our aim is to estimate parameters, such as `\(\beta_0\)` (intercept) and `\(\beta_1\)` (slope), and to make inferences about them that are not biased by our particular sample.

In an ML approach, we can use the same linear regression model, but with a goal other than making unbiased inferences about the `\(\beta\)` parameters:

<h4><center>In supervised ML, our goal is to minimize the difference between a known `\(y\)` and our predictions, `\(\hat{y}\)`</center></h3>

---

## So, how is this really different?

This _predictive goal_ means that we can do things differently:

- Multicollinearity is not an issue because we do not care to make inferences about parameters
- Because interpreting specific parameters is less of an interest, we can use a great deal more predictors
- We focus not on `\(R^2\)` as a metric, but, instead, how accurately a _trained_ model can predict the values in _test_ data
- We can make our models very complex (but may wish to "regularize" coefficients that are small so that their values are near-zero or are zero):
  - Ridge models (can set parameters near to zero)
  - Lasso models (can set parameters to zero)
  - Elastic net models (can balance between ridge and lasso models)

---

## Okay, _really_ complex

- Okay, _really_ complex:
  - Neural networks
  - Deep networks
- And, some models can take a different form than familiar regressions:
  - *k*-nearest neighbors
  - Decision trees (and their extensions of bagged and random forests)
- Last, the modeling process can look different:
  - Ensemble models that combine or improve on ("boosting") the predictions of individual models

---

## A helpful frame: Supervised & unsupervised

### Unsupervised ML

- Does not require coded data; one way to think about unsupervised ML is that its purpose is to discover codes/labels
- Is used to discover groups among observations/cases or to summarize across variables
- Can be used in an _exploratory mode_ (see [Nelson, 2020](https://journals.sagepub.com/doi/full/10.1177/0049124118769114?casa_token=EV5XH31qbyAAAAAA%3AFg09JQ1XHOOzlxYT2SSJ06vZv0jG-s4Qfz8oDIQwh2jrZ-jrHNr7xZYL2FwnZtZiokhPalvV1RL2Bw)) 
- **Warning**: The results of unsupervised ML _cannot_ directly be used to provide codes/outcomes for supervised ML techniques 
- Can work with both continuous and dichotomous or categorical variables
- Algorithms include:
  - Cluster analysis
  - [Principle Components Analysis (really!)](https://web.stanford.edu/~hastie/ElemStatLearn/)
  - Latent Dirichlet Allocation (topic modeling)

---

## What technique should I choose?

Do you have coded data or data with a known outcome -- let's say about science students -- and, do you want to:

- _Predict_ how other students with similar data (but without a known outcome) perform?
- _Scale_ coding that you have done for a sample of data to a larger sample?
- _Provide timely or instantaneous feedback_, like in many learning analytics systems?

<h4><center>Supervised methods may be your best bet</center></h4>

Do you not yet have codes/outcomes -- and do you want to?

- _Achieve a starting point_ for qualitative coding, perhaps in a ["computational grounded theory"](https://journals.sagepub.com/doi/full/10.1177/0049124117729703) mode?
- _Discover groups or patterns in your data_ that may be of interest?
- _Reduce the number of variables in your dataset_ to a smaller, but perhaps nearly as explanatory/predictive - set of variables?

<h4><center>Unsupervised methods may be helpful</center></h4>

---

### 10 min. break!

---

# 4. Trying out NLP analyses, part A (30 min.)

---

## Structure

- Discussing a few examples
- Running code within breakout rooms
- Sharing what we found together
]

The _three_ segments will be organized around a recent, key paper: [Nelson et al. (2021)](https://www.jstor.org/stable/pdf/24572662.pdf?casa_token=_2BwFZHwOl0AAAAA:IQto4pK-kR-YEkU3zGahMgVI3dFQcG_TgV2TlyoTVkw4kG7xvkJPMIdLJF2T8m6bmsX9R_QFBQ53kcINy8jnIKf2keWi7PqzB5drP2Lj1JCubQLfNUoC):

1. Dictionary-based approaches
2. Topic modeling
3. Supervised machine learning

*These may be especially relevant to science education research*

<img src="img/nelson-2021.png" width="75%" />
]

---

## What we'll do in part A

Dictionary-based analyses

]

We'll do this in R through the use of the [{tidytext}](https://www.tidytextmining.com/tidytext.html) R package

Core ideas:

- tokenizing the text
- **counting up** words!
- joining another set of data with words and pre-determined codes (a _dictionary_)
- **counting up** the codes that are joined!

]
---

## But first

---

## Coding frame

```r
coding_frame <- read_csv(here("data", "coding-frame.csv"))
coding_frame %>% 
  knitr::kable()
```

---

## Outline

Sought to use machine learning (ML) to develop a construct map for _students’ consideration of generality_, a key epistemic understanding that undergirds meaningful participation in _scientific practices_.

**Example of an open-ended prompt for students**

*We will be using students' responses to these prompts in this workshop*

---
## In break-out rooms

Using the file `try-it-out-part-a.R`, work to answer one or more of the following questions:

- What do the most frequent words tell us?
- What does the dictionary-based analysis tell us?

Let's head over to the [Learning Lab in RStudio Cloud](https://rstudio.cloud/spaces/156192/projects) (and find `try-it-out-part-a.R`)!

---

# 5. Trying out NLP analyses, part B (30 min.)

---

## What well do in part B

We'll focus on topic modeling

]

We'll do this in R through the use of the [{quanteda}](https://quanteda.io/) and the [{topicmodels}](https://cran.r-project.org/web/packages/topicmodels/index.html) R packages

Core ideas include:

- Determining the value of _k_, the number of topics
- Interpreting the topics (consider doing so in light of the coding frame on an earlier slide)

]

---

## But first

The creation of a _document-term matrix_ involves processing: tokenizing and optionally removing stopwords and stemming (reducing words to root forms)

---
## In break-out rooms

Using the file `try-it-out-part-b.R`, work to answer one or more of the following questions:

- What is the best number of _k_?
- What do the topics tell us about students' responses?

Let's head over to the [Learning Lab in RStudio Cloud](https://rstudio.cloud/spaces/156192/projects) (and find `try-it-out-part-b.R`)!

---

### 10 min. break!

---

# 6. Trying out NLP analyses, part C (30 min.)

---

## What well do in part C

- Supervised machine learning

]

We'll do this in R again through the use of the [{quanteda}](https://quanteda.io/) R package

Core ideas include:

- Once we have coded data, we can use this data to _train_ a model that can help us to scale up our analysis

]

---

## But first

Generality classifier: https://faast.shinyapps.io/generality-shiny/

---

## In break-out rooms

Using the file `try-it-out-part-c.R`, work to answer one or more of the following questions:

- How well can a computer learn to apply already-existing codes/?

Let's head over to the [Learning Lab in RStudio Cloud](https://rstudio.cloud/spaces/156192/projects) (and find `try-it-out-part-c.R`)!

---

## Discussion of the three try it outs

- What worked well?
- For what purpose?

---
class: inverse, center, middle

# 7. Learning and doing more

---

## Learning more

- [tidymodels](https://www.tidymodels.org/)
- [Hands-on Machine Learning With R](https://bradleyboehmke.github.io/HOML/)
- [Elements of Statistical Learning](https://web.stanford.edu/~hastie/ElemStatLearn/)
- [Learning to Teach Machines to Learn](https://alison.rbind.io/post/2019-12-23-learning-to-teach-machines-to-learn/)
- [Julia Silge's blog](https://juliasilge.com/blog/)

---

## Recommendations

- Start with a problem you are facing
- Use a simple algorithm that you can understand and troubleshoot
- Be mindful of your R code; small issues (with names!) can cause difficulties
- Share and receive feedback
- Explore the variety of ways you can use machine learning; we are deciding as a field how machine learning will (or will not) make a difference in our work
- It _will be really hard at first_; you can do this!

---

# 8. Wrap-up

---

## Thank you and contact information

What questions do you have? What is next for you?
]

Joshua Rosenberg  
jmrosenberg@utk.edu  
http://joshuamrosenberg.com/  
[@jrosenberg6432](https://twitter.com/jrosenberg6432)
]