Flagging toxic comments with Tidytext and Keras

· by Gordon Shotwell · Read in about 14 min · (2892 Words)

I recently took a look at a Kaggle Competition about automatically labeling toxic comments on Wikipedia. I thought it would be a good opportunity to showcase how well the tidytext and keras R interfaces work together. This data includes a lot of really troubling language, including racial slurs, misogynistic language, and homophobic content and so I decided to not include most of the exploratory data analysis because this is a family blog.

The task here is to try to determine how likely a string is to have a particular set of labels. We can take a look at the data

library(tidyverse)
library(tidytext)
train <- read_csv("train.csv")
train$id <- as.factor(train$id)
train$comment_text[1:5]
## [1] "Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27"                                                                                                                                                                                                                                                                                                                                                                               
## [2] "D'aww! He matches this background colour I'm seemingly stuck with. Thanks.  (talk) 21:51, January 11, 2016 (UTC)"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
## [3] "Hey man, I'm really not trying to edit war. It's just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. He seems to care more about the formatting than the actual info."                                                                                                                                                                                                                                                                                                                                                                                                               
## [4] "\"\nMore\nI can't make any real suggestions on improvement - I wondered if the section statistics should be later on, or a subsection of \"\"types of accidents\"\"  -I think the references may need tidying so that they are all in the exact same format ie date format etc. I can do that later on, if no-one else does first - if you have any preferences for formatting style on references or want to do it yourself please let me know.\n\nThere appears to be a backlog on articles for review so I guess there may be a delay until a reviewer turns up. It's listed in the relevant form eg Wikipedia:Good_article_nominations#Transport  \""
## [5] "You, sir, are my hero. Any chance you remember what page that's on?"

This is a multi-label classification problem. Each string can have up to six labels, and there is some likelihood that labels are correlated with one another. For instance, if a string is severe_toxic then we would think that it will also be toxic. The only information we have about the comments is the text itself, and it looks like the text includes some uninformative characters like time-stamps and formatting characters.

labels <- train %>% select(id, toxic:identity_hate) 
labels  %>%
    gather(label, value, -id) %>% 
    group_by(label) %>% 
    summarize(n_labeled = sum(value)) %>% 
    mutate(percent_labeled = n_labeled / nrow(train)) %>% 
    arrange(desc(percent_labeled))
## # A tibble: 6 x 3
##   label         n_labeled percent_labeled
##   <chr>             <int>           <dbl>
## 1 toxic             15294         0.0958 
## 2 obscene            8449         0.0529 
## 3 insult             7877         0.0494 
## 4 severe_toxic       1595         0.01000
## 5 identity_hate      1405         0.00880
## 6 threat              478         0.00300
labels %>% select(-id) %>%  cor(use = "complete.obs")
##                   toxic severe_toxic   obscene    threat    insult
## toxic         1.0000000    0.3086191 0.6765145 0.1570584 0.6475181
## severe_toxic  0.3086191    1.0000000 0.4030145 0.1236013 0.3758072
## obscene       0.6765145    0.4030145 1.0000000 0.1411790 0.7412724
## threat        0.1570584    0.1236013 0.1411790 1.0000000 0.1500224
## insult        0.6475181    0.3758072 0.7412724 0.1500224 1.0000000
## identity_hate 0.2660094    0.2016002 0.2868669 0.1151283 0.3377363
##               identity_hate
## toxic             0.2660094
## severe_toxic      0.2016002
## obscene           0.2868669
## threat            0.1151283
## insult            0.3377363
## identity_hate     1.0000000

A couple of things stand out here. First most if the labels are fairly uncommon, and some like the threat label are extremely rare. This is going to make our modelling efforts more difficult because we might not have enough data to apply some kinds of classifiers. Second, the labels are well correlated with one another which indicates that we probably want to model these labels together in some way.

Text preparation

The first step for looking at the actual text is to split up the strings into words and then remove stop words. Stop words like “the” and “a” are unlikely to provide much information about the toxicity of the comment, so we can take them out.

txt <- train %>% 
    select(id, comment_text) %>% 
    unnest_tokens(word, comment_text, ) %>% 
    anti_join(stop_words, by = "word") 

txt <- labels %>% 
    left_join(txt, by = "id")

#filter out numbers and punctuation
txt$word <- str_replace_all(txt$word, "[:punct:]", "") %>% 
    str_replace_all("[:digit:]", "")
txt_freq <- txt %>% 
    group_by(word) %>% 
    tally(sort = TRUE) 
head(txt_freq)
## # A tibble: 6 x 2
##   word           n
##   <chr>      <int>
## 1 ""        142747
## 2 article    55934
## 3 page       46207
## 4 wikipedia  36666
## 5 talk       32578
## 6 edit       18242

Somewhat unsurprisingly, the most common words are related to Wikipedia editing, such as “page”, “word”, “article” etc. When you take a look at the individual labels, you can see that their top words are quite a bit more offensive.

The last thing I want to do in terms of data preparation is to stem the words, that is, replace the various forms of a word with one token (for example, “win”, “winning”, and “winner” are all replaced with the single stem “win”). The assumption behind this process is that all of those words have more or less the same meaning, so we can replace them with a single token without losing much information. This serves as a type of dimensionality reduction for text classification problems because you just have one variable for the word-stem instead of several for each individual word.

pre_stem <- length(unique(txt$word))
txt$word <- SnowballC::wordStem(txt$word) 
stem_diff <- pre_stem - length(unique(txt$word))

By stemming the text we’ve reduced the number of dimensions by about 4000.

Classifying the text

We have two basic options for classifying multi-label data: we can either use a classifier like a neural network, which can naturally classify records into more than one category, or we can turn the multi-label classification problem into a discrete classification problem. For instance, we could create a new factor variable with one category for each possible combination of labels:

label_cat <- train %>% 
    select(id, toxic:identity_hate) %>% 
    gather(label, value, -id) %>%
    filter(value == 1) %>% 
    group_by(id) %>% 
    summarize(label_cat = paste(label, collapse = "|")) 

label_cat %>% 
    group_by(label_cat) %>% 
    tally() %>% 
    arrange(n)
## # A tibble: 40 x 2
##    label_cat                                    n
##    <chr>                                    <int>
##  1 toxic|severe_toxic|threat|identity_hate      1
##  2 toxic|severe_toxic|threat|insult             1
##  3 obscene|threat                               2
##  4 obscene|threat|insult                        2
##  5 obscene|identity_hate                        3
##  6 threat|insult                                3
##  7 toxic|severe_toxic|identity_hate             3
##  8 toxic|threat|insult|identity_hate            3
##  9 toxic|severe_toxic|obscene|threat            4
## 10 toxic|severe_toxic|obscene|identity_hate     6
## # ... with 30 more rows

This technique might work for this kind of problem because the number of labels is relatively small. But since the number of categories in the new variable is the factorial of the number of labels, it can quickly become unmanageable as the number of labels increases. The bigger issue for this approach is that the data we’re working with doesn’t have that many labeled entries, and so some of the interacted categories have only one or two observations. For instance, any algorithm that was trying to learn to identify the obscene|threat label would be doing so based on just 2 observation. That’s just not enough information to make a reliable prediction.

Another approach is to fit a distinct classifier to each label, and then generate the predicted labels independently of one another. This is a good approach because it’s fairly simple and will probably work better for the sparse data; however, the problem with this approach is that we’re assuming that the labels are independent of one another when in fact we know that they’re probably highly correlated.

To train these classifiers we need to create a bag-of-words data frame that has the distinct words in the columns and the number of occurrences of each stemmed word in the cells. We can also reserve 30% of the data at this point as a test set. Some experimentation revealed that my laptop isn’t quite up to this problem so I’m going to filter the data down to the most common words to send through the classifier. One important step here is that I’m selecting the 250 most common words for each label to ensure that we get some identifying words for each category.

txt <- left_join(txt, labels, by = "id")

common_words <- txt %>% 
    ungroup() %>% 
    gather(label, value, -id, -word) %>% 
    filter(value == 1) %>% 
    filter(word != "id") %>% 
    filter( nchar(word) > 1 & nchar(word < 40)) %>% 
    group_by(label, word) %>% 
    summarize(n = n()) %>% 
    arrange(desc(n)) %>% 
    top_n(250) %>% 
    pull(word) %>% 
    unique()
## Selecting by n
tf_df <- txt %>%  
    ungroup() %>% 
    select(id, word) %>% 
    filter(word %in% common_words) %>% 
    group_by(id, word) %>% 
    tally() %>% 
    spread(word, n, fill = 0) 

response <- labels %>% 
    filter(id %in% tf_df$id)

train_idx <- sample(c(TRUE, FALSE), nrow(tf_df), replace = TRUE,  c(0.7, 0.3))
x_train <- tf_df[train_idx, common_words]
x_test  <- tf_df[!train_idx, common_words]

y_train <- response[train_idx, ]
y_test  <- response[!train_idx, ]

Building the classifier

I’m going to use a logistic regression model with a Lasso penalty mainly because I have a suspicion that the true model is sparse – that most of these words are probably not relevant to whether the text should be labeled one way or another. If that’s the case, we want to have a model that does some feature selection to identify which words are important. The Lasso penalty tends to push some of the coefficients to zero, which eliminates irrelevant words from the model.

library(glmnet)
x_train <- Matrix(as.matrix(x_train), sparse = TRUE)
models <- y_train %>% 
    ungroup() %>% 
    select(-id) %>% 
    map(~cv.glmnet(x_train, ., 
        nfolds = 5, 
        family = "binomial", 
        type.measure = "auc", 
        parallel = TRUE,
        maxit = 1e4))

We now have a list of models, one for each of the response variables. We can look at a couple of the AUC plots to see how the model fits.

walk(c("toxic", "insult", "threat"), ~{
    plot(models[[.]])
    title(.)})

These plots aren’t exactly what we were hoping for. Our model is doing a fairly good job identifying the toxic and insult labels, but failed to converge for the threat label. This is especially disappointing because fitting all these models took forever, and we ended up with something that isn’t at all useful for making predictions.

I have four main ideas for what to do at this point:

  1. Try to solve an easier problem. The model is having trouble identifying the rare categories, and we could probably do a better job if we rolled these categories up into larger groups, for instance by combining identity_hate, threat and severe_toxic. Maybe if we started from that easy problem a solution to the harder problem would become apparent

  2. Fiddle with the existing model I made a lot of concessions due to the time I had to work on this and my available computing power, but we could probably get a better model by doing things like including more words in the model, increasing the number of glmnet iterations, or using something other than raw word counts for our predictor variables. For instance we could scale the word occurrences using term frequency

  3. Investigate contextual models The model we’re using here is just based on word frequency, which doesn’t include information about how words are connected to one another. With all language, and especially hate speech, context is very important information that we should probably consider. For instance, the word “cow” could be innocuous in one context but highly offensive in another. To include this information we could do something as simple as counting n-grams instead of words, but probably some kind of neural network is the best way to capture that kind of contextual information.

  4. Try a classifier chain A classifier chain is just like the approach I took here, except that you include the output of one model as a predictor of a subsequent model. This is a good option here because some of our outcome variables are correlated, so if a string is likely to be toxic it is more likely that it is severely toxic. Unfortunately the labels I’m having trouble with aren’t that well correlated with other models so maybe a classifier chain wouldn’t help that much with those categories.

Neural Networks to the rescue

I’ve always had quite a lot of imposter syndrome about statistical methods. I studied philosophy and law in school which didn’t include things like vector calculus or probability. I feel especially nervous around neural networks because they are surrounded by equal proportions of hype, integrals, and unfriendly mathematicians. Despite this I think I’m going to try option #3 and see how deep learning does on this problem. There are two main reasons I think this approach is the most promising.

First, neural networks can naturally output multi-label classifiers. I had to do a lot of gymnastics to try to get a set of binary classifiers to output overlapping labels, but all you have to do to get multi-label classification out of a neural network is to add nodes to the output layer. This is also great because neural networks learn all the labels at once rather than piecemeal.

Second, neural networks are really good at learning from context. For instance, if you want to create a binary classifier which considers the interaction of several words, you typically have to create a new variable to include in the model. Neural networks can build up these interactions on their own without requiring the analyst to recode the data ahead of time. The way I think about this is that you need to explicitly test your suspicions in a traditional classifier, but neural networks can generate their own suspicions.

To start with I’m going to consider a few more words, the top 500 from each label instead of the top 250.

common_words <- txt %>% 
    ungroup() %>% 
    gather(label, value, -id, -word) %>% 
    filter(value == 1) %>% 
    filter(word != "id") %>% 
    filter( nchar(word) > 1 & nchar(word < 40)) %>% 
    group_by(label, word) %>% 
    summarize(n = n()) %>% 
    arrange(desc(n)) %>% 
    top_n(500) %>% 
    pull(word) %>% 
    unique()
## Selecting by n

Next we need to encode this data in a way that the network can understand. I’m going to use a one-hot matrix where each row is an observation, and each column is a word. The entries of the matrix are 1 if the observed string has that word, and 0 otherwise. Keras has some great tools for generating these matrices in a more efficient manner, but I’m going to use tidyr because it’s a bit more transparent.

x <- txt %>% 
    select(id, word) %>% 
    filter(word %in% common_words) %>% 
    group_by(id, word) %>% 
    summarize(one_hot = 1) %>%
    spread(word, one_hot, fill = 0)

We can create our test and training sets for both the one-hot matrix, and the matrix of labels. The output matrix has the same number of rows as our observation, and one column per possible label.

test_idx <- sample(c(TRUE, FALSE), nrow(x), replace = TRUE, c(.7, .3))

x_train <- x[test_idx, ] %>% 
    ungroup() %>% 
    select(-id) %>% 
    as.matrix()
y_train <- labels %>% 
    filter(id %in% x$id) %>% 
    select(-id) %>% 
    .[test_idx,] %>% 
    as.matrix()

x_test <- x[!test_idx, ] %>% 
    ungroup() %>% 
    select(-id) %>% 
    as.matrix()
y_test <- labels %>% 
    filter(id %in% x$id) %>% 
    select(-id) %>% 
    .[!test_idx,] %>% 
    as.matrix()

Now we’re ready to construct the network. I’m using the wonderful new Keras package from RStudio, and mostly default values for the training parameters.

library(keras)
tox_mod <- keras_model_sequential() %>% 
    layer_dense(units = 16, activation = "relu", input_shape = ncol(x_train)) %>% 
    layer_dropout(.3) %>% 
    layer_dense(units = 16, activation = "relu") %>%
    layer_dropout(.2) %>% 
    layer_dense(units = 16, activation = "relu") %>% 
    layer_dense(units = ncol(y_train), activation = "sigmoid")

val_indices <- sample(1:nrow(x_train), 20000)

x_val <- x_train[val_indices,]
partial_x_train <- x_train[-val_indices,]

y_val <- y_train[val_indices,]
partial_y_train <- y_train[-val_indices,]

tox_mod %>% compile(
  optimizer = "rmsprop",
  loss = "binary_crossentropy",
  metrics = c("accuracy")
)

history <- tox_mod %>% fit(
  partial_x_train,
  partial_y_train,
  epochs = 20,
  batch_size = 512,
  validation_data = list(x_val, y_val)
)
plot(history)

This plot shows that the model is getting better each epoch and doesn’t appear to be over-fitting. It also more or less plateaus after about the fifth epoch, so we could cut off training there if we were worried about computation resources. Importantly, I’m actually getting something which can make predictions which is a big step up from my first attempt.

Finally we can output some predictions to check the model against the hold out test. The Kaggle competition asked for a mean log-loss score which we can compute easily using purrr and the MLmetrics package.

preds <- tox_mod %>% 
    predict(x_test) %>% 
    as.data.frame()
y_test <- as.data.frame(y_test)
purrr::map2_dbl(preds, y_test, ~MLmetrics::LogLoss(.x,.y)) %>% 
    mean()
## [1] 0.06922315

The leader board on the Kaggle competition tells me that the best mean log-loss score is 0.052 and while that’s quite a bit lower than my score the fact that it’s in the same ballpark is encouraging. The network I used here is very simple with only two hidden layers with 16 nodes, but it performs quite well. The next steps for this problem would be to add some complexity to the model by adding layers or nodes.

Conclusion

The big thing I learned through this process is not to be afraid of state-of-the-art techniques. A lot of the time these techniques were developed because traditional approaches were hard to apply to certain problems. In other words, they were created to make those problems simpler rather than more complicated. In this case the approach I was comfortable with (Lasso regression) was kind of tricky to the multi-label classification problem and a neural network both performed better, and was actually easier to understand.

Comments