Natural Language Processing

[Part 2] Ticket Tagging with BERT — NLP Text Classification

6 min readDec 14, 2020

Trying a contextual based model for a text classification problem

Motivation

In this previous post, we took a dataset with all the tickets loaded to our IT Support Team and our goal was to create two context-free linear models to predict two different targets. One target was binary, the other one was a multilabel class. We got some decent results, but also taking a deeper look into data, we’ve visualized some problems. So…

What if we try with a contextual-based, more powerful and specially built for NLP purposes model? Will we get better results or just overfit our model? Will this powerful model be able to deal with the problems that we found in our data, or it’s just not possible without an special custom made cleaning?

what if we try BERT?

A little bit about BERT

BERT was open-sourced in 2018 as a new deep learning technique for NLP. It has two main keys:

First, it’s bidirectional (or actually, non-directional). Before BERT, a language model would have looked at every text sequence during training from either left-to-right or combined left-to-right and right-to-left. This is a one-directional approach and it does perform well. It predicts next word in a sentence given its previous words. Ie, for the sentence “I want to play guitar with my musician friends”, if we would want to predict the word guitar, we could only consider “I want to play”. Bert, on the other hand, will use full sentence for prediction. This target word it’s called [MASK] and that’s why it’s non directional.

“I want to play [MASK] with my musician friends”.

Second, it’s based on Transformers. A Transformer works by performing a small, constant number of steps. In each step, it applies an attention mechanism to understand relationships between all words in a sentence, regardless of their respective position. For example, given the sentence, “I arrived at the bank after crossing the river”, to determine that the word “bank” refers to the shore of a river and not a financial institution, the Transformer can learn to immediately pay attention to the word “river” and make this decision in just one step.

Now, we’re in condition to understand that BERT is short for Bidirectional Encoder Representations from Transformers.

Off course, this is just an overview about BERT since we’re not going to dive into its architecture. For more details about BERT check the References section at the bottom.

Keeping it simple with Hugging Face

To develop our models, we’re going to use pretrained models based on BERT.

These models are all trained from BERT for specific NLP purposes and with different corpuses. This is really helpful for many reasons. First, we don’t have to invent the wheel again to tackle common problems. Also, probably we don’t have such big corpuses to achieve similar results. Not to mention computing power and time (!) to create this huge models.

Hugging face provides community transformers that are state of the art for NLP purposes. These transformers are backed by PyTorch and TensorFlow and provide a single interface for a quick implementation.

We can surf through all of its models, and we should choose the one that fits better to our problem. This will be the biggest challenge for us using hugging face, so it’s really important to know all the features of our corpus in order to find the best transformer. Even though it’s not exactly straight forward for all cases, it’s really easy to switch between transformers. Once we have our pretrained model, we will fine tune it to adapt our dataset. For this, we will add special bert tokens to indicate where our text begins and ends, and provide correct tag. By training this new model we will learn our dataset in order to generate predictions.

To start working with hugging face transformers

Training With TensorFlow and Hugging Face

Let’s recap first and remember a little bit about about our dataset. First, it’s written mainly in spanish, but it has some english words. And some spanglish words as well, to be fair. It has near 14k training rows and it’s really dirty with special characters, HTML tags, accents, etc.

Sorry for the ban, it’s confidential information

So, let’s do some cleaning first..

Ok, now it’s time to use BERT through hugging face. For this particular job, we’ve tried a couple of models and we kept the best:

Roberta-base gave us the best result for same training.

Once we have our model and our dataset ready, we need to encode dataset to add masked special tokens, and vectorize it. We can do this manually adding tokens for every input and then calling encode method from tokenizer. Also, hugging face provide a batch_encode_plus that will automatically perform tokenization and vectorization putting correct tags according to the selected model.

batch_encode_plus does tokenization and vectorization for us

Next steps include splitting our dataset into training, validation and test. We are also going to create a keras training model, so we need to convert input into TFDatasets. First keras’ layer in the model, will be fed with embeddings from our BERT encoder. After that, we can add as many layers as our model needs with normalizations and layer activations. For more details about it, check Keras’ official page in the references section. With that done, we’ll have a compiling model to train on.

?””””””””””””””””””””FTRrrrrrrrf0

Results

After training for 30 epochs, we’ve got following results

Between the two gray dashed lines lies the best fit for our model, and we can see that training accuracy score is ~0.90 and ~0.88 for validation set. Considering the problems that our dataset already have (check out “It was always about the data” section in part one) score is by far acceptable, and are around 10% better than we’ve achieved with our previous linear model.

To finish, this are the final results on a test set totally unseen by the model:

For test set we only have 5 classes. It was the only unseen amount of data we could get. And we still can see that we have problems predicting classes 4 and 5 (as we described in part one).

To summarize, this is by far a more powerful model and does a much better job than a linear one. Still, inconsistencies in our data must be fix in order to improve even more results, and start having some hope on predicting minority classes.

This will be a future work to do, so stay tuned.

Love and understand your data in order to achieve great results

Here is the full notebook script. Feel free to share any thoughts with us.