Welcome back to Posh’s AI Academy series. Our first post in this series touched on Artificial Intelligence (AI) and Machine Learning (ML). Now, let’s zoom in on a sub-area of the field called natural language processing, or NLP for short, which is basically AI/ML applied to just about any task that involves understanding or generating natural language. Usually, NLP refers to dealing with natural language as written text, and speech processing is used to refer to dealing with natural language as spoken utterances.
We’ll talk about the tasks involved in NLP, focusing on those most pertinent to building a chatbot, like our Posh chatbot, which helps credit unions answer common questions and execute banking tasks for their customers.
To build a viable chatbot, you need to be able to understand what the user says (“how do i join the credit union”) and generate the appropriate response (“Click here to learn how you can become a member…”). This corresponds to the two major sub-areas of NLP: natural language understanding (NLU) and natural language generation (NLG).
Probably the most common NLU task is classification: take a free-form text and classify it into one of a number of categories (also called classes). Chatbots need to be able to understand what a user said in order to give an intelligent response. Intent classification (also known as intent recognition) aims to categorize the user’s input into a set of known intents. For example, in the example above, “how do i join the credit union?” would be an instance of the become-member intent. Another utterance like “transfer $50 from savings to my checking account” would be an instance of a different, transfer-money intent.
Another example of a classification task is sentiment analysis, where you take an utterance and determine if it is one of three classes: positive, negative, or neutral. This can also be useful in a chatbot! If you detect that your user is becoming angry and frustrated, for instance, you might direct them to a resource for extra help, perhaps transferring them to a human agent.
If your NLP pipeline is multilingual, you might first apply language identification to the utterance before funneling it to a specific intent classification module for the identified language. This is yet another example of text classification, where the categories are your supported languages.
Now, let’s go back to that utterance, “transfer $50 from savings to my checking account”. Knowing that the intent is transfer-money isn’t sufficient. In order to actually execute the transaction, we need to know: how much should we transfer? From which account? To which account? This task is called entity extraction. The relevant entities are the amount to be transferred and the account names. This is a different form of classification task called structured prediction. The goal of structured prediction is to automatically identify spans of text that correspond to specific entity types, such as places or accounts within a sentence. In this example, “savings” and “checking account” would be labeled types of accounts.
What if you want your chatbot to be able to handle voice content? Recognizing what your user says and translating it into a text representation that can go through intent classification and entity extraction is called speech-to-text (STT) or speech recognition. This is the first step in our Interactive Voice Response (IVR) bots, which allow our credit union customers to do their banking over the phone.
All of the steps above have enabled us to translate a raw utterance into something our chatbot can understand programmatically, but what about generating the bot’s response? There’s a design choice here. One is to use pre-configured responses, often with template slots for including information specific to users or transactions:
The other is to let your AI generate your response for you. Such AI-assisted natural language generation algorithms are a hot topic of research in the field at the moment, with behemoths like Google and Facebook leading the way with their Meena and Blender chatbots respectively. These chatbot architectures usually don’t separate their NLU and NLG components. Instead, they read in the user’s message (often, including the history of previous utterances) as input and directly generate a message in response. This is an instance of a sequence-to-sequence task, often abbreviated Seq2Seq. Other tasks in this category that you may be familiar with are machine translation and text summarization (and speech recognition falls into this category as well!).
This end-to-end approach yields chatbots that are great at chit-chat, but aren’t particularly good at completing tasks. They might contradict themselves after some time as the influence of earlier utterances fades, and they tend to have a somewhat loose relationship with reality (to put it mildly!).
Imagine your banking chatbot inventing a new loan rate to offer to a customer! Or, worse yet, responding to an innocent query with offensive content (remember Microsoft Tay, anyone?). This is why Google Meena hasn’t been released to the public, and Blender is only available as parameters and code that you have to run yourself. (Another drawback: training an architecture like Google Meena costs over $1.4 million in compute time! 😬)
For all of these reasons, we’re sticking for now to more conventional approaches to language generation, which works well and has the bonus of being easily customized by our credit union customers. We keep a close eye, however, on advances in this space and are continually thinking about how to adapt them to our context.
And to round things off, voice-based bots require the additional task of translating our text response into a spoken response for our customers to listen to. This speech processing task is called text-to-speech, often abbreviated TTS, and is the last step in our NLP pipeline for our telephony bots. Pop quiz: is text-to-speech a text classification, structured prediction, or Seq2Seq task?
In our previous blog post, my colleague Dhairya talked about how AI can span everything from rules to machine learning to deep learning. What do these look like in the context of natural language processing?
Rules still have their place, though at Posh we only use them in the context of our entity extraction module to identify known and high-probability entities such as currency amounts, the very standard “savings” and “checking” accounts and other custom names like “Harley Davidson account”, where we’re looking for a string of proper names before the word “account”.
On the other hand, rules aren’t so great for intent classification. You could imagine maybe looking for certain words or templates, but there’s so much variation in how people can phrase things that this would become unwieldy very quickly. So instead, we turn to supervised machine learning. For this, we need a corpus of training data. On the left we have example utterances, and on the right we have the correct intent.
For ordinary machine learning (ML), we need to build our own features. One simple feature type is word frequency. We split our sentences into words and count how many times each appears. Then we can apply a common ML algorithm such as logistic regression, which learns from the data which words are associated with which intents. For example, the word “transfer” might be positively associated with the “transfer-money” intent. But it’s also positively associated with the “agent” intent and it will take more word features to disambiguate the two.
Besides ambiguity, this “bag-of-words” approach has other drawbacks: it cannot share information between words, for example by knowing that “representative” and “rep” basically mean the same thing in our environment. To this model, these words are as close as “representative” and “space”. Also, it doesn’t take into account any kind of context, including word order: “man bites dog” and “dog bites man” have the same representation.
While we could engineer more sophisticated features to extract such information from the text, it’s much more common nowadays to turn to deep learning (DL) techniques for NLP, such as the state-of-the-art transformer architecture, which offers a complex architecture capable of doing its own feature learning and incorporating linguistic information such as local and long-distance dependencies between words. Another advantage of such architectures is that they can first be trained on large corpora of plain text in order to learn the basic structure of English, or Swahili, or whatever language you’re working in. The pre-trained model can then be finetuned on your own dataset and task. This technique is called transfer learning and is one of the main reasons for the massive leaps NLP has made in the very recent past, by removing the bottleneck of having to come up with thousands of labeled examples for your algorithms to achieve any degree of accuracy. Needless to say, natural language processing is an exciting field and we’re lucky to be working in it!
Thanks for paying attention to the very end! I hope this has given you insight into what NLP tasks chatbots solve and how different approaches like rules, ML, and DL can solve them.
If you’d like to learn even more, our very own Head of Customer Success, Keith Galli, recently gave a 1.5-hour tutorial on natural language processing at PyCon 2020. There are so many great resources out there, but some we have found especially useful are: