• Tien Nguyen

NLP - Transformers

Updated: 2 days ago


In recent years, the Transformer has become the basic building block of many state-of-the-art natural language processing (NLP) models. Like recurrent neural networks (RNN), Transformer is a powerful performance model proven useful for everyday NLP tasks such as intent recognition in a search engine, text generation in a chatbot engine, and classification. This blog explores the difference between RNN and Transformer architecture as well as how the Transformer can be applied to build a chatbot engine.


Recurrent Neural Network, also known as RNN, works in a backward direction where RNN makes predictions based on the input X(t) and the historical internal state h(t-1) for every time step t. Every hidden RNN layer is dependent on its precedent layer, unlike a standard neural network. Thus, because weights are shared across time, RNN is like a state machine that takes actions temporally based on its historical sequential information. For example, RNN can be trained on a sequence of characters to generate the next character correctly.

RNN - The activation at each time step is feedback to the next time step

For many years, RNN and its gated variants were the most popular architectures used for NLP. However, one of the main problems with RNN is the vanishing gradient problem, where the gradient approaches zero and prevents RNN from learning further.

Another problem with RNN is its short-term memory. As mentioned above, the RNN internal state retains information through time. The current hidden state is calculated using the information from the previous time step’s hidden state and the current input. However, if the input sequence is slightly longer than RNN’s designed internal state, it becomes difficult for RNN to remember the long-range context, hence short-term memory.

A solution to the above problem is using a Transformer model. The Transformer model has an encoder-decoder with an attention mechanism architecture.


The Transformer combines both encoder-decoder and the attention mechanism into a model. Instead of a fixed context vector, the transformer model develops a context vector explicitly filtered for each step output.

The encoder-decoder model was introduced in Sutskever et al.’s paper “Sequence to Sequence Learning with Neural Networks” and Cho et al.’s paper “Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation.” The encoder is responsible for encoding the entire temporal input sequence into a fixed-length context vector. Simultaneously, the decoder is responsible for temporally decoding the context vector back into a variable-length sequence.

The attention mechanism was introduced in Bahdanau et al.’s paper “Neural Machine Translation by Jointly Learning to Align and Translate.” The attention mechanism allows the model to choose the most relevant information in the input sequence and select the appropriate output.

Transformer Encoder with Attention Layer

The Transformer encoder consists of six identical encoders where each encoder has an attention layer and a feedforward layer stack together. Each word in the input sequence gets converted to a 512-dimensional word embedding vector and serves as a query, a key, and a value. Then, the Transformer encoder simultaneously computes the similarity between the queries and its keys for the whole input vector, followed by a fully-connected feedforward layer. The output is an n x 512-dimensional vector, where n is a predefined length for the input sequence.

Transformer Encoder with Attention Layer (Source)

Transformer Decoder with Attention Layer

The Transformer decoder also has six identical decoders where each decoder has an attention layer, a feedforward layer, and a masked attention layer stack together. The Transformer decoder’s input is a combination of two different sources. The key and value inputs are from the transformer encoder output, while the query input is from the predicted output sequence. The masked attention layer pads the output sequence with zero to prevent the problem of the unknown output sequence length and to ensure that the Transformer decoder cannot see the future information during training. Like the Transformer encoder, the Transformer decoder simultaneously computes the similarity between the queries and its keys for the entire input vector, followed by a fully-connected feedforward layer.

The Transformer encoder-decoder output is referred to as a hidden state output. This hidden state output is fed into additional fully connected layers depending on the application before exiting the model as a prediction Y(t).

Transformer Decoder with Attention Layer (Source)


A chatbot application is an artificial intelligence program that can answer questions automatically. Amazon Alexa and Google Assistant are some of the most popular chatbot applications in the market. Chatbot applications are a great way to engage customers and collect information that is valuable for the business.

For years, creating a chatbot was a tedious process that would take months or even years to complete. The engineering team needed to design the intent scenarios and then write thousands of answers to cover those scenarios. Usually, intents are customer’s inquiries or other essential conversation topics. Intent questions are a similar set of easy questions to answer, such as “Are you open?” or “Can I schedule an appointment for next Thursday?”

With the recent progress in NLP deep learning, Google’s Dialogflow helps eliminate the traditional work and builds a much more powerful conversational chatbot in just a matter of hours. Google’s Dialogflow is a no-coding web application that even non-developers can use. Dialogflow uses a large scale pre-trained language model such as a GPT model or a masked language model.

GPT is a Transformer-like model that processes the input text left-to-right to predict the next word from the previous context. The masked language model is also a Transformer-like model, such as BERT or ALBERT, which predicts and identifies a small number of words that have been masked out in the input sequence. The difference is that GPT is a unidirectional model and the masked language model is a bidirectional model. The bidirectional model tries to understand the context from both the left and right of the masked token, instead of one side like the unidirectional model.

Language Modelling V. Masked Language Modelling

Training a language model from scratch is an expensive operation. The training process could take weeks or even months to collect and process the training data, and that time does not include the actual training and fine-tuning of the model to meet the end goal. Fortunately, Dialogflow uses transfer learning to speed up the training process. Google provides a pre-trained model that can generate an answer for any given input question, such as the BERT model or the ALBERT model, which are similar to the base Transformer described in the section above.

These models are call encoder or unidirectional models, which use the left context to predict the next word. The designer only needs to create the intent contexts for their application and provide a few examples for fine-tuning the model to fit their needs. The final Dialogflow model understands the examples provided and it can also extend the provided context to other numerous phrases that have the same meaning context.

The developer can easily build a web interface that can send a query to DialogFlow. The application needs to make an HTTP request with JSON describing the user intents to Dialogflow’s endpoint with an intents knowledge base and a dialog history. When Dialogflow receives the query, it combines the content of its knowledge base with the new query’s intent to generate a reply.


Transformer architecture is used primarily in the field of natural language processing and understanding. Transformer architecture combined with transfer learning can significantly reduce computing time and production costs. Dozens of pre-trained Transformer architectures can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as intent recognition and question answering.

Read more about NLP and methods to calculate syntactic similarity in text, here - NLP - Finding Syntactic Similarity in Text.


Tien Nguyen

Associate Consultant at Lumenci

Tien has a master's degree in Computer Science from Georgia Institute of Technology and is also an alumnus of the University of Texas. He is proficient in Python, Java, and C++. Tien spends his time off work reading and watching classics.

  • Facebook | Lumenci
  • Twitter | Lumenci
  • LinkedIn | Lumenci

©  2020 by Lumenci Inc.