Business leaders wrapping their heads around the impressive capabilities of ChatGPT are still wondering how LLMs can impact customer service.
How We Built Training Insights at Ada
What is the problem we are solving?
Last year, we automated more than 1.5 billion customer interactions for hundreds of category leaders across the globe. That’s a lot of conversations! Most of the time we could answer the customers inquiry, however, there were some cases where a client’s bot couldn't answer a question. We call these Not Understood (NU) questions and they could be because:
- the answer to their question does not exist in the Ada bot's knowledge base
- their question is outside of the bot's domain (example:
What is the capital of Luxembourg?)
- misclassification, i.e. the answer exists but the bot is unable to respond with the correct answer
Currently, we surface all NU questions in a reverse chronological list, which means recent questions are shown first. Bot builders traverse this list searching for common NU questions for which to build new answers, or improve existing answers, thereby automating more questions from their customers. However, this is challenging for our builders, as they are attempting to draw insights and find a set of common topics among a really long list.
To address this issue, we built Training Insights which groups NU questions into similar topics. Each group has a title, keywords, question count and some samples. The builder can go inside each group and select questions to be added as training questions to a new or existing answer. If similar answers exist, they are suggested to the builder.
How do we group Not Understood questions?
Given that there are no labels associated with Not Understood questions and our wide variety of clients prevents us defining a small set of common groups, we started looking into unsupervised approaches. Latent Dirichlet Allocation (LDA) is a commonly used topic modelling approach, where documents (in our case, questions) are represented as mixtures of topics and each word in the document is modelled as being generated by a topic. We quickly realized that LDA was not suitable for our task because it relies on a bag-of-word sparse representation. LDA is often used for topic modelling on news articles, where the documents contain many words and some of those words represent the topics well. Bag-of-words representation work well for domains like news where an article containing the words
Federer, Nadal, Wimbledon, grand slam belongs to the topic
tennis. However, two short questions on the topic of operating hours (
What are your operating hours?,
When is your store open?) can have few overlapping words.
In recent years, models using Transformer-based architectures such as BERT and RoBERTa have become the dominant method of representing text for prominent benchmarks such as GLUE. These encoders can produce a a contextualized dense representation for text data. At Ada, we have fine-tuned a BERT language model on an internal dataset in a multi-task fashion using MT-DNN. We have also trained word embeddings using fastText for different industry verticals. From our previous experiments, we knew that these methods were superior to bag-of-words for representing questions. Our next attempt to group NU questions was to embed them using these methods and then perform density-based clustering. This approach has also seen recent success in academia - Top2Vec, BERTTopic. Here is a t-SNE visualization of one of our bot's questions after embedding them with BERT where colours represent different clusters. Since there are over 50 clusters, some colours are repeated.
If we zoom into a part of this plot, we can see that the clusters are quite meaningful. Here is a manually labelled sub-plot.
There are two main approaches for evaluating unsupervised models:
Distance-based metrics: Examples include Davies-Bouldin score and Silhouette Coefficient. These metrics compare intra-cluster distance to inter-cluster distances. They do not require ground truth labels.
Comparing to ground-truth labels: Examples include V-Measure and AMI. These metrics are preferred since you can evaluate clustering performance to the labels relevant to the task, but obtaining the labels can be challenging.
In Ada's dashboard, bot builders create training questions for answers. For example, here is a screenshot of an
About Me answer with four training questions.
This gives us a dataset of questions where we know the correct cluster assignment so we use performance on this task as a proxy. We clustered all the training questions for a bot, and since we had access to the answer label, we were able to calculate both types of metrics (distance and ground-truth) for this proxy task. After looking at some of the clustering results and the metrics, we settled on V-Measure as the main metric. V-Measure measures the agreement between the cluster assignment and the answer labels, considering both homogeneity and completeness. A higher score indicates better clustering. This allowed us to iterate faster and try out different modelling techniques while optimizing for a single metric.
We also found the distance-based metrics were not meaningful for our purpose and since the main task of grouping Not Understood questions was different than this proxy task of clustering training questions, we validated some of our main modelling choices by showing groupings created by different approaches to early clients.
We validated that clustering training questions was a good proxy task for Not Understood questions clustering because clients consistently preferred groupings from models that also performed better in our proxy task.
Experiment results and ML Pipeline
The final pipeline consisted of three main stages:
- feature extraction / embedding questions
- dimensionality reduction
Dimensionality reduction is used because clustering methods use distances metrics which can become unreliable in high dimensions, leading to lower clustering performance. You can read more here.
We ran experiments to compare methods at each stage of the pipeline, while controlling for the other stages.
BERT embeddings are essential to clustering performance. The number of BERT layers (1 vs. 4) used did not matter as much. BERT outperformed word embeddings (WE) across the board, and we ended up choosing a joint representation (BERT 1 layer + WE).
We looked at three dimensionality reduction methods: PCA, EDR and UMAP. UMAP was able to reconstruct the training questions better overall. However, this was mostly because UMAP produced far more clusters than PCA or EDR. PCA and EDR put around 50% of the questions in the noise cluster, but the clustered questions had higher precision. We initially went with PCA because the clusters were more precise but later changed to UMAP after getting feedback that "a lot of questions were missing in groups".
We also looked into several other factors that we've left out here, such as how to combine embeddings into sentence representations, min cluster size, clustering and dim reduction metric, number of dim reduction components, etc. A simple version of the final pipeline using scikit-learn looks like this:
Making it into the product
After building out the ML pipeline, we generated automated clustering CSV exports for all our clients to get further validation. After receiving positive feedback, the ML team collaborated with the design team to build out the Training Insights feature. The dashboard currently shows a Grouped view by default with the option to toggle back to the old Ungrouped view.
The title for each group is the most common n-gram among a cluster's questions. The keywords are selected using TF-IDF where a document is the concatenated text in each cluster, and then we do some post-processing to remove similar words.
Builders can go inside each group and select questions to create a new answer or train to an existing one.
Of course, to make the system perform well in the the real world, we had take care of engineering details which we have skipped out in this article.
We have built a feature that groups Not Understood questions into meaningful topics so that builders can create new answers or add better training questions to existing answers. A few takeaways:
- Feature extraction is key for clustering to work well. It is probably the main lever you can pull to generate clusters that are meaningful to your task.
- Evaluation of clustering methods can be tricky. Dog-food your cluster predictions and see if the predictions and metrics make sense.
- When possible, create a proxy task with available labels to iterate faster but make sure that the proxy task observations are consistent with the real task.
- Your first version will never be the final one, so ship a first version as fast as possible.
- Getting client feedback can be a slow process so build a good pipeline that you can use to quickly iterate when you do get feedback.
This work was done in collaboration with Gordon Gibson, Liam Bolton, Adam Sils, Viacheslav Tradunskyi. We would also like to thank Francisco Cho for his constant feedback throughout the project and Archy de Berker for proofreading.
If you’re interested about building impactful ML product features at scale - we’re hiring! You can apply on our careers page.