When it comes to choosing the best technology to power automated brand interaction, technical teams must navigate a sea of options. These options require custom integrations and dedicated support from technical teams, even for relatively manual changes — including anything from inferior “chatbots” to heavily siloed CX platforms and legacy developer platforms.
The solution is relatively straightforward: technical teams need a system-agnostic, enterprise-grade platform that is both developer and business user friendly. But there’s a lot of work to be done behind the curtain to achieve this.
Over the past few years, advancements in hardware capabilities have allowed machine learning (ML) teams to train large language models that can generalize for downstream tasks. This led to the emergence of "in-context learning" — where instructions to downstream tasks are stated in natural language (also called a prompt) and solved by conditioning on the prompt.
This means that many of these tasks can achieve state-of-the-art performance with only a few task-specific training examples — no doubt a big win, but the convenience of generative language models still has limitations.
The pitfalls of generative models
- Lack of interpretability: Generative language models are black box solutions that are hard to interpret and produce outputs that are hard to explain.
- Difficult to control the generation process: Generative language models require careful prompt engineering to guide the generation process and experimentation to find the right prompt structure.
- Generated outputs can be biased: The Common Crawl dataset — web data dating back to 2008 — is commonly used to train large language models. Although training data is usually scrubbed for profanity and PII, they can still hold implicit biases.
Not to fear, there are ways around this. Let’s take a closer look at how we do this at Ada.
Introducing Answer Training Suggestions
The Answer Training Suggestions feature helps bot builders create more representative training questions in a fraction of the time. To generate the suggestions, we prompt a large language model with instructions to derive meaningful training questions from the client’s answer content. Bot builders are then able to accept or reject suggested training questions, and the accepted suggestions are funnelled back into the prompt to stimulate few-shot learning — thus completing the feedback cycle.
The results we’ve seen using this new method speak for themselves: over the beta period, 41.1% of generated training questions were accepted by our bot builders.
Our Answer Training Suggestions feature leverages advancements in large language models to build a fast and performant system. By deploying the feature in a human-in-the-loop setting, we are able to ensure biased and irrelevant training suggestions are rejected. Using an extrinsic evaluation framework for prompt engineering, the ML team can select metrics that are closely aligned with the business use case.
How Training Suggestions are generated
We use the following data from client answers to engineer the prompt, which is fed as input to GPT-3 through the OpenAI API:
- Answer title
- Textual answer content
- Existing training questions
Let’s take a closer look at how this is done through the few-shot generation process. A sample prompt can be formulated for the generative model that uses no training questions. We refer to this as zero-shot generation.
This prompt can be easily extended to incorporate any number of training questions, where each existing question can be seen as an example of the task.
In this example, we iterated through many prompt structures before finding an optimal one to power the Answer Training Suggestions feature.
Once the input prompt is fixed, we parse the raw output from the generative model to retrieve the set of generated suggestions. Not all generations from the model are shown to the end user, and finding out the most useful suggestions to show the bot builder is not a trivial task. We run each generated suggestion through our own content filters and OpenAI’s content filter — a fine tuned model that detects whether generated content is likely to be unsafe or sensitive. Any unsafe or sensitive suggestions are discarded. We also measure the n-gram overlap between the suggestion, the answer title, and content to remove suggestions that are likely to be semantically irrelevant.
The full process for generating the final suggestions, that are to be accepted/rejected by bot builders, is outlined in the diagram below.
Evaluating Answer Training Suggestions
Evaluating generative language models is a tricky process — traditional evaluation metrics such as perplexity and entropy are a good proxy to evaluate the model’s confidence, but this doesn’t necessarily apply to performance.
To ensure our feature is aligned with business goals, we designed extrinsic (task-specific) metrics to evaluate the suggestion generation capabilities. We used the below evaluation framework to select the best generative model, prompt structure, and hyperparameters used for our generation:
- Randomly sample answer content, stratified by meaningful segments
- Use the generative model to suggest training questions for each answer content
- Present each generated suggestion (along with the answer content) to a human annotator
Based on the aggregate of the annotator’s labels (reject or accept), we derived the following metrics:
- Mean Suggestion Acceptance Rate (MSAR): Average ratio of accepted suggestions to generated suggestions
- Mean Suggestion Acceptance Value (MSAV): Average number of training suggestions accepted per generation
- Useful Generation Rate (UGR): Average number of generations which have at least one accepted suggestion
Intuitively, we expect a high-performing generative model to have the following characteristics:
- Generates high quality suggestions that are mostly accepted by our bot builders
- Generates multiple usable suggestions
High values of UGR and MSAR correspond to the first characteristic, while high values of MSAV correspond to characteristic two.
To further illustrate the complementary nature of UGR, MSAV and MSAR, consider two generative models: G1 and G2. Let’s assume G1 generates two suggestions, and a bot builder accepts both suggestions. G2 generates five suggestions, and a bot builder accepts all five. We assume just one answer block without loss of generality.
In this example, both G1 and G2 have a UGR = MSAR = 1.0. However G1 has MSAV = 2 while G2 has MSAV = 5, so the MSAV metric helps us reach a more fine-grained understanding of different models’ performance.
To generate useful and unique training suggestions, we focus on the stochasticity of the model. In language models, this is controlled by the parameters of temperature, top-k, and top-p, which are used in the decoding process. During prompt engineering, we observed that even our top performing prompts can have sub-optimal performance if the hyperparameters used for generation aren’t optimally set.
In particular, we found that sweeping the temperature and top_p values can result in noticeable performance variation for a fixed prompt. Using this evaluation framework, we were able to tune the hyperparameters and the prompt structure in order to select the most fitting model for this task.
Top performing models when using no training questions (zero-shot) compared to using training questions in the input prompt (few-shot).
In a sense, the few-shot case can be seen as a paraphrase task, while the zero-shot is more like open-ended question answering. Our experimental results follow the ones from the initial paper introducing GPT-3, in that the more examples we provide as context the better the performance of the model.
We also observed that higher temperature values lead to better suggestions when no training questions were present for a given answer. We believe this is because, in those scenarios, we want to try to encourage the model to be more creative. In contrast, when training questions are present, we want the generated suggestions to be similar to the existing questions. Ultimately, lower temperature values lead to better performance.
The future of Answer Training Suggestions
In the first month of launching Answer Training Suggestions, the feature created 25% of Ada’s training questions, reducing the time it takes for bot builders to come up with this on their own. This gives brands the fastest time to true value, and empowers teams to focus on driving the improvements that matter most: design experiences, offers, and answers that anticipate needs.
With technology like this, every brand interaction is improved through automation, creating a more consistent and purposeful brand experience. Every customer and employee can receive a VIP experience that’s personalized, proactive, and accessible — no matter who they are, what channel they prefer, or what language they speak.
We’re always looking for opportunities to learn and improve, knowing that innovation doesn’t come without risks. But at Ada, we promote a culture that encourages employees — or Ada “Owners” as we call them — to try new things and break new ground. So we’re currently working on leveraging automatic evaluation metrics to get faster feedback for our experimentation. We are experimenting with both lexical similarity measures and contextual embedding based metrics to automatically determine the suitability of a generated suggestion. Stay tuned — we can’t wait to share the results.