Developing LLM benchmarks for conversational realism in lifelike AI agents
As a machine learning engineer at Inworld, my job is to design, develop and deploy machine learning models and systems. In my previous work on Google Search, I applied machine learning research findings to benchmark and improve search quality.
Many AI agents deployed at Google and in similar contexts are designed with specific tasks in mind, such as optimizing search results for user queries. The evaluation of AI agents powered by generative AI typically relies on established benchmarks and traditional metrics – such as relevance and accuracy – that are appropriate in use cases where factual accuracy is often the most critical consideration. For example, BLEU, ROUGE, and BERTScore are all established metrics that measure the quality of machine translation, summarization, and contextual text generation respectively.
However, recent advancements in large language models (LLMs) have allowed developers to create remarkably human-like AI agents that can populate simulated environments and engage in conversations that improve the immersiveness and believability of video games and interactive entertainment. Existing benchmarks, because they were developed for very different use cases, weren’t able to properly evaluate conversational quality in these entertainment-focused use cases.
To measure and benchmark conversational realism in our own agentic applications of LLMs through our AI Engine, we’ve developed our Conversational Authenticity Metric (CAM). The schema evaluates conversations using the following parameters:
- Naturalness
- Usefulness
- Factuality
Naturalness: Bridging the gap between human and machine
Naturalness measures how effectively our generated conversation mirrors human-like communication patterns and behaviors. It determines the degree to which the model's responses align with human expectations and norms. In the context of Inworld’s AI Engine, we also assess the naturalness of speech in relation to the unique personality and storyline of each individual character.
LLM agents often exhibit consistent patterns in their responses, including repetitive phrases or structures, especially when prompted with similar inputs. In contrast, human speech varies according to our moods and contexts, so responses need to reflect that variability and diversity in speech.
As part of naturalness, our CAM takes into account variables such as response length, response structure, and semantic similarity based on embeddings. These measures of lexical diversity, syntactic variation, and the exploration of different conversational pathways ensure our models are capable of generating diverse and nuanced responses.
Other measurements of naturalness include telltale signs of LLM-powered conversation such as:
- Repetition in the conversation history
- Frequency of vocative use such as “hey friend” or “my dear”
- Common AI phrases like “well well well.”
Here are two illustrative excerpts that have been evaluated by our CAM:
Conversation #1: Non-diverse response
This conversation illustrates several breaks in naturalness with non-diverse vocatives and repetition in the conversation history. This explains the low CAM score.
Conversation #2: Diverse response
This second example achieves a much higher CAM score by more naturally addressing the player, maintaining a logical conversational structure, and showing conversational diversity.
Usefulness: Aligning developer inputs and LLM responses
When discussing relevance in AI, we could refer to the usefulness of a voice recognition system in accurately transcribing spoken language, such as the widely used voice assistant, Siri, on Apple devices.
When it comes to the outputs of Inworld’s AI Engine, we evaluate how well the generated responses match developer inputs and conversational context. Our goal is to determine the extent to which responses suit the intended character, ensuring they accurately portray their persona with responses that manifest the specific traits they were designed to have.
For example, when evaluating the AI Engine's outputs, we focus on whether the responses match the character's personality traits, guided by extracted “hints” taken from the character description or chosen dialogue style. We extract these hints to understand how the character should communicate. Then, we score responses against these guidelines, scoring them based on how well they align.
For example, if the character is supposed to be polite and helpful, we check if the responses reflect these traits. This ensures that the response we choose to serve from the handful of dialogue options we generate for each conversational turn fits the character and the conversation. This thorough process is essential for enhancing storytelling in virtual environment and ensuring that the characters stay in-character, in-world and on-topic.
The steps involved include:
- Extracting Hints - particular traits that the character needs to exhibit.
- Example: “The character must be polite to player” or “They must always say ‘Tra La La’ when they are confused.”
- Using LLMs to score dialogue output options with the help of character responses and policies (groups of hints).
- Providing an overall metric of how well the character adheres to the developer’s inputs.
- Choosing to serve the most appropriate response.
Factuality: Improving factuality and combatting hallucinations
One of the challenges inherent in LLMs is the occurrence of hallucinations, where the model generates text that lacks grounding in reality or context. Controlling hallucinations in a game world or entertainment experience can be particularly difficult to do since the worlds often have their own facts and reality that differ from the context that a large language model was trained in.
However, controlling hallucinations is critical since they can undermine players’ suspension of disbelief, eroding the game's realism and detracting from overall enjoyment and engagement. That’s why we created a CAM evaluation for factuality.
Our CAM evaluation focuses on era knowledge, niche knowledge and whether a character’s responses have taken into account the knowledge field inputs added by developers or users (i.e. facts about that character’s backstory or the game world that character inhabits). We measure:
- How well the model dynamically retrieves relevant information to deliver accurate responses.
- If the information retrieved is correctly referenced.
- Whether any responses contradict the facts and knowledge included when designing the character.
- How often the model provides misleading or false information.
Our evaluation uses the logic from our Knowledge Filters system to determine if information referenced in the conversation is appropriate for the character and game world. This logic groups the types of character-referenced information into the following categories to create a ‘list of fact’ the character is both likely and unlikely to know:
- Creator-specified information: This includes anything entered into the character's description, knowledge base, and scene knowledge.
- Mutations of creator-specified information: This refers to tangential information that is related to the facts directly specified by the creator.
- Probable character knowledge: Based on the character's setup, this category encompasses information that the character is likely to know but which is not explicitly mentioned in the character’s inputted fields.
- Improbable character knowledge: Based on the character's setup, this category encompasses information that the character is less likely to know.
Our CAM then measures factuality using the following method:
- The goal is to detect if characters are able to stick only to facts that they should know instead of hallucinating.
- In order to do so, we created an evaluator that takes player query, character response and the list of facts configured in the character’s setup and knowledge. The facts may include information about the character’s backstory, their surroundings, or narratives of the game. We then go through each fact in the list and compare them to the player query and character response to classify the facts into:
- Contradicted facts: Facts that would contradict the generated character responses, indicating that the responses are compromised in their factuality.
- Mentioned facts: Facts that are in-line with generated character responses, indicating that facts are used in the correct context.
- Stated facts: Facts that are to be stated verbatim by the character, indicating that facts are used quite literally.
- We then aggregate cases of each of the three types of facts, and then use them to compute a composite metric where higher likelihood of stated and mentioned facts and lower likelihood of contradicted facts correlate with better factuality.
What’s next?
Our ML team is working to continuously improve conversation quality through prompt engineering and fine-tuning. Automated benchmarks like CAM provide us the feedback we need to more quickly iterate and improve conversational quality than relying on human feedback alone.
As part of our commitment to provide the most realistic conversations, we also look at a number of other factors in our automated conversation quality evaluation. For example, we also measure Coherence to evaluate the model’s effectiveness in retrieving longer conversation histories and maintaining conversation logic – including storyline coherence, something particularly important in gaming and entertainment use cases. More on that process in a future post!
Stay tuned for more quality enhancements, such as a new character-specific DPO feature for fine-tuning. This feature will empower creators to provide useful dialogue quality feedback through the simulator, addressing preferred conversation styles, fact corrections, tone adjustments, and safety considerations.
Curious about our AI engine? Try it out today!
Contact us to get started
Get in touch to discuss signing up for the Inworld License.