Role Adherence: A Metric for Evaluating Role Consistency in Multi-Turn Conversational AI
Axel Fritz
axel.fritz@alquimia.ai
Alquimia AI
Alex Fiorenza
alex.fiorenza@alquimia.ai
Alquimia AI
April 2026
Abstract
As conversational AI systems are deployed in role-constrained settings (customer support agents, educational tutors, medical assistants), measuring whether a model consistently adheres to its assigned role becomes a core quality requirement. This paper proposes Role Adherence, a metric that quantifies role consistency across multi-turn conversations under a constructive definition: an assistant must actively exhibit the behaviour its role demands, not merely avoid out-of-scope content. We evaluate eight scoring methods and the LLM judge in both output modes on a controlled synthetic benchmark of 150 turns (scope violations, tone violations, and constructive failures) across three dataset-size conditions (). An LLM judge (\texttt{llama-3.3-70b-versatile}; \citealt{dubey2024llama3}) achieves , at and is the only method that reads the role definition directly, detecting all violation types including tone deviations and constructive failures. Deterministic methods (BERTScore, cosine similarity, -NN) reach AUC\,, but this is a structural artefact: they measure semantic proximity to a ground-truth reference and are blind to violations that are semantically close to the correct response. Distribution-based methods degrade or invert with corpus size: Mahalanobis Distance declines monotonically (AUC: ), a PCA-based variant inverts the adherence signal entirely (AUC\, at ), and KL Divergence converges to a weak ceiling (AUC\,). The LLM judge in continuous mode, which extracts from the model's token distribution, achieves AUC\, with Gemma\,3-12B without requiring ground truth, enabling unified comparison with deterministic methods on the same scale. The metric uses an LLM-as-a-judge as its default scoring path, as it is the only method capable of detecting tone violations and constructive failures. Deterministic methods are available as a cost-free alternative when ground truth is present, but empirical results show they act as semantic drift detectors rather than genuine role adherence evaluators. The metric is part of the Gaussia evaluation framework (\url{https://www.gaussia.ai/}). The implementation is open-source at \url{https://github.com/Alquimia-ai/pygaussia}.
Introduction
Modern LLM-based assistants are typically deployed with a system prompt that defines their role: the topics they should cover, the tone they should adopt, the behaviours they should exhibit, and the boundaries they should respect. Despite this, models frequently deviate from their assigned roles --- answering out-of-scope questions, abandoning their defined persona, or failing to exhibit the behaviour their role requires.
Existing evaluation frameworks either measure role adherence at the level of individual turns in isolation [1], or rely entirely on LLM-based judges, which introduce reproducibility and cost concerns [2]. This paper proposes a more complete metric that evaluates adherence turn by turn with prior conversational context, uses an LLM-as-a-judge as its default scoring path, and offers deterministic semantic scoring as a reproducible alternative when ground-truth reference responses are available.
Problem Statement
Given a multi-turn conversation between a user and an AI assistant, and a definition of the role the assistant is expected to play, we want to compute a scalar score representing how consistently the assistant adhered to that role throughout the conversation.
Let a conversational session be a sequence of assistant turns , and let denote the role definition. The Role Adherence score is defined as:
where is the conversation history up to turn , and is a per-turn adherence score. The session score lies in , where 1 indicates full adherence across all turns. The range of and how it is computed depend on the scoring path selected (Section).
Defining Adherence
sec:adherence
Constructive vs.\ Restrictive Adherence
Two competing definitions of adherence are possible.
Restrictive adherence asks: did the assistant avoid going outside the scope of its role? A turn is compliant if it does not address topics or exhibit behaviours explicitly excluded by the role. This definition only penalises negative deviations.
Constructive adherence asks: did the assistant actively behave as its role requires? A turn is compliant only if it exhibits the behaviour, tone, and knowledge a properly role-playing assistant should exhibit. This definition penalises both negative deviations (scope violations) and positive failures (failing to act as the role demands).
Consider an assistant assigned the role of a technical support agent. If the user asks a question within the assistant's domain and the assistant responds ``I don't know'', restrictive adherence would record no violation --- no out-of-scope content was produced. Constructive adherence would record a failure, because a role-adherent support agent is expected to answer questions in its domain.
We adopt the constructive definition as the primary formulation for this metric, as it provides a more meaningful signal for production quality assessment. Restrictive adherence remains a valid and simpler formulation for use cases where the primary concern is containment rather than quality of execution. Both definitions are implementable within the proposed schema.
Conversational Context
The default mode evaluates each turn individually: turn is scored by passing the role definition, the prior conversation history , and the candidate response to the judge in a single call. This produces one score per turn and preserves per-turn interpretability. A turn evaluated without prior context may appear off-role but be entirely appropriate given what came before; conversely, a neutral-looking turn may be part of an escalating deviation only visible across the conversational arc.
An alternative is to evaluate the full conversation in a single call
(evaluation\_granularity='conversation'), which reduces inference
cost at the expense of per-turn granularity. We note this as a valid
trade-off for cost-sensitive deployments but do not adopt it as the default.
Role definition schema. The role is provided as a free-form natural
language string and is a property of the session, not of the metric.
chatbot\_role is defined as a field on the Dataset object
because a dataset already carries implicit contextual assumptions: the way a
user interacts with a customer support agent differs from how they interact
with a coding assistant, even if the raw conversation text were identical.
Separating the role from the session would break this coupling. The string is
passed directly to all scoring methods: embedded alongside the candidate
response in deterministic path computations, and injected verbatim into the
LLM judge prompt.
Scoring Methodology
sec:scoring
Scoring Strategy
The metric was originally designed with deterministic semantic scoring as its primary path, motivated by reproducibility and zero per-inference cost. Empirical evaluation (Section) showed that deterministic methods, while reliable at detecting semantic drift from a reference response, fail to detect tone violations and constructive failures --- the hardest and most practically relevant violation types in a role-constrained deployment. This finding motivated adopting the LLM judge as the default scoring path.
The metric uses an LLM-as-a-judge as its default scoring path. The judge reads the role definition directly and is the only method capable of detecting tone violations and constructive failures --- a conclusion supported empirically in Section.
When ground-truth reference responses are available and the user prefers a deterministic, reproducible alternative, the metric can be configured to use a semantic scoring function instead (see Section). Deterministic methods are faster and free of per-call cost, but detect semantic drift from the reference rather than role adherence per se.
LLM-as-a-Judge Path
Each assistant turn is evaluated by an LLM judge that receives: (1) the role definition ; (2) the conversation history ; and (3) the turn to evaluate . The judge prompt implements the constructive adherence definition: the judge is asked not only whether the turn avoids scope violations, but whether it actively reflects the expected behaviour of the defined role.
Evaluation granularity. In the default 'turn' mode the
judge evaluates each turn independently with its prior context, producing
separate scores aligned with Equation. In
'conversation' mode the entire conversation is submitted in a
single call, reducing inference cost at the expense of per-turn granularity.
Output mode. The judge supports two output modes
(Section). 'binary' (default) samples the model
at temperature 0 and returns a hard score per turn.
'continuous' extracts from the model's
first-token log-probability distribution, producing a calibrated score in
without prompting the model for a number. When
include\_reason is enabled, the judge additionally returns a
natural-language justification for each per-turn score.
Output Modes: Binary and Continuous
sec:logprob
The judge supports two output modes. Binary mode (default) samples
the model at temperature 0 and returns a hard decision per turn.
Continuous mode extracts a calibrated adherence score from the
model's first-token probability distribution by querying at temperature 1
with logprobs=True, top\_logprobs=10, avoiding the calibration
issues of prompting the model to produce a number directly.
Let and denote the sets of tokenisation
variants for yes and no respectively
(e.g.\ "YES", "Yes", "yes"). For each token
in the top- list, let denote its log-probability. The
continuous adherence score is:
where denotes log-sum-exp, aggregating probability mass across tokenisation variants without numerical underflow. When neither yes nor no appears in the top- list, the score defaults to .
A non-obvious property of instruction-tuned models.
Some instruction-tuned models do not assign the highest probability to
yes or no as their first token, even when explicitly
instructed to do so. Gemma 3-12B assigns approximately 99\% of its
first-token probability to the token "Okay", with yes
and no appearing as low-probability alternatives
(0.5\% and 0.001\% respectively), a conversational
prior from instruction tuning. Despite this, the relative probability
of yes versus no at the first token position encodes the
model's judgment correctly, and AUC\, (Section)
confirms that this signal is a reliable discriminator.
Infrastructure requirement. Continuous mode requires a deployment that exposes per-token log-probabilities. This is available in open-weight deployments (HuggingFace TGI, vLLM) but not in all commercial APIs. When logprobs are unavailable, binary mode remains the only option.
Deterministic Path
sec:deterministic
When a ground\_truth\_assistant response is available for each turn,
six scoring functions are proposed: BERTScore, cosine similarity,
-nearest neighbours, and NLI compare individual turn pairs and are
independent of corpus size. Mahalanobis Distance and Kullback--Leibler
Divergence are distribution-based: they build a model of the role from
the full corpus of ground-truth responses, and their signal quality varies
with the number of reference examples available. A PCA-based variant of
Mahalanobis Distance is additionally evaluated as a candidate solution to
the curse-of-dimensionality problem and reported in Section.
The per-turn score is continuous in and is averaged across turns
to produce the session score.
A note on scope. Deterministic methods compare candidate responses against a ground-truth reference and measure semantic proximity, not role adherence directly. A response can achieve high semantic similarity to the ground truth while violating the role's tone, register, or constructive requirements --- violations that are semantically indistinguishable from adherent responses but contextually non-compliant. This limitation is characterised empirically in Section and discussed in Section.
BERTScore
BERTScore [3] computes semantic similarity between two texts by extracting contextual token embeddings from a pre-trained language model and computing precision, recall, and F1 over token-level cosine similarities. Unlike surface-level metrics, it captures paraphrastic equivalence. BERTScore F1 is computed as:
Role-adherent responses are not expected to reproduce the exact phrasing of a reference. BERTScore tolerates natural linguistic variation while remaining independent of a secondary judge model.
Semantic Cosine Similarity
Cosine similarity between sentence-level embeddings [4] is a lighter-weight alternative. A sentence encoder maps each response to a dense vector; adherence is scored as the cosine of the angle between the candidate and reference vectors:
Cosine similarity operates at sentence granularity rather than token granularity, making it faster. BERTScore and cosine similarity measure semantic proximity at different levels of granularity; Section characterises their practical differences in group separation.
-Nearest Neighbours
-Nearest Neighbours (-NN) generalises cosine similarity to multi-reference settings [8]. Given a reference set of ground-truth embeddings for a turn, the adherence score is the cosine proximity to the nearest neighbour:
where denotes cosine distance. When and a single reference per turn is available, Equation is mathematically equivalent to cosine similarity. The distinction materialises when multiple valid reference responses exist per turn: -NN distances to the nearest neighbours weight the score towards the most similar acceptable response, gaining robustness over a single-reference cosine comparison. We propose -NN as the natural extension of cosine similarity for production deployments where multiple ground-truth responses per turn can be maintained.
Natural Language Inference
NLI models [5] classify the relationship between two texts as entailment, neutral, or contradiction. Applied to role adherence, a candidate response that contradicts what a ground-truth role-adherent response asserts is a strong signal of non-adherence. NLI is proposed as a complementary signal for detecting explicit violations rather than measuring overall similarity. A contradiction label serves as a hard penalty regardless of similarity scores.
Mahalanobis Distance
Cosine similarity computes the angle between embedding vectors but treats all dimensions equally, regardless of how much the role actually varies along each dimension. A role that discusses diverse topics (high variance) but maintains strict tone constraints (low variance) will be mis-scored: large topic-dimensional shifts are penalised as heavily as small tone-dimensional shifts.
Mahalanobis Distance [6] corrects this by incorporating the empirical covariance structure of the role's embedding space. Given the set of ground-truth embeddings , let denote their mean and their covariance matrix. The Mahalanobis Distance of a candidate response embedding from the role centroid is:
In practice is approximated as a diagonal matrix (to avoid singularity in high-dimensional embedding spaces) and regularised as . The distance is converted to an adherence score via , where is a scaling parameter set to 1 by default.
Sample sensitivity. Because is estimated from the ground-truth corpus, signal quality varies with the number of available reference examples. Section characterises this empirically across conditions.
Mahalanobis Distance with PCA Pre-Reduction sec:pca_mahal
Mahalanobis Distance in high-dimensional embedding spaces is susceptible to the curse of dimensionality [10]: as the number of dimensions grows, pairwise distances concentrate around a fixed value regardless of true class proximity, eroding discriminative power. Principal Component Analysis [9] offers a candidate solution by projecting embeddings to a lower-dimensional subspace before applying Mahalanobis Distance.
We evaluate a pre-reduction from 384 to 32 dimensions (top-32 principal components of the ground-truth embedding matrix) followed by Mahalanobis Distance in the compressed space. This variant is evaluated empirically but not proposed as a recommended scoring function; Section shows it produces inverted adherence scores and is reported as a negative finding.
Kullback--Leibler Divergence
For roles with explicit stylistic or vocabulary constraints, adherence can be framed as a distributional comparison. Let denote the vocabulary distribution of the role estimated from ground-truth responses, and let denote the vocabulary distribution of the candidate turn. The Kullback--Leibler Divergence [7] from to the reference is:
We use --- the divergence of the candidate from the role reference --- because is the distributional target. To handle zero probabilities, both distributions are smoothed with Laplace smoothing. The vocabulary is restricted to the top- most frequent tokens in the ground-truth corpus (default ) to reduce sparsity. The adherence score is , with by default.
KL Divergence is proposed as a signal for vocabulary drift: cases where a response uses phrasing or terminology the role explicitly avoids. Section characterises its empirical performance.
Sample sensitivity. Estimating reliably requires a representative vocabulary distribution from the ground-truth corpus. Section characterises how signal quality varies with corpus size.
Lexical Metrics: Considered and Discarded
ROUGE and BLEU measure n-gram overlap between candidate and reference texts. Both were considered and discarded as primary metrics for this use case.
Role-adherent responses are not expected to reproduce the exact phrasing of a reference --- they are expected to exhibit the same role-consistent behaviour with natural linguistic variation. N-gram metrics penalise paraphrases and reward surface-level copying, neither of which is desirable here. ROUGE and BLEU are noted as valid lightweight baselines for benchmarking purposes only.
Hypotheses
sec:hypotheses
We formulate four hypotheses to guide the empirical evaluation in Sections--.
[leftmargin=2em, labelwidth=2em] [H1] At small corpus sizes, KL Divergence does not produce a stable discriminative signal because the ground-truth vocabulary distribution is too sparse to be representative. We additionally predict, from the concentration of measure in high-dimensional sentence embedding spaces [10], that Mahalanobis Distance will degrade as the corpus grows --- a failure mode distinct from the sparse-distribution instability of KL Divergence.
[H2] With a sufficient corpus, deterministic methods achieve agreement with the gold standard comparable to the LLM judge, while offering perfect reproducibility and zero per-inference cost.
H1 motivates evaluating distribution-based methods across corpus sizes rather than at a single fixed point. H2 motivated exploring deterministic methods as a cost-effective alternative to the LLM judge; Section reports it refuted.
Experimental Setup
sec:setup
Benchmark Dataset
We construct a controlled synthetic benchmark for a single role: a customer-facing support agent for a fintech application. The role is defined with four explicit constraints: (1) scope limited to account inquiries, transaction history, and card management; (2) a professional, empathetic, and concise tone; (3) no financial or investment advice under any circumstances; and (4) a constructive requirement to always offer a next step or resolution path rather than an open-ended refusal.
The benchmark consists of 150 assistant turns organised in 30 simulated conversations of 5 turns each. Labels are assigned by design: each turn is generated to exhibit one of four target behaviours --- adherent, scope violation, tone violation, or constructive failure --- and the label reflects the construction intent rather than post-hoc annotation. The distribution is approximately 40\% adherent and 20\% per violation type.
Each record contains four fields:
- role: the full role definition string.
- assistant: the generated response, which may violate the role.
- ground\_truth\_assistant: a role-adherent reference response.
- label: the designed ground-truth class.
Evaluation Conditions
Three conditions evaluate the same benchmark at increasing :
- Condition A (): the first 15 turns.
- Condition B (): the first 50 turns.
- Condition C (): all 150 turns.
Conditions are cumulative: B includes all turns in A, and C includes all turns in B. This isolates the effect of dataset size on each scoring method while holding the role and violation distribution constant.
Scoring Methods
[h] Scoring methods evaluated, with their data requirements. tab:methods
| LLM judge (continuous) | No | No |
|---|---|---|
| BERTScore F1 | Yes | No |
| Cosine similarity | Yes | No |
| -NN | Yes | No |
| NLI | Yes | No |
| Mahalanobis Distance | Yes | Yes |
| PCA+Mahalanobis (eval.\ only) | Yes | Yes |
| KL Divergence | Yes | Yes |
Evaluation Protocol
Continuous methods (BERTScore, cosine similarity, -NN, Mahalanobis Distance, PCA+Mahalanobis, and KL Divergence) require a threshold to produce binary predictions; any threshold is arbitrary in production. We report two threshold-agnostic statistics:
AUC-ROC [12] is the probability that the method assigns a higher score to a randomly selected adherent turn than to a randomly selected violation. AUC\, indicates perfect ranking; AUC\, is chance.
Separation is the difference between the mean score of adherent turns and the mean score of violation turns. It quantifies the absolute gap between group centroids and reflects how reliably a threshold can be set in practice. High AUC with low separation indicates perfect ranking but fragile threshold selection.
Binary classifiers (NLI and LLM Judge) produce hard predictions directly. We report three statistics. F1 per class is the harmonic mean of precision and recall for each class:
where and are precision and recall for the target class. Macro F1 is the unweighted average of per-class F1 scores, treating each class equally regardless of frequency --- a relevant choice here because the adherent and violation classes are not balanced. Cohen's [11] is a chance-corrected agreement coefficient:
where is the observed fraction of agreements between the classifier and the gold labels, and is the fraction expected by chance under the marginal class distributions. indicates perfect agreement; indicates agreement at chance level; negative values indicate worse-than-chance performance.
For the LLM Judge, macro F1 and per-class F1 are additionally accompanied by 95\% bootstrap confidence intervals (1\,000 resamples, seed 42) to quantify estimation uncertainty at each sample size.
Mahalanobis Distance uses diagonal covariance estimation with regularisation (). KL Divergence uses Laplace-smoothed distributions restricted to the top-500 most frequent tokens in the ground-truth corpus. The LLM judge is run once per turn per condition at temperature 0.
Results
sec:results
Continuous Scoring Methods
Table reports mean score by class (adherent and violation), separation, and AUC-ROC for each continuous method across the three conditions. Figure provides a visual comparison of per-class mean scores across methods and conditions.
[ht] Continuous scoring methods: mean score by class, separation, and AUC-ROC across evaluation conditions. Sep.\,=\,mean(adh.)\,\,mean(viol.); higher is better. AUC: 1.0\,=\,perfect ranking, 0.5\,=\,chance. Bold indicates inverted signal. tab:cont_results
| 0.940 | 0.869 | 1.000 | |||
|---|---|---|---|---|---|
| 0.941 | 0.868 | 0.995 | |||
| [3pt] Cosine | 0.905 | 0.466 | 1.000 | ||
| 0.902 | 0.520 | 0.998 | |||
| 0.877 | 0.497 | 0.994 | |||
| [3pt] -NN | 0.905 | 0.496 | 1.000 | ||
| 0.902 | 0.544 | 0.998 | |||
| 0.877 | 0.546 | 0.993 | |||
| [3pt] Mahalanobis | 0.404 | 0.348 | 0.963 | ||
| 0.391 | 0.348 | 0.872 | |||
| 0.391 | 0.350 | 0.856 | |||
| [3pt] PCA+Maha | 0.283 | 0.419 | 0.000 | ||
| 0.318 | 0.393 | 0.110 | |||
| 0.343 | 0.383 | 0.219 | |||
| [3pt] KL Div. | 0.860 | 0.858 | 0.537 | ||
| 0.693 | 0.671 | 0.870 | |||
| 0.517 | 0.494 | 0.877 |
Figure shows the separation curve for Mahalanobis Distance, KL Divergence, and PCA+Mahalanobis over intermediate sample sizes from to . Mahalanobis separation declines monotonically from to . KL separation rises from near zero () to . PCA+Mahalanobis separation remains negative at every , converging from below towards zero without crossing it.
Binary Classifiers
[ht] Binary classifiers: per-class F1, macro F1, and Cohen's . tab:bin_results
| 0.625 | 0.333 | 0.479 | 0.167 | ||
|---|---|---|---|---|---|
| 0.612 | 0.269 | 0.441 | 0.128 | ||
| [3pt] LLM Judge | 1.000 | 1.000 | 1.000 | 1.000 | |
| 0.974 | 0.984 | 0.979 | 0.958 | ||
| 0.974 | 0.984 | 0.979 | 0.958 |
LLM Judge: Bootstrap Confidence Intervals
The result (, ) reflects perfect classification on a small and easy subsample and should not be taken as the representative estimate. Table reports macro F1, per-class F1, and Cohen's with 95\% bootstrap confidence intervals (1\,000 resamples, seed 42). The canonical result is : , , CI\,[,\,].
[ht]
LLM Judge (llama-3.3-70b-versatile): macro F1, per-class F1,
and Cohen's with 95\% bootstrap confidence intervals
(1\,000 resamples, seed 42).
tab:llm_ci
| 0.979 | [0.934,\,1.000] | 0.975 | [0.914,\,1.000] | 0.984 | [0.947,\,1.000] | |
|---|---|---|---|---|---|---|
| 0.979 | [0.952,\,1.000] | 0.975 | [0.942,\,1.000] | 0.984 | [0.963,\,1.000] |
LLM Judge: Continuous Mode
sec:logprob_results
Three models were evaluated using the judge's continuous mode on the full benchmark (), each chosen for a specific purpose: Llama 3.3-70B for direct comparison with the binary judge of the same family; Llama 3.1-8B to test whether the approach scales to a smaller, cheaper model [13]; and Gemma 3-12B (March 2025) as a more recent model better optimised for instruction following, to test generalisation beyond the Llama family. A fourth model, Qwen3-14B, was evaluated but is excluded from Table: all 150 scores default to because yes and no do not appear in the top-10 log-probability list for any record, even with chain-of-thought generation disabled. This is a structural incompatibility between Qwen3's instruction-tuning format and the logprob approach.
[ht] LLM Judge (continuous mode): AUC, Average Precision, and mean by class (). Sep.\,=\,mean(adh.)\,\,mean(viol.). Italic: inverted signal. tab:logprob_results
| Llama 3.3-70B | 0.883 | 0.767 | 0.208 | 0.032 | |
|---|---|---|---|---|---|
| Llama 3.1-8B | 0.303 | 0.295 | 0.536 | 0.688 |
Table shows the per-violation breakdown for Gemma 3-12B, confirming uniform discrimination across all three violation types.
[ht] LLM Judge in continuous mode (Gemma 3-12B): mean by class (). tab:logprob_gemma_viol
| scope\_violation | 0.000 |
|---|---|
| tone\_violation | 0.000 |
| constructive\_failure | 0.000 |
Table places all continuous methods and the logprob judge side by side on AUC at , enabling a unified comparison across scoring families for the first time.
[ht] Unified AUC comparison across all methods (). GT\,=\,ground-truth reference responses required. tab:unified_auc
| Cosine similarity | Yes | 0.994 |
|---|---|---|
| -NN | Yes | 0.993 |
| KL Divergence | Yes | 0.877 |
| Mahalanobis Dist.\ | Yes | 0.856 |
| PCA+Mahalanobis | Yes | 0.219 |
| LLM judge cont.\ (Llama 3.3-70B) | No | 0.883 |
Discussion
sec:discussion
The Semantics Trap: Why AUC\,$\approx 1$ Is Not a Victory
BERTScore, Cosine, and -NN achieve AUC\, across all conditions. Read naively, this looks like a definitive result. It is not. The benchmark was constructed so that adherent responses are paraphrases of the ground truth (semantically close by design) and violations are responses that mishandle or ignore the request (semantically distinct by design). Any method that measures semantic proximity to a per-turn ground-truth reference separates this benchmark perfectly because separability was built into the construction, not discovered by the method.
The per-class mean scores make the mechanism explicit: Cosine similarity sits at for adherent responses and for violations at --- the groups are nearly half a unit apart because adherent examples were generated as paraphrases of the ground truth and violations were not. In production, the hardest violations --- tone deviations and constructive failures --- are precisely the ones where semantic similarity to the ground truth is highest: a response may say the right things in the wrong register, or omit the required resolution path while remaining topically on-role. Cosine similarity would assign these a high score and miss the violation entirely.
The correct interpretation of deterministic methods with ground truth is that they are semantic drift detectors, not role adherence evaluators. They are useful for monitoring whether response content deviates from an expected reference, but they cannot substitute for a method that reads and understands the role definition.
Cosine vs.\ BERTScore: Practical Separation Matters
Of the methods that reach AUC\,, Cosine similarity achieves a separation of at while BERTScore achieves only . Although BERTScore operates at finer granularity (token-level contextual embeddings versus sentence-level pooling), this additional resolution does not translate to better group separation in this setup: the two groups are sufficiently similar at the token level that BERTScore barely distinguishes them in absolute terms, even though its ranking remains nearly perfect.
For a deployment requiring threshold-based binary classification, BERTScore's small separation makes threshold selection fragile: a threshold at 0.90 and one at 0.91 produce meaningfully different decisions. Cosine similarity, with its large separation, is more robust to threshold choice.
-NN is mathematically equivalent to Cosine in the single-reference-per-turn setup used here (). Its advantage materialises when multiple valid ground-truth responses exist per turn: the nearest-neighbour distance weights the score towards the most similar acceptable response, gaining robustness that a single-reference cosine comparison cannot provide.
Mahalanobis Distance: Curse of Dimensionality Confirmed
Mahalanobis Distance shows a monotonic degradation pattern: AUC is highest at () and declines to at . This is the opposite of what standard statistical intuition would predict --- more data should produce better distributional estimates, not worse ones.
The explanation is the concentration of measure in high-dimensional spaces [10]. As grows, the estimated covariance of the 384-dimensional ground-truth embeddings converges towards the true distribution, but in 384D all pairwise distances concentrate around a fixed value regardless of class membership. The diagonal regularisation () partially mitigates this at small by preventing covariance collapse; at larger the regularisation becomes relatively negligible and the concentration effect dominates. This confirms the prediction embedded in H1: Mahalanobis Distance is not viable for production deployments at standard embedding dimensions.
PCA+Mahalanobis: A Structural Failure
The PCA+Mahalanobis variant was motivated as a dimensionality reduction solution to the concentration problem. The result is unambiguous: the method inverts the adherence signal at every condition (sep\,=\, at , never crossing zero). AUC\, at means the method assigns higher scores to violations than to adherent responses with perfect consistency.
The mechanism is structural. PCA decomposes the variance of the ground-truth embedding matrix: the 32 retained principal components capture the directions of maximum variance within the reference set. This transformation has no relationship to the adherence-violation geometry --- it captures what varies most among adherent examples, not what separates adherent from violation. In the compressed space, violation embeddings happen to project near the centroid of the transformed reference distribution while adherent examples, spread across the principal components, project further away.
More data partially corrects the PCA estimate (components become less dominated by individual examples), which is why separation improves from to as grows. But the structural distortion never reverses. PCA+Mahalanobis is not recommended for any deployment scenario. H2 is refuted: no deterministic method approaches the LLM Judge at large , and the methods that appear to do so via AUC are benefiting from the benchmark artefact described above.
KL Divergence: The Wrong Level of Abstraction
KL Divergence operates on vocabulary distributions estimated from the ground-truth corpus. Its near-random signal at (AUC\,) is expected: with 15 turns the top-500 vocabulary distribution is too sparse to be representative. As grows the distribution stabilises and AUC converges to at .
The ceiling is not a sample-size problem --- it is a conceptual one. The role violations in the benchmark (scope, tone, constructive failures) do not produce distinctive vocabulary patterns. An agent that answers a question in the wrong tone uses largely the same vocabulary as one that answers correctly. KL Divergence is blind to this distinction because it operates at the lexical level, not the semantic or pragmatic level. The convergence to AUC\, suggests KL may serve as a rough vocabulary drift monitor but not as a primary adherence detector.
NLI: Insufficient for Role Adherence
NLI achieves and at --- better than chance but far below useful production thresholds. The model correctly flags some explicit scope violations (responses that contradict ground-truth assertions) but misses tone violations and constructive failures. These failure types do not produce logical contradictions with the ground-truth response: an agent that fails to offer a resolution path omits something, it does not contradict anything. NLI's contradiction signal is a necessary but insufficient condition for role violation. Performance is also unstable across : drops from at to at , suggesting the threshold used for binary classification is not stable.
LLM Judge: The Only Role-Aware Method
The LLM Judge achieves at with bootstrap CI [,\,] at , [,\,]. It is the only evaluated method that reads the role definition directly and evaluates the response against it. All other methods operate on signals that are at best correlated with adherence; the judge assesses adherence by definition and can detect tone violations and constructive failures that are invisible to semantic similarity methods.
The practical costs are non-determinism (mitigated by fixing temperature to 0) and per-call inference cost, both well-documented properties of LLM evaluation [1].
Judge model quality is a design variable. The result
was obtained with llama-3.3-70b-versatile. A smaller model will
produce substantially different results. Reporting LLM Judge performance
without specifying the judge model is methodologically equivalent to reporting
a deterministic metric without specifying its embedding model: a necessary
parameter omitted. Section extends the comparison by
placing the LLM judge's continuous mode on the same AUC scale as the deterministic methods.
LLM Judge Continuous Mode: Unified AUC Comparison
sec:logprob_discussion
The continuous mode experiment was motivated by a comparison gap: the binary judge produces and , while continuous methods produce AUC. These are not directly comparable. By extracting from the judge model's token distribution (Section), the same judge family can be evaluated on AUC, placing all methods on a common scale.
On this unified scale (Table), Gemma 3-12B in continuous mode reaches AUC\,, matching the top deterministic methods. The meaningful difference lies not in the AUC value itself but in what each method measures. BERTScore, Cosine, and -NN require ground-truth reference responses and measure semantic proximity to them, as established in Section. The LLM judge in continuous mode requires no ground truth and evaluates adherence by reading the role definition directly, including tone violations and constructive failures invisible to semantic similarity methods.
Two additional findings from the logprob experiment carry practical significance. First, model quality and recency outweigh size: Gemma 3-12B (12B parameters, March 2025) substantially outperforms Llama 3.3-70B (70B parameters) on this task (AUC\, vs.~). This suggests that the approach is sensitive to the instruction-following quality of the judge model rather than its size.
Second, the 8B failure is a capacity problem, not a calibration problem. Llama 3.1-8B assigns systematically higher to violations than to adherent responses (mean separation , AUC\,). This cannot be corrected by threshold tuning: the probability assignments are inverted. The result establishes a practical lower bound on the model capability required for reliable continuous scoring.
The main practical constraint is infrastructure: continuous mode requires deployments that expose per-token log-probabilities (Section). When logprobs are unavailable, binary mode remains the only option.
Configuration Parameters
sec:scoring_params
The metric exposes the following parameters.
scoring\_method (default: 'llm\_judge') ---
selects the scoring function. 'llm\_judge' (default) reads the
role definition directly and is the only option capable of detecting tone
violations and constructive failures. When ground-truth responses are
available, deterministic alternatives can be selected: 'cosine'
and 'knn' are recommended for cost-sensitive monitoring, with the
explicit caveat that they detect semantic drift from the reference rather
than role adherence per se; 'bertscore' and 'nli' are
also available. See Section for empirical characterisation
of each method.
output\_mode (default: 'binary') --- controls
the LLM judge output format. 'binary' samples the model at
temperature 0 and returns a hard score per turn.
'continuous' extracts from the model's first-token log-probability distribution,
producing a calibrated score in (Section).
Continuous mode requires a deployment that exposes per-token log-probabilities
(e.g.\ HuggingFace TGI, vLLM); if the provider does not support logprobs,
the framework falls back to 'binary' automatically. Applies only
when scoring\_method='llm\_judge'.
evaluation\_granularity (default: 'turn') ---
controls whether the judge evaluates each assistant turn with its prior
context ('turn', one API call per turn, aligned with
Equation) or the full conversation in a single call
('conversation', lower cost, no per-turn breakdown). Applies only
when scoring\_method='llm\_judge'.
strict\_mode (default: False) --- when
True, the session passes only if every turn is adherent (session
score ). Intended for high-criticality deployments where partial
adherence is not acceptable.
threshold (default: 0.5) --- the minimum session score
required to mark the conversation as adherent. Ignored when
strict\_mode=True. In binary output mode a threshold of 0.5
requires that at least half of all turns are adherent; most relevant in
output\_mode='continuous'.
model --- the LLM used as judge. Follows the
Guardian abstraction in the Gaussia framework, allowing any
supported provider to be substituted. As shown in Section,
judge model quality significantly affects detection performance and should
be treated as a first-class design parameter.
include\_reason (default: False) --- when
enabled, the judge returns a natural-language justification alongside each
per-turn score. Applies only when scoring\_method='llm\_judge'.
Implementation Considerations
The metric extends the Gaussia base class. The batch()
method iterates over each assistant turn in a session, dispatches to the
scoring function selected via scoring\_method (default:
'llm\_judge'), and appends a per-turn score. The session score is
the mean of per-turn scores (Equation).
The chatbot\_role field is read from the Dataset object
and passed as a string to all scoring methods and, when the judge path is
active, injected verbatim into the judge prompt.
Statistical mode compatibility (Frequentist and Bayesian) follows the existing Gaussia pattern. No changes to the statistical layer are required.
Future Work
Three directions are identified for future development.
Continuous mode capacity threshold. The continuous mode experiment establishes that Llama 3.1-8B fails (AUC\,, inverted signal) while Gemma 3-12B succeeds (AUC\,). The minimum model capability required for reliable continuous scoring lies in this gap and has not been characterised. A systematic evaluation across model sizes and families would delimit this threshold and guide model selection in cost-constrained deployments.
Multimodal settings. As conversational AI systems process images, audio, and video alongside text, role adherence evaluation must generalise beyond text-only responses. The LLM judge path extends naturally by requiring a vision-capable model; the deterministic path would require modality-appropriate similarity functions, such as CLIP-based similarity for image content.
Re-evaluation on a larger and more diverse benchmark. The current benchmark is synthetic and scoped to a single domain. As discussed in Section, the near-perfect AUC of semantic methods is a structural property of this benchmark's construction rather than evidence of genuine role understanding. A multi-domain benchmark with naturally occurring violations --- including tone deviations that are semantically close to adherent responses --- would provide a more demanding test for all methods and may yield a different relative ordering.
Conclusion
We have proposed Role Adherence, a metric for the Gaussia evaluation framework that measures how consistently an AI assistant maintains its assigned role across a multi-turn conversation. The metric adopts a constructive definition of adherence, evaluates each turn in full conversational context, and uses an LLM-as-a-judge as its default scoring path. Deterministic semantic scoring is available as a reproducible alternative when ground-truth reference responses are present.
Empirical evaluation on a controlled synthetic benchmark surfaces a distinction with practical consequences: deterministic methods (BERTScore, Cosine, -NN) measure semantic proximity to a reference response, not role adherence per se. Their near-perfect AUC on the benchmark is a structural artefact of how the benchmark was constructed, not evidence of role understanding. A response that violates the role in tone or fails a constructive requirement while remaining semantically close to the expected answer is invisible to these methods. This finding motivates using deterministic methods only as lightweight semantic drift monitors, not as primary adherence evaluators.
Among methods that model the reference distribution rather than individual turn pairs, Mahalanobis Distance in high-dimensional embedding space degrades monotonically with sample size (AUC ), confirming the curse of dimensionality. A PCA-based pre-reduction inverts the adherence signal entirely and is not recommended. KL Divergence converges to a weak ceiling (AUC ) because vocabulary overlap is a poor proxy for adherence. The LLM Judge (, , CI [,\,] at ) is the only method that reliably detects all violation types, at the cost of per-call inference and dependence on judge model quality --- a variable that must be treated as a first-class design parameter.
The LLM judge in continuous mode, which extracts from the model's token distribution rather than sampling a binary decision, provides a calibrated adherence score without ground truth and enables unified AUC comparison with the deterministic methods. Gemma 3-12B achieves AUC\,, matching the best deterministic methods on this benchmark while evaluating genuine role adherence rather than semantic proximity to a reference. The failure of Llama 3.1-8B (AUC\,, inverted signal) confirms that the approach requires a model capable of reliably reasoning about role adherence at the first-token level.
Primary directions for future work are: (1) per-violation-type evaluation to test whether any deterministic method retains diagnostic value for specific violation categories; (2) multi-domain benchmarking beyond the single fintech support agent domain used here; (3) cost-accuracy analysis of judge models to characterise the tradeoff between inference cost and detection quality; and (4) characterisation of the minimum model capability threshold for reliable continuous scoring, delimiting the gap observed between Llama 3.1-8B (AUC\,) and Gemma 3-12B (AUC\,).
plainnat refs
References
- [1]Zheng et al. (2023). Judging LLM. Advances in Neural Information Processing Systems.
- [2]Deriu et al. (2021). Survey on evaluation methods for dialogue systems. Artificial Intelligence Review.
- [3]Zhang et al. (2020). BERTS. International Conference on Learning Representations.
- [4]Reimers et al. (2019). Sentence-BERT. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing.
- [5]Bowman et al. (2015). A Large Annotated Corpus for Natural Language Inference. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing.
- [6]Mahalanobis et al. (1936). On the generalised distance in statistics. Proceedings of the National Institute of Sciences of India.
- [7]Kullback et al. (1951). On information and sufficiency. The Annals of Mathematical Statistics.
- [8]Cover et al. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory.
- [9]Jolliffe et al. (2002). Principal Component Analysis.
- [10]Beyer et al. (1999). When Is ``Nearest Neighbor'' Meaningful?. Proceedings of the 7th International Conference on Database Theory.
- [11]Cohen et al. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement.
- [12]Hanley et al. (1982). The meaning and use of the area under a receiver operating characteristic (ROC. Radiology.
- [13]Dubey et al. (2024). The Llama. arXiv preprint arXiv:2407.21783.