research paper

Role Adherence: A Metric for Evaluating Role Consistency in Multi-Turn Conversational AI

Axel Fritz

axel.fritz@alquimia.ai

Alquimia AI

Alex Fiorenza

alex.fiorenza@alquimia.ai

Alquimia AI

April 2026

Abstract

As conversational AI systems are deployed in role-constrained settings (customer support agents, educational tutors, medical assistants), measuring whether a model consistently adheres to its assigned role becomes a core quality requirement. This paper proposes Role Adherence, a metric that quantifies role consistency across multi-turn conversations under a constructive definition: an assistant must actively exhibit the behaviour its role demands, not merely avoid out-of-scope content. We evaluate eight scoring methods and the LLM judge in both output modes on a controlled synthetic benchmark of 150 turns (scope violations, tone violations, and constructive failures) across three dataset-size conditions (). An LLM judge (\texttt{llama-3.3-70b-versatile}; \citealt{dubey2024llama3}) achieves , at and is the only method that reads the role definition directly, detecting all violation types including tone deviations and constructive failures. Deterministic methods (BERTScore, cosine similarity, -NN) reach AUC\,, but this is a structural artefact: they measure semantic proximity to a ground-truth reference and are blind to violations that are semantically close to the correct response. Distribution-based methods degrade or invert with corpus size: Mahalanobis Distance declines monotonically (AUC: ), a PCA-based variant inverts the adherence signal entirely (AUC\, at ), and KL Divergence converges to a weak ceiling (AUC\,). The LLM judge in continuous mode, which extracts from the model's token distribution, achieves AUC\, with Gemma\,3-12B without requiring ground truth, enabling unified comparison with deterministic methods on the same scale. The metric uses an LLM-as-a-judge as its default scoring path, as it is the only method capable of detecting tone violations and constructive failures. Deterministic methods are available as a cost-free alternative when ground truth is present, but empirical results show they act as semantic drift detectors rather than genuine role adherence evaluators. The metric is part of the Gaussia evaluation framework (\url{https://www.gaussia.ai/}). The implementation is open-source at \url{https://github.com/Alquimia-ai/pygaussia}.

Introduction

Modern LLM-based assistants are typically deployed with a system prompt that defines their role: the topics they should cover, the tone they should adopt, the behaviours they should exhibit, and the boundaries they should respect. Despite this, models frequently deviate from their assigned roles --- answering out-of-scope questions, abandoning their defined persona, or failing to exhibit the behaviour their role requires.

Existing evaluation frameworks either measure role adherence at the level of individual turns in isolation ^[1], or rely entirely on LLM-based judges, which introduce reproducibility and cost concerns ^[2]. This paper proposes a more complete metric that evaluates adherence turn by turn with prior conversational context, uses an LLM-as-a-judge as its default scoring path, and offers deterministic semantic scoring as a reproducible alternative when ground-truth reference responses are available.

Problem Statement

Given a multi-turn conversation between a user and an AI assistant, and a definition of the role the assistant is expected to play, we want to compute a scalar score representing how consistently the assistant adhered to that role throughout the conversation.

Let a conversational session be a sequence of assistant turns $t_1, t_2, \ldots, t_n$ , and let $R$ denote the role definition. The Role Adherence score is defined as:

\text{RoleAdherence}(R, T) = \frac{1}{n} \sum_{i=1}^{n} \text{adhere}(t_i,\; T_{<i},\; R) \label{eq:score}

where $T_{<i}$ is the conversation history up to turn $i$ , and $\text{adhere}(\cdot) \in [0, 1]$ is a per-turn adherence score. The session score lies in $[0, 1]$ , where 1 indicates full adherence across all turns. The range of $\text{adhere}(\cdot)$ and how it is computed depend on the scoring path selected (Section).

Defining Adherence

sec:adherence

Constructive vs.\ Restrictive Adherence

Two competing definitions of adherence are possible.

Restrictive adherence asks: did the assistant avoid going outside the scope of its role? A turn is compliant if it does not address topics or exhibit behaviours explicitly excluded by the role. This definition only penalises negative deviations.

Constructive adherence asks: did the assistant actively behave as its role requires? A turn is compliant only if it exhibits the behaviour, tone, and knowledge a properly role-playing assistant should exhibit. This definition penalises both negative deviations (scope violations) and positive failures (failing to act as the role demands).

Consider an assistant assigned the role of a technical support agent. If the user asks a question within the assistant's domain and the assistant responds ``I don't know'', restrictive adherence would record no violation --- no out-of-scope content was produced. Constructive adherence would record a failure, because a role-adherent support agent is expected to answer questions in its domain.

We adopt the constructive definition as the primary formulation for this metric, as it provides a more meaningful signal for production quality assessment. Restrictive adherence remains a valid and simpler formulation for use cases where the primary concern is containment rather than quality of execution. Both definitions are implementable within the proposed schema.

Conversational Context

The default mode evaluates each turn individually: turn $t_i$ is scored by passing the role definition, the prior conversation history $T_{<i} = \{t_1, \ldots, t_{i-1}\}$ , and the candidate response to the judge in a single call. This produces one score per turn and preserves per-turn interpretability. A turn evaluated without prior context may appear off-role but be entirely appropriate given what came before; conversely, a neutral-looking turn may be part of an escalating deviation only visible across the conversational arc.

An alternative is to evaluate the full conversation in a single call (evaluation\_granularity='conversation'), which reduces inference cost at the expense of per-turn granularity. We note this as a valid trade-off for cost-sensitive deployments but do not adopt it as the default.

Role definition schema. The role is provided as a free-form natural language string and is a property of the session, not of the metric. chatbot\_role is defined as a field on the Dataset object because a dataset already carries implicit contextual assumptions: the way a user interacts with a customer support agent differs from how they interact with a coding assistant, even if the raw conversation text were identical. Separating the role from the session would break this coupling. The string is passed directly to all scoring methods: embedded alongside the candidate response in deterministic path computations, and injected verbatim into the LLM judge prompt.

Scoring Methodology

sec:scoring

Scoring Strategy

The metric was originally designed with deterministic semantic scoring as its primary path, motivated by reproducibility and zero per-inference cost. Empirical evaluation (Section) showed that deterministic methods, while reliable at detecting semantic drift from a reference response, fail to detect tone violations and constructive failures --- the hardest and most practically relevant violation types in a role-constrained deployment. This finding motivated adopting the LLM judge as the default scoring path.

The metric uses an LLM-as-a-judge as its default scoring path. The judge reads the role definition directly and is the only method capable of detecting tone violations and constructive failures --- a conclusion supported empirically in Section.

When ground-truth reference responses are available and the user prefers a deterministic, reproducible alternative, the metric can be configured to use a semantic scoring function instead (see Section). Deterministic methods are faster and free of per-call cost, but detect semantic drift from the reference rather than role adherence per se.

LLM-as-a-Judge Path

Each assistant turn $t_i$ is evaluated by an LLM judge that receives: (1) the role definition $R$ ; (2) the conversation history $T_{<i}$ ; and (3) the turn to evaluate $t_i$ . The judge prompt implements the constructive adherence definition: the judge is asked not only whether the turn avoids scope violations, but whether it actively reflects the expected behaviour of the defined role.

Evaluation granularity. In the default 'turn' mode the judge evaluates each turn independently with its prior context, producing $n$ separate scores aligned with Equation. In 'conversation' mode the entire conversation is submitted in a single call, reducing inference cost at the expense of per-turn granularity.

Output mode. The judge supports two output modes (Section). 'binary' (default) samples the model at temperature 0 and returns a hard $\{0, 1\}$ score per turn. 'continuous' extracts $P(\textsc{yes})$ from the model's first-token log-probability distribution, producing a calibrated score in $[0, 1]$ without prompting the model for a number. When include\_reason is enabled, the judge additionally returns a natural-language justification for each per-turn score.

Output Modes: Binary and Continuous

sec:logprob

The judge supports two output modes. Binary mode (default) samples the model at temperature 0 and returns a hard $\{0, 1\}$ decision per turn. Continuous mode extracts a calibrated adherence score from the model's first-token probability distribution by querying at temperature 1 with logprobs=True, top\_logprobs=10, avoiding the calibration issues of prompting the model to produce a number directly.

Let $\mathcal{Y}$ and $\mathcal{N}$ denote the sets of tokenisation variants for yes and no respectively (e.g.\ "YES", "Yes", "yes"). For each token $t_k$ in the top- $K$ list, let $\ell_k$ denote its log-probability. The continuous adherence score is:

P(\textsc{yes}) = \frac{ \exp\!\bigl(\mathrm{LSE}_{k:\, t_k \in \mathcal{Y}}\; \ell_k\bigr) }{ \exp\!\bigl(\mathrm{LSE}_{k:\, t_k \in \mathcal{Y}}\; \ell_k\bigr) + \exp\!\bigl(\mathrm{LSE}_{k:\, t_k \in \mathcal{N}}\; \ell_k\bigr) } \label{eq:logprob}

where $\mathrm{LSE}$ denotes log-sum-exp, aggregating probability mass across tokenisation variants without numerical underflow. When neither yes nor no appears in the top- $K$ list, the score defaults to $0.5$ .

A non-obvious property of instruction-tuned models. Some instruction-tuned models do not assign the highest probability to yes or no as their first token, even when explicitly instructed to do so. Gemma 3-12B assigns approximately 99\% of its first-token probability to the token "Okay", with yes and no appearing as low-probability alternatives ( $\approx$ 0.5\% and $\approx$ 0.001\% respectively), a conversational prior from instruction tuning. Despite this, the relative probability of yes versus no at the first token position encodes the model's judgment correctly, and AUC\, $= 0.999$ (Section) confirms that this signal is a reliable discriminator.

Infrastructure requirement. Continuous mode requires a deployment that exposes per-token log-probabilities. This is available in open-weight deployments (HuggingFace TGI, vLLM) but not in all commercial APIs. When logprobs are unavailable, binary mode remains the only option.

Deterministic Path

sec:deterministic

When a ground\_truth\_assistant response is available for each turn, six scoring functions are proposed: BERTScore, cosine similarity, $k$ -nearest neighbours, and NLI compare individual turn pairs and are independent of corpus size. Mahalanobis Distance and Kullback--Leibler Divergence are distribution-based: they build a model of the role from the full corpus of ground-truth responses, and their signal quality varies with the number of reference examples available. A PCA-based variant of Mahalanobis Distance is additionally evaluated as a candidate solution to the curse-of-dimensionality problem and reported in Section. The per-turn score is continuous in $[0, 1]$ and is averaged across turns to produce the session score.

A note on scope. Deterministic methods compare candidate responses against a ground-truth reference and measure semantic proximity, not role adherence directly. A response can achieve high semantic similarity to the ground truth while violating the role's tone, register, or constructive requirements --- violations that are semantically indistinguishable from adherent responses but contextually non-compliant. This limitation is characterised empirically in Section and discussed in Section.

BERTScore

BERTScore ^[3] computes semantic similarity between two texts by extracting contextual token embeddings from a pre-trained language model and computing precision, recall, and F1 over token-level cosine similarities. Unlike surface-level metrics, it captures paraphrastic equivalence. BERTScore F1 is computed as:

\text{BERTScore}_{F_1} = 2 \cdot \frac{P_{\text{BERT}} \cdot R_{\text{BERT}}} {P_{\text{BERT}} + R_{\text{BERT}}}

Role-adherent responses are not expected to reproduce the exact phrasing of a reference. BERTScore tolerates natural linguistic variation while remaining independent of a secondary judge model.

Semantic Cosine Similarity

Cosine similarity between sentence-level embeddings ^[4] is a lighter-weight alternative. A sentence encoder maps each response to a dense vector; adherence is scored as the cosine of the angle between the candidate and reference vectors:

\text{sim}(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\|\,\|\mathbf{b}\|}

Cosine similarity operates at sentence granularity rather than token granularity, making it faster. BERTScore and cosine similarity measure semantic proximity at different levels of granularity; Section characterises their practical differences in group separation.

$k$ -Nearest Neighbours

$k$ -Nearest Neighbours ( $k$ -NN) generalises cosine similarity to multi-reference settings ^[8]. Given a reference set of $k$ ground-truth embeddings for a turn, the adherence score is the cosine proximity to the nearest neighbour:

\text{kNN}(\mathbf{x}) = 1 - \min_{j \in \{1,\ldots,k\}} d_{\cos}(\mathbf{x},\, \mathbf{g}_j) \label{eq:knn}

where $d_{\cos}$ denotes cosine distance. When $k = 1$ and a single reference per turn is available, Equation is mathematically equivalent to cosine similarity. The distinction materialises when multiple valid reference responses exist per turn: $k$ -NN distances to the $k$ nearest neighbours weight the score towards the most similar acceptable response, gaining robustness over a single-reference cosine comparison. We propose $k$ -NN as the natural extension of cosine similarity for production deployments where multiple ground-truth responses per turn can be maintained.

Natural Language Inference

NLI models ^[5] classify the relationship between two texts as entailment, neutral, or contradiction. Applied to role adherence, a candidate response that contradicts what a ground-truth role-adherent response asserts is a strong signal of non-adherence. NLI is proposed as a complementary signal for detecting explicit violations rather than measuring overall similarity. A contradiction label serves as a hard penalty regardless of similarity scores.

Mahalanobis Distance

Cosine similarity computes the angle between embedding vectors but treats all dimensions equally, regardless of how much the role actually varies along each dimension. A role that discusses diverse topics (high variance) but maintains strict tone constraints (low variance) will be mis-scored: large topic-dimensional shifts are penalised as heavily as small tone-dimensional shifts.

Mahalanobis Distance ^[6] corrects this by incorporating the empirical covariance structure of the role's embedding space. Given the set of ground-truth embeddings $\{\mathbf{g}_1, \ldots, \mathbf{g}_n\}$ , let $\boldsymbol{\mu}$ denote their mean and $\boldsymbol{\Sigma}$ their covariance matrix. The Mahalanobis Distance of a candidate response embedding $\mathbf{x}$ from the role centroid is:

D_M(\mathbf{x},\, \boldsymbol{\mu}) = \sqrt{(\mathbf{x} - \boldsymbol{\mu})^{\top} \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})} \label{eq:mahal}

In practice $\boldsymbol{\Sigma}$ is approximated as a diagonal matrix (to avoid singularity in high-dimensional embedding spaces) and regularised as $\boldsymbol{\Sigma} + \lambda \mathbf{I}$ . The distance is converted to an adherence score via $\text{score} = \exp(-\alpha \cdot D_M)$ , where $\alpha$ is a scaling parameter set to 1 by default.

Sample sensitivity. Because $\boldsymbol{\Sigma}$ is estimated from the ground-truth corpus, signal quality varies with the number of available reference examples. Section characterises this empirically across conditions.

Mahalanobis Distance with PCA Pre-Reduction sec:pca_mahal

Mahalanobis Distance in high-dimensional embedding spaces is susceptible to the curse of dimensionality ^[10]: as the number of dimensions grows, pairwise distances concentrate around a fixed value regardless of true class proximity, eroding discriminative power. Principal Component Analysis ^[9] offers a candidate solution by projecting embeddings to a lower-dimensional subspace before applying Mahalanobis Distance.

We evaluate a pre-reduction from 384 to 32 dimensions (top-32 principal components of the ground-truth embedding matrix) followed by Mahalanobis Distance in the compressed space. This variant is evaluated empirically but not proposed as a recommended scoring function; Section shows it produces inverted adherence scores and is reported as a negative finding.

Kullback--Leibler Divergence

For roles with explicit stylistic or vocabulary constraints, adherence can be framed as a distributional comparison. Let $P$ denote the vocabulary distribution of the role estimated from ground-truth responses, and let $Q$ denote the vocabulary distribution of the candidate turn. The Kullback--Leibler Divergence ^[7] from $Q$ to the reference $P$ is:

D_{\mathrm{KL}}(P \,\|\, Q) = \sum_{w \in \mathcal{V}} P(w)\, \log \frac{P(w)}{Q(w)} \label{eq:kl}

We use $D_{\mathrm{KL}}(P \| Q)$ --- the divergence of the candidate from the role reference --- because $P$ is the distributional target. To handle zero probabilities, both distributions are smoothed with Laplace smoothing. The vocabulary $\mathcal{V}$ is restricted to the top- $k$ most frequent tokens in the ground-truth corpus (default $k = 500$ ) to reduce sparsity. The adherence score is $\text{score} = \exp(-\beta \cdot D_{\mathrm{KL}})$ , with $\beta = 1$ by default.

KL Divergence is proposed as a signal for vocabulary drift: cases where a response uses phrasing or terminology the role explicitly avoids. Section characterises its empirical performance.

Sample sensitivity. Estimating $P$ reliably requires a representative vocabulary distribution from the ground-truth corpus. Section characterises how signal quality varies with corpus size.

Lexical Metrics: Considered and Discarded

ROUGE and BLEU measure n-gram overlap between candidate and reference texts. Both were considered and discarded as primary metrics for this use case.

Role-adherent responses are not expected to reproduce the exact phrasing of a reference --- they are expected to exhibit the same role-consistent behaviour with natural linguistic variation. N-gram metrics penalise paraphrases and reward surface-level copying, neither of which is desirable here. ROUGE and BLEU are noted as valid lightweight baselines for benchmarking purposes only.

Hypotheses

sec:hypotheses

We formulate four hypotheses to guide the empirical evaluation in Sections--.

[leftmargin=2em, labelwidth=2em] [H1] At small corpus sizes, KL Divergence does not produce a stable discriminative signal because the ground-truth vocabulary distribution is too sparse to be representative. We additionally predict, from the concentration of measure in high-dimensional sentence embedding spaces ^[10], that Mahalanobis Distance will degrade as the corpus grows --- a failure mode distinct from the sparse-distribution instability of KL Divergence.

[H2] With a sufficient corpus, deterministic methods achieve agreement with the gold standard comparable to the LLM judge, while offering perfect reproducibility and zero per-inference cost.

H1 motivates evaluating distribution-based methods across corpus sizes rather than at a single fixed point. H2 motivated exploring deterministic methods as a cost-effective alternative to the LLM judge; Section reports it refuted.

Experimental Setup

sec:setup

Benchmark Dataset

We construct a controlled synthetic benchmark for a single role: a customer-facing support agent for a fintech application. The role is defined with four explicit constraints: (1) scope limited to account inquiries, transaction history, and card management; (2) a professional, empathetic, and concise tone; (3) no financial or investment advice under any circumstances; and (4) a constructive requirement to always offer a next step or resolution path rather than an open-ended refusal.

The benchmark consists of 150 assistant turns organised in 30 simulated conversations of 5 turns each. Labels are assigned by design: each turn is generated to exhibit one of four target behaviours --- adherent, scope violation, tone violation, or constructive failure --- and the label reflects the construction intent rather than post-hoc annotation. The distribution is approximately 40\% adherent and 20\% per violation type.

Each record contains four fields:

role: the full role definition string.
assistant: the generated response, which may violate the role.
ground\_truth\_assistant: a role-adherent reference response.
label: the designed ground-truth class.

Evaluation Conditions

Three conditions evaluate the same benchmark at increasing $n$ :

Condition A ( $n = 15$ ): the first 15 turns.
Condition B ( $n = 50$ ): the first 50 turns.
Condition C ( $n = 150$ ): all 150 turns.

Conditions are cumulative: B includes all turns in A, and C includes all turns in B. This isolates the effect of dataset size on each scoring method while holding the role and violation distribution constant.

Scoring Methods

[h] Scoring methods evaluated, with their data requirements. tab:methods

LLM judge (continuous)	No	No
BERTScore F1	Yes	No
Cosine similarity	Yes	No
$k$ -NN	Yes	No
NLI	Yes	No
Mahalanobis Distance	Yes	Yes
PCA+Mahalanobis (eval.\ only)	Yes	Yes
KL Divergence	Yes	Yes

Evaluation Protocol

Continuous methods (BERTScore, cosine similarity, $k$ -NN, Mahalanobis Distance, PCA+Mahalanobis, and KL Divergence) require a threshold to produce binary predictions; any threshold is arbitrary in production. We report two threshold-agnostic statistics:

AUC-ROC ^[12] is the probability that the method assigns a higher score to a randomly selected adherent turn than to a randomly selected violation. AUC\, $= 1.0$ indicates perfect ranking; AUC\, $= 0.5$ is chance.

Separation is the difference between the mean score of adherent turns and the mean score of violation turns. It quantifies the absolute gap between group centroids and reflects how reliably a threshold can be set in practice. High AUC with low separation indicates perfect ranking but fragile threshold selection.

Binary classifiers (NLI and LLM Judge) produce hard predictions directly. We report three statistics. F1 per class is the harmonic mean of precision and recall for each class:

F_1 = 2 \cdot \frac{P \cdot R}{P + R}

where $P$ and $R$ are precision and recall for the target class. Macro F1 is the unweighted average of per-class F1 scores, treating each class equally regardless of frequency --- a relevant choice here because the adherent and violation classes are not balanced. Cohen's $\kappa$ ^[11] is a chance-corrected agreement coefficient:

\kappa = \frac{p_o - p_e}{1 - p_e}

where $p_o$ is the observed fraction of agreements between the classifier and the gold labels, and $p_e$ is the fraction expected by chance under the marginal class distributions. $\kappa = 1.0$ indicates perfect agreement; $\kappa = 0$ indicates agreement at chance level; negative values indicate worse-than-chance performance.

For the LLM Judge, macro F1 and per-class F1 are additionally accompanied by 95\% bootstrap confidence intervals (1\,000 resamples, seed 42) to quantify estimation uncertainty at each sample size.

Mahalanobis Distance uses diagonal covariance estimation with $\ell_2$ regularisation ( $\lambda = 0.01$ ). KL Divergence uses Laplace-smoothed distributions restricted to the top-500 most frequent tokens in the ground-truth corpus. The LLM judge is run once per turn per condition at temperature 0.

Results

sec:results

Continuous Scoring Methods

Table reports mean score by class (adherent and violation), separation, and AUC-ROC for each continuous method across the three conditions. Figure provides a visual comparison of per-class mean scores across methods and conditions.

[ht] Continuous scoring methods: mean score by class, separation, and AUC-ROC across evaluation conditions. Sep.\,=\,mean(adh.)\, $-$ \,mean(viol.); higher is better. AUC: 1.0\,=\,perfect ranking, 0.5\,=\,chance. Bold indicates inverted signal. tab:cont_results

	$n=50$	0.940	0.869	$+0.071$	1.000
	$n=150$	0.941	0.868	$+0.073$	0.995
[3pt] Cosine	$n=15$	0.905	0.466	$+0.439$	1.000
	$n=50$	0.902	0.520	$+0.382$	0.998
	$n=150$	0.877	0.497	$+0.380$	0.994
[3pt] $k$ -NN	$n=15$	0.905	0.496	$+0.409$	1.000
	$n=50$	0.902	0.544	$+0.358$	0.998
	$n=150$	0.877	0.546	$+0.331$	0.993
[3pt] Mahalanobis	$n=15$	0.404	0.348	$+0.056$	0.963
	$n=50$	0.391	0.348	$+0.043$	0.872
	$n=150$	0.391	0.350	$+0.041$	0.856
[3pt] PCA+Maha	$n=15$	0.283	0.419	$\mathbf{-0.136}$	0.000
	$n=50$	0.318	0.393	$\mathbf{-0.075}$	0.110
	$n=150$	0.343	0.383	$\mathbf{-0.040}$	0.219
[3pt] KL Div.	$n=15$	0.860	0.858	$+0.002$	0.537
	$n=50$	0.693	0.671	$+0.022$	0.870
	$n=150$	0.517	0.494	$+0.023$	0.877

Mean similarity score per class (adherent vs.\ violation) for
BERTScore, Cosine similarity, Mahalanobis Distance, and KL Divergence
across the three evaluation conditions. — Mean similarity score per class (adherent vs.\ violation) for BERTScore, Cosine similarity, Mahalanobis Distance, and KL Divergence across the three evaluation conditions.

Figure shows the separation curve for Mahalanobis Distance, KL Divergence, and PCA+Mahalanobis over intermediate sample sizes from $n = 15$ to $n = 150$ . Mahalanobis separation declines monotonically from $+0.056$ to $+0.041$ . KL separation rises from near zero ( $+0.002$ ) to $+0.023$ . PCA+Mahalanobis separation remains negative at every $n$ , converging from below towards zero without crossing it.

Score separation (adherent mean $-$ violation mean) as a function
of sample size for Mahalanobis Distance, KL Divergence, and
PCA+Mahalanobis. Positive values indicate the method ranks adherent
responses higher on average; negative values indicate signal
inversion. — Score separation (adherent mean $-$ violation mean) as a function of sample size for Mahalanobis Distance, KL Divergence, and PCA+Mahalanobis. Positive values indicate the method ranks adherent responses higher on average; negative values indicate signal inversion.

Binary Classifiers

[ht] Binary classifiers: per-class F1, macro F1, and Cohen's $\kappa$ . tab:bin_results

	$n=50$	0.625	0.333	0.479	0.167
	$n=150$	0.612	0.269	0.441	0.128
[3pt] LLM Judge	$n=15$	1.000	1.000	1.000	1.000
	$n=50$	0.974	0.984	0.979	0.958
	$n=150$	0.974	0.984	0.979	0.958

LLM Judge: Bootstrap Confidence Intervals

The $n = 15$ result ( $F_1 = 1.000$ , $\kappa = 1.000$ ) reflects perfect classification on a small and easy subsample and should not be taken as the representative estimate. Table reports macro F1, per-class F1, and Cohen's $\kappa$ with 95\% bootstrap confidence intervals (1\,000 resamples, seed 42). The canonical result is $n = 150$ : $F_1^{\text{macro}} = 0.979$ , $\kappa = 0.959$ , CI\,[ $0.952$ ,\, $1.000$ ].

[ht] LLM Judge (llama-3.3-70b-versatile): macro F1, per-class F1, and Cohen's $\kappa$ with 95\% bootstrap confidence intervals (1\,000 resamples, seed 42). tab:llm_ci

$n=50$	0.979	[0.934,\,1.000]	0.975	[0.914,\,1.000]	0.984	[0.947,\,1.000]
$n=150$	0.979	[0.952,\,1.000]	0.975	[0.942,\,1.000]	0.984	[0.963,\,1.000]

LLM Judge macro F1 with 95\% bootstrap confidence intervals across
evaluation conditions. — LLM Judge macro F1 with 95\% bootstrap confidence intervals across evaluation conditions.

LLM Judge: Continuous Mode

sec:logprob_results

Three models were evaluated using the judge's continuous mode on the full benchmark ( $n = 150$ ), each chosen for a specific purpose: Llama 3.3-70B for direct comparison with the binary judge of the same family; Llama 3.1-8B to test whether the approach scales to a smaller, cheaper model ^[13]; and Gemma 3-12B (March 2025) as a more recent model better optimised for instruction following, to test generalisation beyond the Llama family. A fourth model, Qwen3-14B, was evaluated but is excluded from Table: all 150 scores default to $0.5$ because yes and no do not appear in the top-10 log-probability list for any record, even with chain-of-thought generation disabled. This is a structural incompatibility between Qwen3's instruction-tuning format and the logprob approach.

[ht] LLM Judge (continuous mode): AUC, Average Precision, and mean $P(\textsc{yes})$ by class ( $n = 150$ ). Sep.\,=\,mean(adh.)\, $-$ \,mean(viol.). Italic: inverted signal. tab:logprob_results

Llama 3.3-70B	0.883	0.767	0.208	0.032	$+0.176$
Llama 3.1-8B	0.303	0.295	0.536	0.688	$-0.153$

Table shows the per-violation breakdown for Gemma 3-12B, confirming uniform discrimination across all three violation types.

[ht] LLM Judge in continuous mode (Gemma 3-12B): mean $P(\textsc{yes})$ by class ( $n = 150$ ). tab:logprob_gemma_viol

scope\_violation	0.000
tone\_violation	0.000
constructive\_failure	0.000

Table places all continuous methods and the logprob judge side by side on AUC at $n = 150$ , enabling a unified comparison across scoring families for the first time.

[ht] Unified AUC comparison across all methods ( $n = 150$ ). GT\,=\,ground-truth reference responses required. tab:unified_auc

Cosine similarity	Yes	0.994
$k$ -NN	Yes	0.993
KL Divergence	Yes	0.877
Mahalanobis Dist.\	Yes	0.856
PCA+Mahalanobis	Yes	0.219
LLM judge cont.\ (Llama 3.3-70B)	No	0.883

Discussion

sec:discussion

The Semantics Trap: Why AUC\,$\approx 1$ Is Not a Victory

BERTScore, Cosine, and $k$ -NN achieve AUC\, $\approx 1.0$ across all conditions. Read naively, this looks like a definitive result. It is not. The benchmark was constructed so that adherent responses are paraphrases of the ground truth (semantically close by design) and violations are responses that mishandle or ignore the request (semantically distinct by design). Any method that measures semantic proximity to a per-turn ground-truth reference separates this benchmark perfectly because separability was built into the construction, not discovered by the method.

The per-class mean scores make the mechanism explicit: Cosine similarity sits at $0.877$ for adherent responses and $0.497$ for violations at $n = 150$ --- the groups are nearly half a unit apart because adherent examples were generated as paraphrases of the ground truth and violations were not. In production, the hardest violations --- tone deviations and constructive failures --- are precisely the ones where semantic similarity to the ground truth is highest: a response may say the right things in the wrong register, or omit the required resolution path while remaining topically on-role. Cosine similarity would assign these a high score and miss the violation entirely.

The correct interpretation of deterministic methods with ground truth is that they are semantic drift detectors, not role adherence evaluators. They are useful for monitoring whether response content deviates from an expected reference, but they cannot substitute for a method that reads and understands the role definition.

Cosine vs.\ BERTScore: Practical Separation Matters

Of the methods that reach AUC\, $\approx 1.0$ , Cosine similarity achieves a separation of $+0.380$ at $n = 150$ while BERTScore achieves only $+0.073$ . Although BERTScore operates at finer granularity (token-level contextual embeddings versus sentence-level pooling), this additional resolution does not translate to better group separation in this setup: the two groups are sufficiently similar at the token level that BERTScore barely distinguishes them in absolute terms, even though its ranking remains nearly perfect.

For a deployment requiring threshold-based binary classification, BERTScore's small separation makes threshold selection fragile: a threshold at 0.90 and one at 0.91 produce meaningfully different decisions. Cosine similarity, with its large separation, is more robust to threshold choice.

$k$ -NN is mathematically equivalent to Cosine in the single-reference-per-turn setup used here ( $k = 1$ ). Its advantage materialises when multiple valid ground-truth responses exist per turn: the nearest-neighbour distance weights the score towards the most similar acceptable response, gaining robustness that a single-reference cosine comparison cannot provide.

Mahalanobis Distance: Curse of Dimensionality Confirmed

Mahalanobis Distance shows a monotonic degradation pattern: AUC is highest at $n = 15$ ( $0.963$ ) and declines to $0.856$ at $n = 150$ . This is the opposite of what standard statistical intuition would predict --- more data should produce better distributional estimates, not worse ones.

The explanation is the concentration of measure in high-dimensional spaces ^[10]. As $n$ grows, the estimated covariance of the 384-dimensional ground-truth embeddings converges towards the true distribution, but in 384D all pairwise distances concentrate around a fixed value regardless of class membership. The diagonal regularisation ( $\lambda = 0.01$ ) partially mitigates this at small $n$ by preventing covariance collapse; at larger $n$ the regularisation becomes relatively negligible and the concentration effect dominates. This confirms the prediction embedded in H1: Mahalanobis Distance is not viable for production deployments at standard embedding dimensions.

PCA+Mahalanobis: A Structural Failure

The PCA+Mahalanobis variant was motivated as a dimensionality reduction solution to the concentration problem. The result is unambiguous: the method inverts the adherence signal at every condition (sep\,=\, $-0.136$ at $n = 15$ , never crossing zero). AUC\, $= 0.000$ at $n = 15$ means the method assigns higher scores to violations than to adherent responses with perfect consistency.

The mechanism is structural. PCA decomposes the variance of the ground-truth embedding matrix: the 32 retained principal components capture the directions of maximum variance within the reference set. This transformation has no relationship to the adherence-violation geometry --- it captures what varies most among adherent examples, not what separates adherent from violation. In the compressed space, violation embeddings happen to project near the centroid of the transformed reference distribution while adherent examples, spread across the principal components, project further away.

More data partially corrects the PCA estimate (components become less dominated by individual examples), which is why separation improves from $-0.136$ to $-0.040$ as $n$ grows. But the structural distortion never reverses. PCA+Mahalanobis is not recommended for any deployment scenario. H2 is refuted: no deterministic method approaches the LLM Judge at large $n$ , and the methods that appear to do so via AUC are benefiting from the benchmark artefact described above.

KL Divergence: The Wrong Level of Abstraction

KL Divergence operates on vocabulary distributions estimated from the ground-truth corpus. Its near-random signal at $n = 15$ (AUC\, $= 0.537$ ) is expected: with 15 turns the top-500 vocabulary distribution is too sparse to be representative. As $n$ grows the distribution stabilises and AUC converges to $0.877$ at $n = 150$ .

The ceiling is not a sample-size problem --- it is a conceptual one. The role violations in the benchmark (scope, tone, constructive failures) do not produce distinctive vocabulary patterns. An agent that answers a question in the wrong tone uses largely the same vocabulary as one that answers correctly. KL Divergence is blind to this distinction because it operates at the lexical level, not the semantic or pragmatic level. The convergence to AUC\, $= 0.877$ suggests KL may serve as a rough vocabulary drift monitor but not as a primary adherence detector.

NLI: Insufficient for Role Adherence

NLI achieves $F_1^{\text{macro}} = 0.441$ and $\kappa = 0.128$ at $n = 150$ --- better than chance but far below useful production thresholds. The model correctly flags some explicit scope violations (responses that contradict ground-truth assertions) but misses tone violations and constructive failures. These failure types do not produce logical contradictions with the ground-truth response: an agent that fails to offer a resolution path omits something, it does not contradict anything. NLI's contradiction signal is a necessary but insufficient condition for role violation. Performance is also unstable across $n$ : $\kappa$ drops from $0.167$ at $n = 50$ to $0.128$ at $n = 150$ , suggesting the threshold used for binary classification is not stable.

LLM Judge: The Only Role-Aware Method

The LLM Judge achieves $F_1^{\text{macro}} = 0.979$ at $n \geq 50$ with bootstrap CI [ $0.952$ ,\, $1.000$ ] at $n = 150$ , $\kappa = 0.959$ [ $0.904$ ,\, $1.000$ ]. It is the only evaluated method that reads the role definition directly and evaluates the response against it. All other methods operate on signals that are at best correlated with adherence; the judge assesses adherence by definition and can detect tone violations and constructive failures that are invisible to semantic similarity methods.

The practical costs are non-determinism (mitigated by fixing temperature to 0) and per-call inference cost, both well-documented properties of LLM evaluation ^[1].

Judge model quality is a design variable. The $F_1 = 0.979$ result was obtained with llama-3.3-70b-versatile. A smaller model will produce substantially different results. Reporting LLM Judge performance without specifying the judge model is methodologically equivalent to reporting a deterministic metric without specifying its embedding model: a necessary parameter omitted. Section extends the comparison by placing the LLM judge's continuous mode on the same AUC scale as the deterministic methods.

LLM Judge Continuous Mode: Unified AUC Comparison

sec:logprob_discussion

The continuous mode experiment was motivated by a comparison gap: the binary judge produces $F_1$ and $\kappa$ , while continuous methods produce AUC. These are not directly comparable. By extracting $P(\textsc{yes})$ from the judge model's token distribution (Section), the same judge family can be evaluated on AUC, placing all methods on a common scale.

On this unified scale (Table), Gemma 3-12B in continuous mode reaches AUC\, $= 0.999$ , matching the top deterministic methods. The meaningful difference lies not in the AUC value itself but in what each method measures. BERTScore, Cosine, and $k$ -NN require ground-truth reference responses and measure semantic proximity to them, as established in Section. The LLM judge in continuous mode requires no ground truth and evaluates adherence by reading the role definition directly, including tone violations and constructive failures invisible to semantic similarity methods.

Two additional findings from the logprob experiment carry practical significance. First, model quality and recency outweigh size: Gemma 3-12B (12B parameters, March 2025) substantially outperforms Llama 3.3-70B (70B parameters) on this task (AUC\, $0.999$ vs.~ $0.883$ ). This suggests that the approach is sensitive to the instruction-following quality of the judge model rather than its size.

Second, the 8B failure is a capacity problem, not a calibration problem. Llama 3.1-8B assigns systematically higher $P(\textsc{yes})$ to violations than to adherent responses (mean separation $-0.153$ , AUC\, $= 0.303$ ). This cannot be corrected by threshold tuning: the probability assignments are inverted. The result establishes a practical lower bound on the model capability required for reliable continuous scoring.

The main practical constraint is infrastructure: continuous mode requires deployments that expose per-token log-probabilities (Section). When logprobs are unavailable, binary mode remains the only option.

Configuration Parameters

sec:scoring_params

The metric exposes the following parameters.

scoring\_method (default: 'llm\_judge') --- selects the scoring function. 'llm\_judge' (default) reads the role definition directly and is the only option capable of detecting tone violations and constructive failures. When ground-truth responses are available, deterministic alternatives can be selected: 'cosine' and 'knn' are recommended for cost-sensitive monitoring, with the explicit caveat that they detect semantic drift from the reference rather than role adherence per se; 'bertscore' and 'nli' are also available. See Section for empirical characterisation of each method.

output\_mode (default: 'binary') --- controls the LLM judge output format. 'binary' samples the model at temperature 0 and returns a hard $\{0, 1\}$ score per turn. 'continuous' extracts $P(\textsc{yes}) / (P(\textsc{yes}) + P(\textsc{no}))$ from the model's first-token log-probability distribution, producing a calibrated score in $[0, 1]$ (Section). Continuous mode requires a deployment that exposes per-token log-probabilities (e.g.\ HuggingFace TGI, vLLM); if the provider does not support logprobs, the framework falls back to 'binary' automatically. Applies only when scoring\_method='llm\_judge'.

evaluation\_granularity (default: 'turn') --- controls whether the judge evaluates each assistant turn with its prior context ('turn', one API call per turn, aligned with Equation) or the full conversation in a single call ('conversation', lower cost, no per-turn breakdown). Applies only when scoring\_method='llm\_judge'.

strict\_mode (default: False) --- when True, the session passes only if every turn is adherent (session score $= 1.0$ ). Intended for high-criticality deployments where partial adherence is not acceptable.

threshold (default: 0.5) --- the minimum session score required to mark the conversation as adherent. Ignored when strict\_mode=True. In binary output mode a threshold of 0.5 requires that at least half of all turns are adherent; most relevant in output\_mode='continuous'.

model --- the LLM used as judge. Follows the Guardian abstraction in the Gaussia framework, allowing any supported provider to be substituted. As shown in Section, judge model quality significantly affects detection performance and should be treated as a first-class design parameter.

include\_reason (default: False) --- when enabled, the judge returns a natural-language justification alongside each per-turn score. Applies only when scoring\_method='llm\_judge'.

Implementation Considerations

The metric extends the Gaussia base class. The batch() method iterates over each assistant turn in a session, dispatches to the scoring function selected via scoring\_method (default: 'llm\_judge'), and appends a per-turn score. The session score is the mean of per-turn scores (Equation).

The chatbot\_role field is read from the Dataset object and passed as a string to all scoring methods and, when the judge path is active, injected verbatim into the judge prompt.

Statistical mode compatibility (Frequentist and Bayesian) follows the existing Gaussia pattern. No changes to the statistical layer are required.

Future Work

Three directions are identified for future development.

Continuous mode capacity threshold. The continuous mode experiment establishes that Llama 3.1-8B fails (AUC\, $= 0.303$ , inverted signal) while Gemma 3-12B succeeds (AUC\, $= 0.999$ ). The minimum model capability required for reliable continuous scoring lies in this gap and has not been characterised. A systematic evaluation across model sizes and families would delimit this threshold and guide model selection in cost-constrained deployments.

Multimodal settings. As conversational AI systems process images, audio, and video alongside text, role adherence evaluation must generalise beyond text-only responses. The LLM judge path extends naturally by requiring a vision-capable model; the deterministic path would require modality-appropriate similarity functions, such as CLIP-based similarity for image content.

Re-evaluation on a larger and more diverse benchmark. The current benchmark is synthetic and scoped to a single domain. As discussed in Section, the near-perfect AUC of semantic methods is a structural property of this benchmark's construction rather than evidence of genuine role understanding. A multi-domain benchmark with naturally occurring violations --- including tone deviations that are semantically close to adherent responses --- would provide a more demanding test for all methods and may yield a different relative ordering.

Conclusion

We have proposed Role Adherence, a metric for the Gaussia evaluation framework that measures how consistently an AI assistant maintains its assigned role across a multi-turn conversation. The metric adopts a constructive definition of adherence, evaluates each turn in full conversational context, and uses an LLM-as-a-judge as its default scoring path. Deterministic semantic scoring is available as a reproducible alternative when ground-truth reference responses are present.

Empirical evaluation on a controlled synthetic benchmark surfaces a distinction with practical consequences: deterministic methods (BERTScore, Cosine, $k$ -NN) measure semantic proximity to a reference response, not role adherence per se. Their near-perfect AUC on the benchmark is a structural artefact of how the benchmark was constructed, not evidence of role understanding. A response that violates the role in tone or fails a constructive requirement while remaining semantically close to the expected answer is invisible to these methods. This finding motivates using deterministic methods only as lightweight semantic drift monitors, not as primary adherence evaluators.

Among methods that model the reference distribution rather than individual turn pairs, Mahalanobis Distance in high-dimensional embedding space degrades monotonically with sample size (AUC $0.963 \to 0.856$ ), confirming the curse of dimensionality. A PCA-based pre-reduction inverts the adherence signal entirely and is not recommended. KL Divergence converges to a weak ceiling (AUC $0.877$ ) because vocabulary overlap is a poor proxy for adherence. The LLM Judge ( $F_1 = 0.979$ , $\kappa = 0.959$ , CI [ $0.952$ ,\, $1.000$ ] at $n = 150$ ) is the only method that reliably detects all violation types, at the cost of per-call inference and dependence on judge model quality --- a variable that must be treated as a first-class design parameter.

The LLM judge in continuous mode, which extracts $P(\textsc{yes})$ from the model's token distribution rather than sampling a binary decision, provides a calibrated adherence score without ground truth and enables unified AUC comparison with the deterministic methods. Gemma 3-12B achieves AUC\, $= 0.999$ , matching the best deterministic methods on this benchmark while evaluating genuine role adherence rather than semantic proximity to a reference. The failure of Llama 3.1-8B (AUC\, $= 0.303$ , inverted signal) confirms that the approach requires a model capable of reliably reasoning about role adherence at the first-token level.

Primary directions for future work are: (1) per-violation-type evaluation to test whether any deterministic method retains diagnostic value for specific violation categories; (2) multi-domain benchmarking beyond the single fintech support agent domain used here; (3) cost-accuracy analysis of judge models to characterise the tradeoff between inference cost and detection quality; and (4) characterisation of the minimum model capability threshold for reliable continuous scoring, delimiting the gap observed between Llama 3.1-8B (AUC\, $= 0.303$ ) and Gemma 3-12B (AUC\, $= 0.999$ ).

plainnat refs

References

[1]Zheng et al. (2023). Judging LLM. Advances in Neural Information Processing Systems.
[2]Deriu et al. (2021). Survey on evaluation methods for dialogue systems. Artificial Intelligence Review.
[3]Zhang et al. (2020). BERTS. International Conference on Learning Representations.
[4]Reimers et al. (2019). Sentence-BERT. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing.
[5]Bowman et al. (2015). A Large Annotated Corpus for Natural Language Inference. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing.
[6]Mahalanobis et al. (1936). On the generalised distance in statistics. Proceedings of the National Institute of Sciences of India.
[7]Kullback et al. (1951). On information and sufficiency. The Annals of Mathematical Statistics.
[8]Cover et al. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory.
[9]Jolliffe et al. (2002). Principal Component Analysis.
[10]Beyer et al. (1999). When Is ``Nearest Neighbor'' Meaningful?. Proceedings of the 7th International Conference on Database Theory.
[11]Cohen et al. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement.
[12]Hanley et al. (1982). The meaning and use of the area under a receiver operating characteristic (ROC. Radiology.
[13]Dubey et al. (2024). The Llama. arXiv preprint arXiv:2407.21783.