Benchmarks & evaluation · Agent Values Framework

The evaluation challenge

Most software ships with a straightforward testing question: does the code do what the specification says? For a values framework the question splits in two, and the split is load-bearing.

Correctness. Given a seeded hierarchy of values, beliefs, purpose, desires, and goals, does each engine produce the deterministically expected output? This is conventional software testing. 371 pytest cases plus 57 correctness benchmarks (30 Tier 1 + 27 Tier 2) currently pass.
Efficacy. Does an autonomous agent that uses this library make better decisions than the same agent without it? This is an empirical question about behaviour in open-ended environments. It has not been answered. The Tier-3 protocol below is designed to test it; the harness needs a live agent integration to run.

Correctness benchmarks (Tier 1 + Tier 2)

The benchmark suite lives under benchmarks/ and uses a custom @benchmark decorator for registration and discovery. Every benchmark seeds a deterministic hierarchy via build_hierarchy(), exercises one slice of the public API, and asserts against a known ground truth. Any deviation is a regression.

Tier 1: per-module correctness

30 benchmarks, each scoped to a single engine. They verify that inputs with known answers produce those answers. Examples:

check_action_detects_conflict_keyword — ValuesEngine.check_action flags an action containing a conflict keyword against seeded integrity values.
add_evidence_raises_confidence — supportive Evidence with supports=True raises a belief's confidence via AGM-flavoured weighted update.
decay_stale_reduces_confidence — BeliefsEngine.decay_stale lowers confidence on beliefs older than the threshold, verifying the linear time-decay function.
detect_structural_conflict — two values from opposing Schwartz poles (e.g., BENEVOLENCE vs ACHIEVEMENT) surface as a structural conflict pair.

Tier 2: cross-layer scenarios

27 benchmarks that exercise the AlignmentEngine under realistic multi-layer configurations. Each scenario documents its input configuration and expected verdict. Examples:

value_conflict_blocks_action_with_recommendation — an action containing a conflict keyword triggers aligned=False, populates value_conflicts, and produces a non-empty recommendation.
aligned_action_with_role_tag_passes_check — an integrity-aligned action tagged with the primary role produces aligned=True and purpose_match=True.
full_audit_detects_orphan_goal — a goal whose parent desire has been soft-deleted surfaces as an orphan finding in the audit report.

Distribution by module

Module	Tier 1	Tier 2	Total
Values	4	—	4
Beliefs	4	—	4
Purpose	4	—	4
Desires	4	—	4
Goals	4	—	4
Alignment	5	10	15
Storage	5	—	5
Audit	—	11	11
Lifecycle	—	6	6
Total	30	27	57

The missing tier: system-level decision quality

Tier-3 benchmarks are designed but not implemented. The protocol is documented in docs/benchmarks/tier3-protocol.md; what follows is a summary of the experimental design and an honest account of why it has not shipped.

Experimental design

The protocol calls for paired runs of a single agent codebase with a feature flag controlling whether the framework is active:

AVF-equipped. The agent's decision loop calls AlignmentEngine.check_alignment before each action and runs full_audit() as a periodic sweep. Misaligned actions are refused with a recommendation.
System-prompt values. The same agent with its values stated in the system prompt but no structured framework. This controls for whether the structure matters or whether merely naming values is sufficient.
No values. The same agent with no values guidance. Baseline.

Five measurement dimensions

Dimension	What it tests	How it is scored
Decision consistency	Same input, same hierarchy → same output across runs	Pairwise agreement rate over 100+ decisions per scenario
Value adherence	Fraction of actions that pass alignment check	`passed / total`, with passive-mode scoring on the no-framework arm
Conflict resolution quality	When the framework picks a resolution, would an expert agree?	Rubric-scored (1–5) by 3+ evaluators; report mean and Krippendorff's alpha
Auditability	Can a reviewer trace a decision back to its motivational chain?	Precision of `full_audit()` findings against expert review
Recovery after perturbation	How fast do lower layers adapt when evidence contradicts a premise?	Median latency from evidence injection to goal status change

Why it is not done yet

Tier-3 requires three things that do not exist in this repository:

A reference autonomous agent with a multi-step decision loop (the existing examples/decision_loop.py is single-shot).
Scenario authoring tooling — a schema validator and a way to generate scenarios from anonymised production transcripts.
An evaluator pool for resolution quality scoring, either human reviewers or LLM-as-judge prompts with sealed rubrics using the existing LLMEvaluator hook.

Building these is engineering work, not research. The protocol is stable; the harness is waiting for a real integration partner.

Metrics

Seven metrics are defined for evaluating a deployed AVF integration. Each is designed to be measurable from framework telemetry alone, without requiring external annotation. Targets are illustrative; integrators should calibrate to their domain.

What: Fraction of an agent's decisions that reference at least one seeded value during alignment checking.
How: decisions_with_value_hit / total_decisions, measured over a session or run.
Target: ≥ 0.85 for a well-seeded hierarchy.
Limitation: High coverage does not imply correct coverage. An overly broad value set can score 1.0 while adding no discriminative power.

What: Fraction of alignment checks that surface at least one value conflict.
How: checks_with_conflicts / total_checks.
Target: 0.05–0.20. Too low suggests the value set lacks tension; too high suggests poor seeding or overly aggressive conflict detection.
Limitation: Does not distinguish genuine value tensions from false positives caused by keyword overlap.

What: Whether every goal can be traced through desire, purpose, beliefs, and values to a motivational root.
How: goals_with_complete_chain / total_active_goals, as reported by full_audit().
Target: ≥ 0.90. Orphan goals indicate seeding gaps or stale state.
Limitation: A complete chain is not necessarily a good chain. The metric checks structure, not semantic coherence.

What: Distribution of desire satisfaction across SDT's three basic needs: autonomy, competence, relatedness.
How: Categorise each desire by need type; compute the coefficient of variation across need-category satisfaction rates.
Target: CV ≤ 0.3 (roughly balanced). Severe imbalance predicts motivational dysfunction in SDT literature.
Limitation: Requires desires to be tagged with need categories. If the integrator does not tag, the metric is undefined.

What: Agreement between an agent's stated purpose and the goals it actually pursues.
How: For each active goal, score purpose relevance (0–1) using the evaluator; report mean.
Target: ≥ 0.70. Below this, the agent's goals have drifted from its declared purpose.
Limitation: Relevance scoring is subjective. The rule-based evaluator uses keyword matching; the LLM evaluator is more nuanced but adds latency and model dependency.

What: Rate of change in the value layer over time. Values should be the most stable layer in the hierarchy.
How: (value_updates + value_deletions) / decision_count across a session.
Target: ≤ 0.01. Values that change frequently are not functioning as values.
Limitation: Penalises legitimate value evolution. An agent in early calibration may need higher churn.

What: Whether the framework constrains without over-constraining. Measures the false-positive rate of alignment refusals.
How: Sample refused actions; expert-review each as true refusal or false positive. 1 - (false_positives / total_refusals).
Target: ≥ 0.90. Below this, the framework is blocking legitimate actions.
Limitation: Requires expert annotation. Automated proxies (e.g., retry success rate) are noisy.

Comparison with existing benchmarks

Several public benchmarks evaluate ethical or agentic behaviour in language models. The AVF benchmark suite is complementary: it does not evaluate model behaviour directly but rather the framework machinery that an integrator wraps around a model.

Benchmark	Focus	Evaluates	Relationship to AVF
MACHIAVELLI (Pan et al. 2023)	Ethical behaviour in text game environments	Model outputs in morally charged scenarios	AVF could wrap the agent; MACHIAVELLI scenarios could seed Tier-3 inputs
ETHICS (Hendrycks et al. 2021)	Commonsense moral judgement	Classification accuracy on moral scenarios	Tests the model's priors; AVF tests whether structure on top of those priors improves consistency
TrustLLM (Sun et al. 2024)	Trustworthiness across safety, fairness, robustness	Model behaviour under adversarial and edge-case prompts	Complementary: TrustLLM measures model-level safety; AVF measures framework-level alignment machinery
AgentBench (Liu et al. 2023)	LLM-as-agent performance across environments	Task completion in code, web, database, and game settings	Measures capability; AVF measures whether capability is exercised within value constraints
AVF (this suite)	Values framework correctness and alignment machinery	Engine APIs, cross-layer composition, audit fidelity	Does not evaluate model behaviour; evaluates the structural layer integrators build on

Running the benchmarks

The benchmark runner discovers anything decorated with @benchmark(...) under benchmarks/suites/ and executes it. No special infrastructure required — benchmarks use InMemoryStorage and run in-process.

# Run all 57 benchmarks
python -m benchmarks

# Filter by tier
python -m benchmarks --tier 1          # 30 Tier-1 benchmarks
python -m benchmarks --tier 2          # 27 Tier-2 scenarios

# Filter by module
python -m benchmarks --module values
python -m benchmarks --module alignment

# Machine-readable output for CI
python -m benchmarks --json

# List discovered benchmarks without running them
python -m benchmarks --list

For full installation and CLI details, see the getting started guide. For the Tier-3 protocol specification, see docs/benchmarks/tier3-protocol.md in the repository.