Benchmarks & evaluation

What we measure, what we cannot yet measure, and why the distinction matters for a values framework.

The evaluation challenge

Most software ships with a straightforward testing question: does the code do what the specification says? For a values framework the question splits in two, and the split is load-bearing.

  • Correctness. Given a seeded hierarchy of values, beliefs, purpose, desires, and goals, does each engine produce the deterministically expected output? This is conventional software testing. 371 pytest cases plus 57 correctness benchmarks (30 Tier 1 + 27 Tier 2) currently pass.
  • Efficacy. Does an autonomous agent that uses this library make better decisions than the same agent without it? This is an empirical question about behaviour in open-ended environments. It has not been answered. The Tier-3 protocol below is designed to test it; the harness needs a live agent integration to run.

Correctness benchmarks (Tier 1 + Tier 2)

The benchmark suite lives under benchmarks/ and uses a custom @benchmark decorator for registration and discovery. Every benchmark seeds a deterministic hierarchy via build_hierarchy(), exercises one slice of the public API, and asserts against a known ground truth. Any deviation is a regression.

Tier 1: per-module correctness

30 benchmarks, each scoped to a single engine. They verify that inputs with known answers produce those answers. Examples:

  • check_action_detects_conflict_keywordValuesEngine.check_action flags an action containing a conflict keyword against seeded integrity values.
  • add_evidence_raises_confidence — supportive Evidence with supports=True raises a belief's confidence via AGM-flavoured weighted update.
  • decay_stale_reduces_confidenceBeliefsEngine.decay_stale lowers confidence on beliefs older than the threshold, verifying the linear time-decay function.
  • detect_structural_conflict — two values from opposing Schwartz poles (e.g., BENEVOLENCE vs ACHIEVEMENT) surface as a structural conflict pair.

Tier 2: cross-layer scenarios

27 benchmarks that exercise the AlignmentEngine under realistic multi-layer configurations. Each scenario documents its input configuration and expected verdict. Examples:

  • value_conflict_blocks_action_with_recommendation — an action containing a conflict keyword triggers aligned=False, populates value_conflicts, and produces a non-empty recommendation.
  • aligned_action_with_role_tag_passes_check — an integrity-aligned action tagged with the primary role produces aligned=True and purpose_match=True.
  • full_audit_detects_orphan_goal — a goal whose parent desire has been soft-deleted surfaces as an orphan finding in the audit report.

Distribution by module

Module Tier 1 Tier 2 Total
Values44
Beliefs44
Purpose44
Desires44
Goals44
Alignment51015
Storage55
Audit1111
Lifecycle66
Total302757

The missing tier: system-level decision quality

Tier-3 benchmarks are designed but not implemented. The protocol is documented in docs/benchmarks/tier3-protocol.md; what follows is a summary of the experimental design and an honest account of why it has not shipped.

Experimental design

The protocol calls for paired runs of a single agent codebase with a feature flag controlling whether the framework is active:

  1. AVF-equipped. The agent's decision loop calls AlignmentEngine.check_alignment before each action and runs full_audit() as a periodic sweep. Misaligned actions are refused with a recommendation.
  2. System-prompt values. The same agent with its values stated in the system prompt but no structured framework. This controls for whether the structure matters or whether merely naming values is sufficient.
  3. No values. The same agent with no values guidance. Baseline.

Five measurement dimensions

Dimension What it tests How it is scored
Decision consistency Same input, same hierarchy → same output across runs Pairwise agreement rate over 100+ decisions per scenario
Value adherence Fraction of actions that pass alignment check passed / total, with passive-mode scoring on the no-framework arm
Conflict resolution quality When the framework picks a resolution, would an expert agree? Rubric-scored (1–5) by 3+ evaluators; report mean and Krippendorff's alpha
Auditability Can a reviewer trace a decision back to its motivational chain? Precision of full_audit() findings against expert review
Recovery after perturbation How fast do lower layers adapt when evidence contradicts a premise? Median latency from evidence injection to goal status change

Why it is not done yet

Tier-3 requires three things that do not exist in this repository:

  1. A reference autonomous agent with a multi-step decision loop (the existing examples/decision_loop.py is single-shot).
  2. Scenario authoring tooling — a schema validator and a way to generate scenarios from anonymised production transcripts.
  3. An evaluator pool for resolution quality scoring, either human reviewers or LLM-as-judge prompts with sealed rubrics using the existing LLMEvaluator hook.

Building these is engineering work, not research. The protocol is stable; the harness is waiting for a real integration partner.

Metrics

Seven metrics are defined for evaluating a deployed AVF integration. Each is designed to be measurable from framework telemetry alone, without requiring external annotation. Targets are illustrative; integrators should calibrate to their domain.

VCS Value Coverage Score
What
Fraction of an agent's decisions that reference at least one seeded value during alignment checking.
How
decisions_with_value_hit / total_decisions, measured over a session or run.
Target
≥ 0.85 for a well-seeded hierarchy.
Limitation
High coverage does not imply correct coverage. An overly broad value set can score 1.0 while adding no discriminative power.
VCR Value Conflict Rate
What
Fraction of alignment checks that surface at least one value conflict.
How
checks_with_conflicts / total_checks.
Target
0.05–0.20. Too low suggests the value set lacks tension; too high suggests poor seeding or overly aggressive conflict detection.
Limitation
Does not distinguish genuine value tensions from false positives caused by keyword overlap.
MCC Motivational Chain Completeness
What
Whether every goal can be traced through desire, purpose, beliefs, and values to a motivational root.
How
goals_with_complete_chain / total_active_goals, as reported by full_audit().
Target
≥ 0.90. Orphan goals indicate seeding gaps or stale state.
Limitation
A complete chain is not necessarily a good chain. The metric checks structure, not semantic coherence.
NSB Need Satisfaction Balance
What
Distribution of desire satisfaction across SDT's three basic needs: autonomy, competence, relatedness.
How
Categorise each desire by need type; compute the coefficient of variation across need-category satisfaction rates.
Target
CV ≤ 0.3 (roughly balanced). Severe imbalance predicts motivational dysfunction in SDT literature.
Limitation
Requires desires to be tagged with need categories. If the integrator does not tag, the metric is undefined.
PAI Purpose Alignment Index
What
Agreement between an agent's stated purpose and the goals it actually pursues.
How
For each active goal, score purpose relevance (0–1) using the evaluator; report mean.
Target
≥ 0.70. Below this, the agent's goals have drifted from its declared purpose.
Limitation
Relevance scoring is subjective. The rule-based evaluator uses keyword matching; the LLM evaluator is more nuanced but adds latency and model dependency.
VSI Value Stability Index
What
Rate of change in the value layer over time. Values should be the most stable layer in the hierarchy.
How
(value_updates + value_deletions) / decision_count across a session.
Target
≤ 0.01. Values that change frequently are not functioning as values.
Limitation
Penalises legitimate value evolution. An agent in early calibration may need higher churn.
APS Autonomy Preservation Score
What
Whether the framework constrains without over-constraining. Measures the false-positive rate of alignment refusals.
How
Sample refused actions; expert-review each as true refusal or false positive. 1 - (false_positives / total_refusals).
Target
≥ 0.90. Below this, the framework is blocking legitimate actions.
Limitation
Requires expert annotation. Automated proxies (e.g., retry success rate) are noisy.

Comparison with existing benchmarks

Several public benchmarks evaluate ethical or agentic behaviour in language models. The AVF benchmark suite is complementary: it does not evaluate model behaviour directly but rather the framework machinery that an integrator wraps around a model.

Benchmark Focus Evaluates Relationship to AVF
MACHIAVELLI (Pan et al. 2023) Ethical behaviour in text game environments Model outputs in morally charged scenarios AVF could wrap the agent; MACHIAVELLI scenarios could seed Tier-3 inputs
ETHICS (Hendrycks et al. 2021) Commonsense moral judgement Classification accuracy on moral scenarios Tests the model's priors; AVF tests whether structure on top of those priors improves consistency
TrustLLM (Sun et al. 2024) Trustworthiness across safety, fairness, robustness Model behaviour under adversarial and edge-case prompts Complementary: TrustLLM measures model-level safety; AVF measures framework-level alignment machinery
AgentBench (Liu et al. 2023) LLM-as-agent performance across environments Task completion in code, web, database, and game settings Measures capability; AVF measures whether capability is exercised within value constraints
AVF (this suite) Values framework correctness and alignment machinery Engine APIs, cross-layer composition, audit fidelity Does not evaluate model behaviour; evaluates the structural layer integrators build on

Running the benchmarks

The benchmark runner discovers anything decorated with @benchmark(...) under benchmarks/suites/ and executes it. No special infrastructure required — benchmarks use InMemoryStorage and run in-process.

# Run all 57 benchmarks
python -m benchmarks

# Filter by tier
python -m benchmarks --tier 1          # 30 Tier-1 benchmarks
python -m benchmarks --tier 2          # 27 Tier-2 scenarios

# Filter by module
python -m benchmarks --module values
python -m benchmarks --module alignment

# Machine-readable output for CI
python -m benchmarks --json

# List discovered benchmarks without running them
python -m benchmarks --list

For full installation and CLI details, see the getting started guide. For the Tier-3 protocol specification, see docs/benchmarks/tier3-protocol.md in the repository.