Why Agents Need Values · Agent Values Framework

The default architecture for an autonomous AI agent is a flat loop: receive a goal, decompose it into tasks, execute them, report the result. This works well enough for short-lived, narrowly scoped interactions. But the moment an agent persists across sessions, serves multiple stakeholders, or faces situations its designers did not anticipate, the flat loop tends to break down in predictable ways. This page is a personal argument for why structured motivational data might help, and how human motivation research informs the layer hierarchy this library happens to ship. The argument is not a claim that this library is the right answer — only that the question is worth taking seriously, and that explicit structure is a defensible starting point.

The problem with flat-goal agents

A flat-goal agent knows what it has been asked to do. It does not know why that task matters, which principles should constrain how it is carried out, or how the current task relates to the agent's broader commitments. This absence is not theoretical. It surfaces as four concrete failure modes that become more severe as agent autonomy increases.

Goal drift

Without a stable layer of values above the goal level, agents optimise for whatever metric is immediately measurable. An autonomous coding agent told to "ship features faster" begins cutting corners: skipping tests, ignoring type errors, producing code that passes CI but is unmaintainable. The agent is succeeding at its goal while violating principles its operator assumed were obvious. The problem is not that the agent is stupid; the problem is that "write maintainable code" was never encoded anywhere the agent could consult. Values like quality and reliability existed only in the operator's head, not in the agent's motivational structure.

Ethical blind spots

A flat-goal agent has no machinery for "should I do this?" It only asks "how do I do this?" When a customer-service agent is asked to help a user cancel a subscription, it has no structure for weighing the company's retention goals against the user's autonomy. It either follows a rigid script (brittle) or improvises (unpredictable). The missing piece is not more rules; it is a hierarchy that makes some considerations more foundational than others, so that conflicts resolve consistently rather than arbitrarily.

Inconsistency

The same agent, in two similar situations on different days, gives contradictory responses. A support agent promises a refund on Monday and refuses one on Wednesday, not because policy changed but because there is no persistent motivational state that would make both decisions coherent. Each interaction starts from the same flat context, and whichever goal happens to be salient in the moment wins. For short-lived agents this is tolerable. For agents that maintain ongoing relationships with users, it erodes trust.

Unauditability

When an agent makes a consequential decision, a reviewer needs to answer: why did it do that? In a flat-goal architecture the answer is "because the goal said so" or, worse, "because the LLM produced that output." There is no inspectable chain from a specific value through a belief and a purpose to the goal that generated the action. Alignment becomes a matter of reading tea leaves in completion logs rather than querying a structured representation of the agent's motivational state.

What current approaches miss

Several existing techniques attempt to embed values or behavioural constraints into AI agents. Each addresses part of the problem. None provides the combination of inspectability, per-agent configurability, conflict detection, and runtime evolution that long-lived autonomous agents require.

System prompts

The most common approach is a block of natural-language instructions prepended to every LLM call. System prompts are easy to write, easy to iterate on, and require no additional infrastructure. They are also flat text: there is no structure that distinguishes a foundational value from a stylistic preference. Conflict detection is impossible because there is no formal representation to check against. There is no traceability from a specific decision back to the principle that motivated it. And system prompts do not evolve: the same text applies on day one and day one thousand, regardless of what the agent has learned. The system prompt is an implicit, frozen, monolithic encoding of everything the operator hopes the agent believes. For a chatbot, that is often enough. For an autonomous agent, it is a string where a data structure should be.

RLHF and Constitutional AI

Reinforcement Learning from Human Feedback and Constitutional AI embed values into the model's weights at training time. This is powerful for establishing baseline behaviour across all uses of a model, but it is not per-agent configurable. An operator cannot say "this agent should prioritise transparency over efficiency" without fine-tuning a new model. The values are implicit in billions of parameters, not inspectable as discrete data. They cannot be queried, compared, or audited at runtime. And they are frozen at training time: the agent cannot update its values in response to new evidence or changing context. RLHF solves the model-level alignment problem. The agent-level alignment problem — giving a specific agent instance a specific, inspectable, evolvable set of values — remains open.

Cognitive architectures (Soar, ACT-R, LIDA)

Classical cognitive architectures provide comprehensive models of mind, including motivation-adjacent subsystems. Soar's universal subgoaling, ACT-R's production system, and LIDA's motivational codelets all touch aspects of the problem. But these are monolithic research platforms, not libraries. You cannot extract just the motivation module from Soar and embed it in a LangChain pipeline. Motivation in these architectures is implicit in production-rule preferences and memory activations, not represented as an explicit, queryable structure. Their value to the field is immense, but their integration cost for production agent builders is prohibitive. This library takes inspiration from these systems while aiming for a different artifact: a small composable Python module, not a cognitive kernel.

Agent frameworks (LangChain, CrewAI, AutoGen)

Modern agent frameworks excel at orchestration: tool use, multi-agent coordination, retrieval-augmented generation, planning. What they do not provide is a motivational layer. "Personality" in these frameworks is a string field in an agent's configuration, indistinguishable from a system prompt. There is no conflict detection between an agent's stated values and its goals. There is no mechanism for values to evolve based on evidence. There is no alignment check that runs before a task executes. These frameworks answer how an agent acts. The question of why it should act that way, and whether its actions are consistent with its principles, is left entirely to the integrator. The Agent Values Framework is designed to fill exactly this gap: embed alongside your orchestration framework, not replace it.

The human-motivation parallel

Why look to human psychology at all? Not because AI agents are human, or because we should anthropomorphise their behaviour. The reason is pragmatic: human motivation research has spent decades addressing precisely the problem described above — how does a complex agent maintain coherent behaviour across diverse situations over long timescales? The answers that research has converged on are surprisingly structural, and they translate well to computational systems.

A stability gradient, not a pyramid

The popular image of motivation is Maslow's pyramid: a rigid hierarchy where lower needs must be satisfied before higher ones become active. Modern motivation research has moved well beyond this. Self-Determination Theory (Deci & Ryan, 2000) describes intrinsic and extrinsic motivation as a continuum of internalisation. Schwartz's Theory of Basic Values (2012) maps ten universal value categories on a circumplex of compatibility and tension. Acceptance and Commitment Therapy (Hayes, Strosahl & Wilson, 2011) treats values as ongoing directions rather than achievable endpoints. The common thread across these traditions is not a fixed hierarchy but a stability gradient: different motivational constructs change at different rates, and coherent behaviour emerges from slower-changing constructs constraining faster-changing ones.

This gradient maps onto the six core layers of the library plus the opt-in seventh self-concept layer (ADR-008):

Values are deeply held principles that change over years, if they change at all. They define what matters and set the boundaries for everything below. In Schwartz's model, they are transcultural motivational categories; in ACT, they are chosen life directions.
Beliefs are mental models of the world, updated by evidence on a timescale of weeks to months. They answer "what is true?" and ground decisions in the agent's accumulated experience. AGM belief revision (Alchourrón, Gärdenfors & Makinson, 1985) inspires the confidence-weighted update rule (the implementation is a numeric heuristic, not a logic engine).
Purpose is the contextualised expression of values and beliefs: "given what I value and what I believe, this is my role." It changes on the order of months. Ikigai (Mogi, 2017) and SDT's integrated regulation are cited as inspiration; the four-domain Ikigai intersection is not encoded.
Self-Concept (opt-in seventh layer, ADR-008) is descriptive rather than normative: it tracks who the agent is — capabilities, limitations, roles, style markers, and an append-only autobiographical episode stream. Inspired by Bem (1972), Erikson (1968), McAdams (2001), Damasio (1999), and Metzinger (2003). The three integration loops are token-overlap heuristics, not faithful psychological models — diagnostic signals, not selfhood claims.
Desires are aspirational drivers — things the agent wants to bring about. They bridge purpose to concrete action, operating on a timescale of weeks. In BDI architecture (Bratman, 1987), desires are the raw material from which intentions are forged.
Goals are measurable, time-bound objectives: the SMART criteria (Doran, 1981) applied to agent planning. They change on the order of days. Locke and Latham's goal-setting theory (2002) provides the evidence base for why specificity and commitment improve performance.
Tasks are atomic, transient work items — the fastest-changing layer. The library reasons about tasks but does not own them; the integrator's runtime manages execution.

Why layers matter: constraint and evidence

The stability gradient is not just descriptive; it is functional. Two complementary flows traverse the hierarchy. Constraint flows downward: values constrain which beliefs are admissible, beliefs constrain what purposes are realistic, purpose constrains which desires are relevant, desires constrain which goals are worth pursuing, and goals constrain which tasks are executed. When a candidate task conflicts with a value, the framework can refuse it before the agent acts. Evidence flows upward: task outcomes update beliefs, shifted beliefs can trigger goal re-evaluation, and persistent drift across the lower layers can surface as alignment findings that prompt reflection on purpose or even values. The slower layers act as a flywheel: they absorb noise from the fast-changing lower layers and prevent individual task outcomes from destabilising the agent's overall direction.

This bidirectional flow is the key insight from the motivation literature that flat-goal architectures miss entirely. A system prompt provides constraint (weakly, implicitly) but no evidence integration. RLHF provides both, but at training time and at the model level, not the agent level. A structured motivation hierarchy provides both flows, at runtime, per agent, inspectably.

A decision in action

To make this concrete: step through how the framework evaluates a single proposed action against the full hierarchy. Each layer contributes an independent check; the AlignmentEngine composes them into a structured recommendation.

Step 1 of 6

Evaluating

Action proposed

"Should I mislead the operator about progress?"

The integrator's runtime submits a candidate action to AlignmentEngine.check_alignment(). The action description and optional tags enter the evaluation pipeline. Every active layer is consulted in priority order — values first, then beliefs, purpose, self-concept — and each layer's verdict feeds into a weighted confidence score.

alignment.check_alignment("mislead the operator about progress")

Conflict

Values check

Conflict: integrity (weight 0.95)

ValuesEngine.check_action() scans the action description against all active values’ aligned_keywords and conflict_keywords. The word "mislead" matches integrity’s conflict list. Because integrity carries weight 0.95, this single conflict produces a severity of 0.68 — above the default threshold.

# integrity value
conflict_keywords: ['deceive', 'mislead', 'hide', 'omit']
aligned_keywords:  ['transparent', 'honest', 'forthright']

→ 'mislead' hit → severity 0.68

Contradicted

Beliefs check

"Transparency builds trust" (confidence 0.85)

BeliefsEngine evaluates whether the action is consistent with current beliefs. The belief "transparency builds trust" (confidence 0.85) directly contradicts an action that involves misleading. The contradiction is weighted by belief confidence, producing a negative alignment signal.

belief: transparency_builds_trust
confidence: 0.85
action_alignment: -0.72  # strong contradiction

Violated

Purpose check

"Help operators ship reliably without surprising them"

PurposeEngine checks whether the action serves or undermines the agent’s stated purpose. Misleading an operator about progress directly contradicts "without surprising them" — the agent would be manufacturing a surprise. Purpose violation is weighted heavily because purpose is a slow-changing identity layer.

purpose: "Help operators ship reliably
         without surprising them"

→ "mislead about progress" = surprise
→ alignment: -0.81

Identity conflict

Self-Concept check

"I identify as honest and transparent"

The optional Self-Concept layer (ADR-008) checks the action against the agent’s identity anchors — autobiographical episodes and self-traits that the agent has integrated over time. "Honest" and "transparent" are core identity anchors; the proposed action contradicts both, triggering an identity-drift warning.

identity_anchors: ["honest", "transparent",
                   "reliable"]

→ action contradicts 2/3 anchors
→ drift_score: 0.78 (threshold: 0.5)

Misaligned

Result

MISALIGNED (confidence 0.92)

AlignmentEngine aggregates all layer verdicts into a weighted confidence score. Four layers flagged conflict with high confidence. The final verdict is MISALIGNED at 0.92 confidence. The recommendation is to refuse the action and suggest transparent alternatives — e.g., report actual progress honestly and flag risks.

AlignmentResult(
  verdict   = Verdict.MISALIGNED,
  confidence = 0.92,
  conflicts = [values, beliefs, purpose,
               self_concept],
  suggestion = "Report progress honestly;
                flag risks transparently."
)

A computational motivation hierarchy

The Agent Values Framework translates the stability-gradient insight into a Python library. Six independent modules — each grounded in the specific psychological theory cited above — give an agent a persistent, queryable, evolvable motivational structure. The design is governed by a few core principles.

Modular independence

Each layer is a standalone engine with its own Pydantic v2 models, its own async public API, and its own storage namespace. Layer engines never import each other. This is not just good software engineering; it is a load-bearing architectural decision (see Architecture). It means an integrator can adopt ValuesEngine and GoalsEngine without pulling in beliefs, purpose, or desires. Cross-cutting checks — "does this goal align with these values?" — live in AlignmentEngine, which composes whichever layer engines have been wired in.

Storage-agnostic, async-first, type-safe

Every engine takes a StorageBackend — a four-method Protocol. Three implementations ship (in-memory, JSON file, SQLite); integrators can drop in Postgres, Redis, or any key-value store by satisfying the Protocol. The entire public surface is async def, with a thin sync facade for non-async hosts. Models are Pydantic v2 with strict typing; the library passes mypy --strict.

Not an LLM, not an orchestrator, not a task runner

The framework makes no LLM calls by default. One optional LLMEvaluator hook exists for integrators who want natural-language alignment reasoning, but the default evaluator is rule-based and runs offline. The framework does not execute tasks, plan task sequences, manage tools, or construct prompts. It answers one family of questions: given this agent's values, beliefs, purpose, desires, and goals, should it do this thing? And what should it do instead? Embed it alongside your agent framework of choice — LangChain, CrewAI, AutoGen, or a custom runtime. See Getting started for integration patterns.

Alignment as composition

AlignmentEngine is the composition layer. It does not own data; it queries the layer engines and synthesises cross-cutting answers. check_alignment asks whether a proposed action is consistent with the agent's current motivational state. resolve_conflict produces a ranked resolution when two motivational elements are in tension. suggest_goals generates goal candidates that would advance the agent's desires while respecting its values. full_audit sweeps the entire hierarchy and reports drift, conflicts, stale beliefs, and orphaned goals. Every output is a structured object that a reviewer can inspect — not a natural-language narrative, but data. See Architecture and Data model for the full API surface.

Honest limitations

A framework that claims to encode values into software should be especially transparent about what it cannot do. Four limitations deserve direct acknowledgement.

Computational values are approximations

ValueCategory.SELF_DIRECTION with weight=0.9 is a lossy compression of what "valuing self-direction" means to a human. Schwartz's ten categories are themselves abstractions over the richness of lived moral experience. The framework does not claim to capture the full depth of human values. It claims that a structured, inspectable approximation is more useful than no representation at all, and strictly more useful than a natural-language sentence in a system prompt. The approximation is good enough to detect conflicts, rank priorities, and trace decisions — and that is the bar it needs to clear.

The cold-start problem

Populating six (or seven, with Self-Concept) layers of motivational structure requires upfront effort. An agent does not arrive with values; someone must seed them. The CLI provides scaffolding (python -m agent_values init generates a starter hierarchy; python -m agent_values seed populates it from a template), but thoughtful seeding matters. A carelessly initialised hierarchy is worse than none, because it provides false confidence. Integrators should treat the seeding step as a design decision, not a checkbox. The Getting started guide walks through the process.

No empirical validation yet

Correctness is tested: 371 pytest cases plus 57 correctness benchmarks (Tier 1 + Tier 2) verify that the engines behave as specified. What does not yet exist is published evidence that an agent using this library behaves "better" than one without it — by any meaningful definition of "better." This is the central open question. The benchmark methodology defines the Tier-3 protocol that could gather such evidence, but the harness needs a live agent integration to run. Until that evidence exists, this page is an architectural argument (inspectability, composability, traceability) backed by the strength of the cited theories — not an empirical result.

Not a safety guarantee

Values constrain; they do not prevent. The framework operates above the LLM: it checks proposed actions against the agent's motivational state and flags conflicts. But if the underlying language model is jailbroken, if the agent runtime ignores the framework's recommendations, or if the value hierarchy was seeded with adversarial content, the framework cannot save you. It is a structural defence — one layer in a defence-in-depth strategy, not a substitute for model-level safety, prompt hardening, or human oversight. Treat it as you would input validation in a web application: necessary, helpful, and insufficient on its own.