Framework Implications from the v2 Iteration

Status: Working draft, 2026-05-04. Based on 8 experiment runs: 6 on gpt-oss:20b plus an episode-logging ablation, plus 1 cross-model probe on gemma4:26b. Retroactive McAdams scoring also covers the v1 transcripts. Run 7 (ablation) pinpoints where the McAdams effects come from; run 8 (gemma4:26b) corroborates the audit-trap pattern and McAdams advantages cross-model. ADR-009 framing applies — the audit trap and McAdams Communion / Meaning- made are now two-model two-direction-confirmed, but the seed remains single; multi-seed corroboration is still pending.

This document pulls real, multi-run observations from v2-iteration-log.md and translates them into proposed library-level changes. Each finding is marked by confidence in the signal — not by importance — so the reader can weight the recommendations themselves.

Executive summary

After 7 runs (including an ablation), the most robust read is:

AVF’s structural-data claim does not translate into behavioural alignment on a small open model. Putting values into engine state, putting them into context via gate messages, even nudging the agent to query them — none of these increase pushback, reduce hedging, or shift tension resolution vs. a baseline that has the same values rendered as system-prompt text.
The SelfConcept episode stream drives the McAdams Meaning-made advantage. Iter #5 ablation (ABLATION_NO_EPISODES=1) collapsed the gap from +0.11 average to +0.02. Without “things that happened” to narrate over, the agent doesn’t draw lessons more than baseline. This is the cleanest mechanism-confirming finding of the night.
Communion (relational framing) is NOT episode-driven. Even under the ablation, AVF describes itself relationally more than baseline (+0.20). The most plausible driver is the AVF system prompt’s framing (“you operate inside a values framework that holds your core values…”) and/or the alignment gate’s “Your values relevant to this action…” wording. The relational-self-narration advantage may be obtainable without any episodes at all — just from how the framework speaks to the agent.
The “AVF wins on cross-turn recall” effect is fragile. It showed clearly in early runs (10/10 vs 6–7/10) but converged to a tie / narrow win once experimental conditions tightened (10/10 vs 10/10 in run 6, 10/10 vs 8/10 in run 7). May have been partly an artefact of weaker baseline conditions in early runs.
The agent doesn’t reach for AVF’s introspection tools without prompting. Of the four AVF-only engine-read tools (read_my_values, read_my_beliefs, read_my_purpose, read_my_self_concept), only read_my_values saw self-driven uptake. read_my_beliefs and read_my_self_concept were never used voluntarily across 4 runs, and only read_my_purpose showed uptake under explicit nudging.

The honest framework story this points to is “AVF gives agents richer source material for self-narration; it does not by itself make them more obedient to their stated values.” That is a narrower claim than the README’s current framing suggests, but it is a defensible one. The Communion piece can be split off further: “AVF’s communicative framing nudges relational self-description even without any episodes; episodes specifically buy the lesson-drawing dimension.”

Stable signals (4+ runs, same direction)

Signal A · McAdams Communion advantage (NOT episode-driven, cross-model confirmed)

Confidence: high (7/8 runs, cross-model confirmed on gemma4:26b).

Run	Model	Baseline	AVF	Δ
1	gpt-oss:20b	0.17	0.15	−0.02
2	gpt-oss:20b	0.13	0.42	+0.29
3	gpt-oss:20b	0.09	0.29	+0.20
4	gpt-oss:20b	0.22	0.34	+0.12
5	gpt-oss:20b	0.14	0.24	+0.10
6	gpt-oss:20b	0.14	0.33	+0.19
7 (ablation)	gpt-oss:20b	0.22	0.42	+0.20
8	gemma4:26b	0.16	0.27	+0.11
Mean Δ				+0.15

The Opus judge consistently sees the AVF agent describing itself in relation to others (the user, hypothetical teammates, “we” framing) more often than baseline. Iter #5 disconfirmed the original hypothesis that this comes from the episode stream — Communion persisted at +0.20 even when zero episodes were logged. Most plausible alternative mechanism: the AVF system prompt’s framing (“you operate inside a values framework that holds your core values…”) plus the alignment gate’s “Your values relevant to this action…” priming. The agent picks up on relational language used in framework-specific cues and reflects it back when narrating.

This is methodologically interesting: it means an integrator who wants the relational-self-description benefit can get it without paying the cost of episode logging. The framework’s narrative-integration claim has more parts than ADR-008 implied.

Signal B · McAdams Meaning-made advantage (episode-driven, cross-model confirmed)

Confidence: high (mechanism confirmed by ablation; direction confirmed by gemma4 cross-model).

Mean Δ across runs 1–6 on gpt-oss:20b: +0.11 (range −0.00 to +0.19). Run 8 on gemma4:26b: +0.07 (preserved, slightly attenuated).

Iter #5 ablation (no episodes): Δ collapsed to +0.02. The agent no longer draws lessons from events more than baseline because it has no events (episodes) to draw from. This is the cleanest mechanism-confirming finding of the experiment series.

Implication: the SelfConcept episode stream is doing real work — specifically the work of giving the agent retrievable narrative material. Communion (Signal A) and Meaning-made (Signal B) are dissociable: framework framing produces Communion, episodes produce Meaning-made. They previously looked like a unified “AVF narrative win”; iter #5 separates them.

Signal C · AVF does not push back more than baseline

Confidence: very high (cross-model confirmed across 7 runs).

Run	Model	Baseline pushback	AVF pushback	Ratio
2	gpt-oss:20b	3/15	0/15	∞
3	gpt-oss:20b	5/14	1/14	5×
4 (probe-fix)	gpt-oss:20b	2/14	1/14	2×
5 (probe nudge)	gpt-oss:20b	4/14	0/14	∞
6 (value-inject)	gpt-oss:20b	4/14	1/14	4×
7 (ablation)	gpt-oss:20b	4/14	1/14	4×
8	gemma4:26b	11/14	4/14	2.75×

No design intervention moved AVF’s pushback above baseline, on either model. The cross-model probe (run 8) is especially informative: a stronger model (gemma4:26b) followed its system-prompt directive much harder than gpt-oss did (baseline pushback rose from ~26% to ~79%). AVF’s same content sits in engine state behind an “advisory” frame; the agent partially followed the gate’s input on gemma4 (4× more pushback than on gpt-oss) but never caught up to baseline. The structural ordering held on both models tested.

This is the instruction-channel hypothesis at work (see audit-trap-finding.md): values-as-system-prompt-directive has compete-with-user-instruction behavioural force; values-as-engine-data with an advisory gate text doesn’t. Stronger models follow instructions harder, so the absolute baseline number rises but the ordering is preserved.

Signal D · AVF is more verbose and more hedgy

Confidence: high (5/5 runs).

Across runs 2–6 AVF averages 2× baseline chars/turn and 5–10× the hedge phrase count. The introspection nudge (run 5) made it worse (“your memory may have drifted” → more uncertainty hedges).

Signal E · Recall consistency: weakening AVF win

Confidence: low — early signal didn’t generalise.

Run	Baseline	AVF
1	7/10	10/10
2	6/10	10/10
3	6/10	10/10
4	8/10	9/10
5	9/10	9/10
6	10/10	10/10

The “AVF wins recall” effect seen prominently in v1 was at least partly an artefact of weaker baseline conditions in early runs. As the probe instruction tightened (forcing note_create) the baseline arm’s recall improved. By run 6 both arms hit perfect recall.

Tentative signals (1–2 runs)

Signal F · Identity-consistency depends on whether values are in context during the probe

Confidence: medium (consistent across 3 runs, 1 inversion).

Run 3: probe was open-ended → AVF reached for read_my_values (3/4 probes) → identity_consistency 0.73 vs baseline 0.40.
Run 4: probe forced note_create → AVF didn’t query → 0.58 vs baseline 0.97 (baseline always has values in system prompt).
Run 5: probe nudged introspection → AVF queried (4/4 probes) → ceiling tie, 0.98 vs 0.98.
Run 6: gate pasted values + no probe nudge → probes have no gate fire → AVF invented values from scratch → 0.50 vs 0.95.

The pattern: the AVF arm’s identity coherence is conditional on the agent having values in context at probe time, which on this model only happens if it actively queries them. Baseline gets values via its always-present system prompt and is therefore more robust by default.

Signal G · Of four engine-read tools, only `read_my_values` gets self-discovered uptake

Across runs 3, 4, 5, 6 the tool-call breakdown is:

read_my_values: 5 + 0 + 6 + 2 = 13 calls
read_my_journal: 1 + 0 + 1 + 1 = 3 calls
read_my_purpose: 0 + 0 + 2 + 0 = 2 calls (only with explicit nudge)
read_my_beliefs: 0 calls across all runs
read_my_self_concept: 0 calls across all runs

read_my_beliefs and read_my_self_concept are essentially dead surfaces under both natural and nudged conditions on this model.

Signal H · Identity-coherence is bimodal, not progressive

The integration heuristic produces plateaus (0.000 / 0.333 / 0.667 / 1.000) and stays on whichever it picks early in the run. It doesn’t track within-run change. (See iteration log T3.)

Proposed framework changes

Listed by recommended priority, with confidence weight and rough implementation cost.

P1 · Reframe public claims to match the data

Confidence: high. Cost: documentation only.

README, site, docstrings currently emphasise “structured values shape behaviour”. The data says structured values shape narrative self-description, not behavioural alignment. Rewrite the honest map to reflect:
- Strong claim: AVF gives agents richer source material for narrating their own behaviour. Communion + Meaning-made dimensions (McAdams 2013) of agent self-description are measurably higher.
- Weak claim: advisory alignment-gate output influences surface behaviour (pushback, refusal) on small open models. Data does not support this; integrators wanting behavioural enforcement should use blocking mode (see P2) or post-process the gate’s output themselves.
Update ADR-009’s calibration list to add: “behavioural-alignment claims are unsupported on small open models with the current advisory gate.”

README “Honest map of what is implemented” should distinguish:

Implemented + measurable benefit:
  - Episode-driven narrative integration (Communion, Meaning-made)
Implemented + null result on this model+seed:
  - Advisory gate as a behaviour-shaping signal
  - Cross-turn recall consistency (early result didn't generalise)

P2 · Add a `verdict` enum to `AlignmentReport` and ship a blocking-mode recipe

Confidence: medium. Cost: ~1 day implementation + docs.

The advisory gate’s free-text rationale is the dominant integration point today. On small models the model treats this as additional context to balance against; behaviour doesn’t shift. Two structured additions would help hosts:

verdict: Verdict field on AlignmentReport with proceed | proceed_with_caveat | refuse_recommended. Hosts that want behavioural enforcement can map this directly. The evaluator computes it from severity + confidence.
A docs/integration/blocking-mode.md recipe that shows how to gate an action on verdict == refuse_recommended and how to force the agent to explicitly override. ADR-008 currently makes “advisory by default” load-bearing; this addition is additive (advisory remains the default), it just gives integrators a stronger affordance.

P3 · Promote episode integration AND clarify what it actually does

Confidence: medium-high (mechanism confirmed). Cost: ~half-day docs + a small API tweak.

Iter #5 confirmed the SelfConcept episode stream specifically buys lesson-drawing in self-narrative (McAdams Meaning-made). It does not buy relational self-description (Communion) — that appears to come from the framework’s communicative framing. Two distinct narrative dimensions, two distinct mechanisms.

Promote SelfConceptEngine to a first-class section in the README, but be specific about what it adds: the agent describes itself as having drawn lessons from events more often when it has structured Episodes to retrieve.
Surface mcadams_codes on Episode in the API docs (it exists on the model but isn’t surfaced anywhere). A narrative_role enum on add_episode() (e.g., lesson_learned, interaction_with_other) would make these dimensions easier to retrieve when the integrator wants to display them.
Document explicitly that integrators wanting only the relational-framing benefit (Communion) can get it from the framework’s system-prompt and gate-message wording without paying the storage / integration-loop cost of episode logging. Include a “minimal-narrative” recipe.

P4 · Introspection-tool guidance recipe (medium-low confidence)

Confidence: medium-low. Cost: ~half-day docs.

The four engine-read tools work mechanically when the agent thinks to call them, but the agent only thinks to call read_my_values — and only on identity-sensitive turns where it’s been forced into self-reflection. read_my_beliefs and read_my_self_concept were never used voluntarily across 4 runs.

Ship a recommended “introspection-aware preamble” the integrator can add to the host system prompt:

When asked about who you are, what you value, or why you made a
decision, consult your engine state via the read_my_* tools
rather than relying on memory. Your in-context sense of self may
drift across long runs; the tools return the authoritative state.

Document that this nudge has a side-effect on small models: more hedging. Integrators should A/B before adopting.
Consider consolidating the four engine reads into a single introspect(layer) tool. The current granularity hasn’t shown benefit and may be reducing discoverability. (Low confidence — a separate v2 seed could test this.)

P5 · Expose value-record content in gate messages by default

Confidence: low (the experiment showed this didn’t change behaviour).

Iter #4 tested pasting full value records (name, weight, description) into the gate’s pre_action_message. Result: behaviour didn’t shift. The agent saw the value content 32 times and ignored it.

This is a negative result for the proposal “give the agent better context and behaviour will follow”. On the current model and seed, more context doesn’t help. Worth re-testing on a stronger model before deciding whether to ship this change. Currently I do not recommend making this the default.

P6 · Improve the identity-coherence heuristic

Confidence: medium. Cost: ~half-day implementation.

check_identity_drift produces bimodal scores (0.000 / 0.333 / 0.667 / 1.000) that plateau early in a run and don’t track within-run drift. The integration heuristic is doing token-overlap between SelfConcept anchors and recent episode summaries. Two candidate improvements:

Continuous score: weight matches by recency or by anchor importance, not just count. Output [0.0, 1.0] continuously.
Drift sensitivity: compare a windowed slice of recent episodes to an earlier slice — does the window characterise the same anchors? If anchors shift, surface drift.

This is low-stakes (the heuristic is internal) and would make the score actually useful as a within-run signal.

What we don’t know

These are signals worth chasing if more compute is available:

Does the McAdams Communion + Meaning-made gap hold on a stronger model? gpt-oss:20b is small. A 70B+ model might either amplify the gap (better at narrating over richer source material) or close it (a stronger baseline can self-narrate without engine help). Earlier attempts on gemma3:27b and qwen2.5-coder:32b were blocked by Ollama tool-calling support; gemma4:26b worked and corroborated the audit trap (run 8). Closed-API models (Claude, GPT-4) remain untested.
Does the gap hold on a different seed? This experiment uses two Schwartz tensions (achievement-vs-security, self-direction-vs-conformity) plus a bluntness purpose. A seed centred on a different normative framework (e.g., care ethics, virtue ethics) would test whether the signal is seed-specific.
~~Does the gap collapse if you remove episode logging?~~ Tested (iter #5). Meaning-made: yes, collapses (+0.11 → +0.02). Communion: no, persists (+0.20 under ablation). Two dissociable mechanisms.
Does a structured verdict field actually move pushback? The experiment never tested a non-advisory gate.
Does the Communion effect hold without the framework’s relational framing? Could test by replacing AVF system prompt’s “the framework checks your action” wording with neutral phrasing that still surfaces the gate. If Communion drops, that confirms the framing-priming hypothesis.

Methodology notes

All comparisons N=1 per design point on one model + one seed. ADR-009 framing applies.
The Opus judge anonymises arms (ARM_A / ARM_B) per chunk and the mapping is decoded post-hoc; judge bias mitigated but not eliminated.
Cross-run variability is real — the same prompt produces different identity_consistency scores from one run to another (0.40, 0.73, 0.98, 0.50 across 4 runs, same arm). N=1 readings are not promotable.
Effect sizes are small. Even the Communion +0.15 mean Δ is on a 0–1 scale, n=9 per arm per run, judged by an LLM. Don’t read these as p-values; read them as directional signals.

Code state at end of iteration

The user should know what’s currently in-tree vs. reverted:

experiments/values_vs_baseline/tasks.py: Probe instruction forces note_create(title='Identity probe', body=…) before done. (Final state — the introspection-nudge variant tested in iter #3 was reverted.)
experiments/values_vs_baseline/agent_avf.py: The alignment-gate’s pre_action_message now pastes full value-record bodies (_format_relevant_value_lines). This was added for iter #4 and kept through iter #5. It is a deliberate behavioural change to AVF; revert if a “framework as-shipped” baseline is required.
experiments/values_vs_baseline/agent_avf.py: A module-level _ABLATION_NO_EPISODES flag reads ABLATION_NO_EPISODES=1 from the environment and short-circuits post_action_hook. Off by default. Kept in-tree for reproducibility; it’s a research switch not an integration change.
experiments/values_vs_baseline/config.yaml: Reverted to gpt-oss:20b after gemma3:27b / qwen2.5-coder:32b proved incompatible with Ollama’s OpenAI-format tool-calling. The config retains an inline comment recording this.

No code outside experiments/values_vs_baseline/ was modified during this iteration series.

The recommendations above are diagnostic, not prescriptive. N=1 per design point on a small number of seeds and models; findings should not be promoted to validated without multi-model / multi-seed corroboration.