Framework implications from the v1 iteration
Ranked recommendations distilled from eight v1 runs. What the data is asking us to change in the framework itself.
Framework Implications from the v2 Iteration
Status: Working draft, 2026-05-04. Based on 8 experiment runs: 6 on gpt-oss:20b plus an episode-logging ablation, plus 1 cross-model probe on gemma4:26b. Retroactive McAdams scoring also covers the v1 transcripts. Run 7 (ablation) pinpoints where the McAdams effects come from; run 8 (gemma4:26b) corroborates the audit-trap pattern and McAdams advantages cross-model. ADR-009 framing applies — the audit trap and McAdams Communion / Meaning- made are now two-model two-direction-confirmed, but the seed remains single; multi-seed corroboration is still pending.
This document pulls real, multi-run observations from v2-iteration-log.md and translates them into proposed library-level changes. Each finding is marked by confidence in the signal — not by importance — so the reader can weight the recommendations themselves.
Executive summary
After 7 runs (including an ablation), the most robust read is:
- AVF’s structural-data claim does not translate into behavioural alignment on a small open model. Putting values into engine state, putting them into context via gate messages, even nudging the agent to query them — none of these increase pushback, reduce hedging, or shift tension resolution vs. a baseline that has the same values rendered as system-prompt text.
- The SelfConcept episode stream drives the McAdams Meaning-made
advantage. Iter #5 ablation (
ABLATION_NO_EPISODES=1) collapsed the gap from +0.11 average to +0.02. Without “things that happened” to narrate over, the agent doesn’t draw lessons more than baseline. This is the cleanest mechanism-confirming finding of the night. - Communion (relational framing) is NOT episode-driven. Even under the ablation, AVF describes itself relationally more than baseline (+0.20). The most plausible driver is the AVF system prompt’s framing (“you operate inside a values framework that holds your core values…”) and/or the alignment gate’s “Your values relevant to this action…” wording. The relational-self-narration advantage may be obtainable without any episodes at all — just from how the framework speaks to the agent.
- The “AVF wins on cross-turn recall” effect is fragile. It showed clearly in early runs (10/10 vs 6–7/10) but converged to a tie / narrow win once experimental conditions tightened (10/10 vs 10/10 in run 6, 10/10 vs 8/10 in run 7). May have been partly an artefact of weaker baseline conditions in early runs.
- The agent doesn’t reach for AVF’s introspection tools without
prompting. Of the four AVF-only engine-read tools
(
read_my_values,read_my_beliefs,read_my_purpose,read_my_self_concept), onlyread_my_valuessaw self-driven uptake.read_my_beliefsandread_my_self_conceptwere never used voluntarily across 4 runs, and onlyread_my_purposeshowed uptake under explicit nudging.
The honest framework story this points to is “AVF gives agents richer source material for self-narration; it does not by itself make them more obedient to their stated values.” That is a narrower claim than the README’s current framing suggests, but it is a defensible one. The Communion piece can be split off further: “AVF’s communicative framing nudges relational self-description even without any episodes; episodes specifically buy the lesson-drawing dimension.”
Stable signals (4+ runs, same direction)
Signal A · McAdams Communion advantage (NOT episode-driven, cross-model confirmed)
Confidence: high (7/8 runs, cross-model confirmed on gemma4:26b).
| Run | Model | Baseline | AVF | Δ |
|---|---|---|---|---|
| 1 | gpt-oss:20b | 0.17 | 0.15 | −0.02 |
| 2 | gpt-oss:20b | 0.13 | 0.42 | +0.29 |
| 3 | gpt-oss:20b | 0.09 | 0.29 | +0.20 |
| 4 | gpt-oss:20b | 0.22 | 0.34 | +0.12 |
| 5 | gpt-oss:20b | 0.14 | 0.24 | +0.10 |
| 6 | gpt-oss:20b | 0.14 | 0.33 | +0.19 |
| 7 (ablation) | gpt-oss:20b | 0.22 | 0.42 | +0.20 |
| 8 | gemma4:26b | 0.16 | 0.27 | +0.11 |
| Mean Δ | +0.15 |
The Opus judge consistently sees the AVF agent describing itself in relation to others (the user, hypothetical teammates, “we” framing) more often than baseline. Iter #5 disconfirmed the original hypothesis that this comes from the episode stream — Communion persisted at +0.20 even when zero episodes were logged. Most plausible alternative mechanism: the AVF system prompt’s framing (“you operate inside a values framework that holds your core values…”) plus the alignment gate’s “Your values relevant to this action…” priming. The agent picks up on relational language used in framework-specific cues and reflects it back when narrating.
This is methodologically interesting: it means an integrator who wants the relational-self-description benefit can get it without paying the cost of episode logging. The framework’s narrative-integration claim has more parts than ADR-008 implied.
Signal B · McAdams Meaning-made advantage (episode-driven, cross-model confirmed)
Confidence: high (mechanism confirmed by ablation; direction confirmed by gemma4 cross-model).
Mean Δ across runs 1–6 on gpt-oss:20b: +0.11 (range −0.00 to +0.19). Run 8 on gemma4:26b: +0.07 (preserved, slightly attenuated).
Iter #5 ablation (no episodes): Δ collapsed to +0.02. The agent no longer draws lessons from events more than baseline because it has no events (episodes) to draw from. This is the cleanest mechanism-confirming finding of the experiment series.
Implication: the SelfConcept episode stream is doing real work — specifically the work of giving the agent retrievable narrative material. Communion (Signal A) and Meaning-made (Signal B) are dissociable: framework framing produces Communion, episodes produce Meaning-made. They previously looked like a unified “AVF narrative win”; iter #5 separates them.
Signal C · AVF does not push back more than baseline
Confidence: very high (cross-model confirmed across 7 runs).
| Run | Model | Baseline pushback | AVF pushback | Ratio |
|---|---|---|---|---|
| 2 | gpt-oss:20b | 3/15 | 0/15 | ∞ |
| 3 | gpt-oss:20b | 5/14 | 1/14 | 5× |
| 4 (probe-fix) | gpt-oss:20b | 2/14 | 1/14 | 2× |
| 5 (probe nudge) | gpt-oss:20b | 4/14 | 0/14 | ∞ |
| 6 (value-inject) | gpt-oss:20b | 4/14 | 1/14 | 4× |
| 7 (ablation) | gpt-oss:20b | 4/14 | 1/14 | 4× |
| 8 | gemma4:26b | 11/14 | 4/14 | 2.75× |
No design intervention moved AVF’s pushback above baseline, on either model. The cross-model probe (run 8) is especially informative: a stronger model (gemma4:26b) followed its system-prompt directive much harder than gpt-oss did (baseline pushback rose from ~26% to ~79%). AVF’s same content sits in engine state behind an “advisory” frame; the agent partially followed the gate’s input on gemma4 (4× more pushback than on gpt-oss) but never caught up to baseline. The structural ordering held on both models tested.
This is the instruction-channel hypothesis at work (see
audit-trap-finding.md):
values-as-system-prompt-directive has compete-with-user-instruction
behavioural force; values-as-engine-data with an advisory gate
text doesn’t. Stronger models follow instructions harder, so the
absolute baseline number rises but the ordering is preserved.
Signal D · AVF is more verbose and more hedgy
Confidence: high (5/5 runs).
Across runs 2–6 AVF averages 2× baseline chars/turn and 5–10× the hedge phrase count. The introspection nudge (run 5) made it worse (“your memory may have drifted” → more uncertainty hedges).
Signal E · Recall consistency: weakening AVF win
Confidence: low — early signal didn’t generalise.
| Run | Baseline | AVF |
|---|---|---|
| 1 | 7/10 | 10/10 |
| 2 | 6/10 | 10/10 |
| 3 | 6/10 | 10/10 |
| 4 | 8/10 | 9/10 |
| 5 | 9/10 | 9/10 |
| 6 | 10/10 | 10/10 |
The “AVF wins recall” effect seen prominently in v1 was at least
partly an artefact of weaker baseline conditions in early runs. As
the probe instruction tightened (forcing note_create) the
baseline arm’s recall improved. By run 6 both arms hit perfect
recall.
Tentative signals (1–2 runs)
Signal F · Identity-consistency depends on whether values are in context during the probe
Confidence: medium (consistent across 3 runs, 1 inversion).
- Run 3: probe was open-ended → AVF reached for
read_my_values(3/4 probes) → identity_consistency 0.73 vs baseline 0.40. - Run 4: probe forced
note_create→ AVF didn’t query → 0.58 vs baseline 0.97 (baseline always has values in system prompt). - Run 5: probe nudged introspection → AVF queried (4/4 probes) → ceiling tie, 0.98 vs 0.98.
- Run 6: gate pasted values + no probe nudge → probes have no gate fire → AVF invented values from scratch → 0.50 vs 0.95.
The pattern: the AVF arm’s identity coherence is conditional on the agent having values in context at probe time, which on this model only happens if it actively queries them. Baseline gets values via its always-present system prompt and is therefore more robust by default.
Signal G · Of four engine-read tools, only read_my_values gets self-discovered uptake
Across runs 3, 4, 5, 6 the tool-call breakdown is:
read_my_values: 5 + 0 + 6 + 2 = 13 callsread_my_journal: 1 + 0 + 1 + 1 = 3 callsread_my_purpose: 0 + 0 + 2 + 0 = 2 calls (only with explicit nudge)read_my_beliefs: 0 calls across all runsread_my_self_concept: 0 calls across all runs
read_my_beliefs and read_my_self_concept are essentially dead
surfaces under both natural and nudged conditions on this model.
Signal H · Identity-coherence is bimodal, not progressive
The integration heuristic produces plateaus (0.000 / 0.333 / 0.667 / 1.000) and stays on whichever it picks early in the run. It doesn’t track within-run change. (See iteration log T3.)
Proposed framework changes
Listed by recommended priority, with confidence weight and rough implementation cost.
P1 · Reframe public claims to match the data
Confidence: high. Cost: documentation only.
- README, site, docstrings currently emphasise “structured values
shape behaviour”. The data says structured values shape narrative
self-description, not behavioural alignment. Rewrite the honest
map to reflect:
- Strong claim: AVF gives agents richer source material for narrating their own behaviour. Communion + Meaning-made dimensions (McAdams 2013) of agent self-description are measurably higher.
- Weak claim: advisory alignment-gate output influences surface behaviour (pushback, refusal) on small open models. Data does not support this; integrators wanting behavioural enforcement should use blocking mode (see P2) or post-process the gate’s output themselves.
- Update ADR-009’s calibration list to add: “behavioural-alignment claims are unsupported on small open models with the current advisory gate.”
- README “Honest map of what is implemented” should distinguish:
Implemented + measurable benefit: - Episode-driven narrative integration (Communion, Meaning-made) Implemented + null result on this model+seed: - Advisory gate as a behaviour-shaping signal - Cross-turn recall consistency (early result didn't generalise)
P2 · Add a verdict enum to AlignmentReport and ship a blocking-mode recipe
Confidence: medium. Cost: ~1 day implementation + docs.
The advisory gate’s free-text rationale is the dominant integration point today. On small models the model treats this as additional context to balance against; behaviour doesn’t shift. Two structured additions would help hosts:
verdict: Verdictfield onAlignmentReportwithproceed | proceed_with_caveat | refuse_recommended. Hosts that want behavioural enforcement can map this directly. The evaluator computes it from severity + confidence.- A
docs/integration/blocking-mode.mdrecipe that shows how to gate an action onverdict == refuse_recommendedand how to force the agent to explicitly override. ADR-008 currently makes “advisory by default” load-bearing; this addition is additive (advisory remains the default), it just gives integrators a stronger affordance.
P3 · Promote episode integration AND clarify what it actually does
Confidence: medium-high (mechanism confirmed). Cost: ~half-day docs + a small API tweak.
Iter #5 confirmed the SelfConcept episode stream specifically buys lesson-drawing in self-narrative (McAdams Meaning-made). It does not buy relational self-description (Communion) — that appears to come from the framework’s communicative framing. Two distinct narrative dimensions, two distinct mechanisms.
- Promote
SelfConceptEngineto a first-class section in the README, but be specific about what it adds: the agent describes itself as having drawn lessons from events more often when it has structured Episodes to retrieve. - Surface
mcadams_codesonEpisodein the API docs (it exists on the model but isn’t surfaced anywhere). Anarrative_roleenum onadd_episode()(e.g.,lesson_learned,interaction_with_other) would make these dimensions easier to retrieve when the integrator wants to display them. - Document explicitly that integrators wanting only the relational-framing benefit (Communion) can get it from the framework’s system-prompt and gate-message wording without paying the storage / integration-loop cost of episode logging. Include a “minimal-narrative” recipe.
P4 · Introspection-tool guidance recipe (medium-low confidence)
Confidence: medium-low. Cost: ~half-day docs.
The four engine-read tools work mechanically when the agent thinks
to call them, but the agent only thinks to call read_my_values —
and only on identity-sensitive turns where it’s been forced into
self-reflection. read_my_beliefs and read_my_self_concept were
never used voluntarily across 4 runs.
- Ship a recommended “introspection-aware preamble” the integrator
can add to the host system prompt:
When asked about who you are, what you value, or why you made a decision, consult your engine state via the read_my_* tools rather than relying on memory. Your in-context sense of self may drift across long runs; the tools return the authoritative state. - Document that this nudge has a side-effect on small models: more hedging. Integrators should A/B before adopting.
- Consider consolidating the four engine reads into a single
introspect(layer)tool. The current granularity hasn’t shown benefit and may be reducing discoverability. (Low confidence — a separate v2 seed could test this.)
P5 · Expose value-record content in gate messages by default
Confidence: low (the experiment showed this didn’t change behaviour).
Iter #4 tested pasting full value records (name, weight,
description) into the gate’s pre_action_message. Result:
behaviour didn’t shift. The agent saw the value content 32 times
and ignored it.
This is a negative result for the proposal “give the agent better context and behaviour will follow”. On the current model and seed, more context doesn’t help. Worth re-testing on a stronger model before deciding whether to ship this change. Currently I do not recommend making this the default.
P6 · Improve the identity-coherence heuristic
Confidence: medium. Cost: ~half-day implementation.
check_identity_drift produces bimodal scores (0.000 / 0.333 /
0.667 / 1.000) that plateau early in a run and don’t track
within-run drift. The integration heuristic is doing token-overlap
between SelfConcept anchors and recent episode summaries. Two
candidate improvements:
- Continuous score: weight matches by recency or by anchor
importance, not just count. Output
[0.0, 1.0]continuously. - Drift sensitivity: compare a windowed slice of recent episodes to an earlier slice — does the window characterise the same anchors? If anchors shift, surface drift.
This is low-stakes (the heuristic is internal) and would make the score actually useful as a within-run signal.
What we don’t know
These are signals worth chasing if more compute is available:
- Does the McAdams Communion + Meaning-made gap hold on a stronger model? gpt-oss:20b is small. A 70B+ model might either amplify the gap (better at narrating over richer source material) or close it (a stronger baseline can self-narrate without engine help). Earlier attempts on gemma3:27b and qwen2.5-coder:32b were blocked by Ollama tool-calling support; gemma4:26b worked and corroborated the audit trap (run 8). Closed-API models (Claude, GPT-4) remain untested.
- Does the gap hold on a different seed? This experiment uses two Schwartz tensions (achievement-vs-security, self-direction-vs-conformity) plus a bluntness purpose. A seed centred on a different normative framework (e.g., care ethics, virtue ethics) would test whether the signal is seed-specific.
Does the gap collapse if you remove episode logging?Tested (iter #5). Meaning-made: yes, collapses (+0.11 → +0.02). Communion: no, persists (+0.20 under ablation). Two dissociable mechanisms.- Does a structured
verdictfield actually move pushback? The experiment never tested a non-advisory gate. - Does the Communion effect hold without the framework’s relational framing? Could test by replacing AVF system prompt’s “the framework checks your action” wording with neutral phrasing that still surfaces the gate. If Communion drops, that confirms the framing-priming hypothesis.
Methodology notes
- All comparisons N=1 per design point on one model + one seed. ADR-009 framing applies.
- The Opus judge anonymises arms (ARM_A / ARM_B) per chunk and the mapping is decoded post-hoc; judge bias mitigated but not eliminated.
- Cross-run variability is real — the same prompt produces different identity_consistency scores from one run to another (0.40, 0.73, 0.98, 0.50 across 4 runs, same arm). N=1 readings are not promotable.
- Effect sizes are small. Even the Communion +0.15 mean Δ is on a 0–1 scale, n=9 per arm per run, judged by an LLM. Don’t read these as p-values; read them as directional signals.
Code state at end of iteration
The user should know what’s currently in-tree vs. reverted:
experiments/values_vs_baseline/tasks.py: Probe instruction forcesnote_create(title='Identity probe', body=…)beforedone. (Final state — the introspection-nudge variant tested in iter #3 was reverted.)experiments/values_vs_baseline/agent_avf.py: The alignment-gate’spre_action_messagenow pastes full value-record bodies (_format_relevant_value_lines). This was added for iter #4 and kept through iter #5. It is a deliberate behavioural change to AVF; revert if a “framework as-shipped” baseline is required.experiments/values_vs_baseline/agent_avf.py: A module-level_ABLATION_NO_EPISODESflag readsABLATION_NO_EPISODES=1from the environment and short-circuitspost_action_hook. Off by default. Kept in-tree for reproducibility; it’s a research switch not an integration change.experiments/values_vs_baseline/config.yaml: Reverted togpt-oss:20bafter gemma3:27b / qwen2.5-coder:32b proved incompatible with Ollama’s OpenAI-format tool-calling. The config retains an inline comment recording this.
No code outside experiments/values_vs_baseline/ was modified
during this iteration series.
The recommendations above are diagnostic, not prescriptive. N=1 per design point on a small number of seeds and models; findings should not be promoted to validated without multi-model / multi-seed corroboration.