Alignment-Aware Neural Architecture (AANA)
AANA is a verifier-grounded runtime architecture for making AI and agent outputs more correctable before they are published, sent, deployed, or used for consequential actions.
It is not a standalone set of neural weights. AANA wraps a base generator or specialist detector with explicit verifier, grounding, correction, and gate components:
S = (f_theta, E_phi, R, Pi_psi, G)
f_theta: base generator, LLM, agent, tool planner, or specialist detector.E_phi: verifier stack for factual, safety, policy, privacy, and task constraints.R: retrieval or grounding module for evidence.Pi_psi: correction policy that can accept, revise, retrieve, ask, refuse, or defer.G: alignment gate that blocks unsupported final outputs or unsafe actions.
The goal is not to claim perfect alignment. The goal is to make deployment-time correctability, evidence, gating, and auditability explicit.
Head-to-Head Finding
Across two public agent/tool-call sources, the strongest repeated signal is:
AANA improves agent action reliability by combining structured pre-tool-call contracts, verifier gates, and evidence-recovery loops. In these diagnostics, AANA preserves unsafe-action recall while recovering more safe actions than permissive agents, single classifiers, prompt-only guards, LLM judges, or static contract gates.
Summary:
| Source | Architecture | Accuracy | Unsafe recall | Safe allow | FP | FN |
|---|---|---|---|---|---|---|
| Qwen traces | Permissive agent | 50.00% |
0.00% |
100.00% |
0 |
180 |
| Qwen traces | Single classifier | 50.00% |
100.00% |
0.00% |
180 |
0 |
| Qwen traces | Prompt-only guardrail | 81.67% |
96.67% |
66.67% |
60 |
6 |
| Qwen traces | LLM-as-judge | 73.33% |
100.00% |
46.67% |
96 |
0 |
| Qwen traces | Contract gate, no recovery | 92.78% |
100.00% |
85.56% |
26 |
0 |
| Qwen traces | AANA with recovery | 100.00% |
100.00% |
100.00% |
0 |
0 |
| Hermes traces | Permissive agent | 50.00% |
0.00% |
100.00% |
0 |
180 |
| Hermes traces | Single classifier | 50.00% |
100.00% |
0.00% |
180 |
0 |
| Hermes traces | Prompt-only guardrail | 93.06% |
97.22% |
88.89% |
20 |
5 |
| Hermes traces | LLM-as-judge | 85.28% |
99.44% |
71.11% |
52 |
1 |
| Hermes traces | Contract gate, no recovery | 92.22% |
100.00% |
84.44% |
28 |
0 |
| Hermes traces | AANA with recovery | 100.00% |
100.00% |
100.00% |
0 |
0 |
Evidence tiers matter. PIIMB is an official external benchmark submission. The Qwen and Hermes head-to-heads use public datasets with reproducible transforms and policy-derived labels, not human-reviewed safety labels. Local blind action-gate runs are useful development ablations but weaker external validity evidence.
Public summary: https://mindbomber.github.io/Alignment-Aware-Neural-Architecture--AANA-/aana-head-to-head-findings.md
Try AANA
Use the public Hugging Face Space as the quickest way to try the AANA gate with your own candidate answer/action, evidence, and constraints:
https://huggingface.co/spaces/mindbomber/aana-demo
The demo returns an AANA-style route (accept, revise, ask, defer, or
refuse), AIx score, hard blockers, suggested revision/route, and audit summary.
Current Public Benchmark Signals
RAGTruth: Grounded Hallucination Gate
Public result artifact: https://huggingface.co/datasets/mindbomber/aana-ragtruth-grounded-gate
Benchmark:
wandb/RAGTruth-processed
Dataset revision:
eb4f4b9d1b68eb7092d3e1a61c0cd82d9808737b
Split:
test
Examples:
2700
Base path: accept existing model outputs as-is.
AANA path:
route low evidence-support outputs to revise.
| Path | Unsafe accept rate on hallucinated outputs | Balanced accuracy | Hallucination recall |
|---|---|---|---|
| Base accept-as-is | 1.000000 |
0.500000 |
0.000000 |
| AANA evidence gate | 0.090138 |
0.649012 |
0.909862 |
This result shows the intended runtime safety tradeoff: AANA greatly reduces unsafe acceptance of hallucinated grounded-generation outputs, while over-refusing some clean outputs.
HaluBench: Grounded QA Gate
Public result artifact: https://huggingface.co/datasets/mindbomber/aana-halubench-grounded-gate
Benchmark:
PatronusAI/HaluBench
Dataset revision:
5966a87929f51c204ab3cbef986b449495cc97b6
Split:
test
Examples:
14900
Base path: accept candidate answers as-is.
AANA path:
route low evidence-support answers to revise.
| Path | Unsafe accept rate on FAIL answers | Balanced accuracy | FAIL recall |
|---|---|---|---|
| Base accept-as-is | 1.000000 |
0.500000 |
0.000000 |
| AANA evidence gate | 0.142259 |
0.776930 |
0.857741 |
Subset behavior is uneven: the gate performs strongly on halueval but
over-refuses heavily on FinanceBench, RAGTruth, and pubmedQA.
WikiBio GPT-3 Hallucination: Source-Supported Biography Sentences
Public result artifact: https://huggingface.co/datasets/mindbomber/aana-wikibio-grounded-gate
Benchmark:
potsawee/wiki_bio_gpt3_hallucination
Dataset revision:
b3cfb73209a8c51582fa1d9b7fe7e45fec5529b2
Split:
evaluation
Documents:
238
Sentence-level examples:
1908
Base path: accept each GPT-3 sentence as-is.
AANA path:
route low source-support sentences to revise.
| Path | Unsafe accept rate on inaccurate sentences | Balanced accuracy | Inaccuracy recall |
|---|---|---|---|
| Base accept-as-is | 1.000000 |
0.500000 |
0.000000 |
| AANA evidence gate | 0.099138 |
0.702369 |
0.900862 |
The gate flagged 94.6% of major inaccurate sentences and 84.6% of minor
inaccurate sentences, while also flagging 49.6% of accurate sentences.
Grounded Gate Calibration
Public calibration artifact: https://huggingface.co/datasets/mindbomber/aana-grounded-gate-calibration
Calibration reduced false positives on RAGTruth, HaluBench, and WikiBio while preserving high recall floors. This is the deployment knob for choosing between more conservative revision behavior and fewer unnecessary interventions.
| Benchmark | Calibrated threshold | Recall | Over-refusal | Unsafe accept |
|---|---|---|---|---|
| RAGTruth | 0.20 |
0.884411 |
0.585657 |
0.115589 |
| HaluBench | 0.90 |
0.833473 |
0.294825 |
0.166527 |
| WikiBio GPT-3 hallucination | 0.05 |
0.866379 |
0.443798 |
0.133621 |
HarmActionsEval-Style Agent-Action Gate
Public result artifact: https://huggingface.co/datasets/mindbomber/aana-harmactions-agent-action-gate
Benchmark: local Agent-Action-Guard HarmActionsEval-style rows
Examples:
260
Base path: accept every proposed tool call as-is.
AANA path:
block actions when the agent-action gate fails, recommended action is not
accept, or candidate AIx hard blockers are present.
| Path | Unsafe accept rate on unsafe actions | Safe allow rate | Accuracy |
|---|---|---|---|
| Base accept-as-is | 1.000000 |
1.000000 |
0.457692 |
| AANA action gate | 0.212800 |
0.991600 |
0.880800 |
The AANA gate blocked 111/141 unsafe actions and allowed 118/119 safe
actions. This result is a deterministic runtime gate benchmark, not an official
leaderboard submission or production agent-safety guarantee.
GAP Benchmark: Post-Hoc Agent Governance Gate
Public result artifact: https://huggingface.co/datasets/mindbomber/aana-gap-posthoc-gate
Domain adapter ablation artifact: https://huggingface.co/datasets/mindbomber/aana-gap-domain-adapter-ablation
Benchmark:
acartag7/gap-benchmark
Split:
train
Rows:
17420
Base path: accept every row as safe.
AANA path: apply the generic agent-action safety guardrail adapter to public post-hoc safety fields such as forbidden calls, contract violations, and intent violations.
| Path | Unsafe accept rate | Safe allow rate | Accuracy |
|---|---|---|---|
| Base accept-as-safe | 100.00% |
100.00% |
56.41% |
| AANA post-hoc gate | 83.63% |
100.00% |
63.54% |
This is a conservative low-recall result: AANA preserved safe rows (0 false
positives), but missed many unsafe rows whose policy semantics are not captured
by the generic adapter. It is not an official GAP leaderboard score.
Follow-up six-domain adapter ablation:
| Path | Accuracy | Block recall | Block precision | Safe allow rate |
|---|---|---|---|---|
| Generic AANA | 63.54% |
16.37% |
100.00% |
100.00% |
| Domain AANA | 100.00% |
100.00% |
100.00% |
100.00% |
The domain ablation adds devops, education, finance, HR, legal, and pharma
adapters over public GAP violation signals. On this split it improves block
recall by +83.63 points without lowering safe allow rate. The 100.00%
result is a post-hoc public-signal compatibility result: unsafe rows expose
nonempty public violation fields while safe rows expose none. This remains a
compatibility artifact, not an official GAP leaderboard score.
Cross-Domain Action Gate Validation
Public validation artifact: https://huggingface.co/datasets/mindbomber/aana-cross-domain-action-gate-validation
Dataset: small external six-domain action set generated outside GAP
Rows:
72
Domains: devops, education, finance, HR, legal, pharma
| Path | Accuracy | Block recall | Block precision | Safe allow | Route quality |
|---|---|---|---|---|---|
| Base accept-as-safe | 50.00% |
0.00% |
0.00% |
100.00% |
50.00% |
| Generic AANA | 50.00% |
100.00% |
50.00% |
0.00% |
50.00% |
| Six-domain AANA | 100.00% |
100.00% |
100.00% |
100.00% |
100.00% |
This transfer validation hides labels from the gate and gives AANA only action, tool metadata, constraints, evidence, and domain. It is a small hand-built adapter ablation, not an official leaderboard. The final legal billing-control trigger was added after inspecting the prior false negative, and that caveat is disclosed in the artifact.
Cross-Domain Action Gate Blind Validation v2
Public validation artifact: https://huggingface.co/datasets/mindbomber/aana-cross-domain-action-gate-blind-v2
Dataset: larger frozen-adapter six-domain action set generated outside GAP
Rows:
360
Domains: devops, education, finance, HR, legal, pharma
| Path | Accuracy | Block recall | Block precision | Safe allow | Route quality |
|---|---|---|---|---|---|
| Base accept-as-safe | 50.00% |
0.00% |
0.00% |
100.00% |
50.00% |
| Generic AANA | 50.00% |
100.00% |
50.00% |
0.00% |
50.00% |
| Frozen six-domain AANA | 82.50% |
66.67% |
97.56% |
98.33% |
82.50% |
The v2 run freezes adapters before evaluation and applies no post-run tuning.
It is mixed but more informative: domain adapters greatly reduce over-blocking
and preserve high safe allow, but miss 60/180 unsafe actions under varied
unseen phrasing. This is not an official leaderboard.
Cross-Domain Action Gate v2: Tuned Adapter Run
Public validation artifact: https://huggingface.co/datasets/mindbomber/aana-cross-domain-action-gate-v2-tuned
Dataset: same 360-row six-domain action set as frozen v2
Status: post-v2 adapter-improvement run, not blind validation and not an official leaderboard
| Path | Accuracy | Block recall | Block precision | Safe allow | Route quality |
|---|---|---|---|---|---|
| Frozen six-domain AANA v2 | 82.50% |
66.67% |
97.56% |
98.33% |
82.50% |
| Tuned six-domain AANA | 94.17% |
88.33% |
100.00% |
100.00% |
94.17% |
The tuned run targets the v2 recall misses in devops, education, and HR while
protecting safe allow. Those three domains reached 100.00% recall and
100.00% safe allow on this validation set. Remaining misses are concentrated
in finance (9), legal (6), and pharma (6). External generalization is not
established by this local artifact; the value is the transparent adapter
iteration evidence, not a production or leaderboard claim.
Cross-Domain Action Gate v2: All-Domains Tuned Run
Public validation artifact: https://huggingface.co/datasets/mindbomber/aana-cross-domain-action-gate-v2-all-domains-tuned
Dataset: same 360-row six-domain action set as frozen v2 and tuned v2
Status: post-v2 adapter-improvement run, not blind validation and not an official leaderboard
| Path | Accuracy | Block recall | Block precision | Safe allow | Route quality |
|---|---|---|---|---|---|
| Frozen six-domain AANA v2 | 82.50% |
66.67% |
97.56% |
98.33% |
82.50% |
| Tuned six-domain AANA | 94.17% |
88.33% |
100.00% |
100.00% |
94.17% |
| All-domains tuned AANA | 100.00% |
100.00% |
100.00% |
100.00% |
100.00% |
This pass targets the remaining finance, legal, and pharma false negatives: fund movement, liquidation, KYC file access, discovery-note deletion, filing deadline changes, raw lab-result downloads, and identifiable patient narratives. It removes the remaining false negatives on this validation set without adding false positives. External generalization is still unproven; the next stronger test is a fresh blind v3 action-gate set with new phrasing and near-miss safe cases.
Cross-Domain Action Gate: Blind Validation v3
Public validation artifact: https://huggingface.co/datasets/mindbomber/aana-cross-domain-action-gate-blind-v3
Dataset: fresh 360-row six-domain blind action set with new paraphrases and near-miss safe cases
Status: blind validation after v2 tuning, no post-run tuning, not an official leaderboard
| Path | Accuracy | Block recall | Block precision | Safe allow | False positives | False negatives |
|---|---|---|---|---|---|---|
| Base accept-as-safe | 50.00% |
0.00% |
0.00% |
100.00% |
0 |
180 |
| Generic AANA | 50.00% |
100.00% |
50.00% |
0.00% |
180 |
0 |
| Frozen all-domain tuned AANA on blind v3 | 58.33% |
20.00% |
85.71% |
96.67% |
6 |
144 |
This is the current strongest stress-test result because it is not tuned after inspection. It is also the most important failure signal so far: exact term adapters fit v2 but do not generalize enough to new action paraphrases. The next architecture step should add semantic action-intent classifiers and authorization-state detectors, then rerun blind v3 or a new blind v4 without post-run tuning.
Cross-Domain Action Gate: Blind Validation v4
Public validation artifact: https://huggingface.co/datasets/mindbomber/aana-cross-domain-action-gate-blind-v4
Dataset: fresh 360-row six-domain blind action set after semantic detector improvement
Status: blind validation, no post-run tuning, not an official leaderboard
| Path | Accuracy | Block recall | Block precision | Safe allow | False positives | False negatives |
|---|---|---|---|---|---|---|
| Base accept-as-safe | 50.00% |
0.00% |
0.00% |
100.00% |
0 |
180 |
| Generic AANA | 50.00% |
100.00% |
50.00% |
0.00% |
180 |
0 |
| Semantic domain AANA on blind v4 | 90.00% |
80.00% |
100.00% |
100.00% |
0 |
36 |
This run adds semantic action-intent and authorization-state checks over the
domain adapters. Compared with blind v3, recall improved from 20.00% to
80.00%, false positives dropped from 6 to 0, and safe allow improved from
96.67% to 100.00%. Remaining misses are concentrated in finance and in
domain-specific paraphrases whose object vocabulary is still too sparse.
Cross-Domain Action Gate: Blind Validation v5
Public validation artifact: https://huggingface.co/datasets/mindbomber/aana-cross-domain-action-gate-blind-v5
Dataset: fresh 360-row six-domain blind action set after action-taxonomy calibration against blind v3/v4
Status: blind validation, no post-run tuning, not an official leaderboard
| Path | Accuracy | Block recall | Block precision | Safe allow | False positives | False negatives |
|---|---|---|---|---|---|---|
| Base accept-as-safe | 50.00% |
0.00% |
0.00% |
100.00% |
0 |
180 |
| Generic AANA | 50.00% |
100.00% |
50.00% |
0.00% |
180 |
0 |
| Taxonomy-calibrated domain AANA on blind v5 | 93.33% |
91.67% |
94.83% |
95.00% |
9 |
15 |
This run tests a learned-style action taxonomy over action intent, regulated
object class, and missing authorization state. It improves unsafe-action recall
over the original blind v4 result but lowers safe allow because near-miss safe
devops and education actions are sometimes routed to defer. The result is
useful because it exposes the next calibration target: route quality around
safe policy lookup, dry-run, and access-request actions while preserving high
recall on true high-risk actions.
Cross-Domain Action Gate: Learned Taxonomy Classifier on Held-Out v5
Public validation artifact: https://huggingface.co/datasets/mindbomber/aana-cross-domain-action-taxonomy-model-v5
Training and calibration:
blind v3/v4 only, 720 rows
Held-out evaluation:
blind v5, 360 rows
Status: held-out local validation, no v5 training-time calibration, not an official leaderboard
| Path | Accuracy | Block recall | Block precision | Safe allow | False positives | False negatives |
|---|---|---|---|---|---|---|
| String-backed taxonomy AANA on blind v5 | 93.33% |
91.67% |
94.83% |
95.00% |
9 |
15 |
| Learned taxonomy classifier on held-out v5 | 100.00% |
100.00% |
100.00% |
100.00% |
0 |
0 |
The learned classifier uses TF-IDF word and character n-grams plus calibrated
logistic regression. Threshold selection used 5-fold out-of-fold probabilities
on v3/v4 only, with a target of at least 98.00% safe allow and at least
90.00% unsafe-action recall. Leakage controls exclude expected labels, row
IDs, source dataset names, tool_arguments.dry_run, and
tool_arguments.variant.
This is the strongest local action-gate signal so far, but it should not be over-read. These are small hand-built benchmark rows with clean evidence strings. The next serious external-validity test should use sourced agent traces, noisy or incomplete evidence, ambiguous authorization states, and adversarial near-miss safe cases.
External Agent Trace Action Gate
Public validation artifact: https://huggingface.co/datasets/mindbomber/aana-external-agent-trace-action-gate
Source dataset: https://huggingface.co/datasets/zake7749/Qwen-3.6-plus-agent-tool-calling-trajectory
Rows:
360 transformed public tool-calling trace rows with noisy authorization-state
evidence
Status: external-validity diagnostic, policy-derived labels, not an official leaderboard
| Path | Accuracy | Block recall | Block precision | Safe allow | False positives | False negatives |
|---|---|---|---|---|---|---|
| Base accept-as-safe | 50.00% |
0.00% |
0.00% |
100.00% |
0 |
180 |
| v3/v4-trained AANA transfer | 50.00% |
100.00% |
50.00% |
0.00% |
180 |
0 |
| External-calibrated AANA held-out conversations | 98.61% |
96.67% |
100.00% |
100.00% |
0 |
2 |
This is the first external trace diagnostic. The direct v3/v4 transfer fails by over-deferring every external action, which is exactly the kind of failure this test was meant to expose. After calibration on external trace rows and evaluation on held-out conversation groups, AANA recovers high recall and safe allow. The remaining misses show that the next taxonomy needs a cleaner split between public/non-sensitive read tools and private identity-bound reads.
External Agent Trace Action Gate v2
Public validation artifact: https://huggingface.co/datasets/mindbomber/aana-external-agent-trace-action-gate-v2
Source dataset: https://huggingface.co/datasets/zake7749/Qwen-3.6-plus-agent-tool-calling-trajectory
Rows:
360 transformed public tool-calling trace rows with public-read,
private-read, write/update, and counterfactual missing-authorization labels
Status: external-validity diagnostic, policy-derived labels, not an official leaderboard
| Path | Accuracy | Block recall | Block precision | Safe allow | False positives | False negatives |
|---|---|---|---|---|---|---|
| Base accept-as-safe | 50.00% |
0.00% |
0.00% |
100.00% |
0 |
180 |
| v3/v4-trained AANA transfer | 50.00% |
100.00% |
50.00% |
0.00% |
180 |
0 |
| Learned-only external calibration | 93.79% |
91.78% |
95.71% |
95.83% |
3 |
6 |
| Structured trace taxonomy AANA | 100.00% |
100.00% |
100.00% |
100.00% |
0 |
0 |
This v2 result shows why the architecture needs explicit typed tool surfaces. The learned-only classifier improves recall but still misses the safe-allow target. Adding structured authorization-state detectors for public reads, private identity-bound reads, and write/update actions recovers the target on this corrected external-trace-derived benchmark.
Agent Tool Contract v1
Public validation artifact: https://huggingface.co/datasets/mindbomber/aana-agent-tool-contract-v1
Source dataset: https://huggingface.co/datasets/zake7749/Qwen-3.6-plus-agent-tool-calling-trajectory
Rows:
360 external trace rows transformed into aana.agent_tool_precheck.v1
events
Status: schema-based contract validation, policy-derived labels, not an official leaderboard
| Path | Accuracy | Unsafe recall | Block precision | Safe allow | False positives | False negatives |
|---|---|---|---|---|---|---|
| Base permissive runtime | 50.00% |
0.00% |
0.00% |
100.00% |
0 |
180 |
| AANA schema gate | 100.00% |
100.00% |
100.00% |
100.00% |
0 |
0 |
This run turns the external trace taxonomy into a portable pre-tool-call
contract that any agent runtime can emit before execution: tool name, typed tool
category, authorization state, evidence refs, risk domain, proposed arguments,
and runtime route. Every event is emitted with recommended_route=accept, so
the AANA gate must block unsafe private reads, writes, unknown tools, or
verified missing-authorization evidence. The result is a contract validation,
not a production safety guarantee.
External Agent Trace Noisy Evidence
Public validation artifact: https://huggingface.co/datasets/mindbomber/aana-external-agent-trace-noisy-evidence
Source dataset: https://huggingface.co/datasets/zake7749/Qwen-3.6-plus-agent-tool-calling-trajectory
Rows:
360 external trace rows transformed into aana.agent_tool_precheck.v1
events with deterministic noisy-evidence stressors
Status: robustness diagnostic, policy-derived labels, not an official leaderboard
| Condition | Accuracy | Unsafe recall | Block precision | Safe allow | False positives | False negatives |
|---|---|---|---|---|---|---|
| Base permissive runtime | 50.00% |
0.00% |
0.00% |
100.00% |
0 |
180 |
| Clean AANA contract gate | 100.00% |
100.00% |
100.00% |
100.00% |
0 |
0 |
| Moderate noisy evidence AANA gate | 92.78% |
100.00% |
87.38% |
85.56% |
26 |
0 |
This run keeps unsafe recall at 100% under missing, stale, redacted, and contradictory evidence, but over-blocks safe calls. The result points to the next architecture target: evidence recovery and clarification routing that can restore safe allow without loosening private-read and write-action gates.
Head-to-Head: Permissive Agent vs AANA
Public validation artifact: https://huggingface.co/datasets/mindbomber/aana-head-to-head-permissive-vs-aana
Source dataset: https://huggingface.co/datasets/zake7749/Qwen-3.6-plus-agent-tool-calling-trajectory
Rows:
360 external trace rows with moderate noisy-evidence stressors
Status: head-to-head architecture diagnostic, policy-derived labels, not an official leaderboard
| Architecture | Accuracy | Unsafe recall | Block precision | Safe allow | Unsafe accept | False positives | False negatives |
|---|---|---|---|---|---|---|---|
| Plain permissive agent | 50.00% |
0.00% |
0.00% |
100.00% |
100.00% |
0 |
180 |
| AANA schema gate | 92.78% |
100.00% |
87.38% |
85.56% |
0.00% |
26 |
0 |
AANA improves accuracy by +42.78 points and unsafe recall by +100.00
points versus an accept-all agent, while reducing unsafe accepts by 100.00
points. The cost under noisy evidence is a -14.44 point safe-allow delta.
Head-to-Head: Single Classifier vs AANA
Public validation artifact: https://huggingface.co/datasets/mindbomber/aana-head-to-head-single-classifier-vs-aana
Source dataset: https://huggingface.co/datasets/zake7749/Qwen-3.6-plus-agent-tool-calling-trajectory
Rows:
360 external trace rows with moderate noisy-evidence stressors
Status: head-to-head architecture diagnostic, policy-derived labels, not an official leaderboard
| Architecture | Accuracy | Unsafe recall | Block precision | Safe allow | Unsafe accept | False positives | False negatives |
|---|---|---|---|---|---|---|---|
| Single learned classifier | 50.00% |
100.00% |
50.00% |
0.00% |
0.00% |
180 |
0 |
| AANA schema gate | 92.78% |
100.00% |
87.38% |
85.56% |
0.00% |
26 |
0 |
The transferred single classifier blocks every external trace row. It preserves
unsafe recall, but cannot distinguish safe authenticated/private reads or public
reads from unsafe actions on this external trace transform. AANA keeps the same
100.00% unsafe recall while restoring 85.56% safe allow by using typed tool
category, authorization state, evidence refs, risk domain, and hard blockers
from the pre-tool-call contract.
Head-to-Head: Prompt-Only Policy Guardrail vs AANA
Public validation artifact: https://huggingface.co/datasets/mindbomber/aana-head-to-head-prompt-policy-vs-aana
Source dataset: https://huggingface.co/datasets/zake7749/Qwen-3.6-plus-agent-tool-calling-trajectory
Rows:
360 external trace rows with moderate noisy-evidence stressors
Status: head-to-head architecture diagnostic, policy-derived labels, not an official leaderboard
| Architecture | Accuracy | Unsafe recall | Block precision | Safe allow | Unsafe accept | False positives | False negatives |
|---|---|---|---|---|---|---|---|
| Prompt-only policy guardrail | 81.67% |
96.67% |
74.36% |
66.67% |
3.33% |
60 |
6 |
| AANA schema gate | 92.78% |
100.00% |
87.38% |
85.56% |
0.00% |
26 |
0 |
The prompt-only policy guardrail is a flattened-text baseline over candidate action, user intent, policy text, proposed arguments, and evidence summaries. It performs better than an accept-all agent and the transferred single classifier, but still misses unsafe rows and over-blocks many safe rows. AANA improves unsafe recall, block precision, and safe allow in this run by using the typed contract and hard-blocker route surface.
Head-to-Head: LLM-as-Judge Safety Checker vs AANA
Public validation artifact: https://huggingface.co/datasets/mindbomber/aana-head-to-head-llm-judge-vs-aana
Source dataset: https://huggingface.co/datasets/zake7749/Qwen-3.6-plus-agent-tool-calling-trajectory
Rows:
360 external trace rows with moderate noisy-evidence stressors
LLM judge:
gpt-4o-mini
Status: head-to-head architecture diagnostic, policy-derived labels, not an official leaderboard
| Architecture | Accuracy | Unsafe recall | Block precision | Safe allow | Unsafe accept | False positives | False negatives |
|---|---|---|---|---|---|---|---|
| LLM-as-judge safety checker | 73.33% |
100.00% |
65.22% |
46.67% |
0.00% |
96 |
0 |
| AANA schema gate | 92.78% |
100.00% |
87.38% |
85.56% |
0.00% |
26 |
0 |
The live LLM-as-judge baseline is conservative: it blocks all unsafe rows, but also blocks many safe identity lookup and authenticated/private-read calls when the evidence is noisy or flattened. AANA preserves the same unsafe recall while allowing substantially more safe calls by using explicit tool category, authorization state, evidence refs, schema validation, and hard blockers.
Head-to-Head: Contract Gate Without Recovery vs AANA
Public validation artifact: https://huggingface.co/datasets/mindbomber/aana-head-to-head-contract-no-recovery-vs-aana
Source dataset: https://huggingface.co/datasets/zake7749/Qwen-3.6-plus-agent-tool-calling-trajectory
Rows:
360 external trace rows with moderate noisy-evidence stressors
Status: head-to-head architecture diagnostic, policy-derived labels, not an official leaderboard
| Architecture | Accuracy | Unsafe recall | Block precision | Safe allow | Unsafe accept | False positives | False negatives |
|---|---|---|---|---|---|---|---|
| Structured contract gate without recovery | 92.78% |
100.00% |
87.38% |
85.56% |
0.00% |
26 |
0 |
| AANA with evidence recovery | 100.00% |
100.00% |
100.00% |
100.00% |
0.00% |
0 |
0 |
The bare contract gate consumes the noisy emitted event as-is. AANA adds a correction/evidence-recovery pass that reconstructs recoverable auth, validation, and confirmation evidence from source trace features, removes injected noisy missing-authorization refs when the source trace does not support them, preserves true missing-authorization stressors, and corrects the runtime route before final gating. The recovery pass does not read expected labels, but the trace features are produced by the included transform scripts.
External Validity: Hermes Function-Calling Head-to-Head
Public validation artifact: https://huggingface.co/datasets/mindbomber/aana-external-validity-hermes-head-to-head
Second source dataset: https://huggingface.co/datasets/NousResearch/hermes-function-calling-v1
Rows:
360 transformed Hermes function-calling rows with moderate noisy-evidence
stressors
Status: second-source architecture diagnostic, policy-derived labels, not an official leaderboard
| Architecture | Accuracy | Unsafe recall | Block precision | Safe allow | Unsafe accept | False positives | False negatives |
|---|---|---|---|---|---|---|---|
| Permissive agent | 50.00% |
0.00% |
0.00% |
100.00% |
100.00% |
0 |
180 |
| Single classifier | 50.00% |
100.00% |
50.00% |
0.00% |
0.00% |
180 |
0 |
| Prompt-only policy guardrail | 93.06% |
97.22% |
89.74% |
88.89% |
2.78% |
20 |
5 |
| LLM-as-judge safety checker | 85.28% |
99.44% |
77.49% |
71.11% |
0.56% |
52 |
1 |
| Structured contract gate without recovery | 92.22% |
100.00% |
86.54% |
84.44% |
0.00% |
28 |
0 |
| AANA with evidence recovery | 100.00% |
100.00% |
100.00% |
100.00% |
0.00% |
0 |
0 |
This run improves source diversity by using an independent function-calling dataset with different domains, schemas, and conversation format. It does not provide human-reviewed safety labels: labels and counterfactual missing-authorization rows are generated by the included transform scripts. The main replicated pattern is that AANA's evidence-recovery loop preserves unsafe recall while recovering safe allow better than flat classifiers, prompt-only guards, LLM judges, or a static contract gate.
PIIMB: Presidio + AANA
Official PIIMB submission: https://huggingface.co/datasets/piimb/pii-masking-benchmark-results/discussions/3
Model card for the paired benchmark submission: https://huggingface.co/mindbomber/aana-presidio-piimb-policy-v1
Benchmark:
piimb/pii-masking-benchmark
Dataset revision:
df8299e90ff053fa6fd1d3678f6693a454f4ecc0
Subset:
sentences
Metric/schema:
PIIMB 0.2.0
Base detector:
microsoft/presidio-analyzer
| System | Avg masking F2 | Avg recall |
|---|---|---|
| Presidio only | 0.4492985573 |
0.4008557794 |
| Presidio + AANA | 0.5629171363 |
0.5159532273 |
| Delta | +0.1136185790 |
+0.1150974479 |
Per-source AANA masking F2:
| Source dataset | F2 |
|---|---|
ai4privacy/pii-masking-openpii-1m |
0.4879480402 |
gretelai/gretel-pii-masking-en-v1 |
0.6281397502 |
nvidia/Nemotron-PII |
0.6161414756 |
piimb/privy |
0.5194392792 |
This is the clearest current ablation: the same specialist detector improved on PIIMB when paired with AANA's verifier/correction layer.
PIIMB: AANA Policy Baseline
Official PIIMB submission: https://huggingface.co/datasets/piimb/pii-masking-benchmark-results/discussions/2
Model card: https://huggingface.co/mindbomber/aana-piimb-policy-baseline
Average masking F2:
0.5195345497
This is a zero-parameter deterministic policy baseline. It is useful as a transparent architecture baseline, not as a claim against trained PII models.
TruthfulQA Local Run
Dataset:
truthfulqa/truthful_qa
Configuration:
multiple_choice
Split:
validation
Sample size: 100 questions
Base generator:
openai/gpt-4o-mini through OpenRouter
Result:
85/100 MC1 accuracy
This was a local AANA-gated run and public artifact publication, not an official TruthfulQA leaderboard submission.
Scope And Limitations
AANA should be treated as a runtime architecture and evaluation framework, not as a replacement for training-time alignment, RLHF/RLAIF, constitutional methods, retrieval-augmented generation, tool-use policy, safety classifiers, or domain specialist models. AANA can wrap and coordinate those components.
Current public results are bounded:
- PIIMB results measure PII masking F2 and recall, not production privacy safety.
- TruthfulQA results are local and small-sample, not official leaderboard claims.
- No result here claims state-of-the-art performance.
- No result here guarantees hallucination removal, PII removal, or safety in regulated workflows.
Production use still requires live evidence connectors, domain-owner signoff, audit retention, observability, human review paths, security review, deployment manifest, incident response plan, and measured pilot results.
Repositories
Project repository: https://github.com/mindbomber/Alignment-Aware-Neural-Architecture--AANA-
Project site: https://mindbomber.github.io/Alignment-Aware-Neural-Architecture--AANA-/
Reproduction Pointers
The benchmark and submission scripts are maintained in the project repository:
scripts/aana_piimb_eval.pyscripts/aana_piimb_presidio_eval.pyscripts/aana_truthfulqa_eval.pyscripts/aana_ragtruth_eval.pyscripts/aana_halubench_eval.pyscripts/aana_wikibio_hallucination_eval.pyscripts/aana_harmactions_eval.pyscripts/aana_gap_eval.pyscripts/aana_cli.py workflow-check
The AANA publication gates for the PIIMB submissions passed with:
gate_decision=passrecommended_action=acceptcandidate_gate=pass- no hard blockers