Authors commited on about 1 month ago

Commit

7f59fb7

verified ·

1 Parent(s): 587e704

Initial anonymous NeurIPS 2026 E&D code and results release

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

README.md +13 -0
README_RELEASE.md +5 -0
eval_code/configs/recap/vllm_serve_gemma4_31b_it.yaml +40 -0
eval_code/scripts/build_caption_cbu_requests.py +196 -0
eval_code/scripts/build_cbu_vqa_requests.py +139 -0
eval_code/scripts/build_grounded_cbu_verify_requests.py +197 -0
eval_code/scripts/caption_embedding_vendi.py +1330 -0
eval_code/scripts/compute_longclip_retrieval_margin.py +368 -0
eval_code/scripts/export_cbu_metric_tables.py +386 -0
eval_code/scripts/export_cbu_vqa_tables.py +84 -0
eval_code/scripts/pack_recap_ed_metrics.py +223 -0
eval_code/scripts/plot_caption_survey_curves.py +251 -0
eval_code/scripts/run_cbu_vqa_requests.py +261 -0
eval_code/scripts/run_grounded_cbu_verify_requests.py +289 -0
eval_code/scripts/run_text_json_requests.py +256 -0
eval_code/scripts/summarize_cbu_responses.py +296 -0
eval_code/scripts/summarize_cbu_vqa_responses.py +153 -0
eval_code/scripts/summarize_grounded_cbu_verify.py +135 -0
eval_code/scripts/vllm/serve_gemma4_31b_it.sh +72 -0
eval_results/ALL_EVAL_RESULTS_INDEX.md +28 -0
eval_results/README.md +14 -0
eval_results/all_cbu_b64_summary.csv +15 -0
eval_results/all_vqa_b64_summary.csv +17 -0
eval_results/cc12m_budget_frontier_plot.csv +17 -0
eval_results/cc12m_cbu_budget_frontier.png +0 -0
eval_results/cc12m_cbu_vqa_bootstrap_ci.tsv +5 -0
eval_results/cc12m_cbu_yield_efficiency_scatter.png +0 -0
eval_results/cc12m_gemma4_vqa_bootstrap_ci.tsv +5 -0
eval_results/cc12m_longclip_plot.csv +9 -0
eval_results/cc12m_vqa_supported_risk_pareto.csv +9 -0
eval_results/cc12m_vqa_supported_risk_pareto.png +0 -0
eval_results/datacomp-naive-qwen35-baseline-2026-05-02/README.md +106 -0
eval_results/datacomp-naive-qwen35-baseline-2026-05-02/cpu_text_metrics/cpu_text_comparison.md +4 -0
eval_results/datacomp-naive-qwen35-baseline-2026-05-02/cpu_text_metrics/cpu_text_comparison.tsv +3 -0
eval_results/datacomp-naive-qwen35-baseline-2026-05-02/cpu_text_metrics/cpu_text_summary.json +56 -0
eval_results/datacomp-naive-qwen35-baseline-2026-05-02/gemma4_metric_tables/cbu_bootstrap_summary.json +238 -0
eval_results/datacomp-naive-qwen35-baseline-2026-05-02/gemma4_metric_tables/cbu_vqa_gemma4_table.md +3 -0
eval_results/datacomp-naive-qwen35-baseline-2026-05-02/gemma4_metric_tables/cbu_vqa_gemma4_table.tex +7 -0
eval_results/datacomp-naive-qwen35-baseline-2026-05-02/gemma4_metric_tables/claimed_cbu_ci.tsv +2 -0
eval_results/datacomp-naive-qwen35-baseline-2026-05-02/gemma4_metric_tables/grounded_cbu_category_ci.tsv +9 -0
eval_results/datacomp-naive-qwen35-baseline-2026-05-02/gemma4_metric_tables/grounded_cbu_ci.tsv +2 -0
eval_results/datacomp-naive-qwen35-baseline-2026-05-02/naive_qwen35_caption.summary.json +15 -0
eval_results/embeddinggemma_pair_summary.tsv +8 -0
eval_results/eval_results_summary.md +34 -0
eval_results/gemma-cross-corpus-2026-05-02/README.md +3 -0
eval_results/gemma-cross-corpus-2026-05-02/cbu_bootstrap_summary.json +1375 -0
eval_results/gemma-cross-corpus-2026-05-02/cbu_vqa_gemma4_cross_corpus_table.md +10 -0
eval_results/gemma-cross-corpus-2026-05-02/cbu_vqa_gemma4_cross_corpus_table.tex +14 -0
eval_results/gemma-cross-corpus-2026-05-02/claimed_cbu_ci.tsv +1 -0
eval_results/gemma-cross-corpus-2026-05-02/grounded_cbu_category_ci.tsv +65 -0

README.md ADDED Viewed

	@@ -0,0 +1,13 @@

+# NeurIPS E&D Recap Evaluation Export Bundle
+This sanitized bundle groups review-facing material for a recaptioned T2I supervision evaluation submission.
+## Layout
+- `dataset_release/`: Hugging Face oriented caption metadata records, grouped pair records, dataset card draft, and Croissant metadata template. Source images are not included.
+- `eval_code/`: reproducible evaluation scripts and vLLM configuration copies.
+- `eval_results/`: compact result tables, plot-ready CSV files, and generated figure drafts.
+- `paper_drafts/`: sanitized writing drafts and appendix notes.
+- `metadata/`: auxiliary export metadata.
+Local machine paths, usernames, and repository identifiers have been replaced with placeholders.

README_RELEASE.md ADDED Viewed

	@@ -0,0 +1,5 @@

+# Anonymous Recap T2I Evaluation Code and Results
+This repository stages executable evaluation scripts, compact result tables, and manifests for the NeurIPS 2026 E&D review package. Dataset metadata is staged separately at `https://huggingface.co/datasets/Anonymous1557/recap-t2i-evaluation-metadata-2026`.
+Large image audit tarballs and unredacted source metadata are excluded from this code package and retained in the private SMB archive unless explicitly approved for release.

eval_code/configs/recap/vllm_serve_gemma4_31b_it.yaml ADDED Viewed

	@@ -0,0 +1,40 @@

+# vLLM serve config: google/gemma-4-31B-it, DP=8
+#
+# Intended use:
+#   VLLM_CONFIG=configs/recap/vllm_serve_gemma4_31b_it.yaml \
+#   VLLM_LOG=/tmp/vllm_gemma4_31b_it.log \
+#   bash scripts/vllm/serve_gemma4_31b_it.sh start
+#
+# This config is for cross-family VQA/CBU judging. It uses one replica per H200
+# to maximize throughput on image-conditioned yes/no/uncertain audit requests.
+model: "<HF_CACHE>/models--google--gemma-4-31B-it/snapshots/439edf5652646a0d1bd8b46bfdc1d3645761a445"
+served-model-name: "google/gemma-4-31B-it"
+host: "0.0.0.0"
+port: 8000
+# Parallelism
+data-parallel-size: 8
+tensor-parallel-size: 1
+# Memory and concurrency
+dtype: "auto"
+gpu-memory-utilization: 0.94
+max-model-len: 4096
+max-num-seqs: 512
+max-num-batched-tokens: 65536
+max-cudagraph-capture-size: 512
+# Keep KV compact for high-concurrency VQA judge workloads.
+kv-cache-dtype: "fp8"
+# Multimodal / throughput
+enable-chunked-prefill: true
+enable-prefix-caching: true
+limit-mm-per-prompt: '{"image": 1}'
+mm-processor-kwargs: '{"max_pixels": 1003520}'
+allowed-local-media-path: "/"
+# Logging
+disable-uvicorn-access-log: true
+uvicorn-log-level: "warning"

eval_code/scripts/build_caption_cbu_requests.py ADDED Viewed

	@@ -0,0 +1,196 @@

+#!/usr/bin/env python3
+"""Build text-only claimed-CBU extraction requests from caption JSONL files."""
+from __future__ import annotations
+import argparse
+import hashlib
+import json
+from pathlib import Path
+from typing import Any
+UNIT_CATEGORIES = [
+    "object",
+    "attribute",
+    "relation",
+    "style",
+    "camera",
+    "lighting",
+    "count",
+    "text_rendering",
+]
+SYSTEM_PROMPT = """You extract atomic controllable visual content units from captions for text-to-image training-data evaluation.
+Return only valid compact JSON. Extract only facts explicitly claimed by the caption. Do not infer image content beyond the caption."""
+CBU_JSON_SCHEMA: dict[str, Any] = {
+    "type": "object",
+    "properties": {
+        "caption_id": {"type": "string"},
+        "claimed_units": {
+            "type": "array",
+            "items": {
+                "type": "object",
+                "properties": {
+                    "category": {"type": "string", "enum": UNIT_CATEGORIES},
+                    "unit": {"type": "string", "maxLength": 80},
+                    "span": {"type": "string", "maxLength": 120},
+                    "target": {"type": "string", "maxLength": 80},
+                },
+                "required": ["category", "unit", "span", "target"],
+                "additionalProperties": False,
+            },
+        },
+    },
+    "required": ["caption_id", "claimed_units"],
+    "additionalProperties": False,
+}
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(description="Build claimed-CBU extraction request JSONL")
+    parser.add_argument("--input", required=True, help="Caption JSONL")
+    parser.add_argument("--output", required=True)
+    parser.add_argument("--text-field", default="caption")
+    parser.add_argument("--id-field", default=None)
+    parser.add_argument("--surface", required=True)
+    parser.add_argument("--max-records", type=int, default=None)
+    parser.add_argument("--sample-records", type=int, default=None)
+    parser.add_argument("--sample-seed", type=int, default=0)
+    parser.add_argument("--max-caption-chars", type=int, default=1800)
+    parser.add_argument(
+        "--token-budget",
+        type=int,
+        default=None,
+        help="Optional whitespace token prefix budget for length-controlled CBU@B requests",
+    )
+    parser.add_argument(
+        "--max-units",
+        type=int,
+        default=None,
+        help="Optional maximum atomic units in the JSON schema; use only for stress/debug caps",
+    )
+    return parser.parse_args()
+def stable_float(*parts: object) -> float:
+    raw = ":".join(str(part) for part in parts)
+    digest = hashlib.blake2b(raw.encode("utf-8"), digest_size=8).digest()
+    return int.from_bytes(digest, "big") / 2**64
+def iter_rows(args: argparse.Namespace) -> list[tuple[int, str | None, str]]:
+    rows: list[tuple[int, str | None, str]] = []
+    with Path(args.input).open("r", encoding="utf-8") as handle:
+        for row_index, line in enumerate(handle):
+            if args.max_records is not None and args.sample_records is None and len(rows) >= args.max_records:
+                break
+            if not line.strip():
+                continue
+            row = json.loads(line)
+            text = row.get(args.text_field)
+            if not isinstance(text, str) or not text.strip():
+                continue
+            row_id = row.get(args.id_field) if args.id_field else None
+            rows.append((row_index, str(row_id) if row_id is not None else None, text))
+    if args.sample_records is not None:
+        rows.sort(key=lambda item: stable_float(args.sample_seed, args.surface, item[0], item[1] or ""))
+        rows = rows[: args.sample_records]
+        rows.sort(key=lambda item: item[0])
+    return rows
+def schema_with_max_units(max_units: int | None) -> dict[str, Any]:
+    schema = json.loads(json.dumps(CBU_JSON_SCHEMA))
+    if max_units is not None:
+        schema["properties"]["claimed_units"]["maxItems"] = max_units
+    return schema
+def build_user_prompt(caption_id: str, caption: str, max_caption_chars: int, max_units: int | None) -> str:
+    clipped = caption[:max_caption_chars].replace("\n", " ")
+    schema = json.dumps(schema_with_max_units(max_units), ensure_ascii=False, separators=(",", ":"))
+    categories = ", ".join(UNIT_CATEGORIES)
+    return (
+        "Extract caption-claimed controllable visual units as atomic records.\n"
+        f"Unit categories: {categories}.\n"
+        "Rules:\n"
+        "- Each record must contain exactly one visual control fact.\n"
+        "- Use each semantic fact once; choose the single best category.\n"
+        "- unit is a short canonical phrase, not a full clause.\n"
+        "- span is the shortest caption span supporting the unit.\n"
+        "- target is the object or scene element modified by the unit; use \"scene\" when global.\n"
+        "- relation units must include both the relation and participating objects; do not output lone verbs or prepositions.\n"
+        "- count units must attach a number to a target object; never output articles such as a, an, or the.\n"
+        "- text_rendering units are only visible rendered text explicitly claimed by the caption; absent text claims are not units.\n"
+        "- Do not output negative or absent facts, metadata, captioner phrases, or duplicate paraphrases.\n"
+        "- Keep text_rendering units short; do not copy long copyright, table, or legal text blocks.\n"
+        "- Use [] when the caption contains no controllable visual units.\n"
+        "Return only JSON matching this schema:\n"
+        f"{schema}\n\n"
+        f"caption_id={caption_id}\ncaption={clipped}"
+    )
+def apply_token_budget(caption: str, token_budget: int | None) -> str:
+    if token_budget is None:
+        return caption
+    return " ".join(caption.split()[:token_budget])
+def main() -> int:
+    args = parse_args()
+    if args.max_records is not None and args.sample_records is not None:
+        raise SystemExit("--max-records and --sample-records are mutually exclusive")
+    output = Path(args.output)
+    output.parent.mkdir(parents=True, exist_ok=True)
+    rows = iter_rows(args)
+    with output.open("w", encoding="utf-8") as handle:
+        for emitted_index, (source_row, row_id, caption) in enumerate(rows):
+            caption_id = row_id or f"{args.surface}:{source_row}"
+            request_caption = apply_token_budget(caption, args.token_budget)
+            budget_tag = f"b{args.token_budget}" if args.token_budget is not None else "full"
+            request_id = hashlib.blake2b(
+                f"claimed_cbu_v2:{budget_tag}:{args.surface}:{source_row}:{caption_id}".encode("utf-8"),
+                digest_size=16,
+            ).hexdigest()
+            row = {
+                "request_id": request_id,
+                "task": "claimed_cbu_v2",
+                "token_budget": args.token_budget,
+                "surface": args.surface,
+                "caption_id": caption_id,
+                "source_row": source_row,
+                "emitted_index": emitted_index,
+                "caption": request_caption,
+                "source_caption": caption,
+                "system_prompt": SYSTEM_PROMPT,
+                "user_prompt": build_user_prompt(caption_id, request_caption, args.max_caption_chars, args.max_units),
+            }
+            handle.write(json.dumps(row, ensure_ascii=False) + "\n")
+    manifest = {
+        "task": "claimed_cbu_v2",
+        "input": args.input,
+        "output": str(output),
+        "surface": args.surface,
+        "text_field": args.text_field,
+        "id_field": args.id_field,
+        "max_records": args.max_records,
+        "sample_records": args.sample_records,
+        "sample_seed": args.sample_seed,
+        "token_budget": args.token_budget,
+        "max_units": args.max_units,
+        "rows": len(rows),
+        "schema": schema_with_max_units(args.max_units),
+    }
+    manifest_path = output.with_suffix(".manifest.json")
+    manifest_path.write_text(json.dumps(manifest, indent=2, ensure_ascii=False), encoding="utf-8")
+    print(json.dumps({"output": str(output), "manifest": str(manifest_path), "requests": len(rows)}, indent=2))
+    return 0
+if __name__ == "__main__":
+    raise SystemExit(main())

eval_code/scripts/build_cbu_vqa_requests.py ADDED Viewed

	@@ -0,0 +1,139 @@

+#!/usr/bin/env python3
+"""Build VQA-style yes/no question requests from grounded-CBU request JSONL."""
+from __future__ import annotations
+import argparse
+import hashlib
+import json
+from pathlib import Path
+from typing import Any
+SYSTEM_PROMPT = """You are a strict visual question answering judge.
+Return only valid compact JSON. Answer each question using only visible image evidence."""
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(description="Build VQA-style requests from CBU verification requests")
+    parser.add_argument("--input", required=True, help="grounded-CBU request JSONL")
+    parser.add_argument("--output", required=True)
+    parser.add_argument("--max-requests", type=int, default=None)
+    parser.add_argument("--sample-records", type=int, default=None)
+    parser.add_argument("--sample-seed", type=int, default=0)
+    parser.add_argument("--max-questions-per-request", type=int, default=None)
+    return parser.parse_args()
+def stable_float(*parts: object) -> float:
+    raw = ":".join(str(part) for part in parts)
+    digest = hashlib.blake2b(raw.encode("utf-8"), digest_size=8).digest()
+    return int.from_bytes(digest, "big") / 2**64
+def question_for(unit: dict[str, Any]) -> str:
+    category = str(unit.get("category", ""))
+    phrase = str(unit.get("unit", "")).strip()
+    target = str(unit.get("target", "")).strip()
+    if category == "text_rendering":
+        return f"Is the rendered text claim '{phrase}' visibly supported by the image?"
+    if target:
+        return f"Is the visual claim '{target}: {phrase}' supported by the image?"
+    return f"Is the visual claim '{phrase}' supported by the image?"
+def user_prompt(questions: list[dict[str, str]]) -> str:
+    question_json = json.dumps(questions, ensure_ascii=False, separators=(",", ":"))
+    return (
+        "Answer each visual question using only the image.\n"
+        "Rules:\n"
+        "- Do not use any caption text or outside knowledge.\n"
+        "- Use yes when the image visibly supports the question.\n"
+        "- Use no when the image contradicts the question or lacks visible support.\n"
+        "- Use uncertain when the question is too fine-grained, occluded, unreadable, or visually ambiguous.\n"
+        "- Keep evidence short and grounded in visible image content.\n"
+        "- Return exactly one answer for each input question_id.\n\n"
+        f"questions={question_json}"
+    )
+def iter_rows(args: argparse.Namespace) -> list[dict[str, Any]]:
+    rows: list[dict[str, Any]] = []
+    with Path(args.input).open("r", encoding="utf-8") as handle:
+        for line in handle:
+            if args.max_requests is not None and args.sample_records is None and len(rows) >= args.max_requests:
+                break
+            if line.strip():
+                rows.append(json.loads(line))
+    if args.sample_records is not None:
+        rows.sort(key=lambda row: stable_float(args.sample_seed, row.get("request_id", "")))
+        rows = rows[: args.sample_records]
+        rows.sort(key=lambda row: row.get("source_row", 0))
+    return rows
+def main() -> int:
+    args = parse_args()
+    rows = iter_rows(args)
+    output = Path(args.output)
+    output.parent.mkdir(parents=True, exist_ok=True)
+    written = 0
+    skipped = 0
+    with output.open("w", encoding="utf-8") as handle:
+        for row in rows:
+            units = row.get("claimed_units", [])
+            if args.max_questions_per_request is not None:
+                units = units[: args.max_questions_per_request]
+            questions = [
+                {
+                    "question_id": str(unit["unit_id"]),
+                    "category": str(unit.get("category", "")),
+                    "question": question_for(unit),
+                }
+                for unit in units
+                if isinstance(unit, dict) and isinstance(unit.get("unit_id"), str)
+            ]
+            if not questions:
+                skipped += 1
+                continue
+            request_id = hashlib.blake2b(
+                f"cbu_vqa_v1:{row.get('request_id')}:{row.get('caption_id')}".encode("utf-8"),
+                digest_size=16,
+            ).hexdigest()
+            out = {
+                "request_id": request_id,
+                "task": "cbu_vqa_v1",
+                "surface": row.get("surface"),
+                "caption_id": row.get("caption_id"),
+                "source_row": row.get("source_row"),
+                "token_budget": row.get("token_budget"),
+                "questions": questions,
+                "system_prompt": SYSTEM_PROMPT,
+                "user_prompt": user_prompt(questions),
+                "image_url": row.get("image_url"),
+                "image_path": row.get("image_path"),
+                "image_sha256": row.get("image_sha256"),
+                "pair_id": row.get("pair_id"),
+                "pair_key": row.get("pair_key"),
+                "public_lookup_key": row.get("public_lookup_key"),
+                "family": row.get("family"),
+            }
+            handle.write(json.dumps(out, ensure_ascii=False) + "\n")
+            written += 1
+    manifest = {
+        "task": "cbu_vqa_v1",
+        "input": args.input,
+        "output": str(output),
+        "requests": written,
+        "skipped": skipped,
+        "sample_records": args.sample_records,
+        "sample_seed": args.sample_seed,
+        "max_questions_per_request": args.max_questions_per_request,
+    }
+    output.with_suffix(".manifest.json").write_text(json.dumps(manifest, indent=2, ensure_ascii=False), encoding="utf-8")
+    print(json.dumps(manifest, indent=2, ensure_ascii=False))
+    return 0
+if __name__ == "__main__":
+    raise SystemExit(main())

eval_code/scripts/build_grounded_cbu_verify_requests.py ADDED Viewed

	@@ -0,0 +1,197 @@

+#!/usr/bin/env python3
+"""Build exact-unit image audit requests from claimed-CBU responses."""
+from __future__ import annotations
+import argparse
+import hashlib
+import json
+from pathlib import Path
+from typing import Any
+SYSTEM_PROMPT = """You are a strict visual grounding judge for text-to-image training captions.
+Return only valid compact JSON. Judge only whether each provided caption-derived unit is visibly supported by the image."""
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(description="Build exact-unit grounded-CBU verification requests")
+    parser.add_argument("--claimed-responses", required=True)
+    parser.add_argument("--source-jsonl", required=True, help="Fair-slice JSONL used to build the claimed requests")
+    parser.add_argument("--output", required=True)
+    parser.add_argument("--max-requests", type=int, default=None)
+    parser.add_argument("--max-units-per-request", type=int, default=None, help="Debug cap only; omit for main audit")
+    parser.add_argument("--image-path-field", default=None)
+    parser.add_argument(
+        "--require-local-image",
+        action="store_true",
+        help="Skip rows without a local image path. Use for reproducible image-grounded audits.",
+    )
+    parser.add_argument(
+        "--surface-filter",
+        default=None,
+        help="If set, keep only claimed responses whose request.surface exactly matches this value.",
+    )
+    return parser.parse_args()
+def iter_ok_claims(path: Path, surface_filter: str | None = None) -> list[dict[str, Any]]:
+    rows: list[dict[str, Any]] = []
+    with path.open("r", encoding="utf-8") as handle:
+        for line in handle:
+            if not line.strip():
+                continue
+            row = json.loads(line)
+            parsed = row.get("parsed")
+            request = row.get("request", {})
+            if surface_filter is not None and request.get("surface") != surface_filter:
+                continue
+            units = parsed.get("claimed_units") if isinstance(parsed, dict) else None
+            if not row.get("ok") or not isinstance(units, list) or not units:
+                continue
+            rows.append({"request": request, "parsed": parsed})
+    return rows
+def load_source_rows(source_jsonl: Path, needed: set[int]) -> dict[int, dict[str, Any]]:
+    out: dict[int, dict[str, Any]] = {}
+    with source_jsonl.open("r", encoding="utf-8") as handle:
+        for index, line in enumerate(handle):
+            if index in needed and line.strip():
+                out[index] = json.loads(line)
+                if len(out) == len(needed):
+                    break
+    return out
+def image_fields(source_row: dict[str, Any], image_path_field: str | None) -> dict[str, Any]:
+    image = source_row.get("image") if isinstance(source_row.get("image"), dict) else {}
+    metadata = source_row.get("metadata") if isinstance(source_row.get("metadata"), dict) else {}
+    local_record = source_row.get("local_record") if isinstance(source_row.get("local_record"), dict) else {}
+    public_record = source_row.get("public_record") if isinstance(source_row.get("public_record"), dict) else {}
+    if image_path_field:
+        image_path = source_row.get(image_path_field)
+    else:
+        image_path = (
+            image.get("local_abs_path")
+            or local_record.get("image_abs_path")
+            or source_row.get("image_abs_path")
+            or source_row.get("image_path")
+        )
+    image_url = (
+        image.get("url")
+        or source_row.get("url")
+        or source_row.get("image_url")
+        or metadata.get("canonical_url")
+        or public_record.get("url")
+        or source_row.get("pair_key")
+    )
+    return {
+        "image_url": image_url,
+        "image_path": image_path,
+        "image_sha256": image.get("sha256") or source_row.get("sha256"),
+        "pair_id": source_row.get("pair_id"),
+        "pair_key": source_row.get("pair_key"),
+        "public_lookup_key": source_row.get("public_lookup_key"),
+        "family": source_row.get("family"),
+    }
+def normalize_unit(raw: dict[str, Any], caption_id: str, index: int) -> dict[str, str]:
+    return {
+        "unit_id": f"{caption_id}:u{index:04d}",
+        "category": str(raw.get("category", "")),
+        "unit": str(raw.get("unit", "")),
+        "span": str(raw.get("span", "")),
+        "target": str(raw.get("target", "")),
+    }
+def user_prompt(caption: str, units: list[dict[str, str]]) -> str:
+    unit_json = json.dumps(units, ensure_ascii=False, separators=(",", ":"))
+    return (
+        "Verify the visual grounding of each provided caption-derived unit.\n"
+        "Rules:\n"
+        "- Do not add, remove, split, merge, rename, or reinterpret unit_id values.\n"
+        "- Use grounded when the image visibly supports the unit.\n"
+        "- Use unsupported when the image contradicts the unit or lacks visible support.\n"
+        "- Use uncertain when the unit is too fine-grained, occluded, unreadable, or visually ambiguous.\n"
+        "- Use invalid_text_unit only when the unit is not a meaningful visual claim from the caption.\n"
+        "- Use not_a_visual_claim only for non-visual metadata or captioner-language units.\n"
+        "- Keep evidence short; cite only visible image evidence.\n"
+        "Return JSON with caption_id and unit_results, exactly one result for each input unit_id.\n\n"
+        f"caption={caption}\n"
+        f"claimed_units={unit_json}"
+    )
+def main() -> int:
+    args = parse_args()
+    claims = iter_ok_claims(Path(args.claimed_responses), args.surface_filter)
+    if args.max_requests is not None:
+        claims = claims[: args.max_requests]
+    needed = {int(item["request"]["source_row"]) for item in claims if item["request"].get("source_row") is not None}
+    sources = load_source_rows(Path(args.source_jsonl), needed)
+    output = Path(args.output)
+    output.parent.mkdir(parents=True, exist_ok=True)
+    written = 0
+    skipped = 0
+    with output.open("w", encoding="utf-8") as handle:
+        for item in claims:
+            req = item["request"]
+            source_row = sources.get(int(req["source_row"]))
+            if source_row is None:
+                skipped += 1
+                continue
+            image_info = image_fields(source_row, args.image_path_field)
+            if args.require_local_image and not image_info.get("image_path"):
+                skipped += 1
+                continue
+            caption_id = str(item["parsed"].get("caption_id") or req.get("caption_id"))
+            units = [
+                normalize_unit(raw, caption_id, index)
+                for index, raw in enumerate(item["parsed"].get("claimed_units", []))
+                if isinstance(raw, dict)
+            ]
+            if args.max_units_per_request is not None:
+                units = units[: args.max_units_per_request]
+            if not units:
+                skipped += 1
+                continue
+            row = {
+                "request_id": hashlib.blake2b(
+                    f"grounded_cbu_verify_v2:{req.get('request_id')}:{caption_id}".encode("utf-8"),
+                    digest_size=16,
+                ).hexdigest(),
+                "task": "grounded_cbu_verify_v2",
+                "surface": req.get("surface"),
+                "caption_id": caption_id,
+                "source_row": req.get("source_row"),
+                "token_budget": req.get("token_budget"),
+                "caption": req.get("caption"),
+                "source_caption": req.get("source_caption"),
+                "claimed_units": units,
+                "system_prompt": SYSTEM_PROMPT,
+                "user_prompt": user_prompt(str(req.get("caption", "")), units),
+                **image_info,
+            }
+            handle.write(json.dumps(row, ensure_ascii=False) + "\n")
+            written += 1
+    manifest = {
+        "task": "grounded_cbu_verify_v2",
+        "claimed_responses": args.claimed_responses,
+        "source_jsonl": args.source_jsonl,
+        "output": str(output),
+        "requests": written,
+        "skipped": skipped,
+        "max_requests": args.max_requests,
+        "max_units_per_request": args.max_units_per_request,
+        "surface_filter": args.surface_filter,
+    }
+    output.with_suffix(".manifest.json").write_text(json.dumps(manifest, indent=2, ensure_ascii=False), encoding="utf-8")
+    print(json.dumps(manifest, indent=2, ensure_ascii=False))
+    return 0
+if __name__ == "__main__":
+    raise SystemExit(main())

eval_code/scripts/caption_embedding_vendi.py ADDED Viewed

	@@ -0,0 +1,1330 @@

+#!/usr/bin/env python3
+"""Encode caption text and compute block Vendi scores.
+The script is intentionally split into three subcommands:
+- `inspect`: report tokenizer/config limits for candidate encoders
+- `encode`: cache normalized text embeddings from JSONL captions
+- `vendi`: compute sampled block Vendi/effective-rank summaries from caches
+The encoder path is GPU-ready but the same code can be sanity-checked on CPU
+with a tiny sample before H200 allocation.
+"""
+from __future__ import annotations
+import argparse
+import json
+import math
+import random
+import sys
+import time
+import types
+from dataclasses import asdict, dataclass
+from pathlib import Path
+from typing import Any, Iterable
+import numpy as np
+import torch
+@dataclass
+class EmbeddingShard:
+    path: str
+    rows: int
+    dim: int
+    dtype: str
+    start_row: int
+    end_row: int
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(description="Caption embedding cache and Vendi utilities")
+    subparsers = parser.add_subparsers(dest="cmd", required=True)
+    inspect = subparsers.add_parser("inspect", help="Inspect tokenizer/model text limits")
+    inspect.add_argument("--model", action="append", required=True, help="HF model id/path; may be repeated")
+    inspect.add_argument("--trust-remote-code", action="store_true")
+    inspect.add_argument(
+        "--compat-remote-code",
+        action="store_true",
+        help="Install small compatibility shims for older HF remote-code embedding models.",
+    )
+    encode = subparsers.add_parser("encode", help="Extract normalized text embeddings")
+    encode.add_argument("--input", required=True, help="JSONL input")
+    encode.add_argument("--text-field", default="caption")
+    encode.add_argument("--id-field", default=None)
+    encode.add_argument("--model", required=True)
+    encode.add_argument("--output-dir", required=True)
+    encode.add_argument("--max-records", type=int, default=None)
+    encode.add_argument(
+        "--sample-records",
+        type=int,
+        default=None,
+        help="Reservoir-sample this many records before modulo splitting. Mutually exclusive with --max-records.",
+    )
+    encode.add_argument("--sample-seed", type=int, default=0)
+    encode.add_argument("--split-count", type=int, default=1, help="Modulo split count for multi-GPU extraction")
+    encode.add_argument("--split-index", type=int, default=0, help="Modulo split index for this worker")
+    encode.add_argument("--batch-size", type=int, default=256)
+    encode.add_argument("--max-length", type=int, default=None)
+    encode.add_argument("--device", default="cuda")
+    encode.add_argument("--dtype", default="float16", choices=["float16", "bfloat16", "float32"])
+    encode.add_argument("--embedding-dtype", default="float16", choices=["float16", "float32"])
+    encode.add_argument("--shard-rows", type=int, default=100_000)
+    encode.add_argument("--pooling", default="auto", choices=["auto", "cls", "mean", "pooler", "last"])
+    encode.add_argument("--padding-side", default=None, choices=["left", "right"], help="Override tokenizer padding side")
+    encode.add_argument("--text-prefix", default="", help="Prefix applied to every text before tokenization")
+    encode.add_argument(
+        "--text-template",
+        default=None,
+        help="Python format template applied before tokenization. Must contain '{text}'. Overrides --text-prefix.",
+    )
+    encode.add_argument("--trust-remote-code", action="store_true")
+    encode.add_argument(
+        "--compat-remote-code",
+        action="store_true",
+        help="Install small compatibility shims for older HF remote-code embedding models.",
+    )
+    encode.add_argument("--compile", action="store_true")
+    bge = subparsers.add_parser("encode-bge-m3", help="Extract official BGE-M3 dense embeddings via FlagEmbedding")
+    bge.add_argument("--input", required=True, help="JSONL input")
+    bge.add_argument("--text-field", default="caption")
+    bge.add_argument("--id-field", default=None)
+    bge.add_argument("--model", default="BAAI/bge-m3")
+    bge.add_argument("--output-dir", required=True)
+    bge.add_argument("--max-records", type=int, default=None)
+    bge.add_argument("--sample-records", type=int, default=None)
+    bge.add_argument("--sample-seed", type=int, default=0)
+    bge.add_argument("--split-count", type=int, default=1)
+    bge.add_argument("--split-index", type=int, default=0)
+    bge.add_argument("--batch-size", type=int, default=256)
+    bge.add_argument("--max-length", type=int, default=512)
+    bge.add_argument("--device", default="cuda")
+    bge.add_argument("--use-fp16", action=argparse.BooleanOptionalAction, default=True)
+    bge.add_argument("--embedding-dtype", default="float16", choices=["float16", "float32"])
+    bge.add_argument("--shard-rows", type=int, default=100_000)
+    bge.add_argument("--text-prefix", default="", help="Prefix applied to every text before encoding")
+    bge.add_argument("--text-template", default=None, help="Python format template containing '{text}'")
+    bge.add_argument("--encode-mode", default="corpus", choices=["corpus", "queries", "encode"])
+    bge.add_argument("--query-instruction", default=None, help="Optional BGEM3 query_instruction_for_retrieval")
+    bge.add_argument("--query-instruction-format", default="{}{}", help="BGEM3 query_instruction_format")
+    st = subparsers.add_parser(
+        "encode-sentence-transformer",
+        help="Extract embeddings with SentenceTransformer's model-specific encode protocol",
+    )
+    st.add_argument("--input", required=True, help="JSONL input")
+    st.add_argument("--text-field", default="caption")
+    st.add_argument("--id-field", default=None)
+    st.add_argument("--model", required=True)
+    st.add_argument("--output-dir", required=True)
+    st.add_argument("--max-records", type=int, default=None)
+    st.add_argument("--sample-records", type=int, default=None)
+    st.add_argument("--sample-seed", type=int, default=0)
+    st.add_argument("--split-count", type=int, default=1)
+    st.add_argument("--split-index", type=int, default=0)
+    st.add_argument("--batch-size", type=int, default=256)
+    st.add_argument("--max-length", type=int, default=None)
+    st.add_argument("--device", default="cuda")
+    st.add_argument("--embedding-dtype", default="float16", choices=["float16", "float32"])
+    st.add_argument("--shard-rows", type=int, default=100_000)
+    st.add_argument("--text-prefix", default="", help="Prefix applied to every text before encoding")
+    st.add_argument("--text-template", default=None, help="Python format template containing '{text}'")
+    st.add_argument("--prompt-name", default=None, help="SentenceTransformer prompt_name, e.g. document or query")
+    vendi = subparsers.add_parser("vendi", help="Compute sampled block Vendi from embedding cache")
+    vendi.add_argument("--manifest", required=True)
+    vendi.add_argument("--output", required=True)
+    vendi.add_argument("--block-size", type=int, default=4096)
+    vendi.add_argument("--blocks", type=int, default=64)
+    vendi.add_argument(
+        "--sampling",
+        choices=["random", "partition"],
+        default="random",
+        help="random samples blocks; partition shuffles once and uses every row in disjoint blocks.",
+    )
+    vendi.add_argument("--seed", type=int, default=0)
+    vendi.add_argument("--device", default="cuda")
+    vendi.add_argument("--matrix-device", default=None, help="Override device for eigvalsh; defaults to --device")
+    vendi.add_argument("--dtype", default="float32", choices=["float16", "bfloat16", "float32"])
+    geom = subparsers.add_parser("geometry", help="Compute embedding-distribution geometry summaries")
+    geom.add_argument("--manifest", required=True)
+    geom.add_argument("--output", required=True)
+    geom.add_argument("--max-rows", type=int, default=100_000)
+    geom.add_argument("--seed", type=int, default=0)
+    geom.add_argument("--device", default="cuda")
+    geom.add_argument("--dtype", default="float32", choices=["float16", "bfloat16", "float32"])
+    knn = subparsers.add_parser("knn", help="Compute exact nearest-neighbor support between two embedding caches")
+    knn.add_argument("--query-manifest", required=True)
+    knn.add_argument("--gallery-manifest", required=True)
+    knn.add_argument("--output", required=True)
+    knn.add_argument("--query-max-rows", type=int, default=None)
+    knn.add_argument("--gallery-max-rows", type=int, default=None)
+    knn.add_argument("--seed", type=int, default=0)
+    knn.add_argument("--device", default="cuda")
+    knn.add_argument("--dtype", default="float16", choices=["float16", "bfloat16", "float32"])
+    knn.add_argument("--query-batch-size", type=int, default=1024)
+    knn.add_argument(
+        "--gallery-chunk-size",
+        type=int,
+        default=0,
+        help="0 keeps the full gallery resident on device; positive values stream gallery chunks.",
+    )
+    knn.add_argument("--thresholds", default="0.60,0.70,0.75,0.80,0.85,0.90")
+    knn.add_argument("--save-scores", default=None, help="Optional .npy path for per-query nearest-neighbor cosine scores")
+    support = subparsers.add_parser("support", help="Compute PRDC-style query-in-gallery manifold support")
+    support.add_argument("--query-manifest", required=True, help="Prompt/query embedding manifest P")
+    support.add_argument("--gallery-manifest", required=True, help="Caption/support embedding manifest C")
+    support.add_argument("--output", required=True)
+    support.add_argument("--query-max-rows", type=int, default=None)
+    support.add_argument("--gallery-max-rows", type=int, default=None)
+    support.add_argument("--seed", type=int, default=0)
+    support.add_argument("--k", type=int, default=10)
+    support.add_argument("--device", default="cuda")
+    support.add_argument("--dtype", default="float16", choices=["float16", "bfloat16", "float32"])
+    support.add_argument("--query-batch-size", type=int, default=512)
+    support.add_argument("--gallery-batch-size", type=int, default=512)
+    support.add_argument("--save-scores", default=None, help="Optional .npz path for per-query support scores")
+    return parser.parse_args()
+def torch_dtype(name: str) -> torch.dtype:
+    return {"float16": torch.float16, "bfloat16": torch.bfloat16, "float32": torch.float32}[name]
+def numpy_dtype(name: str) -> np.dtype:
+    return {"float16": np.float16, "float32": np.float32}[name]
+def load_transformers():
+    try:
+        from transformers import AutoConfig, AutoModel, AutoTokenizer
+    except ImportError as exc:  # pragma: no cover - depends on uv environment
+        raise SystemExit("transformers is required. Run through `uv run` after sourcing .env.") from exc
+    return AutoConfig, AutoModel, AutoTokenizer
+def install_remote_code_compat() -> None:
+    """Compatibility shims for embedding-model remote code.
+    Jina v2 imports `transformers.onnx.OnnxConfig`, which is absent in the
+    current Transformers build used by this project. Jina v3 also expects the
+    legacy `all_tied_weights_keys` property on PreTrainedModel. The shims are
+    intentionally minimal and only installed when requested.
+    """
+    try:
+        import transformers
+        from transformers import PreTrainedModel
+    except ImportError:
+        return
+    if "transformers.onnx" not in sys.modules:
+        onnx_module = types.ModuleType("transformers.onnx")
+        class OnnxConfig:  # pragma: no cover - exercised by remote code import
+            pass
+        onnx_module.OnnxConfig = OnnxConfig
+        sys.modules["transformers.onnx"] = onnx_module
+        setattr(transformers, "onnx", onnx_module)
+    if not hasattr(PreTrainedModel, "all_tied_weights_keys"):
+        def all_tied_weights_keys(self: Any) -> dict[str, None]:
+            stored = getattr(self, "_compat_all_tied_weights_keys", None)
+            if stored is not None:
+                return stored
+            keys = getattr(self, "_tied_weights_keys", None) or []
+            return {key: None for key in keys}
+        def set_all_tied_weights_keys(self: Any, value: Any) -> None:
+            if isinstance(value, dict):
+                self._compat_all_tied_weights_keys = value
+            elif value is None:
+                self._compat_all_tied_weights_keys = {}
+            else:
+                self._compat_all_tied_weights_keys = {key: None for key in value}
+        PreTrainedModel.all_tied_weights_keys = property(  # type: ignore[attr-defined]
+            all_tied_weights_keys,
+            set_all_tied_weights_keys,
+        )
+    try:
+        import transformers.pytorch_utils as pytorch_utils
+        if not hasattr(pytorch_utils, "find_pruneable_heads_and_indices"):
+            def find_pruneable_heads_and_indices(
+                heads: list[int] | set[int],
+                n_heads: int,
+                head_size: int,
+                already_pruned_heads: set[int],
+            ) -> tuple[set[int], torch.Tensor]:
+                heads = set(heads) - already_pruned_heads
+                mask = torch.ones(n_heads, head_size)
+                for head in heads:
+                    pruned_before = sum(1 if pruned_head < head else 0 for pruned_head in already_pruned_heads)
+                    mask[head - pruned_before] = 0
+                mask = mask.view(-1).contiguous().eq(1)
+                index = torch.arange(len(mask))[mask].long()
+                return heads, index
+            pytorch_utils.find_pruneable_heads_and_indices = find_pruneable_heads_and_indices
+        if not hasattr(pytorch_utils, "prune_linear_layer"):
+            from transformers.modeling_utils import prune_linear_layer
+            pytorch_utils.prune_linear_layer = prune_linear_layer
+    except Exception:
+        pass
+def iter_jsonl(
+    path: Path,
+    text_field: str,
+    id_field: str | None,
+    max_records: int | None,
+    split_count: int,
+    split_index: int,
+) -> Iterable[tuple[str, str | None, int]]:
+    emitted = 0
+    seen = 0
+    with path.open("r", encoding="utf-8") as handle:
+        for line in handle:
+            if max_records is not None and emitted >= max_records:
+                break
+            line = line.strip()
+            if not line:
+                seen += 1
+                continue
+            row_index = seen
+            seen += 1
+            if row_index % split_count != split_index:
+                continue
+            row = json.loads(line)
+            text = row.get(text_field)
+            if not isinstance(text, str):
+                text = ""
+            row_id = str(row.get(id_field)) if id_field and row.get(id_field) is not None else None
+            emitted += 1
+            yield text, row_id, row_index
+def iter_jsonl_sampled(
+    path: Path,
+    text_field: str,
+    id_field: str | None,
+    sample_records: int,
+    sample_seed: int,
+    split_count: int,
+    split_index: int,
+) -> Iterable[tuple[str, str | None, int]]:
+    if sample_records < 1:
+        raise SystemExit("--sample-records must be >= 1")
+    rng = random.Random(sample_seed)
+    reservoir: list[tuple[str, str | None, int]] = []
+    seen = 0
+    with path.open("r", encoding="utf-8") as handle:
+        for line in handle:
+            line = line.strip()
+            if not line:
+                continue
+            row_index = seen
+            seen += 1
+            row = json.loads(line)
+            text = row.get(text_field)
+            if not isinstance(text, str):
+                text = ""
+            row_id = str(row.get(id_field)) if id_field and row.get(id_field) is not None else None
+            item = (text, row_id, row_index)
+            if len(reservoir) < sample_records:
+                reservoir.append(item)
+            else:
+                replace_index = rng.randrange(seen)
+                if replace_index < sample_records:
+                    reservoir[replace_index] = item
+    reservoir.sort(key=lambda item: item[2])
+    for emitted, item in enumerate(reservoir):
+        if emitted % split_count == split_index:
+            yield item
+def batched(items: Iterable[tuple[str, str | None, int]], batch_size: int) -> Iterable[list[tuple[str, str | None, int]]]:
+    batch: list[tuple[str, str | None, int]] = []
+    for item in items:
+        batch.append(item)
+        if len(batch) >= batch_size:
+            yield batch
+            batch = []
+    if batch:
+        yield batch
+def config_text_limit(config: Any) -> int | None:
+    candidates = []
+    for obj in [config, getattr(config, "text_config", None)]:
+        if obj is None:
+            continue
+        for name in ["max_position_embeddings", "max_sequence_length", "context_length", "seq_length"]:
+            value = getattr(obj, name, None)
+            if isinstance(value, int) and value > 0:
+                candidates.append(value)
+    return min(candidates) if candidates else None
+def inspect_models(args: argparse.Namespace) -> int:
+    if args.compat_remote_code:
+        install_remote_code_compat()
+    AutoConfig, _AutoModel, AutoTokenizer = load_transformers()
+    rows = []
+    for model_id in args.model:
+        tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=args.trust_remote_code)
+        config = AutoConfig.from_pretrained(model_id, trust_remote_code=args.trust_remote_code)
+        rows.append(
+            {
+                "model": model_id,
+                "model_type": getattr(config, "model_type", None),
+                "tokenizer_model_max_length": getattr(tokenizer, "model_max_length", None),
+                "config_text_limit": config_text_limit(config),
+                "text_config_max_position_embeddings": getattr(getattr(config, "text_config", None), "max_position_embeddings", None),
+                "max_position_embeddings": getattr(config, "max_position_embeddings", None),
+                "projection_dim": getattr(config, "projection_dim", None) or getattr(config, "projection_size", None),
+                "hidden_size": getattr(config, "hidden_size", None) or getattr(getattr(config, "text_config", None), "hidden_size", None),
+            }
+        )
+    print(json.dumps(rows, indent=2, ensure_ascii=False))
+    return 0
+def load_encoder(
+    model_id: str,
+    device: str,
+    dtype: str,
+    trust_remote_code: bool,
+    compile_model: bool,
+    compat_remote_code: bool,
+    padding_side: str | None,
+):
+    if compat_remote_code:
+        install_remote_code_compat()
+    AutoConfig, AutoModel, AutoTokenizer = load_transformers()
+    tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=trust_remote_code)
+    if padding_side is not None:
+        tokenizer.padding_side = padding_side
+    config = None
+    if compat_remote_code:
+        config = AutoConfig.from_pretrained(model_id, trust_remote_code=trust_remote_code)
+        for name, value in {
+            "is_decoder": False,
+            "add_cross_attention": False,
+            "chunk_size_feed_forward": 0,
+            "use_return_dict": True,
+            "output_attentions": False,
+            "output_hidden_states": False,
+        }.items():
+            if not hasattr(config, name):
+                setattr(config, name, value)
+    model = AutoModel.from_pretrained(
+        model_id,
+        config=config,
+        dtype=torch_dtype(dtype),
+        trust_remote_code=trust_remote_code,
+    )
+    model.eval().to(device)
+    if compile_model:
+        model = torch.compile(model)
+    return tokenizer, model
+def pool_outputs(model: Any, outputs: Any, encoded: dict[str, torch.Tensor], pooling: str) -> torch.Tensor:
+    if hasattr(outputs, "text_embeds") and outputs.text_embeds is not None:
+        return outputs.text_embeds
+    if hasattr(outputs, "pooler_output") and outputs.pooler_output is not None and pooling in {"auto", "pooler"}:
+        return outputs.pooler_output
+    hidden = outputs.last_hidden_state if hasattr(outputs, "last_hidden_state") else outputs[0]
+    if pooling == "last":
+        attention = encoded.get("attention_mask")
+        if attention is None:
+            return hidden[:, -1]
+        left_padding = bool((attention[:, -1].sum() == attention.shape[0]).item())
+        if left_padding:
+            return hidden[:, -1]
+        sequence_lengths = attention.sum(dim=1) - 1
+        batch_size = hidden.shape[0]
+        return hidden[torch.arange(batch_size, device=hidden.device), sequence_lengths]
+    if pooling == "cls":
+        return hidden[:, 0]
+    attention = encoded.get("attention_mask")
+    if pooling in {"auto", "mean"} and attention is not None:
+        weights = attention.to(hidden.dtype).unsqueeze(-1)
+        return (hidden * weights).sum(dim=1) / weights.sum(dim=1).clamp_min(1.0)
+    return hidden[:, 0]
+@torch.inference_mode()
+def encode_batch(
+    tokenizer: Any,
+    model: Any,
+    texts: list[str],
+    device: str,
+    max_length: int | None,
+    pooling: str,
+) -> torch.Tensor:
+    encoded = tokenizer(
+        texts,
+        padding=True,
+        truncation=True,
+        max_length=max_length,
+        return_tensors="pt",
+    )
+    encoded = {key: value.to(device) for key, value in encoded.items()}
+    if hasattr(model, "get_text_features"):
+        features = model.get_text_features(**encoded)
+        if not isinstance(features, torch.Tensor):
+            features = pool_outputs(model, features, encoded, pooling)
+    else:
+        outputs = model(**encoded)
+        features = pool_outputs(model, outputs, encoded, pooling)
+    features = torch.nn.functional.normalize(features.float(), dim=-1)
+    return features.cpu()
+def flush_shard(
+    output_dir: Path,
+    shard_index: int,
+    start_row: int,
+    rows: list[np.ndarray],
+    embedding_dtype: str,
+) -> EmbeddingShard:
+    array = np.asarray(rows, dtype=numpy_dtype(embedding_dtype))
+    path = output_dir / f"embeddings-{shard_index:05d}.npy"
+    np.save(path, array)
+    return EmbeddingShard(
+        path=str(path),
+        rows=int(array.shape[0]),
+        dim=int(array.shape[1]) if array.ndim == 2 else 0,
+        dtype=embedding_dtype,
+        start_row=start_row,
+        end_row=start_row + int(array.shape[0]),
+    )
+def encode_main(args: argparse.Namespace) -> int:
+    output_dir = Path(args.output_dir)
+    output_dir.mkdir(parents=True, exist_ok=True)
+    tokenizer, model = load_encoder(
+        args.model,
+        args.device,
+        args.dtype,
+        args.trust_remote_code,
+        args.compile,
+        args.compat_remote_code,
+        args.padding_side,
+    )
+    config_limit = config_text_limit(getattr(model, "config", None))
+    max_length = args.max_length or config_limit or getattr(tokenizer, "model_max_length", None)
+    if isinstance(max_length, int) and max_length > 1_000_000:
+        max_length = None
+    rows: list[np.ndarray] = []
+    row_ids: list[str | None] = []
+    row_indices: list[int] = []
+    shards: list[EmbeddingShard] = []
+    total = 0
+    shard_start = 0
+    started = time.time()
+    if args.split_count < 1:
+        raise SystemExit("--split-count must be >= 1")
+    if not (0 <= args.split_index < args.split_count):
+        raise SystemExit("--split-index must satisfy 0 <= split_index < split_count")
+    if args.sample_records is not None and args.max_records is not None:
+        raise SystemExit("--sample-records and --max-records are mutually exclusive")
+    if args.text_template is not None and "{text}" not in args.text_template:
+        raise SystemExit("--text-template must contain '{text}'")
+    if args.sample_records is not None:
+        source = iter_jsonl_sampled(
+            Path(args.input),
+            args.text_field,
+            args.id_field,
+            args.sample_records,
+            args.sample_seed,
+            args.split_count,
+            args.split_index,
+        )
+    else:
+        source = iter_jsonl(
+            Path(args.input),
+            args.text_field,
+            args.id_field,
+            args.max_records,
+            args.split_count,
+            args.split_index,
+        )
+    for batch in batched(source, args.batch_size):
+        texts = [text for text, _row_id, _row_index in batch]
+        if args.text_template is not None:
+            texts = [args.text_template.format(text=text) for text in texts]
+        elif args.text_prefix:
+            texts = [f"{args.text_prefix}{text}" for text in texts]
+        ids = [row_id for _text, row_id, _row_index in batch]
+        indices = [row_index for _text, _row_id, row_index in batch]
+        features = encode_batch(tokenizer, model, texts, args.device, max_length, args.pooling)
+        rows.extend(features.numpy())
+        row_ids.extend(ids)
+        row_indices.extend(indices)
+        total += len(batch)
+        if len(rows) >= args.shard_rows:
+            shards.append(flush_shard(output_dir, len(shards), shard_start, rows, args.embedding_dtype))
+            shard_start += len(rows)
+            rows = []
+    if rows:
+        shards.append(flush_shard(output_dir, len(shards), shard_start, rows, args.embedding_dtype))
+    if row_indices:
+        with (output_dir / "row_ids.jsonl").open("w", encoding="utf-8") as handle:
+            for index, (row_id, row_index) in enumerate(zip(row_ids, row_indices, strict=True)):
+                handle.write(
+                    json.dumps(
+                        {"split_row": index, "source_row": row_index, "id": row_id},
+                        ensure_ascii=False,
+                    )
+                    + "\n"
+                )
+    manifest = {
+        "input": args.input,
+        "text_field": args.text_field,
+        "id_field": args.id_field,
+        "model": args.model,
+        "max_length": max_length,
+        "max_records": args.max_records,
+        "sample_records": args.sample_records,
+        "sample_seed": args.sample_seed,
+        "split_count": args.split_count,
+        "split_index": args.split_index,
+        "pooling": args.pooling,
+        "padding_side": getattr(tokenizer, "padding_side", None),
+        "text_prefix": args.text_prefix,
+        "text_template": args.text_template,
+        "compat_remote_code": args.compat_remote_code,
+        "device": args.device,
+        "dtype": args.dtype,
+        "embedding_dtype": args.embedding_dtype,
+        "rows": total,
+        "seconds": round(time.time() - started, 3),
+        "rows_per_second": round(total / max(time.time() - started, 1e-6), 3),
+        "shards": [asdict(shard) for shard in shards],
+    }
+    (output_dir / "manifest.json").write_text(json.dumps(manifest, indent=2, ensure_ascii=False), encoding="utf-8")
+    print(json.dumps({"output_dir": str(output_dir), "rows": total, "shards": len(shards), "max_length": max_length}, indent=2))
+    return 0
+def encode_bge_m3_main(args: argparse.Namespace) -> int:
+    try:
+        from FlagEmbedding import BGEM3FlagModel
+    except ImportError as exc:
+        raise SystemExit("FlagEmbedding is required for encode-bge-m3. Install with `uv sync --extra eval`.") from exc
+    output_dir = Path(args.output_dir)
+    output_dir.mkdir(parents=True, exist_ok=True)
+    if args.split_count < 1:
+        raise SystemExit("--split-count must be >= 1")
+    if not (0 <= args.split_index < args.split_count):
+        raise SystemExit("--split-index must satisfy 0 <= split_index < split_count")
+    if args.sample_records is not None and args.max_records is not None:
+        raise SystemExit("--sample-records and --max-records are mutually exclusive")
+    if args.text_template is not None and "{text}" not in args.text_template:
+        raise SystemExit("--text-template must contain '{text}'")
+    model = BGEM3FlagModel(
+        args.model,
+        normalize_embeddings=True,
+        use_fp16=args.use_fp16,
+        devices=args.device,
+        pooling_method="cls",
+        batch_size=args.batch_size,
+        query_max_length=args.max_length,
+        passage_max_length=args.max_length,
+        return_dense=True,
+        return_sparse=False,
+        return_colbert_vecs=False,
+        query_instruction_for_retrieval=args.query_instruction,
+        query_instruction_format=args.query_instruction_format,
+    )
+    if args.sample_records is not None:
+        source = iter_jsonl_sampled(
+            Path(args.input),
+            args.text_field,
+            args.id_field,
+            args.sample_records,
+            args.sample_seed,
+            args.split_count,
+            args.split_index,
+        )
+    else:
+        source = iter_jsonl(
+            Path(args.input),
+            args.text_field,
+            args.id_field,
+            args.max_records,
+            args.split_count,
+            args.split_index,
+        )
+    rows: list[np.ndarray] = []
+    row_ids: list[str | None] = []
+    row_indices: list[int] = []
+    shards: list[EmbeddingShard] = []
+    total = 0
+    shard_start = 0
+    started = time.time()
+    for batch in batched(source, args.batch_size):
+        texts = [text for text, _row_id, _row_index in batch]
+        if args.text_template is not None:
+            texts = [args.text_template.format(text=text) for text in texts]
+        elif args.text_prefix:
+            texts = [f"{args.text_prefix}{text}" for text in texts]
+        ids = [row_id for _text, row_id, _row_index in batch]
+        indices = [row_index for _text, _row_id, row_index in batch]
+        encode_fn = {
+            "corpus": model.encode_corpus,
+            "queries": model.encode_queries,
+            "encode": model.encode,
+        }[args.encode_mode]
+        encoded = encode_fn(
+            texts,
+            batch_size=args.batch_size,
+            max_length=args.max_length,
+            return_dense=True,
+            return_sparse=False,
+            return_colbert_vecs=False,
+        )
+        features = np.asarray(encoded["dense_vecs"], dtype=np.float32)
+        features /= np.maximum(np.linalg.norm(features, axis=1, keepdims=True), 1e-12)
+        rows.extend(features)
+        row_ids.extend(ids)
+        row_indices.extend(indices)
+        total += len(batch)
+        if len(rows) >= args.shard_rows:
+            shards.append(flush_shard(output_dir, len(shards), shard_start, rows, args.embedding_dtype))
+            shard_start += len(rows)
+            rows = []
+    if rows:
+        shards.append(flush_shard(output_dir, len(shards), shard_start, rows, args.embedding_dtype))
+    if row_indices:
+        with (output_dir / "row_ids.jsonl").open("w", encoding="utf-8") as handle:
+            for index, (row_id, row_index) in enumerate(zip(row_ids, row_indices, strict=True)):
+                handle.write(
+                    json.dumps(
+                        {"split_row": index, "source_row": row_index, "id": row_id},
+                        ensure_ascii=False,
+                    )
+                    + "\n"
+                )
+    elapsed = time.time() - started
+    manifest = {
+        "input": args.input,
+        "text_field": args.text_field,
+        "id_field": args.id_field,
+        "model": args.model,
+        "backend": "FlagEmbedding.BGEM3FlagModel",
+        "max_length": args.max_length,
+        "max_records": args.max_records,
+        "sample_records": args.sample_records,
+        "sample_seed": args.sample_seed,
+        "split_count": args.split_count,
+        "split_index": args.split_index,
+        "pooling": "cls",
+        "encode_mode": args.encode_mode,
+        "normalize_embeddings": True,
+        "text_prefix": args.text_prefix,
+        "text_template": args.text_template,
+        "query_instruction": args.query_instruction,
+        "query_instruction_format": args.query_instruction_format,
+        "device": args.device,
+        "use_fp16": args.use_fp16,
+        "embedding_dtype": args.embedding_dtype,
+        "rows": total,
+        "seconds": round(elapsed, 3),
+        "rows_per_second": round(total / max(elapsed, 1e-6), 3),
+        "shards": [asdict(shard) for shard in shards],
+    }
+    (output_dir / "manifest.json").write_text(json.dumps(manifest, indent=2, ensure_ascii=False), encoding="utf-8")
+    print(json.dumps({"output_dir": str(output_dir), "rows": total, "shards": len(shards), "max_length": args.max_length}, indent=2))
+    return 0
+def encode_sentence_transformer_main(args: argparse.Namespace) -> int:
+    try:
+        from sentence_transformers import SentenceTransformer
+    except ImportError as exc:
+        raise SystemExit("sentence-transformers is required. Run `uv sync --extra eval`.") from exc
+    output_dir = Path(args.output_dir)
+    output_dir.mkdir(parents=True, exist_ok=True)
+    if args.split_count < 1:
+        raise SystemExit("--split-count must be >= 1")
+    if not (0 <= args.split_index < args.split_count):
+        raise SystemExit("--split-index must satisfy 0 <= split_index < split_count")
+    if args.sample_records is not None and args.max_records is not None:
+        raise SystemExit("--sample-records and --max-records are mutually exclusive")
+    if args.text_template is not None and "{text}" not in args.text_template:
+        raise SystemExit("--text-template must contain '{text}'")
+    model = SentenceTransformer(args.model, device=args.device)
+    if args.max_length is not None:
+        model.max_seq_length = args.max_length
+    max_length = int(model.max_seq_length) if getattr(model, "max_seq_length", None) is not None else args.max_length
+    if args.sample_records is not None:
+        source = iter_jsonl_sampled(
+            Path(args.input),
+            args.text_field,
+            args.id_field,
+            args.sample_records,
+            args.sample_seed,
+            args.split_count,
+            args.split_index,
+        )
+    else:
+        source = iter_jsonl(
+            Path(args.input),
+            args.text_field,
+            args.id_field,
+            args.max_records,
+            args.split_count,
+            args.split_index,
+        )
+    rows: list[np.ndarray] = []
+    row_ids: list[str | None] = []
+    row_indices: list[int] = []
+    shards: list[EmbeddingShard] = []
+    total = 0
+    shard_start = 0
+    started = time.time()
+    for batch in batched(source, args.batch_size):
+        texts = [text for text, _row_id, _row_index in batch]
+        if args.text_template is not None:
+            texts = [args.text_template.format(text=text) for text in texts]
+        elif args.text_prefix:
+            texts = [f"{args.text_prefix}{text}" for text in texts]
+        ids = [row_id for _text, row_id, _row_index in batch]
+        indices = [row_index for _text, _row_id, row_index in batch]
+        encode_kwargs = {
+            "batch_size": args.batch_size,
+            "normalize_embeddings": True,
+            "convert_to_numpy": True,
+            "show_progress_bar": False,
+        }
+        if args.prompt_name is not None:
+            encode_kwargs["prompt_name"] = args.prompt_name
+        features = model.encode(texts, **encode_kwargs)
+        features = np.asarray(features, dtype=np.float32)
+        features /= np.maximum(np.linalg.norm(features, axis=1, keepdims=True), 1e-12)
+        rows.extend(features)
+        row_ids.extend(ids)
+        row_indices.extend(indices)
+        total += len(batch)
+        if len(rows) >= args.shard_rows:
+            shards.append(flush_shard(output_dir, len(shards), shard_start, rows, args.embedding_dtype))
+            shard_start += len(rows)
+            rows = []
+    if rows:
+        shards.append(flush_shard(output_dir, len(shards), shard_start, rows, args.embedding_dtype))
+    if row_indices:
+        with (output_dir / "row_ids.jsonl").open("w", encoding="utf-8") as handle:
+            for index, (row_id, row_index) in enumerate(zip(row_ids, row_indices, strict=True)):
+                handle.write(
+                    json.dumps(
+                        {"split_row": index, "source_row": row_index, "id": row_id},
+                        ensure_ascii=False,
+                    )
+                    + "\n"
+                )
+    elapsed = time.time() - started
+    manifest = {
+        "input": args.input,
+        "text_field": args.text_field,
+        "id_field": args.id_field,
+        "model": args.model,
+        "backend": "sentence_transformers.SentenceTransformer",
+        "max_length": max_length,
+        "max_records": args.max_records,
+        "sample_records": args.sample_records,
+        "sample_seed": args.sample_seed,
+        "split_count": args.split_count,
+        "split_index": args.split_index,
+        "pooling": "model_default",
+        "normalize_embeddings": True,
+        "text_prefix": args.text_prefix,
+        "text_template": args.text_template,
+        "prompt_name": args.prompt_name,
+        "available_prompts": getattr(model, "prompts", None),
+        "device": args.device,
+        "embedding_dtype": args.embedding_dtype,
+        "rows": total,
+        "seconds": round(elapsed, 3),
+        "rows_per_second": round(total / max(elapsed, 1e-6), 3),
+        "shards": [asdict(shard) for shard in shards],
+    }
+    (output_dir / "manifest.json").write_text(json.dumps(manifest, indent=2, ensure_ascii=False), encoding="utf-8")
+    print(json.dumps({"output_dir": str(output_dir), "rows": total, "shards": len(shards), "max_length": max_length}, indent=2))
+    return 0
+def load_embedding_manifest(path: Path) -> tuple[dict[str, Any], np.ndarray]:
+    manifest = json.loads(path.read_text(encoding="utf-8"))
+    arrays = [np.load(shard["path"], mmap_mode="r") for shard in manifest["shards"]]
+    if not arrays:
+        return manifest, np.zeros((0, 0), dtype=np.float32)
+    return manifest, np.concatenate(arrays, axis=0)
+def sample_embeddings(embeddings: np.ndarray, max_rows: int | None, seed: int) -> tuple[np.ndarray, list[int]]:
+    n = int(embeddings.shape[0])
+    if max_rows is None or max_rows >= n:
+        indices = list(range(n))
+    else:
+        rng = random.Random(seed)
+        indices = sorted(rng.sample(range(n), max_rows))
+    return np.asarray(embeddings[indices], dtype=np.float32), indices
+def vendi_from_block(block: torch.Tensor) -> dict[str, float]:
+    block = torch.nn.functional.normalize(block.float(), dim=-1)
+    kernel = block @ block.T
+    eigenvalues = torch.linalg.eigvalsh(kernel).clamp_min(0)
+    total = eigenvalues.sum().clamp_min(1e-12)
+    probs = eigenvalues / total
+    entropy = -(probs * torch.log(probs.clamp_min(1e-12))).sum()
+    vendi = torch.exp(entropy)
+    return {
+        "vendi": float(vendi.item()),
+        "effective_rank": float(vendi.item()),
+        "trace": float(total.item()),
+        "max_eigen_prob": float(probs.max().item()),
+    }
+def mean_ci(values: list[float]) -> dict[str, float]:
+    if not values:
+        return {"mean": 0.0, "ci95_low": 0.0, "ci95_high": 0.0}
+    mean = sum(values) / len(values)
+    if len(values) == 1:
+        return {"mean": mean, "ci95_low": mean, "ci95_high": mean}
+    variance = sum((value - mean) ** 2 for value in values) / (len(values) - 1)
+    half = 1.96 * math.sqrt(variance / len(values))
+    return {"mean": mean, "ci95_low": mean - half, "ci95_high": mean + half}
+def parse_thresholds(text: str) -> list[float]:
+    values = []
+    for part in text.split(","):
+        part = part.strip()
+        if not part:
+            continue
+        value = float(part)
+        if not -1.0 <= value <= 1.0:
+            raise SystemExit(f"invalid cosine threshold outside [-1, 1]: {value}")
+        values.append(value)
+    if not values:
+        raise SystemExit("--thresholds must contain at least one value")
+    return values
+def summarize_scores(scores: np.ndarray, thresholds: list[float]) -> dict[str, Any]:
+    percentiles = {
+        f"p{percentile:02d}": float(np.percentile(scores, percentile))
+        for percentile in [1, 5, 10, 25, 50, 75, 90, 95, 99]
+    }
+    support = {
+        f"support_at_{threshold:.2f}": float(np.mean(scores >= threshold))
+        for threshold in thresholds
+    }
+    return {
+        "mean_nn_cosine": float(np.mean(scores)),
+        "std_nn_cosine": float(np.std(scores, ddof=1)) if scores.size > 1 else 0.0,
+        **percentiles,
+        **support,
+    }
+def summarize_support(covered: np.ndarray, density: np.ndarray, nn_cosine: np.ndarray) -> dict[str, Any]:
+    nn_distance = 1.0 - nn_cosine
+    return {
+        "coverage": float(np.mean(covered)),
+        "density": float(np.mean(density)),
+        "density_p50": float(np.percentile(density, 50)),
+        "density_p95": float(np.percentile(density, 95)),
+        "nn_cosine_mean": float(np.mean(nn_cosine)),
+        "nn_cosine_p50": float(np.percentile(nn_cosine, 50)),
+        "nn_cosine_p05": float(np.percentile(nn_cosine, 5)),
+        "nn_distance_p95": float(np.percentile(nn_distance, 95)),
+        "nn_distance_p99": float(np.percentile(nn_distance, 99)),
+    }
+@torch.inference_mode()
+def exact_nn_cosine(
+    query: np.ndarray,
+    gallery: np.ndarray,
+    device: str,
+    dtype: torch.dtype,
+    query_batch_size: int,
+    gallery_chunk_size: int,
+) -> np.ndarray:
+    if query.ndim != 2 or gallery.ndim != 2:
+        raise SystemExit("query and gallery embeddings must be 2D arrays")
+    if query.shape[1] != gallery.shape[1]:
+        raise SystemExit(f"dimension mismatch: query dim {query.shape[1]} vs gallery dim {gallery.shape[1]}")
+    if query.shape[0] == 0 or gallery.shape[0] == 0:
+        raise SystemExit("query and gallery embeddings must be non-empty")
+    if query_batch_size < 1:
+        raise SystemExit("--query-batch-size must be >= 1")
+    if gallery_chunk_size < 0:
+        raise SystemExit("--gallery-chunk-size must be >= 0")
+    scores: list[np.ndarray] = []
+    if gallery_chunk_size == 0:
+        gallery_tensor = torch.from_numpy(gallery).to(device=device, dtype=dtype)
+        gallery_tensor = torch.nn.functional.normalize(gallery_tensor.float(), dim=-1).to(dtype)
+        gallery_t = gallery_tensor.T.contiguous()
+        for start in range(0, query.shape[0], query_batch_size):
+            query_tensor = torch.from_numpy(query[start : start + query_batch_size]).to(device=device, dtype=dtype)
+            query_tensor = torch.nn.functional.normalize(query_tensor.float(), dim=-1).to(dtype)
+            sims = query_tensor @ gallery_t
+            scores.append(sims.float().max(dim=1).values.cpu().numpy())
+        return np.concatenate(scores, axis=0)
+    for start in range(0, query.shape[0], query_batch_size):
+        query_tensor = torch.from_numpy(query[start : start + query_batch_size]).to(device=device, dtype=dtype)
+        query_tensor = torch.nn.functional.normalize(query_tensor.float(), dim=-1).to(dtype)
+        best = torch.full((query_tensor.shape[0],), -2.0, device=device, dtype=torch.float32)
+        for gallery_start in range(0, gallery.shape[0], gallery_chunk_size):
+            gallery_tensor = torch.from_numpy(gallery[gallery_start : gallery_start + gallery_chunk_size]).to(device=device, dtype=dtype)
+            gallery_tensor = torch.nn.functional.normalize(gallery_tensor.float(), dim=-1).to(dtype)
+            sims = query_tensor @ gallery_tensor.T
+            best = torch.maximum(best, sims.float().max(dim=1).values)
+        scores.append(best.cpu().numpy())
+    return np.concatenate(scores, axis=0)
+@torch.inference_mode()
+def kth_self_neighbor_cosine(
+    gallery: np.ndarray,
+    k: int,
+    device: str,
+    dtype: torch.dtype,
+    batch_size: int,
+) -> np.ndarray:
+    if k < 1:
+        raise SystemExit("--k must be >= 1")
+    if gallery.shape[0] <= k:
+        raise SystemExit(f"gallery rows ({gallery.shape[0]}) must be > k ({k})")
+    if batch_size < 1:
+        raise SystemExit("--gallery-batch-size must be >= 1")
+    gallery_tensor = torch.from_numpy(gallery).to(device=device, dtype=dtype)
+    gallery_tensor = torch.nn.functional.normalize(gallery_tensor.float(), dim=-1).to(dtype)
+    gallery_t = gallery_tensor.T.contiguous()
+    thresholds: list[np.ndarray] = []
+    for start in range(0, gallery.shape[0], batch_size):
+        stop = min(start + batch_size, gallery.shape[0])
+        sims = gallery_tensor[start:stop] @ gallery_t
+        row_indices = torch.arange(stop - start, device=device)
+        sims[row_indices, torch.arange(start, stop, device=device)] = -2.0
+        kth = torch.topk(sims.float(), k=k, dim=1).values[:, -1]
+        thresholds.append(kth.cpu().numpy())
+    return np.concatenate(thresholds, axis=0)
+@torch.inference_mode()
+def prdc_query_in_gallery_support(
+    query: np.ndarray,
+    gallery: np.ndarray,
+    gallery_thresholds: np.ndarray,
+    k: int,
+    device: str,
+    dtype: torch.dtype,
+    query_batch_size: int,
+) -> tuple[np.ndarray, np.ndarray, np.ndarray]:
+    if query_batch_size < 1:
+        raise SystemExit("--query-batch-size must be >= 1")
+    gallery_tensor = torch.from_numpy(gallery).to(device=device, dtype=dtype)
+    gallery_tensor = torch.nn.functional.normalize(gallery_tensor.float(), dim=-1).to(dtype)
+    gallery_t = gallery_tensor.T.contiguous()
+    thresholds = torch.from_numpy(gallery_thresholds.astype(np.float32)).to(device=device)
+    covered_rows: list[np.ndarray] = []
+    density_rows: list[np.ndarray] = []
+    nn_rows: list[np.ndarray] = []
+    for start in range(0, query.shape[0], query_batch_size):
+        query_tensor = torch.from_numpy(query[start : start + query_batch_size]).to(device=device, dtype=dtype)
+        query_tensor = torch.nn.functional.normalize(query_tensor.float(), dim=-1).to(dtype)
+        sims = (query_tensor @ gallery_t).float()
+        support_hits = sims >= thresholds.unsqueeze(0)
+        hit_counts = support_hits.sum(dim=1).float()
+        covered_rows.append((hit_counts > 0).cpu().numpy())
+        density_rows.append((hit_counts / float(k)).cpu().numpy())
+        nn_rows.append(sims.max(dim=1).values.cpu().numpy())
+    return (
+        np.concatenate(covered_rows, axis=0),
+        np.concatenate(density_rows, axis=0),
+        np.concatenate(nn_rows, axis=0),
+    )
+def vendi_main(args: argparse.Namespace) -> int:
+    manifest, embeddings = load_embedding_manifest(Path(args.manifest))
+    n = int(embeddings.shape[0])
+    if n == 0:
+        raise SystemExit("empty embedding cache")
+    block_size = min(args.block_size, n)
+    rng = random.Random(args.seed)
+    matrix_device = args.matrix_device or args.device
+    dtype = torch_dtype(args.dtype)
+    block_rows = []
+    if args.sampling == "partition":
+        order = list(range(n))
+        rng.shuffle(order)
+        index_blocks = [order[start : start + block_size] for start in range(0, n, block_size)]
+        if index_blocks and len(index_blocks[-1]) < max(2, block_size // 2):
+            # Avoid a tiny tail block with a non-comparable Vendi scale.
+            index_blocks[-2].extend(index_blocks[-1])
+            index_blocks.pop()
+    else:
+        index_blocks = [
+            rng.sample(range(n), block_size) if block_size < n else list(range(n))
+            for _ in range(args.blocks)
+        ]
+    for block_index, indices in enumerate(index_blocks):
+        array = np.asarray(embeddings[indices], dtype=np.float32)
+        block = torch.from_numpy(array).to(matrix_device, dtype=dtype)
+        stats = vendi_from_block(block)
+        stats.update({"block_index": block_index, "block_size": len(indices)})
+        block_rows.append(stats)
+    vendi_values = [row["vendi"] for row in block_rows]
+    payload = {
+        "embedding_manifest": args.manifest,
+        "source_model": manifest.get("model"),
+        "source_rows": n,
+        "block_size": block_size,
+        "blocks": len(block_rows),
+        "requested_blocks": args.blocks,
+        "sampling": args.sampling,
+        "seed": args.seed,
+        "device": matrix_device,
+        "summary": {
+            "vendi": mean_ci(vendi_values),
+            "max_eigen_prob": mean_ci([row["max_eigen_prob"] for row in block_rows]),
+        },
+        "block_rows": block_rows,
+        "boundary": "Vendi is an embedding-space semantic diversity metric; it does not measure faithfulness, density, or downstream utility.",
+    }
+    output = Path(args.output)
+    output.parent.mkdir(parents=True, exist_ok=True)
+    output.write_text(json.dumps(payload, indent=2, ensure_ascii=False), encoding="utf-8")
+    print(json.dumps({"output": str(output), "vendi_mean": payload["summary"]["vendi"]["mean"], "blocks": args.blocks}, indent=2))
+    return 0
+def geometry_main(args: argparse.Namespace) -> int:
+    manifest, embeddings = load_embedding_manifest(Path(args.manifest))
+    n = int(embeddings.shape[0])
+    if n == 0:
+        raise SystemExit("empty embedding cache")
+    rng = np.random.default_rng(args.seed)
+    take = min(args.max_rows, n)
+    indices = rng.choice(n, size=take, replace=False) if take < n else np.arange(n)
+    x = torch.from_numpy(np.asarray(embeddings[indices], dtype=np.float32)).to(args.device, dtype=torch_dtype(args.dtype))
+    x = torch.nn.functional.normalize(x.float(), dim=-1)
+    centroid = torch.nn.functional.normalize(x.mean(dim=0, keepdim=True), dim=-1)
+    cosine_to_centroid = (x @ centroid.T).squeeze(1)
+    centered = x - x.mean(dim=0, keepdim=True)
+    cov = centered.T @ centered / max(take - 1, 1)
+    eig = torch.linalg.eigvalsh(cov).clamp_min(0)
+    eig_sum = eig.sum().clamp_min(1e-12)
+    probs = eig / eig_sum
+    spectral_entropy = -(probs * torch.log(probs.clamp_min(1e-12))).sum()
+    erank = torch.exp(spectral_entropy)
+    participation = eig_sum.square() / eig.square().sum().clamp_min(1e-12)
+    payload = {
+        "embedding_manifest": args.manifest,
+        "source_model": manifest.get("model"),
+        "source_rows": n,
+        "sample_rows": take,
+        "seed": args.seed,
+        "device": args.device,
+        "metrics": {
+            "mean_cosine_to_centroid": float(cosine_to_centroid.mean().item()),
+            "std_cosine_to_centroid": float(cosine_to_centroid.std(unbiased=True).item()) if take > 1 else 0.0,
+            "mean_pairwise_cosine_estimate": float((x.mean(dim=0).square().sum().item() * take - 1.0) / max(take - 1, 1)),
+            "cov_effective_rank": float(erank.item()),
+            "cov_participation_ratio": float(participation.item()),
+            "cov_top1_mass": float((eig.max() / eig_sum).item()),
+            "cov_top10_mass": float((eig.topk(min(10, eig.numel())).values.sum() / eig_sum).item()),
+            "cov_trace": float(eig_sum.item()),
+        },
+        "boundary": "Geometry metrics describe embedding distribution shape: concentration, anisotropy, and effective dimensionality. They do not measure faithfulness or prompt support.",
+    }
+    output = Path(args.output)
+    output.parent.mkdir(parents=True, exist_ok=True)
+    output.write_text(json.dumps(payload, indent=2, ensure_ascii=False), encoding="utf-8")
+    print(json.dumps({"output": str(output), **payload["metrics"]}, indent=2))
+    return 0
+def knn_main(args: argparse.Namespace) -> int:
+    query_manifest, query_embeddings_all = load_embedding_manifest(Path(args.query_manifest))
+    gallery_manifest, gallery_embeddings_all = load_embedding_manifest(Path(args.gallery_manifest))
+    query_embeddings, query_indices = sample_embeddings(query_embeddings_all, args.query_max_rows, args.seed)
+    gallery_embeddings, gallery_indices = sample_embeddings(gallery_embeddings_all, args.gallery_max_rows, args.seed + 1)
+    started = time.time()
+    scores = exact_nn_cosine(
+        query_embeddings,
+        gallery_embeddings,
+        args.device,
+        torch_dtype(args.dtype),
+        args.query_batch_size,
+        args.gallery_chunk_size,
+    )
+    thresholds = parse_thresholds(args.thresholds)
+    payload = {
+        "query_manifest": args.query_manifest,
+        "gallery_manifest": args.gallery_manifest,
+        "query_model": query_manifest.get("model"),
+        "gallery_model": gallery_manifest.get("model"),
+        "query_source_rows": int(query_embeddings_all.shape[0]),
+        "gallery_source_rows": int(gallery_embeddings_all.shape[0]),
+        "query_rows": int(query_embeddings.shape[0]),
+        "gallery_rows": int(gallery_embeddings.shape[0]),
+        "query_seed": args.seed,
+        "gallery_seed": args.seed + 1,
+        "query_indices_preview": query_indices[:10],
+        "gallery_indices_preview": gallery_indices[:10],
+        "device": args.device,
+        "dtype": args.dtype,
+        "query_batch_size": args.query_batch_size,
+        "gallery_chunk_size": args.gallery_chunk_size,
+        "seconds": round(time.time() - started, 3),
+        "metrics": summarize_scores(scores, thresholds),
+        "boundary": (
+            "kNN support measures nearest-neighbor coverage in the chosen embedding space. "
+            "It is directional, encoder-dependent, and not a faithfulness or density metric."
+        ),
+    }
+    if args.save_scores is not None:
+        score_path = Path(args.save_scores)
+        score_path.parent.mkdir(parents=True, exist_ok=True)
+        np.save(score_path, scores.astype(np.float32))
+        payload["score_path"] = str(score_path)
+    output = Path(args.output)
+    output.parent.mkdir(parents=True, exist_ok=True)
+    output.write_text(json.dumps(payload, indent=2, ensure_ascii=False), encoding="utf-8")
+    print(json.dumps({"output": str(output), "query_rows": payload["query_rows"], "gallery_rows": payload["gallery_rows"], **payload["metrics"]}, indent=2))
+    return 0
+def support_main(args: argparse.Namespace) -> int:
+    query_manifest, query_embeddings_all = load_embedding_manifest(Path(args.query_manifest))
+    gallery_manifest, gallery_embeddings_all = load_embedding_manifest(Path(args.gallery_manifest))
+    query_embeddings, query_indices = sample_embeddings(query_embeddings_all, args.query_max_rows, args.seed)
+    gallery_embeddings, gallery_indices = sample_embeddings(gallery_embeddings_all, args.gallery_max_rows, args.seed + 1)
+    started = time.time()
+    gallery_thresholds = kth_self_neighbor_cosine(
+        gallery_embeddings,
+        args.k,
+        args.device,
+        torch_dtype(args.dtype),
+        args.gallery_batch_size,
+    )
+    covered, density, nn_cosine = prdc_query_in_gallery_support(
+        query_embeddings,
+        gallery_embeddings,
+        gallery_thresholds,
+        args.k,
+        args.device,
+        torch_dtype(args.dtype),
+        args.query_batch_size,
+    )
+    payload = {
+        "query_manifest": args.query_manifest,
+        "gallery_manifest": args.gallery_manifest,
+        "query_model": query_manifest.get("model"),
+        "gallery_model": gallery_manifest.get("model"),
+        "query_source_rows": int(query_embeddings_all.shape[0]),
+        "gallery_source_rows": int(gallery_embeddings_all.shape[0]),
+        "query_rows": int(query_embeddings.shape[0]),
+        "gallery_rows": int(gallery_embeddings.shape[0]),
+        "query_seed": args.seed,
+        "gallery_seed": args.seed + 1,
+        "query_indices_preview": query_indices[:10],
+        "gallery_indices_preview": gallery_indices[:10],
+        "k": args.k,
+        "device": args.device,
+        "dtype": args.dtype,
+        "query_batch_size": args.query_batch_size,
+        "gallery_batch_size": args.gallery_batch_size,
+        "seconds": round(time.time() - started, 3),
+        "gallery_thresholds": {
+            "mean_kth_neighbor_cosine": float(np.mean(gallery_thresholds)),
+            "p05_kth_neighbor_cosine": float(np.percentile(gallery_thresholds, 5)),
+            "p50_kth_neighbor_cosine": float(np.percentile(gallery_thresholds, 50)),
+            "p95_kth_neighbor_cosine": float(np.percentile(gallery_thresholds, 95)),
+        },
+        "metrics": summarize_support(covered, density, nn_cosine),
+        "boundary": (
+            "P-in-C support is a PRDC-style embedding-manifold estimate: query points are covered "
+            "when they fall inside at least one gallery kNN ball. It measures support in the chosen "
+            "embedding space, not image faithfulness or overall caption quality."
+        ),
+    }
+    if args.save_scores is not None:
+        score_path = Path(args.save_scores)
+        score_path.parent.mkdir(parents=True, exist_ok=True)
+        np.savez_compressed(
+            score_path,
+            covered=covered.astype(np.bool_),
+            density=density.astype(np.float32),
+            nn_cosine=nn_cosine.astype(np.float32),
+            gallery_thresholds=gallery_thresholds.astype(np.float32),
+        )
+        payload["score_path"] = str(score_path)
+    output = Path(args.output)
+    output.parent.mkdir(parents=True, exist_ok=True)
+    output.write_text(json.dumps(payload, indent=2, ensure_ascii=False), encoding="utf-8")
+    print(json.dumps({"output": str(output), "query_rows": payload["query_rows"], "gallery_rows": payload["gallery_rows"], **payload["metrics"]}, indent=2))
+    return 0
+def main() -> int:
+    args = parse_args()
+    if args.cmd == "inspect":
+        return inspect_models(args)
+    if args.cmd == "encode":
+        return encode_main(args)
+    if args.cmd == "encode-bge-m3":
+        return encode_bge_m3_main(args)
+    if args.cmd == "encode-sentence-transformer":
+        return encode_sentence_transformer_main(args)
+    if args.cmd == "vendi":
+        return vendi_main(args)
+    if args.cmd == "geometry":
+        return geometry_main(args)
+    if args.cmd == "knn":
+        return knn_main(args)
+    if args.cmd == "support":
+        return support_main(args)
+    raise AssertionError(args.cmd)
+if __name__ == "__main__":
+    raise SystemExit(main())

eval_code/scripts/compute_longclip_retrieval_margin.py ADDED Viewed

	@@ -0,0 +1,368 @@

+#!/usr/bin/env python3
+"""Compute LongCLIP-style image-caption retrieval separability.
+This metric is a frozen dual-encoder compatibility diagnostic, not a
+faithfulness certificate. It reports whether each caption distinguishes its
+paired image from same-slice negatives, while also reporting text truncation.
+"""
+from __future__ import annotations
+import argparse
+import hashlib
+import json
+import random
+import time
+from pathlib import Path
+from typing import Any
+import numpy as np
+import torch
+from PIL import Image, ImageFile
+ImageFile.LOAD_TRUNCATED_IMAGES = True
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument("--surface", action="append", required=True, metavar="LABEL=JSONL")
+    parser.add_argument("--output-dir", required=True)
+    parser.add_argument("--model", default="zer0int/LongCLIP-GmP-ViT-L-14")
+    parser.add_argument("--max-records", type=int, default=None)
+    parser.add_argument("--sample-records", type=int, default=None)
+    parser.add_argument("--sample-seed", type=int, default=0)
+    parser.add_argument("--batch-size", type=int, default=64)
+    parser.add_argument("--retrieval-block-size", type=int, default=512)
+    parser.add_argument("--max-length", type=int, default=248)
+    parser.add_argument("--device", default="cuda")
+    parser.add_argument("--dtype", default="float16", choices=["float16", "bfloat16", "float32"])
+    parser.add_argument("--bootstrap-reps", type=int, default=1000)
+    parser.add_argument("--trust-remote-code", action="store_true")
+    parser.add_argument("--save-embeddings", action="store_true")
+    return parser.parse_args()
+def torch_dtype(name: str) -> torch.dtype:
+    return {"float16": torch.float16, "bfloat16": torch.bfloat16, "float32": torch.float32}[name]
+def parse_surface(spec: str) -> tuple[str, Path]:
+    if "=" not in spec:
+        raise ValueError(f"--surface must be LABEL=JSONL: {spec}")
+    label, path = spec.split("=", 1)
+    return label, Path(path)
+def stable_float(*parts: object) -> float:
+    raw = ":".join(str(part) for part in parts)
+    digest = hashlib.blake2b(raw.encode("utf-8"), digest_size=8).digest()
+    return int.from_bytes(digest, "big") / 2**64
+def image_path(row: dict[str, Any]) -> str | None:
+    image = row.get("image") if isinstance(row.get("image"), dict) else {}
+    local = image.get("local_abs_path") or row.get("image_abs_path") or row.get("image_path")
+    if isinstance(local, str) and local:
+        return local
+    return None
+def load_surface(path: Path) -> list[dict[str, Any]]:
+    rows: list[dict[str, Any]] = []
+    with path.open("r", encoding="utf-8") as handle:
+        for line in handle:
+            if not line.strip():
+                continue
+            row = json.loads(line)
+            caption = row.get("caption")
+            if isinstance(caption, str) and caption.strip():
+                rows.append(row)
+    return rows
+def align_rows(surface_rows: dict[str, list[dict[str, Any]]], sample_records: int | None, max_records: int | None, seed: int) -> dict[str, list[dict[str, Any]]]:
+    labels = list(surface_rows)
+    n = min(len(surface_rows[label]) for label in labels)
+    indices = list(range(n))
+    if sample_records is not None:
+        indices.sort(key=lambda i: stable_float(seed, i))
+        indices = indices[:sample_records]
+        indices.sort()
+    elif max_records is not None:
+        indices = indices[:max_records]
+    return {label: [surface_rows[label][i] for i in indices] for label in labels}
+def load_model(model_id: str, device: str, dtype_name: str, trust_remote_code: bool):
+    from transformers import AutoImageProcessor, AutoModel, AutoTokenizer
+    dtype = torch_dtype(dtype_name)
+    tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=trust_remote_code)
+    image_processor = AutoImageProcessor.from_pretrained(model_id, trust_remote_code=trust_remote_code)
+    model = AutoModel.from_pretrained(model_id, trust_remote_code=trust_remote_code, torch_dtype=dtype)
+    model.eval().to(device)
+    return tokenizer, image_processor, model
+def normalize(x: torch.Tensor) -> torch.Tensor:
+    return torch.nn.functional.normalize(x.float(), dim=-1)
+def pooled_tensor(output: Any) -> torch.Tensor:
+    """Return a tensor embedding from HF tensor/model-output variants."""
+    if isinstance(output, torch.Tensor):
+        return output
+    pooler_output = getattr(output, "pooler_output", None)
+    if isinstance(pooler_output, torch.Tensor):
+        return pooler_output
+    image_embeds = getattr(output, "image_embeds", None)
+    if isinstance(image_embeds, torch.Tensor):
+        return image_embeds
+    text_embeds = getattr(output, "text_embeds", None)
+    if isinstance(text_embeds, torch.Tensor):
+        return text_embeds
+    last_hidden_state = getattr(output, "last_hidden_state", None)
+    if isinstance(last_hidden_state, torch.Tensor):
+        return last_hidden_state[:, 0]
+    if isinstance(output, (tuple, list)) and output and isinstance(output[0], torch.Tensor):
+        first = output[0]
+        return first[:, 0] if first.ndim == 3 else first
+    raise TypeError(f"Cannot extract pooled tensor from {type(output)!r}")
+def encode_texts(tokenizer: Any, model: Any, texts: list[str], device: str, max_length: int, batch_size: int) -> tuple[np.ndarray, np.ndarray]:
+    embs: list[np.ndarray] = []
+    lengths: list[int] = []
+    with torch.inference_mode():
+        for start in range(0, len(texts), batch_size):
+            batch = texts[start : start + batch_size]
+            raw = tokenizer(batch, padding=False, truncation=False, add_special_tokens=True)
+            lengths.extend(len(ids) for ids in raw["input_ids"])
+            encoded = tokenizer(batch, padding=True, truncation=True, max_length=max_length, return_tensors="pt")
+            encoded = {k: v.to(device) for k, v in encoded.items()}
+            if hasattr(model, "get_text_features"):
+                features = pooled_tensor(model.get_text_features(**encoded))
+            else:
+                features = pooled_tensor(model(**encoded))
+            embs.append(normalize(features).cpu().numpy().astype("float32"))
+    return np.concatenate(embs, axis=0), np.asarray(lengths, dtype=np.int32)
+def encode_images(image_processor: Any, model: Any, rows: list[dict[str, Any]], device: str, batch_size: int) -> tuple[np.ndarray, dict[str, Any]]:
+    embs: list[np.ndarray] = []
+    kept_indices: list[int] = []
+    failures: list[dict[str, Any]] = []
+    batch_images: list[Image.Image] = []
+    batch_indices: list[int] = []
+    def flush() -> None:
+        if not batch_images:
+            return
+        inputs = image_processor(images=batch_images, return_tensors="pt")
+        inputs = {k: v.to(device) for k, v in inputs.items()}
+        with torch.inference_mode():
+            if hasattr(model, "get_image_features"):
+                features = pooled_tensor(model.get_image_features(**inputs))
+            else:
+                features = pooled_tensor(model(**inputs))
+        embs.append(normalize(features).cpu().numpy().astype("float32"))
+        kept_indices.extend(batch_indices)
+        batch_images.clear()
+        batch_indices.clear()
+    for index, row in enumerate(rows):
+        path = image_path(row)
+        if path is None:
+            failures.append({"index": index, "reason": "missing_image_path"})
+            continue
+        try:
+            image = Image.open(path).convert("RGB")
+        except Exception as exc:  # noqa: BLE001
+            failures.append({"index": index, "path": path, "reason": repr(exc)[:500]})
+            continue
+        batch_images.append(image)
+        batch_indices.append(index)
+        if len(batch_images) >= batch_size:
+            flush()
+    flush()
+    if embs:
+        arr = np.concatenate(embs, axis=0)
+    else:
+        arr = np.zeros((0, 0), dtype=np.float32)
+    return arr, {"kept_indices": kept_indices, "failures": failures}
+def mean_ci(values: np.ndarray, reps: int, rng: np.random.Generator) -> dict[str, float]:
+    values = np.asarray(values, dtype=np.float64)
+    if values.size == 0:
+        return {"mean": float("nan"), "ci95_low": float("nan"), "ci95_high": float("nan")}
+    if reps <= 0 or values.size == 1:
+        mean = float(values.mean())
+        return {"mean": mean, "ci95_low": mean, "ci95_high": mean}
+    means = np.empty(reps, dtype=np.float64)
+    n = values.size
+    for i in range(reps):
+        means[i] = values[rng.integers(0, n, n)].mean()
+    return {
+        "mean": float(values.mean()),
+        "ci95_low": float(np.percentile(means, 2.5)),
+        "ci95_high": float(np.percentile(means, 97.5)),
+    }
+def retrieval_metrics(image_emb: np.ndarray, text_emb: np.ndarray, block_size: int) -> dict[str, np.ndarray]:
+    n = min(len(image_emb), len(text_emb))
+    pos = np.sum(image_emb[:n] * text_emb[:n], axis=1).astype(np.float32)
+    max_i2t = np.full(n, -np.inf, dtype=np.float32)
+    max_t2i = np.full(n, -np.inf, dtype=np.float32)
+    rank_i2t = np.ones(n, dtype=np.int32)
+    rank_t2i = np.ones(n, dtype=np.int32)
+    for image_start in range(0, n, block_size):
+        image_end = min(image_start + block_size, n)
+        image_block = image_emb[image_start:image_end]
+        image_idx = np.arange(image_start, image_end)
+        for text_start in range(0, n, block_size):
+            text_end = min(text_start + block_size, n)
+            text_block = text_emb[text_start:text_end]
+            text_idx = np.arange(text_start, text_end)
+            sims = image_block @ text_block.T
+            diag_mask = image_idx[:, None] == text_idx[None, :]
+            masked = sims.copy()
+            masked[diag_mask] = -np.inf
+            max_i2t[image_start:image_end] = np.maximum(max_i2t[image_start:image_end], masked.max(axis=1))
+            max_t2i[text_start:text_end] = np.maximum(max_t2i[text_start:text_end], masked.max(axis=0))
+            greater_i2t = sims > pos[image_start:image_end, None]
+            greater_i2t[diag_mask] = False
+            rank_i2t[image_start:image_end] += greater_i2t.sum(axis=1).astype(np.int32)
+            greater_t2i = sims > pos[text_start:text_end][None, :]
+            greater_t2i[diag_mask] = False
+            rank_t2i[text_start:text_end] += greater_t2i.sum(axis=0).astype(np.int32)
+    return {
+        "pos": pos,
+        "i2t_margin": (pos - max_i2t).astype(np.float32),
+        "t2i_margin": (pos - max_t2i).astype(np.float32),
+        "i2t_r1": (rank_i2t <= 1).astype(np.float32),
+        "i2t_r5": (rank_i2t <= 5).astype(np.float32),
+        "t2i_r1": (rank_t2i <= 1).astype(np.float32),
+        "t2i_r5": (rank_t2i <= 5).astype(np.float32),
+    }
+def main() -> int:
+    args = parse_args()
+    started = time.time()
+    output_dir = Path(args.output_dir)
+    output_dir.mkdir(parents=True, exist_ok=True)
+    surface_specs = dict(parse_surface(spec) for spec in args.surface)
+    raw_rows = {label: load_surface(path) for label, path in surface_specs.items()}
+    rows = align_rows(raw_rows, args.sample_records, args.max_records, args.sample_seed)
+    labels = list(rows)
+    if not labels:
+        raise SystemExit("No surfaces provided")
+    tokenizer, image_processor, model = load_model(args.model, args.device, args.dtype, args.trust_remote_code)
+    image_emb, image_info = encode_images(image_processor, model, rows[labels[0]], args.device, args.batch_size)
+    kept_indices = image_info["kept_indices"]
+    rng = np.random.default_rng(args.sample_seed)
+    summaries: dict[str, Any] = {}
+    text_cache: dict[str, np.ndarray] = {}
+    token_cache: dict[str, np.ndarray] = {}
+    for label in labels:
+        kept_rows = [rows[label][index] for index in kept_indices]
+        texts = [str(row["caption"]) for row in kept_rows]
+        text_emb, token_lengths = encode_texts(tokenizer, model, texts, args.device, args.max_length, args.batch_size)
+        text_cache[label] = text_emb
+        token_cache[label] = token_lengths
+        metrics = retrieval_metrics(image_emb, text_emb, args.retrieval_block_size)
+        summaries[label] = {
+            "rows": int(len(texts)),
+            "token_mean": float(token_lengths.mean()) if len(token_lengths) else 0.0,
+            "token_p50": float(np.percentile(token_lengths, 50)) if len(token_lengths) else 0.0,
+            "token_p95": float(np.percentile(token_lengths, 95)) if len(token_lengths) else 0.0,
+            "truncated_rate_gt_limit": float((token_lengths > args.max_length).mean()) if len(token_lengths) else 0.0,
+            "pos_score": mean_ci(metrics["pos"], args.bootstrap_reps, rng),
+            "i2t_margin": mean_ci(metrics["i2t_margin"], args.bootstrap_reps, rng),
+            "t2i_margin": mean_ci(metrics["t2i_margin"], args.bootstrap_reps, rng),
+            "i2t_r_at_1": mean_ci(metrics["i2t_r1"], args.bootstrap_reps, rng),
+            "i2t_r_at_5": mean_ci(metrics["i2t_r5"], args.bootstrap_reps, rng),
+            "t2i_r_at_1": mean_ci(metrics["t2i_r1"], args.bootstrap_reps, rng),
+            "t2i_r_at_5": mean_ci(metrics["t2i_r5"], args.bootstrap_reps, rng),
+        }
+    payload = {
+        "model": args.model,
+        "max_length": args.max_length,
+        "surface_inputs": {label: str(path) for label, path in surface_specs.items()},
+        "labels": labels,
+        "image_rows": len(rows[labels[0]]),
+        "image_kept": len(kept_indices),
+        "image_failures": image_info["failures"][:100],
+        "retrieval_block_size": args.retrieval_block_size,
+        "bootstrap_reps": args.bootstrap_reps,
+        "seconds": round(time.time() - started, 2),
+        "summaries": summaries,
+    }
+    summary_path = output_dir / "longclip_retrieval_summary.json"
+    summary_path.write_text(json.dumps(payload, indent=2, ensure_ascii=False), encoding="utf-8")
+    rows_tsv = [
+        [
+            "surface",
+            "rows",
+            "trunc_gt_248",
+            "tok_mean",
+            "tok_p95",
+            "pos_mean",
+            "pos_ci95",
+            "i2t_margin_mean",
+            "i2t_margin_ci95",
+            "i2t_r1",
+            "i2t_r5",
+            "t2i_margin_mean",
+            "t2i_margin_ci95",
+            "t2i_r1",
+            "t2i_r5",
+        ]
+    ]
+    for label in labels:
+        s = summaries[label]
+        rows_tsv.append(
+            [
+                label,
+                str(s["rows"]),
+                f"{s['truncated_rate_gt_limit']:.4f}",
+                f"{s['token_mean']:.2f}",
+                f"{s['token_p95']:.1f}",
+                f"{s['pos_score']['mean']:.6f}",
+                f"[{s['pos_score']['ci95_low']:.6f},{s['pos_score']['ci95_high']:.6f}]",
+                f"{s['i2t_margin']['mean']:.6f}",
+                f"[{s['i2t_margin']['ci95_low']:.6f},{s['i2t_margin']['ci95_high']:.6f}]",
+                f"{s['i2t_r_at_1']['mean']:.4f}",
+                f"{s['i2t_r_at_5']['mean']:.4f}",
+                f"{s['t2i_margin']['mean']:.6f}",
+                f"[{s['t2i_margin']['ci95_low']:.6f},{s['t2i_margin']['ci95_high']:.6f}]",
+                f"{s['t2i_r_at_1']['mean']:.4f}",
+                f"{s['t2i_r_at_5']['mean']:.4f}",
+            ]
+        )
+    (output_dir / "longclip_retrieval_summary.tsv").write_text(
+        "\n".join("\t".join(row) for row in rows_tsv) + "\n",
+        encoding="utf-8",
+    )
+    if args.save_embeddings:
+        np.save(output_dir / "image_embeddings.npy", image_emb.astype(np.float16))
+        for label, emb in text_cache.items():
+            np.save(output_dir / f"text_embeddings_{label}.npy", emb.astype(np.float16))
+            np.save(output_dir / f"token_lengths_{label}.npy", token_cache[label])
+    print(json.dumps({"summary": str(summary_path), "rows": len(kept_indices), "labels": labels}, indent=2))
+    return 0
+if __name__ == "__main__":
+    raise SystemExit(main())

eval_code/scripts/export_cbu_metric_tables.py ADDED Viewed

	@@ -0,0 +1,386 @@

+#!/usr/bin/env python3
+"""Export paper-facing CBU tables with caption-level bootstrap CIs.
+The script consumes existing CBU response JSONL artifacts. It does not call a
+model and does not modify source captions.
+"""
+from __future__ import annotations
+import argparse
+import csv
+import json
+import re
+from collections import Counter, defaultdict
+from pathlib import Path
+from typing import Any
+import numpy as np
+UNIT_CATEGORIES = [
+    "object",
+    "attribute",
+    "relation",
+    "style",
+    "camera",
+    "lighting",
+    "count",
+    "text_rendering",
+]
+VISUAL_STATUSES = {"grounded", "unsupported", "uncertain"}
+TOKEN_RE = re.compile(r"[^\W_]+(?:'[^\W_]+)*", re.UNICODE)
+ARTICLE_UNITS = {"a", "an", "the"}
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument("--claimed", action="append", default=[], metavar="LABEL=PATH")
+    parser.add_argument("--grounded", action="append", default=[], metavar="LABEL=PATH")
+    parser.add_argument("--output-dir", required=True)
+    parser.add_argument("--bootstrap-reps", type=int, default=2000)
+    parser.add_argument("--seed", type=int, default=0)
+    return parser.parse_args()
+def parse_label_path(value: str) -> tuple[str, Path]:
+    if "=" not in value:
+        raise ValueError(f"Expected LABEL=PATH, got {value!r}")
+    label, path = value.split("=", 1)
+    return label, Path(path)
+def normalize_unit(text: str) -> str:
+    tokens = TOKEN_RE.findall(text.lower())
+    while tokens and tokens[0] in ARTICLE_UNITS:
+        tokens.pop(0)
+    return " ".join(tokens)
+def normalize_key_part(text: str) -> str:
+    return normalize_unit(text) or ""
+def unit_records(group: Any) -> list[dict[str, str]]:
+    records: list[dict[str, str]] = []
+    if not isinstance(group, list):
+        return records
+    for item in group:
+        if not isinstance(item, dict):
+            continue
+        category = item.get("category")
+        unit = item.get("unit")
+        if category not in UNIT_CATEGORIES or not isinstance(unit, str) or not unit.strip():
+            continue
+        target = item.get("target", "")
+        records.append(
+            {
+                "category": category,
+                "unit": unit.strip(),
+                "target": target.strip() if isinstance(target, str) else "",
+            }
+        )
+    return records
+def dedup_counts(group: Any) -> tuple[int, dict[str, int], int]:
+    counts = {category: 0 for category in UNIT_CATEGORIES}
+    seen: set[str] = set()
+    duplicate = 0
+    for record in unit_records(group):
+        norm = normalize_unit(record["unit"])
+        if not norm:
+            continue
+        key = f"{record['category']}|{norm}|{normalize_key_part(record.get('target', ''))}"
+        if key in seen:
+            duplicate += 1
+            continue
+        seen.add(key)
+        counts[record["category"]] += 1
+    return sum(counts.values()), counts, duplicate
+def caption_tokens(request: dict[str, Any]) -> int:
+    caption = request.get("caption", "")
+    return len(TOKEN_RE.findall(caption)) if isinstance(caption, str) else 0
+def read_claimed(path: Path, label: str) -> list[dict[str, Any]]:
+    rows: list[dict[str, Any]] = []
+    with path.open("r", encoding="utf-8") as handle:
+        for line in handle:
+            if not line.strip():
+                continue
+            raw = json.loads(line)
+            if not raw.get("ok") or not isinstance(raw.get("parsed"), dict):
+                continue
+            total, counts, duplicate = dedup_counts(raw["parsed"].get("claimed_units"))
+            request = raw.get("request", {})
+            rows.append(
+                {
+                    "label": label,
+                    "caption_id": request.get("caption_id"),
+                    "tokens": caption_tokens(request),
+                    "dedup_units": total,
+                    "duplicate_units": duplicate,
+                    **{f"{category}_units": counts[category] for category in UNIT_CATEGORIES},
+                }
+            )
+    return rows
+def request_unit_lookup(request: dict[str, Any]) -> dict[str, dict[str, Any]]:
+    return {
+        unit.get("unit_id"): unit
+        for unit in request.get("claimed_units", [])
+        if isinstance(unit, dict) and isinstance(unit.get("unit_id"), str)
+    }
+def read_grounded(path: Path, label: str) -> list[dict[str, Any]]:
+    rows: list[dict[str, Any]] = []
+    with path.open("r", encoding="utf-8") as handle:
+        for line in handle:
+            if not line.strip():
+                continue
+            raw = json.loads(line)
+            if not raw.get("ok") or not isinstance(raw.get("parsed"), dict):
+                continue
+            lookup = request_unit_lookup(raw.get("request", {}))
+            counter: Counter[str] = Counter()
+            for result in raw["parsed"].get("unit_results", []):
+                if not isinstance(result, dict):
+                    continue
+                unit = lookup.get(result.get("unit_id"), {})
+                category = unit.get("category", "__unknown__")
+                status = result.get("status", "__bad_status__")
+                counter["valid"] += 1
+                counter[status] += 1
+                if status in VISUAL_STATUSES:
+                    counter["visual"] += 1
+                    if category in UNIT_CATEGORIES:
+                        counter[f"{category}_visual"] += 1
+                        counter[f"{category}_{status}"] += 1
+            rows.append(
+                {
+                    "label": label,
+                    "caption_id": raw.get("request", {}).get("caption_id"),
+                    "valid": counter["valid"],
+                    "visual": counter["visual"],
+                    "grounded": counter["grounded"],
+                    "unsupported": counter["unsupported"],
+                    "uncertain": counter["uncertain"],
+                    **{key: counter[key] for key in counter if "_" in key},
+                }
+            )
+    return rows
+def ci(values: np.ndarray) -> tuple[float, float]:
+    return float(np.quantile(values, 0.025)), float(np.quantile(values, 0.975))
+def bootstrap_indices(n: int, reps: int, rng: np.random.Generator) -> np.ndarray:
+    return rng.integers(0, n, size=(reps, n), endpoint=False)
+def summarize_claimed(rows: list[dict[str, Any]], reps: int, rng: np.random.Generator) -> dict[str, Any]:
+    n = len(rows)
+    units = np.asarray([row["dedup_units"] for row in rows], dtype=np.float64)
+    tokens = np.asarray([max(row["tokens"], 1) for row in rows], dtype=np.float64)
+    dups = np.asarray([row["duplicate_units"] for row in rows], dtype=np.float64)
+    idx = bootstrap_indices(n, reps, rng) if n else np.empty((0, 0), dtype=np.int64)
+    def mean_metric(arr: np.ndarray) -> dict[str, float]:
+        point = float(arr.mean()) if len(arr) else 0.0
+        boot = arr[idx].mean(axis=1) if len(arr) else np.asarray([0.0])
+        low, high = ci(boot)
+        return {"mean": point, "ci95_low": low, "ci95_high": high}
+    ratio = float(100.0 * units.sum() / tokens.sum()) if tokens.sum() else 0.0
+    ratio_boot = 100.0 * units[idx].sum(axis=1) / tokens[idx].sum(axis=1) if n else np.asarray([0.0])
+    low, high = ci(ratio_boot)
+    out: dict[str, Any] = {
+        "captions": n,
+        "dedup_units_per_caption": mean_metric(units),
+        "dedup_units_per_100_tokens": {"mean": ratio, "ci95_low": low, "ci95_high": high},
+        "duplicate_units_per_caption": mean_metric(dups),
+    }
+    for category in UNIT_CATEGORIES:
+        arr = np.asarray([row[f"{category}_units"] for row in rows], dtype=np.float64)
+        out[f"{category}_per_caption"] = mean_metric(arr)
+    return out
+def summarize_grounded(rows: list[dict[str, Any]], reps: int, rng: np.random.Generator) -> dict[str, Any]:
+    n = len(rows)
+    grounded = np.asarray([row["grounded"] for row in rows], dtype=np.float64)
+    unsupported = np.asarray([row["unsupported"] for row in rows], dtype=np.float64)
+    uncertain = np.asarray([row["uncertain"] for row in rows], dtype=np.float64)
+    visual = np.asarray([max(row["visual"], 0) for row in rows], dtype=np.float64)
+    idx = bootstrap_indices(n, reps, rng) if n else np.empty((0, 0), dtype=np.int64)
+    def ratio_metric(num: np.ndarray, den: np.ndarray) -> dict[str, float]:
+        point = float(num.sum() / den.sum()) if den.sum() else 0.0
+        if not n:
+            return {"mean": point, "ci95_low": point, "ci95_high": point}
+        boot_den = den[idx].sum(axis=1)
+        boot = np.divide(num[idx].sum(axis=1), boot_den, out=np.zeros_like(boot_den), where=boot_den != 0)
+        low, high = ci(boot)
+        return {"mean": point, "ci95_low": low, "ci95_high": high}
+    def mean_metric(arr: np.ndarray) -> dict[str, float]:
+        point = float(arr.mean()) if len(arr) else 0.0
+        boot = arr[idx].mean(axis=1) if len(arr) else np.asarray([0.0])
+        low, high = ci(boot)
+        return {"mean": point, "ci95_low": low, "ci95_high": high}
+    out: dict[str, Any] = {
+        "captions": n,
+        "visual_units": int(visual.sum()),
+        "grounded_units_per_caption": mean_metric(grounded),
+        "grounded_precision": ratio_metric(grounded, visual),
+        "unsupported_rate": ratio_metric(unsupported, visual),
+        "uncertain_rate": ratio_metric(uncertain, visual),
+    }
+    categories: dict[str, Any] = {}
+    for category in UNIT_CATEGORIES:
+        den = np.asarray([row.get(f"{category}_visual", 0) for row in rows], dtype=np.float64)
+        cat_grounded = np.asarray([row.get(f"{category}_grounded", 0) for row in rows], dtype=np.float64)
+        cat_unsupported = np.asarray([row.get(f"{category}_unsupported", 0) for row in rows], dtype=np.float64)
+        cat_uncertain = np.asarray([row.get(f"{category}_uncertain", 0) for row in rows], dtype=np.float64)
+        categories[category] = {
+            "visual_units": int(den.sum()),
+            "grounded_precision": ratio_metric(cat_grounded, den),
+            "unsupported_rate": ratio_metric(cat_unsupported, den),
+            "uncertain_rate": ratio_metric(cat_uncertain, den),
+        }
+    out["categories"] = categories
+    return out
+def write_tsv(path: Path, rows: list[dict[str, Any]], fieldnames: list[str]) -> None:
+    path.parent.mkdir(parents=True, exist_ok=True)
+    with path.open("w", encoding="utf-8", newline="") as handle:
+        writer = csv.DictWriter(handle, fieldnames=fieldnames, delimiter="\t")
+        writer.writeheader()
+        writer.writerows(rows)
+def fmt_metric(metric: dict[str, float]) -> str:
+    return f"{metric['mean']:.4f} [{metric['ci95_low']:.4f}, {metric['ci95_high']:.4f}]"
+def main() -> int:
+    args = parse_args()
+    out_dir = Path(args.output_dir)
+    out_dir.mkdir(parents=True, exist_ok=True)
+    rng = np.random.default_rng(args.seed)
+    payload: dict[str, Any] = {
+        "bootstrap_reps": args.bootstrap_reps,
+        "seed": args.seed,
+        "claimed": {},
+        "grounded": {},
+    }
+    claimed_tsv: list[dict[str, Any]] = []
+    for item in args.claimed:
+        label, path = parse_label_path(item)
+        rows = read_claimed(path, label)
+        summary = summarize_claimed(rows, args.bootstrap_reps, rng)
+        payload["claimed"][label] = {"input": str(path), **summary}
+        claimed_tsv.append(
+            {
+                "surface": label,
+                "captions": summary["captions"],
+                "cbu_per_caption_ci95": fmt_metric(summary["dedup_units_per_caption"]),
+                "cbu_per_100_tokens_ci95": fmt_metric(summary["dedup_units_per_100_tokens"]),
+                "object_per_caption_ci95": fmt_metric(summary["object_per_caption"]),
+                "attribute_per_caption_ci95": fmt_metric(summary["attribute_per_caption"]),
+                "relation_per_caption_ci95": fmt_metric(summary["relation_per_caption"]),
+                "camera_per_caption_ci95": fmt_metric(summary["camera_per_caption"]),
+                "lighting_per_caption_ci95": fmt_metric(summary["lighting_per_caption"]),
+                "text_rendering_per_caption_ci95": fmt_metric(summary["text_rendering_per_caption"]),
+            }
+        )
+    grounded_tsv: list[dict[str, Any]] = []
+    category_tsv: list[dict[str, Any]] = []
+    for item in args.grounded:
+        label, path = parse_label_path(item)
+        rows = read_grounded(path, label)
+        summary = summarize_grounded(rows, args.bootstrap_reps, rng)
+        payload["grounded"][label] = {"input": str(path), **summary}
+        grounded_tsv.append(
+            {
+                "surface": label,
+                "captions": summary["captions"],
+                "visual_units": summary["visual_units"],
+                "grounded_units_per_caption_ci95": fmt_metric(summary["grounded_units_per_caption"]),
+                "grounded_precision_ci95": fmt_metric(summary["grounded_precision"]),
+                "unsupported_rate_ci95": fmt_metric(summary["unsupported_rate"]),
+                "uncertain_rate_ci95": fmt_metric(summary["uncertain_rate"]),
+            }
+        )
+        for category, cat in summary["categories"].items():
+            category_tsv.append(
+                {
+                    "surface": label,
+                    "category": category,
+                    "visual_units": cat["visual_units"],
+                    "grounded_precision_ci95": fmt_metric(cat["grounded_precision"]),
+                    "unsupported_rate_ci95": fmt_metric(cat["unsupported_rate"]),
+                    "uncertain_rate_ci95": fmt_metric(cat["uncertain_rate"]),
+                }
+            )
+    (out_dir / "cbu_bootstrap_summary.json").write_text(json.dumps(payload, indent=2), encoding="utf-8")
+    write_tsv(
+        out_dir / "claimed_cbu_ci.tsv",
+        claimed_tsv,
+        [
+            "surface",
+            "captions",
+            "cbu_per_caption_ci95",
+            "cbu_per_100_tokens_ci95",
+            "object_per_caption_ci95",
+            "attribute_per_caption_ci95",
+            "relation_per_caption_ci95",
+            "camera_per_caption_ci95",
+            "lighting_per_caption_ci95",
+            "text_rendering_per_caption_ci95",
+        ],
+    )
+    write_tsv(
+        out_dir / "grounded_cbu_ci.tsv",
+        grounded_tsv,
+        [
+            "surface",
+            "captions",
+            "visual_units",
+            "grounded_units_per_caption_ci95",
+            "grounded_precision_ci95",
+            "unsupported_rate_ci95",
+            "uncertain_rate_ci95",
+        ],
+    )
+    write_tsv(
+        out_dir / "grounded_cbu_category_ci.tsv",
+        category_tsv,
+        [
+            "surface",
+            "category",
+            "visual_units",
+            "grounded_precision_ci95",
+            "unsupported_rate_ci95",
+            "uncertain_rate_ci95",
+        ],
+    )
+    print(json.dumps({"output_dir": str(out_dir), "claimed": len(claimed_tsv), "grounded": len(grounded_tsv)}, indent=2))
+    return 0
+if __name__ == "__main__":
+    raise SystemExit(main())

eval_code/scripts/export_cbu_vqa_tables.py ADDED Viewed

	@@ -0,0 +1,84 @@

+#!/usr/bin/env python3
+"""Export compact tables from CBU-VQA summary JSON files."""
+from __future__ import annotations
+import argparse
+import json
+from pathlib import Path
+from typing import Any
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument("--summary", action="append", required=True, help="CBU-VQA summary JSON")
+    parser.add_argument("--output-md", required=True)
+    parser.add_argument("--output-tex", default=None)
+    return parser.parse_args()
+def load_rows(paths: list[str]) -> list[dict[str, Any]]:
+    rows: list[dict[str, Any]] = []
+    for path in paths:
+        data = json.loads(Path(path).read_text(encoding="utf-8"))
+        for surface, stats in sorted(data.get("surfaces", {}).items()):
+            rows.append(
+                {
+                    "source": path,
+                    "surface": surface,
+                    "responses": stats.get("responses", 0),
+                    "ok": stats.get("ok", 0),
+                    "questions": stats.get("questions", 0),
+                    "support": stats.get("support_rate", 0.0),
+                    "risk": stats.get("risk_rate", 0.0),
+                    "uncertain": stats.get("uncertainty_rate", 0.0),
+                }
+            )
+    return rows
+def write_markdown(rows: list[dict[str, Any]], path: Path) -> None:
+    lines = [
+        "| Surface | Resp | OK | Q | Support ↑ | Risk ↓ | Uncertain ↓ |",
+        "|---|---:|---:|---:|---:|---:|---:|",
+    ]
+    for row in rows:
+        lines.append(
+            "| {surface} | {responses:,} | {ok:,} | {questions:,} | {support:.4f} | {risk:.4f} | {uncertain:.4f} |".format(
+                **row
+            )
+        )
+    path.parent.mkdir(parents=True, exist_ok=True)
+    path.write_text("\n".join(lines) + "\n", encoding="utf-8")
+def write_latex(rows: list[dict[str, Any]], path: Path) -> None:
+    lines = [
+        r"\begin{tabular}{lrrrrrr}",
+        r"\toprule",
+        r"Surface & Resp. & OK & Q & Support $\uparrow$ & Risk $\downarrow$ & Uncertain $\downarrow$ \\",
+        r"\midrule",
+    ]
+    for row in rows:
+        lines.append(
+            "{surface} & {responses:,} & {ok:,} & {questions:,} & {support:.4f} & {risk:.4f} & {uncertain:.4f} \\\\".format(
+                **row
+            ).replace("_", r"\_")
+        )
+    lines.extend([r"\bottomrule", r"\end{tabular}"])
+    path.parent.mkdir(parents=True, exist_ok=True)
+    path.write_text("\n".join(lines) + "\n", encoding="utf-8")
+def main() -> int:
+    args = parse_args()
+    rows = load_rows(args.summary)
+    write_markdown(rows, Path(args.output_md))
+    if args.output_tex:
+        write_latex(rows, Path(args.output_tex))
+    print(json.dumps({"rows": len(rows), "output_md": args.output_md, "output_tex": args.output_tex}, indent=2))
+    return 0
+if __name__ == "__main__":
+    raise SystemExit(main())

eval_code/scripts/pack_recap_ed_metrics.py ADDED Viewed

	@@ -0,0 +1,223 @@

+#!/usr/bin/env python3
+"""Pack small recap E&D metric artifacts into a release-friendly directory."""
+from __future__ import annotations
+import argparse
+import csv
+import json
+import shutil
+from pathlib import Path
+from typing import Any
+ROOT = Path("<PROJECT_ROOT>")
+NVME = Path("<LOCAL_CACHE>")
+EMBEDDING_RUNS = [
+    ("Qwen3-Embedding-4B", "ours", "qwen3-embedding-4b/datacomp_ours_50k"),
+    ("Qwen3-Embedding-4B", "ref", "qwen3-embedding-4b/datacomp_ref_llava15_50k"),
+    ("Qwen3-Embedding-8B", "ours", "qwen3-embedding-8b/datacomp_ours_50k"),
+    ("Qwen3-Embedding-8B", "ref", "qwen3-embedding-8b/datacomp_ref_llava15_50k"),
+    ("E5-Mistral-7B", "ours", "e5-mistral-7b-instruct/datacomp_ours_50k"),
+    ("E5-Mistral-7B", "ref", "e5-mistral-7b-instruct/datacomp_ref_llava15_50k"),
+    ("BGE-M3-official", "ours", "bge-m3-official/datacomp_ours_50k"),
+    ("BGE-M3-official", "ref", "bge-m3-official/datacomp_ref_llava15_50k"),
+]
+SUPPORT_RUNS = [
+    ("Qwen3-Embedding-4B raw/raw", "ours", "qwen3-embedding-4b/2026-04-25/diffusiondb_raw_to_ours_50k.support.json"),
+    ("Qwen3-Embedding-4B raw/raw", "ref", "qwen3-embedding-4b/2026-04-25/diffusiondb_raw_to_ref_50k.support.json"),
+    ("Qwen3-Embedding-4B query/doc", "ours", "qwen3-embedding-4b/2026-04-25/diffusiondb_query_to_ours_50k.support.json"),
+    ("Qwen3-Embedding-4B query/doc", "ref", "qwen3-embedding-4b/2026-04-25/diffusiondb_query_to_ref_50k.support.json"),
+    ("E5-Mistral raw/raw", "ours", "e5-mistral-7b-instruct/2026-04-25/diffusiondb_raw_to_ours_50k.support.json"),
+    ("E5-Mistral raw/raw", "ref", "e5-mistral-7b-instruct/2026-04-25/diffusiondb_raw_to_ref_50k.support.json"),
+    ("E5-Mistral query/doc", "ours", "e5-mistral-7b-instruct/2026-04-25/diffusiondb_query_to_ours_50k.support.json"),
+    ("E5-Mistral query/doc", "ref", "e5-mistral-7b-instruct/2026-04-25/diffusiondb_query_to_ref_50k.support.json"),
+    ("BGE-M3 raw/corpus", "ours", "bge-m3-official/2026-04-25/diffusiondb_raw_to_ours_50k.support.json"),
+    ("BGE-M3 raw/corpus", "ref", "bge-m3-official/2026-04-25/diffusiondb_raw_to_ref_50k.support.json"),
+    ("BGE-M3 query/corpus", "ours", "bge-m3-official/2026-04-25/diffusiondb_query_to_ours_50k.support.json"),
+    ("BGE-M3 query/corpus", "ref", "bge-m3-official/2026-04-25/diffusiondb_query_to_ref_50k.support.json"),
+]
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument("--output-dir", default="artifacts/recap-ed/metrics-2026-04-25")
+    return parser.parse_args()
+def load_json(path: Path) -> dict[str, Any]:
+    with path.open("r", encoding="utf-8") as handle:
+        return json.load(handle)
+def write_tsv(path: Path, rows: list[dict[str, Any]], fields: list[str]) -> None:
+    path.parent.mkdir(parents=True, exist_ok=True)
+    with path.open("w", encoding="utf-8", newline="") as handle:
+        writer = csv.DictWriter(handle, fields, delimiter="\t")
+        writer.writeheader()
+        writer.writerows(rows)
+def rel_or_abs(path: Path) -> str:
+    try:
+        return str(path.relative_to(ROOT))
+    except ValueError:
+        return str(path)
+def pack_embedding(out_dir: Path, manifest: dict[str, Any]) -> None:
+    rows: list[dict[str, Any]] = []
+    for encoder, surface, rel in EMBEDDING_RUNS:
+        base = NVME / "caption-embeddings" / rel
+        vendi_path = base / "vendi_partition_b4096_seed0.json"
+        rel_path = Path(rel)
+        geometry_path = NVME / "caption-geometry" / rel_path.parent / f"{rel_path.name}.geometry.json"
+        if not geometry_path.exists():
+            geometry_path = base / "geometry_seed0.json"
+        vendi = load_json(vendi_path)
+        geometry = load_json(geometry_path)
+        geometry_metrics = geometry.get("metrics", geometry)
+        summary = vendi["summary"]["vendi"]
+        rows.append(
+            {
+                "encoder": encoder,
+                "surface": surface,
+                "rows": vendi.get("source_rows"),
+                "vendi_mean": f"{summary['mean']:.6f}",
+                "vendi_ci95_low": f"{summary['ci95_low']:.6f}",
+                "vendi_ci95_high": f"{summary['ci95_high']:.6f}",
+                "cov_effective_rank": f"{geometry_metrics.get('cov_effective_rank', 0):.6f}",
+                "cov_participation_ratio": f"{geometry_metrics.get('cov_participation_ratio', 0):.6f}",
+                "cov_top1_mass": f"{geometry_metrics.get('cov_top1_mass', 0):.6f}",
+            }
+        )
+        manifest["sources"].append(rel_or_abs(vendi_path))
+        manifest["sources"].append(rel_or_abs(geometry_path))
+    write_tsv(
+        out_dir / "embedding" / "caption_embedding_profile.tsv",
+        rows,
+        [
+            "encoder",
+            "surface",
+            "rows",
+            "vendi_mean",
+            "vendi_ci95_low",
+            "vendi_ci95_high",
+            "cov_effective_rank",
+            "cov_participation_ratio",
+            "cov_top1_mass",
+        ],
+    )
+def pack_support(out_dir: Path, manifest: dict[str, Any]) -> None:
+    rows: list[dict[str, Any]] = []
+    for protocol, surface, rel in SUPPORT_RUNS:
+        path = NVME / "prompt-caption-support" / rel
+        data = load_json(path)
+        metrics = data["metrics"]
+        rows.append(
+            {
+                "protocol": protocol,
+                "surface": surface,
+                "prompt_rows": data.get("query_rows"),
+                "caption_rows": data.get("gallery_rows"),
+                "k": data.get("k"),
+                "coverage": f"{metrics['coverage']:.6f}",
+                "density": f"{metrics['density']:.6f}",
+                "nn_cosine_mean": f"{metrics['nn_cosine_mean']:.6f}",
+                "nn_distance_p95": f"{metrics['nn_distance_p95']:.6f}",
+            }
+        )
+        manifest["sources"].append(rel_or_abs(path))
+    write_tsv(
+        out_dir / "embedding" / "prompt_caption_support.tsv",
+        rows,
+        [
+            "protocol",
+            "surface",
+            "prompt_rows",
+            "caption_rows",
+            "k",
+            "coverage",
+            "density",
+            "nn_cosine_mean",
+            "nn_distance_p95",
+        ],
+    )
+def pack_cpu(out_dir: Path, manifest: dict[str, Any]) -> None:
+    cpu_dir = out_dir / "cpu"
+    cpu_dir.mkdir(parents=True, exist_ok=True)
+    small_files = [
+        ROOT / "artifacts/caption-survey/cpu_remaining_2026-04-24/paired_delta_ci.tsv",
+        NVME / "caption-survey/local_long_1m.json",
+        NVME / "caption-survey/hf_manifest_1m.json",
+    ]
+    for src in small_files:
+        dst = cpu_dir / src.name
+        shutil.copy2(src, dst)
+        manifest["sources"].append(rel_or_abs(src))
+        manifest["packed_files"].append(rel_or_abs(dst))
+def write_readme(out_dir: Path) -> None:
+    readme = """# Recap E&D Metric Artifacts
+Date: 2026-04-25
+This directory contains small, paper-facing metric artifacts for the recap E&D draft.
+Large intermediate embedding arrays, VLM response JSONL files, and source image data are
+not copied here. The manifest records local source paths for reproducibility.
+Contents:
+- `cpu/paired_delta_ci.tsv`: paired CPU lexical/surface metric deltas with CIs.
+- `cpu/local_long_1m.json`: local long-caption corpus survey summaries.
+- `cpu/hf_manifest_1m.json`: public-reference corpus survey summaries.
+- `cbu/claimed_cbu_ci.tsv`: caption-level bootstrap CIs for claimed CBU density.
+- `cbu/grounded_cbu_ci.tsv`: caption-level bootstrap CIs for exact-unit grounded CBU audits.
+- `cbu/grounded_cbu_category_ci.tsv`: category-level grounded CBU audit CIs.
+- `embedding/caption_embedding_profile.tsv`: Vendi and covariance-geometry profiles.
+- `embedding/prompt_caption_support.tsv`: PRDC-style prompt-in-caption support metrics.
+Boundary:
+- Text-only metrics describe caption/supervision structure.
+- `GroundedCBU` is a sampled VLM proxy audit, not a human-certified faithfulness score.
+- Embedding metrics are encoder-sensitive and should be reported as profiles, not a single scalar quality score.
+"""
+    (out_dir / "README.md").write_text(readme, encoding="utf-8")
+def main() -> int:
+    args = parse_args()
+    out_dir = Path(args.output_dir)
+    out_dir.mkdir(parents=True, exist_ok=True)
+    manifest: dict[str, Any] = {
+        "date": "2026-04-25",
+        "purpose": "paper-facing recap E&D metric artifact bundle",
+        "sources": [],
+        "packed_files": [],
+    }
+    pack_cpu(out_dir, manifest)
+    pack_embedding(out_dir, manifest)
+    pack_support(out_dir, manifest)
+    write_readme(out_dir)
+    manifest["packed_files"].extend(
+        rel_or_abs(path)
+        for path in sorted(out_dir.rglob("*"))
+        if path.is_file() and path.name != "manifest.json"
+    )
+    (out_dir / "manifest.json").write_text(json.dumps(manifest, indent=2), encoding="utf-8")
+    print(json.dumps({"output_dir": str(out_dir), "files": len(manifest["packed_files"])}, indent=2))
+    return 0
+if __name__ == "__main__":
+    raise SystemExit(main())

eval_code/scripts/plot_caption_survey_curves.py ADDED Viewed

	@@ -0,0 +1,251 @@

+#!/usr/bin/env python3
+"""Plot budget-curve metrics from caption survey JSON outputs."""
+from __future__ import annotations
+import argparse
+import json
+from pathlib import Path
+from typing import Any
+import matplotlib
+matplotlib.use("Agg")
+import matplotlib.pyplot as plt
+METRICS = [
+    ("coverage_rate", "Budget Eligibility@B", "up"),
+    ("distinct_n.2", "Distinct-2@B", "up"),
+    ("distinct_n.3", "Distinct-3@B", "up"),
+    ("ngram_top_k_mass.2", "Top-100 Bigram Mass@B", "down"),
+    ("ngram_top_k_mass.3", "Top-100 Trigram Mass@B", "down"),
+    ("violation_rate", "Violation Rate@B", "down"),
+    ("repeated_4gram_rate", "Repeated 4-gram Rate@B", "down"),
+]
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(description="Plot caption survey budget curves")
+    parser.add_argument("--input", action="append", required=True, help="Survey JSON path (repeatable)")
+    parser.add_argument("--output-dir", required=True, help="Directory for output PNG plots")
+    parser.add_argument(
+        "--long-coverage-threshold",
+        type=float,
+        default=0.5,
+        help="budget-eligibility@64 threshold used to split long vs short regimes",
+    )
+    return parser.parse_args()
+def nested_get(mapping: dict[str, Any], path: str) -> float | None:
+    current: Any = mapping
+    for part in path.split("."):
+        if not isinstance(current, dict) or part not in current:
+            return None
+        current = current[part]
+    return float(current) if isinstance(current, (int, float)) else None
+def load_rows(paths: list[str]) -> list[dict[str, Any]]:
+    rows: list[dict[str, Any]] = []
+    for raw_path in paths:
+        payload = json.loads(Path(raw_path).read_text(encoding="utf-8"))
+        if "results" in payload:
+            for item in payload.get("results", []):
+                summary = item.get("summary") or item.get("survey_summary")
+                if not isinstance(summary, dict):
+                    continue
+                entry = item.get("entry") or {}
+                length_controlled = summary.get("length_controlled") or {}
+                budgets = sorted(int(key) for key in length_controlled.keys())
+                if not budgets:
+                    continue
+                cov64 = nested_get(length_controlled.get("64", {}), "coverage_rate") or 0.0
+                full = summary.get("full_length_reference") or {}
+                avg_tokens = full.get("avg_tokens", full.get("avg_lexical_tokens", 0.0))
+                rows.append(
+                    {
+                        "name": entry.get("name", Path(raw_path).stem),
+                        "family": entry.get("source_family", "unknown"),
+                        "group": entry.get("group", "unknown"),
+                        "description": entry.get("description", ""),
+                        "captioner": entry.get("captioner", ""),
+                        "avg_tokens": float(avg_tokens),
+                        "coverage64": float(cov64),
+                        "budgets": budgets,
+                        "length_controlled": length_controlled,
+                    }
+                )
+            continue
+        if "length_controlled" in payload:
+            length_controlled = payload.get("length_controlled") or {}
+            budgets = sorted(int(key) for key in length_controlled.keys())
+            if not budgets:
+                continue
+            cov64 = nested_get(length_controlled.get("64", {}), "coverage_rate") or 0.0
+            full = payload.get("full_length_reference") or {}
+            avg_tokens = full.get("avg_tokens", full.get("avg_lexical_tokens", 0.0))
+            stem = Path(raw_path).stem
+            name = stem.removesuffix("_1m").removesuffix("_50k")
+            family = "unknown"
+            if "datacomp" in name:
+                family = "datacomp"
+            elif "pd12m" in name:
+                family = "pd12m"
+            rows.append(
+                {
+                    "name": name,
+                    "family": family,
+                    "group": "direct_summary",
+                    "description": "",
+                    "captioner": "",
+                    "avg_tokens": float(avg_tokens),
+                    "coverage64": float(cov64),
+                    "budgets": budgets,
+                    "length_controlled": length_controlled,
+                }
+            )
+    return rows
+def label_for_row(row: dict[str, Any]) -> str:
+    name = row["name"]
+    if name.startswith("ours_"):
+        label = f"ours:{name.removeprefix('ours_')}"
+    elif name.startswith("ref_"):
+        label = f"ref:{name.removeprefix('ref_')}"
+    else:
+        label = name
+    if name == "ref_cc12m_qwen3vl8b":
+        label += "†"
+    return label
+def decorate_metric_label(metric_label: str, direction: str) -> str:
+    arrow = "↑" if direction == "up" else "↓"
+    return f"{metric_label} {arrow}"
+def style_for_row(row: dict[str, Any]) -> dict[str, Any]:
+    if row["name"].startswith("ours_"):
+        return {"linewidth": 2.8, "alpha": 0.95, "linestyle": "-"}
+    return {"linewidth": 1.6, "alpha": 0.85, "linestyle": "--"}
+def series_for_metric(row: dict[str, Any], metric_key: str) -> tuple[list[int], list[float]]:
+    xs: list[int] = []
+    ys: list[float] = []
+    for budget in row["budgets"]:
+        summary = row["length_controlled"].get(str(budget), {})
+        value = nested_get(summary, metric_key)
+        if value is None:
+            continue
+        xs.append(budget)
+        ys.append(value)
+    return xs, ys
+def save_metric_plot(
+    rows: list[dict[str, Any]],
+    metric_key: str,
+    metric_label: str,
+    direction: str,
+    regime_name: str,
+    output_path: Path,
+) -> None:
+    fig, ax = plt.subplots(figsize=(10.5, 6.2))
+    for row in sorted(rows, key=lambda item: (item["family"], item["name"])):
+        xs, ys = series_for_metric(row, metric_key)
+        if not xs:
+            continue
+        ax.plot(xs, ys, marker="o", label=label_for_row(row), **style_for_row(row))
+    decorated_label = decorate_metric_label(metric_label, direction)
+    ax.set_title(f"{decorated_label} by Budget ({regime_name})")
+    ax.set_xlabel("Token Budget")
+    ax.set_ylabel(decorated_label)
+    ax.set_xticks(sorted({budget for row in rows for budget in row["budgets"]}))
+    ax.grid(True, alpha=0.25)
+    ax.legend(fontsize=8, ncol=2)
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    fig.tight_layout()
+    fig.savefig(output_path, dpi=180)
+    plt.close(fig)
+def save_family_plot(rows: list[dict[str, Any]], family: str, output_path: Path) -> None:
+    family_rows = [row for row in rows if row["family"] == family]
+    if not family_rows:
+        return
+    fig, axes = plt.subplots(2, 3, figsize=(14, 8.5))
+    axes = axes.flatten()
+    for axis, (metric_key, metric_label, direction) in zip(axes, METRICS[:6], strict=False):
+        for row in sorted(family_rows, key=lambda item: item["name"]):
+            xs, ys = series_for_metric(row, metric_key)
+            if not xs:
+                continue
+            axis.plot(xs, ys, marker="o", label=label_for_row(row), **style_for_row(row))
+        axis.set_title(decorate_metric_label(metric_label, direction))
+        axis.set_xlabel("Budget")
+        axis.grid(True, alpha=0.25)
+    handles, labels = axes[0].get_legend_handles_labels()
+    if handles:
+        fig.legend(handles, labels, loc="lower center", ncol=2, fontsize=8)
+    fig.suptitle(f"{family} Budget Curves", y=0.98)
+    fig.tight_layout(rect=(0, 0.05, 1, 0.96))
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    fig.savefig(output_path, dpi=180)
+    plt.close(fig)
+def main() -> int:
+    args = parse_args()
+    rows = load_rows(args.input)
+    if not rows:
+        raise SystemExit("No survey rows loaded")
+    output_dir = Path(args.output_dir)
+    long_rows = [row for row in rows if row["coverage64"] >= args.long_coverage_threshold]
+    short_rows = [row for row in rows if row["coverage64"] < args.long_coverage_threshold]
+    for metric_key, metric_label, direction in METRICS:
+        if long_rows:
+            save_metric_plot(
+                long_rows,
+                metric_key,
+                metric_label,
+                direction,
+                "long-regime",
+                output_dir / "overview" / "long" / f"{metric_key.replace('.', '_')}.png",
+            )
+        if short_rows:
+            save_metric_plot(
+                short_rows,
+                metric_key,
+                metric_label,
+                direction,
+                "short-regime",
+                output_dir / "overview" / "short" / f"{metric_key.replace('.', '_')}.png",
+            )
+    for family in sorted({row["family"] for row in rows}):
+        save_family_plot(rows, family, output_dir / "families" / f"{family}.png")
+    manifest = {
+        "inputs": args.input,
+        "output_dir": str(output_dir),
+        "long_coverage_threshold": args.long_coverage_threshold,
+        "rows_loaded": len(rows),
+        "long_rows": [row["name"] for row in long_rows],
+        "short_rows": [row["name"] for row in short_rows],
+        "metrics": [metric_key for metric_key, _, _ in METRICS],
+    }
+    (output_dir / "plot_manifest.json").write_text(json.dumps(manifest, indent=2, ensure_ascii=False) + "\n", encoding="utf-8")
+    print(json.dumps(manifest, indent=2, ensure_ascii=False))
+    return 0
+if __name__ == "__main__":
+    raise SystemExit(main())

eval_code/scripts/run_cbu_vqa_requests.py ADDED Viewed

	@@ -0,0 +1,261 @@

+#!/usr/bin/env python3
+"""Run VQA-style CBU question requests against an OpenAI-compatible VLM server."""
+from __future__ import annotations
+import argparse
+import asyncio
+import base64
+import json
+import time
+from io import BytesIO
+from pathlib import Path
+from typing import Any
+import aiohttp
+from PIL import Image, ImageFile
+ImageFile.LOAD_TRUNCATED_IMAGES = True
+ANSWERS = ["yes", "no", "uncertain"]
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(description="Run CBU VQA requests")
+    parser.add_argument("--input", required=True)
+    parser.add_argument("--output", required=True)
+    parser.add_argument("--urls", default="http://localhost:8000")
+    parser.add_argument("--model", default="Qwen/Qwen3.5-397B-A17B-FP8")
+    parser.add_argument("--max-requests", type=int, default=None)
+    parser.add_argument("--concurrency", type=int, default=512)
+    parser.add_argument("--max-tokens", type=int, default=2048)
+    parser.add_argument("--temperature", type=float, default=0.0)
+    parser.add_argument("--timeout-sec", type=int, default=2400)
+    parser.add_argument("--image-mode", choices=["auto", "file", "data", "url"], default="file")
+    parser.add_argument("--structured-json", action="store_true")
+    parser.add_argument(
+        "--no-evidence",
+        action="store_true",
+        help="Use compact answer-only schema: question_id, answer, confidence.",
+    )
+    parser.add_argument("--resume", action="store_true")
+    parser.add_argument("--resume-ok-only", action="store_true")
+    parser.add_argument("--skip-ok-from", default=None)
+    return parser.parse_args()
+def iter_requests(path: Path, max_requests: int | None) -> list[dict[str, Any]]:
+    rows = []
+    with path.open("r", encoding="utf-8") as handle:
+        for line in handle:
+            if max_requests is not None and len(rows) >= max_requests:
+                break
+            if line.strip():
+                rows.append(json.loads(line))
+    return rows
+def image_url_for(row: dict[str, Any], mode: str) -> str:
+    if mode in {"auto", "data"} and row.get("image_path"):
+        path = Path(row["image_path"])
+        with Image.open(path) as image:
+            if image.mode != "RGB":
+                image = image.convert("RGB")
+            buffer = BytesIO()
+            image.save(buffer, format="JPEG", quality=88)
+        return f"data:image/jpeg;base64,{base64.b64encode(buffer.getvalue()).decode('ascii')}"
+    if mode in {"auto", "file"} and row.get("image_path"):
+        return Path(row["image_path"]).resolve().as_uri()
+    if mode == "file":
+        raise ValueError(f"request {row.get('request_id')} has no image_path")
+    return row["image_url"]
+def response_schema(question_ids: list[str], include_evidence: bool) -> dict[str, Any]:
+    item_properties: dict[str, Any] = {
+        "question_id": {"type": "string", "enum": question_ids},
+        "answer": {"type": "string", "enum": ANSWERS},
+        "confidence": {"type": "number", "minimum": 0.0, "maximum": 1.0},
+    }
+    required = ["question_id", "answer", "confidence"]
+    if include_evidence:
+        item_properties["evidence"] = {"type": "string", "maxLength": 160}
+        required.append("evidence")
+    return {
+        "type": "object",
+        "properties": {
+            "caption_id": {"type": "string"},
+            "question_results": {
+                "type": "array",
+                "minItems": len(question_ids),
+                "maxItems": len(question_ids),
+                "items": {
+                    "type": "object",
+                    "properties": item_properties,
+                    "required": required,
+                    "additionalProperties": False,
+                },
+            },
+        },
+        "required": ["caption_id", "question_results"],
+        "additionalProperties": False,
+    }
+def validate(parsed: Any, row: dict[str, Any], include_evidence: bool) -> str | None:
+    if not isinstance(parsed, dict):
+        return "top-level response is not an object"
+    if not isinstance(parsed.get("caption_id"), str):
+        return "caption_id is not a string"
+    results = parsed.get("question_results")
+    if not isinstance(results, list):
+        return "question_results is not an array"
+    expected = [question["question_id"] for question in row.get("questions", [])]
+    seen = []
+    for index, result in enumerate(results):
+        if not isinstance(result, dict):
+            return f"question_results[{index}] is not an object"
+        question_id = result.get("question_id")
+        if not isinstance(question_id, str):
+            return f"question_results[{index}].question_id is not a string"
+        seen.append(question_id)
+        if result.get("answer") not in set(ANSWERS):
+            return f"question_results[{index}].answer has invalid value"
+        if not isinstance(result.get("confidence"), int | float):
+            return f"question_results[{index}].confidence is not numeric"
+        if include_evidence and not isinstance(result.get("evidence"), str):
+            return f"question_results[{index}].evidence is not a string"
+    if sorted(seen) != sorted(expected):
+        return f"question_id set mismatch: expected={len(expected)} seen={len(seen)}"
+    if len(seen) != len(set(seen)):
+        return "duplicate question_id in response"
+    return None
+def payload_for(row: dict[str, Any], args: argparse.Namespace) -> dict[str, Any]:
+    question_ids = [question["question_id"] for question in row.get("questions", [])]
+    user_prompt = row["user_prompt"]
+    if args.no_evidence:
+        user_prompt = user_prompt.replace(
+            "- Keep evidence short and grounded in visible image content.\n",
+            "- Return only question_id, answer, and confidence for each question; do not include evidence text.\n",
+        )
+    payload: dict[str, Any] = {
+        "model": args.model,
+        "max_tokens": args.max_tokens,
+        "temperature": args.temperature,
+        "messages": [
+            {"role": "system", "content": row["system_prompt"]},
+            {
+                "role": "user",
+                "content": [
+                    {"type": "text", "text": user_prompt},
+                    {"type": "image_url", "image_url": {"url": image_url_for(row, args.image_mode)}},
+                ],
+            },
+        ],
+        "chat_template_kwargs": {"enable_thinking": False},
+    }
+    if args.structured_json:
+        payload["structured_outputs"] = {"json": response_schema(question_ids, include_evidence=not args.no_evidence)}
+    return payload
+async def post_one(session: aiohttp.ClientSession, url: str, row: dict[str, Any], args: argparse.Namespace) -> dict[str, Any]:
+    endpoint = f"{url.rstrip('/')}/v1/chat/completions"
+    start = time.perf_counter()
+    try:
+        async with session.post(endpoint, json=payload_for(row, args), headers={"Authorization": "Bearer sk-fake"}) as response:
+            text = await response.text()
+            elapsed = time.perf_counter() - start
+            if response.status >= 400:
+                return {"request_id": row["request_id"], "ok": False, "status": response.status, "elapsed_sec": round(elapsed, 4), "error": text[:4000], "request": row}
+            body = json.loads(text)
+            content = body["choices"][0]["message"]["content"]
+            parsed = None
+            parse_error = None
+            schema_error = None
+            try:
+                parsed = json.loads(content)
+                schema_error = validate(parsed, row, include_evidence=not args.no_evidence)
+            except Exception as exc:  # noqa: BLE001
+                parse_error = repr(exc)
+            return {
+                "request_id": row["request_id"],
+                "ok": parse_error is None and schema_error is None,
+                "status": response.status,
+                "elapsed_sec": round(elapsed, 4),
+                "model": args.model,
+                "usage": body.get("usage", {}),
+                "response_text": content,
+                "parsed": parsed,
+                "parse_error": parse_error,
+                "schema_error": schema_error,
+                "request": row,
+            }
+    except Exception as exc:  # noqa: BLE001
+        return {"request_id": row["request_id"], "ok": False, "status": None, "elapsed_sec": round(time.perf_counter() - start, 4), "error": repr(exc), "request": row}
+def load_seen(args: argparse.Namespace, output: Path) -> set[str]:
+    seen: set[str] = set()
+    paths: list[Path] = []
+    if args.skip_ok_from:
+        paths.append(Path(args.skip_ok_from))
+    if args.resume and output.exists():
+        paths.append(output)
+    for path in paths:
+        with path.open("r", encoding="utf-8") as handle:
+            for line in handle:
+                if not line.strip():
+                    continue
+                try:
+                    row = json.loads(line)
+                except json.JSONDecodeError:
+                    continue
+                if (path != output or args.resume_ok_only) and not row.get("ok"):
+                    continue
+                request_id = row.get("request_id")
+                if isinstance(request_id, str):
+                    seen.add(request_id)
+    return seen
+async def run(args: argparse.Namespace) -> int:
+    rows = iter_requests(Path(args.input), args.max_requests)
+    urls = [item.strip() for item in args.urls.split(",") if item.strip()]
+    output = Path(args.output)
+    output.parent.mkdir(parents=True, exist_ok=True)
+    seen_request_ids = load_seen(args, output)
+    rows = [row for row in rows if row.get("request_id") not in seen_request_ids]
+    timeout = aiohttp.ClientTimeout(total=args.timeout_sec)
+    connector = aiohttp.TCPConnector(limit=args.concurrency)
+    sem = asyncio.Semaphore(args.concurrency)
+    ok = 0
+    total = 0
+    mode = "a" if args.resume else "w"
+    with output.open(mode, encoding="utf-8") as handle:
+        async with aiohttp.ClientSession(timeout=timeout, connector=connector) as session:
+            async def guarded(index: int, row: dict[str, Any]) -> dict[str, Any]:
+                async with sem:
+                    return await post_one(session, urls[index % len(urls)], row, args)
+            tasks = [asyncio.create_task(guarded(index, row)) for index, row in enumerate(rows)]
+            for task in asyncio.as_completed(tasks):
+                result = await task
+                handle.write(json.dumps(result, ensure_ascii=False) + "\n")
+                handle.flush()
+                total += 1
+                ok += int(bool(result.get("ok")))
+                if total % 10 == 0 or total == len(rows):
+                    print(json.dumps({"completed": total, "ok": ok, "total": len(rows), "skipped_existing": len(seen_request_ids)}, ensure_ascii=False))
+    print(json.dumps({"output": str(output), "completed": total, "ok": ok, "skipped_existing": len(seen_request_ids)}, indent=2))
+    return 0
+def main() -> int:
+    return asyncio.run(run(parse_args()))
+if __name__ == "__main__":
+    raise SystemExit(main())

eval_code/scripts/run_grounded_cbu_verify_requests.py ADDED Viewed

	@@ -0,0 +1,289 @@

+#!/usr/bin/env python3
+"""Run exact-unit grounded-CBU verification requests against vLLM."""
+from __future__ import annotations
+import argparse
+import asyncio
+import base64
+import json
+import time
+from io import BytesIO
+from pathlib import Path
+from typing import Any
+import aiohttp
+from PIL import Image, ImageFile
+ImageFile.LOAD_TRUNCATED_IMAGES = True
+STATUSES = [
+    "grounded",
+    "unsupported",
+    "uncertain",
+    "invalid_text_unit",
+    "not_a_visual_claim",
+    "image_unavailable",
+]
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(description="Run exact-unit grounded-CBU verification requests")
+    parser.add_argument("--input", required=True)
+    parser.add_argument("--output", required=True)
+    parser.add_argument("--urls", default="http://localhost:8000")
+    parser.add_argument("--model", default="Qwen/Qwen3.5-122B-A10B-FP8")
+    parser.add_argument("--max-requests", type=int, default=None)
+    parser.add_argument("--concurrency", type=int, default=32)
+    parser.add_argument("--max-tokens", type=int, default=2048)
+    parser.add_argument("--temperature", type=float, default=0.0)
+    parser.add_argument("--timeout-sec", type=int, default=600)
+    parser.add_argument("--image-mode", choices=["auto", "file", "data", "url"], default="auto")
+    parser.add_argument("--structured-json", action="store_true")
+    parser.add_argument("--resume", action="store_true", help="Append to output and skip request_ids already present.")
+    parser.add_argument(
+        "--resume-ok-only",
+        action="store_true",
+        help="With --resume, skip only previously successful request_ids so timeout/schema failures are retried.",
+    )
+    parser.add_argument(
+        "--skip-ok-from",
+        default=None,
+        help="JSONL response log whose successful request_ids should be skipped while writing a separate output.",
+    )
+    return parser.parse_args()
+def iter_requests(path: Path, max_requests: int | None) -> list[dict[str, Any]]:
+    rows = []
+    with path.open("r", encoding="utf-8") as handle:
+        for line in handle:
+            if max_requests is not None and len(rows) >= max_requests:
+                break
+            if line.strip():
+                rows.append(json.loads(line))
+    return rows
+def image_url_for(row: dict[str, Any], mode: str) -> str:
+    if mode in {"auto", "data"} and row.get("image_path"):
+        path = Path(row["image_path"])
+        with Image.open(path) as image:
+            if image.mode != "RGB":
+                image = image.convert("RGB")
+            buffer = BytesIO()
+            image.save(buffer, format="JPEG", quality=88)
+        return f"data:image/jpeg;base64,{base64.b64encode(buffer.getvalue()).decode('ascii')}"
+    if mode in {"auto", "file"} and row.get("image_path"):
+        return Path(row["image_path"]).resolve().as_uri()
+    if mode == "file":
+        raise ValueError(f"request {row.get('request_id')} has no image_path")
+    return row["image_url"]
+def response_schema(unit_ids: list[str]) -> dict[str, Any]:
+    return {
+        "type": "object",
+        "properties": {
+            "caption_id": {"type": "string"},
+            "unit_results": {
+                "type": "array",
+                "minItems": len(unit_ids),
+                "maxItems": len(unit_ids),
+                "items": {
+                    "type": "object",
+                    "properties": {
+                        "unit_id": {"type": "string", "enum": unit_ids},
+                        "status": {"type": "string", "enum": STATUSES},
+                        "confidence": {"type": "number", "minimum": 0.0, "maximum": 1.0},
+                        "evidence": {"type": "string", "maxLength": 180},
+                    },
+                    "required": ["unit_id", "status", "confidence", "evidence"],
+                    "additionalProperties": False,
+                },
+            },
+        },
+        "required": ["caption_id", "unit_results"],
+        "additionalProperties": False,
+    }
+def validate(parsed: Any, row: dict[str, Any]) -> str | None:
+    if not isinstance(parsed, dict):
+        return "top-level response is not an object"
+    if not isinstance(parsed.get("caption_id"), str):
+        return "caption_id is not a string"
+    results = parsed.get("unit_results")
+    if not isinstance(results, list):
+        return "unit_results is not an array"
+    expected = [unit["unit_id"] for unit in row.get("claimed_units", [])]
+    seen = []
+    for index, result in enumerate(results):
+        if not isinstance(result, dict):
+            return f"unit_results[{index}] is not an object"
+        unit_id = result.get("unit_id")
+        if not isinstance(unit_id, str):
+            return f"unit_results[{index}].unit_id is not a string"
+        seen.append(unit_id)
+        if result.get("status") not in set(STATUSES):
+            return f"unit_results[{index}].status has invalid value"
+        if not isinstance(result.get("confidence"), int | float):
+            return f"unit_results[{index}].confidence is not numeric"
+        if not isinstance(result.get("evidence"), str):
+            return f"unit_results[{index}].evidence is not a string"
+    if sorted(seen) != sorted(expected):
+        return f"unit_id set mismatch: expected={len(expected)} seen={len(seen)}"
+    if len(seen) != len(set(seen)):
+        return "duplicate unit_id in response"
+    return None
+def payload_for(row: dict[str, Any], args: argparse.Namespace) -> dict[str, Any]:
+    unit_ids = [unit["unit_id"] for unit in row.get("claimed_units", [])]
+    payload: dict[str, Any] = {
+        "model": args.model,
+        "max_tokens": args.max_tokens,
+        "temperature": args.temperature,
+        "messages": [
+            {"role": "system", "content": row["system_prompt"]},
+            {
+                "role": "user",
+                "content": [
+                    {"type": "text", "text": row["user_prompt"]},
+                    {"type": "image_url", "image_url": {"url": image_url_for(row, args.image_mode)}},
+                ],
+            },
+        ],
+        "chat_template_kwargs": {"enable_thinking": False},
+    }
+    if args.structured_json:
+        payload["structured_outputs"] = {"json": response_schema(unit_ids)}
+    return payload
+async def post_one(session: aiohttp.ClientSession, url: str, row: dict[str, Any], args: argparse.Namespace) -> dict[str, Any]:
+    endpoint = f"{url.rstrip('/')}/v1/chat/completions"
+    start = time.perf_counter()
+    try:
+        async with session.post(endpoint, json=payload_for(row, args), headers={"Authorization": "Bearer sk-fake"}) as response:
+            text = await response.text()
+            elapsed = time.perf_counter() - start
+            if response.status >= 400:
+                return {
+                    "request_id": row["request_id"],
+                    "ok": False,
+                    "status": response.status,
+                    "elapsed_sec": round(elapsed, 4),
+                    "error": text[:4000],
+                    "request": row,
+                }
+            body = json.loads(text)
+            content = body["choices"][0]["message"]["content"]
+            parsed = None
+            parse_error = None
+            schema_error = None
+            try:
+                parsed = json.loads(content)
+                schema_error = validate(parsed, row)
+            except Exception as exc:  # noqa: BLE001
+                parse_error = repr(exc)
+            return {
+                "request_id": row["request_id"],
+                "ok": parse_error is None and schema_error is None,
+                "status": response.status,
+                "elapsed_sec": round(elapsed, 4),
+                "model": args.model,
+                "usage": body.get("usage", {}),
+                "response_text": content,
+                "parsed": parsed,
+                "parse_error": parse_error,
+                "schema_error": schema_error,
+                "request": row,
+            }
+    except Exception as exc:  # noqa: BLE001
+        return {
+            "request_id": row["request_id"],
+            "ok": False,
+            "status": None,
+            "elapsed_sec": round(time.perf_counter() - start, 4),
+            "error": repr(exc),
+            "request": row,
+        }
+async def run(args: argparse.Namespace) -> int:
+    rows = iter_requests(Path(args.input), args.max_requests)
+    urls = [item.strip() for item in args.urls.split(",") if item.strip()]
+    output = Path(args.output)
+    output.parent.mkdir(parents=True, exist_ok=True)
+    seen_request_ids: set[str] = set()
+    if args.skip_ok_from:
+        with Path(args.skip_ok_from).open("r", encoding="utf-8") as handle:
+            for line in handle:
+                if not line.strip():
+                    continue
+                try:
+                    row = json.loads(line)
+                except json.JSONDecodeError:
+                    continue
+                if not row.get("ok"):
+                    continue
+                request_id = row.get("request_id")
+                if isinstance(request_id, str):
+                    seen_request_ids.add(request_id)
+    if args.resume and output.exists():
+        with output.open("r", encoding="utf-8") as handle:
+            for line in handle:
+                if not line.strip():
+                    continue
+                try:
+                    row = json.loads(line)
+                except json.JSONDecodeError:
+                    continue
+                if args.resume_ok_only and not row.get("ok"):
+                    continue
+                request_id = row.get("request_id")
+                if isinstance(request_id, str):
+                    seen_request_ids.add(request_id)
+    rows = [row for row in rows if row.get("request_id") not in seen_request_ids]
+    timeout = aiohttp.ClientTimeout(total=args.timeout_sec)
+    connector = aiohttp.TCPConnector(limit=args.concurrency)
+    sem = asyncio.Semaphore(args.concurrency)
+    ok = 0
+    total = 0
+    mode = "a" if args.resume else "w"
+    with output.open(mode, encoding="utf-8") as handle:
+        async with aiohttp.ClientSession(timeout=timeout, connector=connector) as session:
+            async def guarded(index: int, row: dict[str, Any]) -> dict[str, Any]:
+                async with sem:
+                    return await post_one(session, urls[index % len(urls)], row, args)
+            tasks = [asyncio.create_task(guarded(index, row)) for index, row in enumerate(rows)]
+            for task in asyncio.as_completed(tasks):
+                result = await task
+                handle.write(json.dumps(result, ensure_ascii=False) + "\n")
+                handle.flush()
+                total += 1
+                ok += int(bool(result.get("ok")))
+                if total % 10 == 0 or total == len(rows):
+                    print(
+                        json.dumps(
+                            {
+                                "completed": total,
+                                "ok": ok,
+                                "total": len(rows),
+                                "skipped_existing": len(seen_request_ids),
+                            },
+                            ensure_ascii=False,
+                        )
+                    )
+    print(json.dumps({"output": str(output), "completed": total, "ok": ok, "skipped_existing": len(seen_request_ids)}, indent=2))
+    return 0
+def main() -> int:
+    return asyncio.run(run(parse_args()))
+if __name__ == "__main__":
+    raise SystemExit(main())

eval_code/scripts/run_text_json_requests.py ADDED Viewed

	@@ -0,0 +1,256 @@

+#!/usr/bin/env python3
+"""Run text-only structured JSON requests against OpenAI-compatible endpoints."""
+from __future__ import annotations
+import argparse
+import asyncio
+import json
+import time
+from pathlib import Path
+from typing import Any
+import aiohttp
+from build_caption_cbu_requests import CBU_JSON_SCHEMA, UNIT_CATEGORIES
+def request_schema(row: dict[str, Any]) -> dict[str, Any]:
+    manifest_schema = row.get("schema")
+    if isinstance(manifest_schema, dict):
+        return manifest_schema
+    prompt = row.get("user_prompt", "")
+    if isinstance(prompt, str):
+        marker = "Return only JSON matching this schema:\n"
+        if marker in prompt:
+            rest = prompt.split(marker, 1)[1]
+            schema_text = rest.split("\n\n", 1)[0]
+            try:
+                parsed = json.loads(schema_text)
+                if isinstance(parsed, dict):
+                    return parsed
+            except Exception:  # noqa: BLE001
+                pass
+    return CBU_JSON_SCHEMA
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(description="Run text-only JSON-schema requests")
+    parser.add_argument("--input", required=True)
+    parser.add_argument("--output", required=True)
+    parser.add_argument("--urls", default="http://localhost:8000")
+    parser.add_argument("--model", default="Qwen/Qwen3.5-35B-A3B")
+    parser.add_argument("--max-requests", type=int, default=None)
+    parser.add_argument("--concurrency", type=int, default=8)
+    parser.add_argument("--max-tokens", type=int, default=1024)
+    parser.add_argument("--temperature", type=float, default=0.0)
+    parser.add_argument("--timeout-sec", type=int, default=240)
+    parser.add_argument("--thinking", action="store_true")
+    parser.add_argument("--structured-json", action="store_true")
+    parser.add_argument("--response-format-schema", action="store_true")
+    parser.add_argument("--response-format-json", action="store_true")
+    parser.add_argument("--resume", action="store_true", help="Append to output and skip previously seen request_ids.")
+    parser.add_argument(
+        "--resume-ok-only",
+        action="store_true",
+        help="With --resume, skip only previously successful request_ids so failures are retried.",
+    )
+    parser.add_argument(
+        "--skip-ok-from",
+        default=None,
+        help="JSONL response log whose successful request_ids should be skipped while writing a separate output.",
+    )
+    return parser.parse_args()
+def iter_requests(path: Path, max_requests: int | None) -> list[dict[str, Any]]:
+    rows = []
+    with path.open("r", encoding="utf-8") as handle:
+        for line in handle:
+            if max_requests is not None and len(rows) >= max_requests:
+                break
+            if line.strip():
+                rows.append(json.loads(line))
+    return rows
+def validate_cbu_response(parsed: Any) -> str | None:
+    if not isinstance(parsed, dict):
+        return "top-level response is not an object"
+    if not isinstance(parsed.get("caption_id"), str):
+        return "caption_id is not a string"
+    claimed = parsed.get("claimed_units")
+    if not isinstance(claimed, list):
+        return "claimed_units is not an array"
+    for index, unit in enumerate(claimed):
+        if not isinstance(unit, dict):
+            return f"claimed_units[{index}] is not an object"
+        extra = sorted(set(unit) - {"category", "unit", "span", "target"})
+        if extra:
+            return f"claimed_units[{index}] has unexpected fields: {extra}"
+        missing = [field for field in ["category", "unit", "span", "target"] if field not in unit]
+        if missing:
+            return f"claimed_units[{index}] is missing fields: {missing}"
+        if unit["category"] not in UNIT_CATEGORIES:
+            return f"claimed_units[{index}].category has invalid value"
+        for field in ["unit", "span", "target"]:
+            if not isinstance(unit[field], str):
+                return f"claimed_units[{index}].{field} is not a string"
+    return None
+def payload_for(row: dict[str, Any], args: argparse.Namespace) -> dict[str, Any]:
+    payload: dict[str, Any] = {
+        "model": args.model,
+        "max_tokens": args.max_tokens,
+        "temperature": args.temperature,
+        "messages": [
+            {"role": "system", "content": row["system_prompt"]},
+            {"role": "user", "content": row["user_prompt"]},
+        ],
+        "chat_template_kwargs": {"enable_thinking": args.thinking},
+    }
+    if args.structured_json:
+        payload["structured_outputs"] = {"json": request_schema(row)}
+    if args.response_format_schema:
+        payload["response_format"] = {
+            "type": "json_schema",
+            "json_schema": {"name": "claimed_cbu", "schema": request_schema(row)},
+        }
+    if args.response_format_json:
+        payload["response_format"] = {"type": "json_object"}
+    return payload
+async def post_one(
+    session: aiohttp.ClientSession,
+    url: str,
+    row: dict[str, Any],
+    args: argparse.Namespace,
+) -> dict[str, Any]:
+    endpoint = f"{url.rstrip('/')}/v1/chat/completions"
+    payload = payload_for(row, args)
+    start = time.perf_counter()
+    try:
+        async with session.post(endpoint, json=payload, headers={"Authorization": "Bearer sk-fake"}) as response:
+            text = await response.text()
+            elapsed = time.perf_counter() - start
+            if response.status >= 400:
+                return {
+                    "request_id": row["request_id"],
+                    "ok": False,
+                    "status": response.status,
+                    "elapsed_sec": round(elapsed, 4),
+                    "error": text[:4000],
+                    "request": row,
+                }
+            body = json.loads(text)
+            content = body["choices"][0]["message"]["content"]
+            parsed = None
+            parse_error = None
+            schema_error = None
+            try:
+                parsed = json.loads(content)
+                schema_error = validate_cbu_response(parsed)
+            except Exception as exc:  # noqa: BLE001
+                parse_error = repr(exc)
+            return {
+                "request_id": row["request_id"],
+                "ok": parse_error is None and schema_error is None,
+                "status": response.status,
+                "elapsed_sec": round(elapsed, 4),
+                "model": args.model,
+                "usage": body.get("usage", {}),
+                "response_text": content,
+                "parsed": parsed,
+                "parse_error": parse_error,
+                "schema_error": schema_error,
+                "request": row,
+            }
+    except Exception as exc:  # noqa: BLE001
+        return {
+            "request_id": row["request_id"],
+            "ok": False,
+            "status": None,
+            "elapsed_sec": round(time.perf_counter() - start, 4),
+            "error": repr(exc),
+            "request": row,
+        }
+async def run(args: argparse.Namespace) -> int:
+    rows = iter_requests(Path(args.input), args.max_requests)
+    urls = [item.strip() for item in args.urls.split(",") if item.strip()]
+    output = Path(args.output)
+    output.parent.mkdir(parents=True, exist_ok=True)
+    seen_request_ids: set[str] = set()
+    if args.skip_ok_from:
+        with Path(args.skip_ok_from).open("r", encoding="utf-8") as handle:
+            for line in handle:
+                if not line.strip():
+                    continue
+                try:
+                    row = json.loads(line)
+                except json.JSONDecodeError:
+                    continue
+                if not row.get("ok"):
+                    continue
+                request_id = row.get("request_id")
+                if isinstance(request_id, str):
+                    seen_request_ids.add(request_id)
+    if args.resume and output.exists():
+        with output.open("r", encoding="utf-8") as handle:
+            for line in handle:
+                if not line.strip():
+                    continue
+                try:
+                    row = json.loads(line)
+                except json.JSONDecodeError:
+                    continue
+                if args.resume_ok_only and not row.get("ok"):
+                    continue
+                request_id = row.get("request_id")
+                if isinstance(request_id, str):
+                    seen_request_ids.add(request_id)
+    rows = [row for row in rows if row.get("request_id") not in seen_request_ids]
+    timeout = aiohttp.ClientTimeout(total=args.timeout_sec)
+    connector = aiohttp.TCPConnector(limit=args.concurrency)
+    sem = asyncio.Semaphore(args.concurrency)
+    ok = 0
+    total = 0
+    mode = "a" if args.resume else "w"
+    with output.open(mode, encoding="utf-8") as handle:
+        async with aiohttp.ClientSession(timeout=timeout, connector=connector) as session:
+            async def guarded(index: int, row: dict[str, Any]) -> dict[str, Any]:
+                async with sem:
+                    return await post_one(session, urls[index % len(urls)], row, args)
+            tasks = [asyncio.create_task(guarded(index, row)) for index, row in enumerate(rows)]
+            for task in asyncio.as_completed(tasks):
+                result = await task
+                handle.write(json.dumps(result, ensure_ascii=False) + "\n")
+                handle.flush()
+                total += 1
+                ok += int(bool(result.get("ok")))
+                if total % 100 == 0 or total == len(rows):
+                    print(
+                        json.dumps(
+                            {
+                                "completed": total,
+                                "ok": ok,
+                                "total": len(rows),
+                                "skipped_existing": len(seen_request_ids),
+                            },
+                            ensure_ascii=False,
+                        )
+                    )
+    print(json.dumps({"output": str(output), "completed": total, "ok": ok, "skipped_existing": len(seen_request_ids)}, indent=2))
+    return 0
+def main() -> int:
+    return asyncio.run(run(parse_args()))
+if __name__ == "__main__":
+    raise SystemExit(main())

eval_code/scripts/summarize_cbu_responses.py ADDED Viewed

	@@ -0,0 +1,296 @@

+#!/usr/bin/env python3
+"""Summarize claimed or grounded CBU response JSONL into table-ready metrics."""
+from __future__ import annotations
+import argparse
+import json
+import re
+import statistics
+from collections import Counter, defaultdict
+from pathlib import Path
+from typing import Any
+UNIT_CATEGORIES = [
+    "object",
+    "attribute",
+    "relation",
+    "style",
+    "camera",
+    "lighting",
+    "count",
+    "text_rendering",
+]
+TOKEN_RE = re.compile(r"[^\W_]+(?:'[^\W_]+)*", re.UNICODE)
+ARTICLE_UNITS = {"a", "an", "the"}
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(description="Summarize CBU extraction/audit responses")
+    parser.add_argument("--input", required=True)
+    parser.add_argument("--output", required=True)
+    parser.add_argument("--mode", choices=["claimed", "grounded"], required=True)
+    parser.add_argument("--latest-by-request", action="store_true")
+    parser.add_argument("--include", action="append", default=[])
+    return parser.parse_args()
+def normalize_unit(text: str) -> str:
+    tokens = TOKEN_RE.findall(text.lower())
+    while tokens and tokens[0] in ARTICLE_UNITS:
+        tokens.pop(0)
+    return " ".join(tokens)
+def normalize_key_part(text: str) -> str:
+    normalized = normalize_unit(text)
+    return normalized or ""
+def caption_token_count(request: dict[str, Any]) -> int:
+    caption = request.get("caption", "")
+    return len(TOKEN_RE.findall(caption)) if isinstance(caption, str) else 0
+def percentile(values: list[float], q: float) -> float | None:
+    if not values:
+        return None
+    index = round((len(values) - 1) * q)
+    return sorted(values)[index]
+def trimmed_mean(values: list[float], trim: float = 0.1) -> float | None:
+    if not values:
+        return None
+    ordered = sorted(values)
+    k = int(len(ordered) * trim)
+    trimmed = ordered[k : len(ordered) - k] if len(ordered) - 2 * k > 0 else ordered
+    return statistics.fmean(trimmed)
+def empty_category_counts() -> dict[str, int]:
+    return {category: 0 for category in UNIT_CATEGORIES}
+def unit_records(group: Any) -> list[dict[str, str]]:
+    """Normalize both legacy category arrays and v2 atomic record arrays."""
+    records: list[dict[str, str]] = []
+    if isinstance(group, dict):
+        for category in UNIT_CATEGORIES:
+            items = group.get(category, [])
+            if not isinstance(items, list):
+                continue
+            for item in items:
+                if isinstance(item, str) and item.strip():
+                    records.append({"category": category, "unit": item.strip(), "span": item.strip(), "target": ""})
+        return records
+    if isinstance(group, list):
+        for item in group:
+            if not isinstance(item, dict):
+                continue
+            category = item.get("category")
+            unit = item.get("unit")
+            if category not in UNIT_CATEGORIES or not isinstance(unit, str) or not unit.strip():
+                continue
+            span = item.get("span", "")
+            target = item.get("target", "")
+            records.append(
+                {
+                    "category": category,
+                    "unit": unit.strip(),
+                    "span": span.strip() if isinstance(span, str) else "",
+                    "target": target.strip() if isinstance(target, str) else "",
+                }
+            )
+    return records
+def count_unit_group(group: Any) -> tuple[int, dict[str, int]]:
+    counts = {category: 0 for category in UNIT_CATEGORIES}
+    for record in unit_records(group):
+        counts[record["category"]] += 1
+    return sum(counts.values()), counts
+def count_deduped_unit_group(group: Any) -> tuple[int, dict[str, int], int, int]:
+    counts = empty_category_counts()
+    seen: set[str] = set()
+    duplicate = 0
+    suspicious = 0
+    for record in unit_records(group):
+        norm = normalize_unit(record["unit"])
+        if not norm:
+            continue
+        key = f"{record['category']}|{norm}|{normalize_key_part(record.get('target', ''))}"
+        if key in seen:
+            duplicate += 1
+            continue
+        seen.add(key)
+        category = record["category"]
+        if category == "count" and norm in ARTICLE_UNITS:
+            suspicious += 1
+            continue
+        if category == "text_rendering" and any(marker in norm for marker in ["no text", "no visible", "not visible", "without text"]):
+            suspicious += 1
+            continue
+        counts[category] += 1
+    return sum(counts.values()), counts, duplicate, suspicious
+def add_counts(dst: Counter[str], counts: dict[str, int], prefix: str) -> None:
+    for category, count in counts.items():
+        dst[f"{prefix}_{category}"] += count
+def summarize_claimed_row(parsed: dict[str, Any], request: dict[str, Any]) -> list[tuple[str, Counter[str]]]:
+    surface = request.get("surface", "unknown")
+    total, counts = count_unit_group(parsed.get("claimed_units"))
+    dedup_total, dedup_counts, duplicate, suspicious = count_deduped_unit_group(parsed.get("claimed_units"))
+    tokens = caption_token_count(request)
+    counter: Counter[str] = Counter()
+    counter["captions"] += 1
+    counter["claimed_total"] += total
+    counter["claimed_dedup_total"] += dedup_total
+    counter["duplicate_units"] += duplicate
+    counter["suspicious_units"] += suspicious
+    counter["caption_tokens"] += tokens
+    counter["rows_with_duplicate"] += int(duplicate > 0)
+    counter["rows_with_suspicious"] += int(suspicious > 0)
+    add_counts(counter, counts, "claimed")
+    add_counts(counter, dedup_counts, "claimed_dedup")
+    return [(surface, counter)]
+def summarize_grounded_row(parsed: dict[str, Any], request: dict[str, Any]) -> list[tuple[str, Counter[str]]]:
+    rows = []
+    for result in parsed.get("results", []) if isinstance(parsed, dict) else []:
+        caption_id = result.get("caption_id")
+        surface = None
+        for caption in request.get("captions", []):
+            if caption.get("caption_id") == caption_id:
+                surface = caption.get("surface")
+                break
+        surface = surface or str(caption_id or "unknown")
+        grounded_total, grounded_counts = count_unit_group(result.get("grounded_units"))
+        unsupported_total, unsupported_counts = count_unit_group(result.get("unsupported_units"))
+        uncertain_total, uncertain_counts = count_unit_group(result.get("uncertain_units"))
+        claimed_total = grounded_total + unsupported_total + uncertain_total
+        counter: Counter[str] = Counter()
+        counter["captions"] += 1
+        counter["claimed_total"] += claimed_total
+        counter["grounded_total"] += grounded_total
+        counter["unsupported_total"] += unsupported_total
+        counter["uncertain_total"] += uncertain_total
+        counter[f"overall_{result.get('overall', 'missing')}"] += 1
+        add_counts(counter, grounded_counts, "grounded")
+        add_counts(counter, unsupported_counts, "unsupported")
+        add_counts(counter, uncertain_counts, "uncertain")
+        rows.append((surface, counter))
+    return rows
+def merge(dst: Counter[str], src: Counter[str]) -> None:
+    for key, value in src.items():
+        dst[key] += value
+def finalize(counter: Counter[str]) -> dict[str, Any]:
+    captions = max(counter["captions"], 1)
+    claimed = counter["claimed_total"]
+    output: dict[str, Any] = dict(counter)
+    output["claimed_per_caption"] = claimed / captions
+    output["claimed_dedup_per_caption"] = counter["claimed_dedup_total"] / captions
+    output["claimed_dedup_per_100_tokens"] = (
+        100 * counter["claimed_dedup_total"] / counter["caption_tokens"] if counter["caption_tokens"] else None
+    )
+    output["duplicate_units_per_caption"] = counter["duplicate_units"] / captions
+    output["suspicious_units_per_caption"] = counter["suspicious_units"] / captions
+    output["duplicate_row_rate"] = counter["rows_with_duplicate"] / captions
+    output["suspicious_row_rate"] = counter["rows_with_suspicious"] / captions
+    output["grounded_precision"] = counter["grounded_total"] / claimed if claimed else None
+    output["unsupported_rate"] = counter["unsupported_total"] / claimed if claimed else None
+    output["uncertain_rate"] = counter["uncertain_total"] / claimed if claimed else None
+    for category in UNIT_CATEGORIES:
+        output[f"claimed_{category}_per_caption"] = counter[f"claimed_{category}"] / captions
+        output[f"claimed_dedup_{category}_per_caption"] = counter[f"claimed_dedup_{category}"] / captions
+        denom = counter[f"grounded_{category}"] + counter[f"unsupported_{category}"] + counter[f"uncertain_{category}"]
+        if denom:
+            output[f"grounded_{category}_precision"] = counter[f"grounded_{category}"] / denom
+            output[f"unsupported_{category}_rate"] = counter[f"unsupported_{category}"] / denom
+    return output
+def main() -> int:
+    args = parse_args()
+    by_surface: dict[str, Counter[str]] = defaultdict(Counter)
+    per_surface_values: dict[str, dict[str, list[float]]] = defaultdict(lambda: defaultdict(list))
+    status = Counter()
+    input_paths = [Path(args.input), *[Path(item) for item in args.include]]
+    if args.latest_by_request:
+        latest: dict[str, dict[str, Any]] = {}
+        for input_path in input_paths:
+            with input_path.open("r", encoding="utf-8") as handle:
+                for line in handle:
+                    if not line.strip():
+                        continue
+                    row = json.loads(line)
+                    request_id = row.get("request_id")
+                    if isinstance(request_id, str):
+                        latest[request_id] = row
+        rows = list(latest.values())
+    else:
+        rows = []
+        for input_path in input_paths:
+            with input_path.open("r", encoding="utf-8") as handle:
+                rows.extend(json.loads(line) for line in handle if line.strip())
+    for row in rows:
+            status["responses"] += 1
+            if not row.get("ok"):
+                status["bad"] += 1
+                continue
+            parsed = row.get("parsed")
+            request = row.get("request", {})
+            items = (
+                summarize_claimed_row(parsed, request)
+                if args.mode == "claimed"
+                else summarize_grounded_row(parsed, request)
+            )
+            for surface, counter in items:
+                merge(by_surface[surface], counter)
+                merge(by_surface["__all__"], counter)
+                status["captions"] += counter["captions"]
+                if args.mode == "claimed":
+                    tokens = max(counter["caption_tokens"], 1)
+                    for key_surface in [surface, "__all__"]:
+                        per_surface_values[key_surface]["claimed"].append(float(counter["claimed_total"]))
+                        per_surface_values[key_surface]["claimed_dedup"].append(float(counter["claimed_dedup_total"]))
+                        per_surface_values[key_surface]["claimed_dedup_per_100_tokens"].append(
+                            100.0 * counter["claimed_dedup_total"] / tokens
+                        )
+                        per_surface_values[key_surface]["caption_tokens"].append(float(counter["caption_tokens"]))
+    surfaces = {surface: finalize(counter) for surface, counter in sorted(by_surface.items())}
+    for surface, metrics in per_surface_values.items():
+        if surface not in surfaces:
+            continue
+        for name, values in metrics.items():
+            surfaces[surface][f"{name}_median"] = statistics.median(values) if values else None
+            surfaces[surface][f"{name}_p25"] = percentile(values, 0.25)
+            surfaces[surface][f"{name}_p75"] = percentile(values, 0.75)
+            surfaces[surface][f"{name}_trimmed_mean"] = trimmed_mean(values)
+    payload = {
+        "input": args.input,
+        "mode": args.mode,
+        "status": dict(status),
+        "surfaces": surfaces,
+    }
+    output = Path(args.output)
+    output.parent.mkdir(parents=True, exist_ok=True)
+    output.write_text(json.dumps(payload, indent=2, ensure_ascii=False), encoding="utf-8")
+    print(json.dumps({"output": str(output), **payload["status"]}, indent=2))
+    return 0
+if __name__ == "__main__":
+    raise SystemExit(main())

eval_code/scripts/summarize_cbu_vqa_responses.py ADDED Viewed

	@@ -0,0 +1,153 @@

+#!/usr/bin/env python3
+"""Summarize CBU VQA response JSONL files."""
+from __future__ import annotations
+import argparse
+import json
+from collections import Counter, defaultdict
+from pathlib import Path
+from typing import Any
+ANSWERS = ["yes", "no", "uncertain"]
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(description="Summarize CBU VQA responses")
+    parser.add_argument("--input", required=True)
+    parser.add_argument("--output", required=True)
+    parser.add_argument(
+        "--include",
+        action="append",
+        default=[],
+        help="Additional response JSONL to merge before latest-by-request summarization.",
+    )
+    parser.add_argument(
+        "--latest-by-request",
+        action="store_true",
+        help="Use only the last response per request_id.",
+    )
+    return parser.parse_args()
+def load_rows(paths: list[Path], latest_by_request: bool) -> list[dict[str, Any]]:
+    if not latest_by_request:
+        rows: list[dict[str, Any]] = []
+        for path in paths:
+            if not path.exists():
+                continue
+            with path.open("r", encoding="utf-8") as handle:
+                rows.extend(json.loads(line) for line in handle if line.strip())
+        return rows
+    latest: dict[str, dict[str, Any]] = {}
+    for path in paths:
+        if not path.exists():
+            continue
+        with path.open("r", encoding="utf-8") as handle:
+            for line in handle:
+                if not line.strip():
+                    continue
+                row = json.loads(line)
+                request_id = row.get("request_id")
+                if isinstance(request_id, str):
+                    latest[request_id] = row
+    return list(latest.values())
+def question_lookup(row: dict[str, Any]) -> dict[str, dict[str, Any]]:
+    request = row.get("request", {})
+    return {
+        question["question_id"]: question
+        for question in request.get("questions", [])
+        if isinstance(question, dict) and isinstance(question.get("question_id"), str)
+    }
+def add_rates(stats: dict[str, Any]) -> dict[str, Any]:
+    total = stats.get("questions", 0)
+    for answer in ANSWERS:
+        stats[f"{answer}_rate"] = stats.get(answer, 0) / total if total else 0.0
+    stats["support_rate"] = stats.get("yes", 0) / total if total else 0.0
+    stats["risk_rate"] = stats.get("no", 0) / total if total else 0.0
+    stats["uncertainty_rate"] = stats.get("uncertain", 0) / total if total else 0.0
+    return stats
+def main() -> int:
+    args = parse_args()
+    paths = [Path(args.input), *[Path(item) for item in args.include]]
+    rows = load_rows(paths, args.latest_by_request)
+    surface_stats: dict[str, Counter[str]] = defaultdict(Counter)
+    category_stats: dict[str, Counter[str]] = defaultdict(Counter)
+    examples: dict[str, list[dict[str, Any]]] = defaultdict(list)
+    responses = 0
+    ok = 0
+    for row in rows:
+        responses += 1
+        request = row.get("request", {})
+        surface = request.get("surface", "__unknown__")
+        surface_stats[surface]["responses"] += 1
+        if not row.get("ok"):
+            surface_stats[surface]["bad"] += 1
+            if len(examples["bad_response"]) < 20:
+                examples["bad_response"].append(
+                    {
+                        "surface": surface,
+                        "caption_id": request.get("caption_id"),
+                        "error": row.get("parse_error") or row.get("schema_error") or row.get("error"),
+                    }
+                )
+            continue
+        ok += 1
+        surface_stats[surface]["ok"] += 1
+        lookup = question_lookup(row)
+        for result in row.get("parsed", {}).get("question_results", []):
+            if not isinstance(result, dict):
+                continue
+            question_id = result.get("question_id")
+            answer = result.get("answer")
+            if answer not in ANSWERS:
+                continue
+            question = lookup.get(question_id, {})
+            category = question.get("category", "__unknown__")
+            surface_stats[surface]["questions"] += 1
+            surface_stats[surface][answer] += 1
+            category_stats[category]["questions"] += 1
+            category_stats[category][answer] += 1
+            if answer in {"no", "uncertain"} and len(examples[answer]) < 20:
+                examples[answer].append(
+                    {
+                        "surface": surface,
+                        "caption_id": request.get("caption_id"),
+                        "category": category,
+                        "question": question.get("question"),
+                        "answer": answer,
+                        "confidence": result.get("confidence"),
+                        "evidence": result.get("evidence"),
+                    }
+                )
+    out = {
+        "input": args.input,
+        "include": args.include,
+        "latest_by_request": args.latest_by_request,
+        "responses": responses,
+        "ok": ok,
+        "bad": responses - ok,
+        "surfaces": {surface: add_rates(dict(counter)) for surface, counter in surface_stats.items()},
+        "categories": {category: add_rates(dict(counter)) for category, counter in category_stats.items()},
+        "examples": examples,
+    }
+    output = Path(args.output)
+    output.parent.mkdir(parents=True, exist_ok=True)
+    output.write_text(json.dumps(out, indent=2, ensure_ascii=False), encoding="utf-8")
+    print(json.dumps({"output": str(output), "responses": responses, "ok": ok, "bad": responses - ok}, indent=2))
+    return 0
+if __name__ == "__main__":
+    raise SystemExit(main())

eval_code/scripts/summarize_grounded_cbu_verify.py ADDED Viewed

	@@ -0,0 +1,135 @@

+#!/usr/bin/env python3
+"""Summarize exact-unit grounded-CBU verification responses."""
+from __future__ import annotations
+import argparse
+import json
+from collections import Counter, defaultdict
+from pathlib import Path
+from typing import Any
+STATUSES = [
+    "grounded",
+    "unsupported",
+    "uncertain",
+    "invalid_text_unit",
+    "not_a_visual_claim",
+    "image_unavailable",
+]
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(description="Summarize grounded-CBU verification responses")
+    parser.add_argument("--input", required=True)
+    parser.add_argument("--output", required=True)
+    parser.add_argument(
+        "--include",
+        action="append",
+        default=[],
+        help="Additional response JSONL to merge before latest-by-request summarization.",
+    )
+    parser.add_argument(
+        "--latest-by-request",
+        action="store_true",
+        help="Use only the last response per request_id. Useful for append/resume retry logs.",
+    )
+    return parser.parse_args()
+def unit_lookup(row: dict[str, Any]) -> dict[str, dict[str, Any]]:
+    return {unit["unit_id"]: unit for unit in row.get("claimed_units", []) if isinstance(unit, dict) and "unit_id" in unit}
+def add_rates(stats: dict[str, Any]) -> dict[str, Any]:
+    valid = stats.get("valid_units", 0)
+    visual = stats.get("visual_units", 0)
+    for status in STATUSES:
+        stats[f"{status}_rate_all"] = stats.get(status, 0) / valid if valid else 0.0
+        stats[f"{status}_rate_visual"] = stats.get(status, 0) / visual if visual else 0.0
+    stats["grounded_precision"] = stats.get("grounded", 0) / visual if visual else 0.0
+    stats["unsupported_rate"] = stats.get("unsupported", 0) / visual if visual else 0.0
+    stats["uncertain_rate"] = stats.get("uncertain", 0) / visual if visual else 0.0
+    return stats
+def main() -> int:
+    args = parse_args()
+    surface_stats: dict[str, Counter[str]] = defaultdict(Counter)
+    category_stats: dict[str, Counter[str]] = defaultdict(Counter)
+    status_examples: dict[str, list[dict[str, Any]]] = defaultdict(list)
+    total = 0
+    ok = 0
+    rows: list[dict[str, Any]] = []
+    input_paths = [Path(args.input), *[Path(item) for item in args.include]]
+    if args.latest_by_request:
+        latest: dict[str, dict[str, Any]] = {}
+        for input_path in input_paths:
+            with input_path.open("r", encoding="utf-8") as handle:
+                for line in handle:
+                    if not line.strip():
+                        continue
+                    row = json.loads(line)
+                    request_id = row.get("request_id")
+                    if isinstance(request_id, str):
+                        latest[request_id] = row
+        rows = list(latest.values())
+    else:
+        rows = []
+        for input_path in input_paths:
+            with input_path.open("r", encoding="utf-8") as handle:
+                rows.extend(json.loads(line) for line in handle if line.strip())
+    for row in rows:
+        total += 1
+        surface = row.get("request", {}).get("surface", "__unknown__")
+        surface_stats[surface]["responses"] += 1
+        if not row.get("ok"):
+            surface_stats[surface]["bad"] += 1
+            continue
+        ok += 1
+        surface_stats[surface]["ok"] += 1
+        lookup = unit_lookup(row.get("request", {}))
+        for result in row.get("parsed", {}).get("unit_results", []):
+            unit_id = result.get("unit_id")
+            unit = lookup.get(unit_id, {})
+            category = unit.get("category", "__unknown__")
+            status = result.get("status", "__bad_status__")
+            surface_stats[surface]["valid_units"] += 1
+            surface_stats[surface][status] += 1
+            category_stats[category]["valid_units"] += 1
+            category_stats[category][status] += 1
+            if status in {"grounded", "unsupported", "uncertain"}:
+                surface_stats[surface]["visual_units"] += 1
+                category_stats[category]["visual_units"] += 1
+            if status in {"unsupported", "uncertain", "invalid_text_unit", "not_a_visual_claim"} and len(status_examples[status]) < 20:
+                status_examples[status].append(
+                    {
+                        "surface": surface,
+                        "caption_id": row.get("request", {}).get("caption_id"),
+                        "category": category,
+                        "unit": unit.get("unit"),
+                        "target": unit.get("target"),
+                        "status": status,
+                        "evidence": result.get("evidence"),
+                    }
+                )
+    surfaces = {surface: add_rates(dict(counter)) for surface, counter in surface_stats.items()}
+    categories = {category: add_rates(dict(counter)) for category, counter in category_stats.items()}
+    out = {
+        "input": args.input,
+        "responses": total,
+        "ok": ok,
+        "bad": total - ok,
+        "surfaces": surfaces,
+        "categories": categories,
+        "examples": status_examples,
+    }
+    Path(args.output).parent.mkdir(parents=True, exist_ok=True)
+    Path(args.output).write_text(json.dumps(out, indent=2, ensure_ascii=False), encoding="utf-8")
+    print(json.dumps({"output": args.output, "responses": total, "ok": ok, "bad": total - ok}, indent=2))
+    return 0
+if __name__ == "__main__":
+    raise SystemExit(main())

eval_code/scripts/vllm/serve_gemma4_31b_it.sh ADDED Viewed

	@@ -0,0 +1,72 @@

+#!/usr/bin/env bash
+# Launch google/gemma-4-31B-it as DP=8 vLLM server for cross-family audits.
+set -euo pipefail
+PROJECT_ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)"
+VENV_DIR="${VLLM_VENV:-<WORKSPACE_ROOT>/vllm-env}"
+VLLM_BIN="${VENV_DIR}/bin/vllm"
+CONFIG="${VLLM_CONFIG:-${PROJECT_ROOT}/configs/recap/vllm_serve_gemma4_31b_it.yaml}"
+PORT="${VLLM_PORT:-8000}"
+LOG="${VLLM_LOG:-/tmp/vllm_gemma4_31b_it.log}"
+PID_FILE="${VLLM_PID_FILE:-/tmp/vllm_gemma4_31b_it.pid}"
+export TMPDIR="${TMPDIR:-/tmp}"
+export OMP_NUM_THREADS="${OMP_NUM_THREADS:-1}"
+export CUDA_HOME="${CUDA_HOME:-/usr/local/cuda}"
+export VLLM_WORKER_MULTIPROC_METHOD="${VLLM_WORKER_MULTIPROC_METHOD:-spawn}"
+export TRITON_CACHE_DIR="${TRITON_CACHE_DIR:-/tmp/triton-cache}"
+export TORCH_HOME="${TORCH_HOME:-/tmp/torch-home}"
+export TORCH_EXTENSIONS_DIR="${TORCH_EXTENSIONS_DIR:-/tmp/torch-extensions}"
+export TORCHINDUCTOR_CACHE_DIR="${TORCHINDUCTOR_CACHE_DIR:-/tmp/torchinductor-cache}"
+export HF_HOME="${HF_HOME:-<LOCAL_CACHE>/hf}"
+export HF_HUB_CACHE="${HF_HUB_CACHE:-<HF_CACHE>}"
+export TRANSFORMERS_CACHE="${TRANSFORMERS_CACHE:-<LOCAL_CACHE>/transformers}"
+if [[ ! -x "${VLLM_BIN}" ]]; then
+  echo "ERROR: vllm binary not found at ${VLLM_BIN}" >&2
+  exit 1
+fi
+status() {
+  if curl -fsS "http://localhost:${PORT}/v1/models" >/dev/null 2>&1; then
+    echo "vLLM gemma-4-31B-it :${PORT} ready"
+    curl -fsS "http://localhost:${PORT}/v1/models"
+  else
+    echo "vLLM gemma-4-31B-it :${PORT} not ready"
+    return 1
+  fi
+}
+stop() {
+  if [[ -f "${PID_FILE}" ]]; then
+    pid="$(cat "${PID_FILE}")"
+    if [[ -n "${pid}" ]] && ps -p "${pid}" -o command= 2>/dev/null | grep -q "vllm serve"; then
+      kill "${pid}" 2>/dev/null || true
+      sleep 2
+      kill -9 "${pid}" 2>/dev/null || true
+    fi
+    rm -f "${PID_FILE}"
+  fi
+  pgrep -f "vllm serve --config ${CONFIG}" 2>/dev/null | xargs -r kill 2>/dev/null || true
+  rm -f /dev/shm/vllm* 2>/dev/null || true
+  echo "stopped vLLM gemma-4-31B-it on :${PORT}"
+}
+start() {
+  mkdir -p "$(dirname "${LOG}")" "${TRITON_CACHE_DIR}" "${TORCH_HOME}" "${TORCH_EXTENSIONS_DIR}" "${TORCHINDUCTOR_CACHE_DIR}"
+  echo "starting vLLM gemma-4-31B-it"
+  echo "  config: ${CONFIG}"
+  echo "  log:    ${LOG}"
+  CUDA_VISIBLE_DEVICES="${CUDA_VISIBLE_DEVICES:-0,1,2,3,4,5,6,7}" \
+    setsid "${VLLM_BIN}" serve --config "${CONFIG}" > "${LOG}" 2>&1 < /dev/null &
+  echo "$!" > "${PID_FILE}"
+  echo "  pid: $!"
+}
+case "${1:-start}" in
+  start) start ;;
+  stop) stop ;;
+  restart) stop; sleep 2; start ;;
+  status) status ;;
+  *) echo "usage: $0 {start|stop|restart|status}" >&2; exit 2 ;;
+esac

eval_results/ALL_EVAL_RESULTS_INDEX.md ADDED Viewed

	@@ -0,0 +1,28 @@

+# Complete Evaluation Results Index
+This directory contains sanitized summary artifacts for all completed recap-evaluation experiments, not only the CC12M frontier table. Raw VLM response JSONL files and source images are excluded; summary JSON/TSV/CSV/Markdown artifacts are included.
+## Result Families
+| Family | Included results | Main files |
+|---|---|---|
+| CPU text metrics | 1M paired lexical/surface metrics, violation-code breakdown, tokenizer truncation, fair-slice manifests | raw_summaries/cpu_text_metrics/ |
+| Prompt support | n-gram prompt-pool JSD/support metrics and bootstrap direction tables | raw_summaries/prompt_support/ |
+| Embedding/Vendi/support | caption Vendi, covariance effective rank, PRDC-style prompt-caption support, dtype sanity | raw_summaries/embedding_vendi_support/ |
+| CBU claimed density | B=64 claimed CBU summaries across CC12M/DataComp/PD12M/LAION-pop/Danbooru and CI tables | raw_summaries/cbu_claimed/ |
+| Grounded CBU legacy | earlier exact-unit grounded audit summaries retained for traceability | raw_summaries/cbu_grounded_legacy/ |
+| Image-conditioned VQA | Qwen-family VQA across DataComp/PD12M/LAION-pop/Danbooru/CC12M plus Gemma cross-family CC12M | raw_summaries/vqa_image_conditioned/ |
+| LongCLIP retrieval | corrected CC12M full-caption and input-64 retrieval separability diagnostics | raw_summaries/longclip_retrieval/ |
+| Plot-ready rollups | curated CSV/PNG files for the paper figures and tables | eval_results/*.csv, eval_results/*.png |
+## Dataset Coverage
+- CC12M: four-caption corrected same-image slice, CBU B-grid, Qwen VQA, Gemma VQA, LongCLIP.
+- DataComp: paired ours/reference CBU@64, Qwen VQA@64, CPU text metrics, prompt-support metrics, embedding/Vendi diagnostics.
+- PD12M: paired ours/reference CBU@64 and Qwen VQA@64; metadata records included in `dataset_release/`.
+- LAION-pop: paired ours/reference CBU@64 and Qwen VQA@64; metadata records included in `dataset_release/`.
+- Danbooru2023: paired ours/reference CBU@64 and Qwen VQA@64; metadata records included in `dataset_release/`.
+## Boundary
+The export is metadata/results only. Source images and raw VLM response streams are intentionally excluded. The included result summaries are enough to reproduce reported tables and inspect completed experiment outcomes.

eval_results/README.md ADDED Viewed

	@@ -0,0 +1,14 @@

+# Recap Evaluation Result Index
+Date: 2026-04-27
+This file indexes plot-ready recap-evaluation outputs. All CSV files are derived from existing JSON/TSV artifacts; no model inference is performed here.
+## Plot-ready files
+- `cc12m_budget_frontier_plot.csv`: CBU budget grid for B in {16,32,48,64}; plot x=B or CBU/100tok, y=CBU/cap, color=surface.
+- `cc12m_vqa_supported_risk_pareto.csv`: VQA@64 Pareto data; plot x=unsupported_cap or risk, y=supported_cap, facet=judge.
+- `cc12m_longclip_plot.csv`: LongCLIP full/input64 retrieval diagnostics; plot x=tok_mean or mode, y=I2T/T2I R@1.
+- `all_vqa_b64_summary.csv`: All available VQA@64 summaries across CC12M/DataComp/PD12M/LAION-pop/Danbooru plus Gemma CC12M.
+## Current CC12M VQA Pareto interpretation
+- Under both Qwen-family and Gemma-family judges, the Pareto frontier for supported yield vs unsupported cost contains `Ours` and the short `Qwen3-VL-8B` baseline.
+- `Ours` is the high-yield endpoint; `Qwen3-VL-8B` is the low-risk endpoint.
+- `LLaVA-NeXT` and `PixelProse` are dominated in the corrected CC12M VQA@64 frontier because each has lower supported yield and higher unsupported cost than `Ours` under the same judge.
+- The CBU budget frontier similarly separates token efficiency from absolute yield: Qwen3-VL-8B is most efficient per token, while Ours has the highest CBU/cap from B=32 onward.

eval_results/all_cbu_b64_summary.csv ADDED Viewed

	@@ -0,0 +1,15 @@

+surface,label,captions,cbu_cap,cbu_100tok,dup_row,object_cap,attribute_cap,relation_cap,style_cap,camera_cap,lighting_cap,count_cap,text_rendering_cap
+cc12m_llavanext_paired__ours_cc12m,CC12M/Ours vs LLaVA-NeXT,4874,15.041649569142388,22.877426199837732,0.0016413623307345096,4.298522773902339,6.580426754205991,2.5888387361510055,0.25851456709068527,0.22404595814526057,0.3459171112022979,0.4708658186294625,0.2745178498153467
+cc12m_llavanext_paired__ref_cc12m_llavanext,CC12M/LLaVA-NeXT,4832,10.586299668874172,20.342157693179512,0.0035182119205298015,3.7547599337748343,3.5384933774834435,1.7119205298013245,0.5428394039735099,0.13927980132450332,0.1490066225165563,0.35513245033112584,0.39486754966887416
+cc12m_pixelprose_paired__ours_cc12m,CC12M/Ours vs PixelProse,4851,14.903318903318903,22.672633589342333,0.00041228612657184083,4.211296639868069,6.548958977530406,2.549577406720264,0.27231498660070086,0.23232323232323232,0.3438466295609153,0.4813440527726242,0.2636569779426922
+cc12m_pixelprose_paired__ref_pixelprose_cc12m,CC12M/PixelProse,4874,11.973943372999589,19.893580033132675,0.005334427574887156,3.8738202708247846,3.8153467377923675,2.540623717685679,0.4185473943373,0.13089864587607714,0.0734509643003693,0.794009027492819,0.32724661469019284
+cc12m_qwen3vl8b_paired__ours_cc12m,CC12M/Ours vs Qwen3-VL-8B,4971,15.103399718366527,22.98503258909574,0.001408167370750352,4.518004425668879,6.555823777911889,2.627439147052907,0.27016696841681753,0.2055924361295514,0.39026352846509754,0.45463689398511364,0.08147254073627037
+cc12m_qwen3vl8b_paired__ref_cc12m_qwen3vl8b,CC12M/Qwen3-VL-8B,4999,6.493098619723945,55.98406319529485,0.000600120024004801,2.995799159831966,1.5659131826365273,1.5679135827165434,0.11502300460092019,0.015403080616123225,0.08621724344868974,0.13522704540908181,0.011602320464092819
+danbooru2023_florence2_paired__ours_danbooru2023,Danbooru/Ours,4995,14.325325325325325,21.97641884649523,0.0,3.8724724724724724,6.966366366366366,2.536136136136136,0.3747747747747748,0.22702702702702704,0.11571571571571572,0.2032032032032032,0.02962962962962963
+danbooru2023_florence2_paired__ref_danbooru_florence2,Danbooru/Florence2,4979,8.184374372363928,19.25657795251777,0.001004217714400482,3.0321349668608155,2.4661578630247036,1.5631652942357903,0.5987146013255674,0.006025306286402892,0.03293834103233581,0.1899979915645712,0.29524000803374173
+datacomp_recap_llava15_paired_url__ours_datacomp_forward,DataComp/Ours,4848,14.451320132013201,21.994788559947256,0.0008250825082508251,3.8803630363036303,6.445957095709571,2.3116749174917492,0.24711221122112212,0.26485148514851486,0.34385313531353134,0.511963696369637,0.44554455445544555
+datacomp_recap_llava15_paired_url__ref_datacomp_recap_llava15_llama3_8b,DataComp/LLaVA1.5-Llama3,5000,10.4414,22.070735254329005,0.0072,3.6,3.5036,1.8032,0.315,0.1404,0.0958,0.3536,0.6298
+laion_pop_llama32_paired__ours_laion_pop,LAION-pop/Ours,4964,14.81809024979855,22.521976356471658,0.0012087026591458502,4.014302981466559,6.941780821917808,2.421232876712329,0.3058017727639001,0.2639000805801773,0.40370668815471394,0.36583400483481066,0.10153102336825141
+laion_pop_llama32_paired__ref_laion_pop_llama32_11b,LAION-pop/Llama3.2-11B,4947,11.909642207398424,18.396730136327587,0.0028299979785728724,3.5611481706084493,4.942793612290277,1.9140893470790379,0.5712553062462098,0.34242975540731757,0.11663634525975339,0.3393976147159895,0.12189205579138872
+pd12m_full_paired__ours_pd12m_img2dataset,PD12M/Ours,4958,15.017144009681322,22.863644180219133,0.0014118596208148447,4.166397741024607,6.718636546994756,2.580072609923356,0.3100040338846309,0.22166196046793063,0.2989108511496571,0.5350947962888262,0.1863654699475595
+pd12m_full_paired__ref_pd12m_full,PD12M/ref full,4992,9.779246794871796,24.892029839026307,0.001201923076923077,4.379607371794871,2.020232371794872,2.371794871794872,0.4238782051282051,0.10616987179487179,0.030448717948717948,0.24739583333333334,0.1997195512820513

eval_results/all_vqa_b64_summary.csv ADDED Viewed

	@@ -0,0 +1,17 @@

+source,surface,label,responses,questions,supported_cap,unsupported_cap,support,risk,uncertain
+cc12m_qwen,ours_cc12m,Ours,4467,68019,14.59592567718827,0.46384598164316093,0.9585556976727091,0.03046207677266646,0.010982225554624444
+cc12m_qwen,ref_cc12m_qwen3vl8b,Qwen3-VL-8B,4490,28911,6.3122494432071266,0.08775055679287305,0.9803189097575318,0.01362803085330843,0.006053059389159835
+cc12m_qwen,ref_cc12m_llavanext,LLaVA-NeXT,4453,48092,9.842128901863912,0.7442173815405345,0.911315811361557,0.06890958995259086,0.019774598685852116
+cc12m_qwen,ref_pixelprose_cc12m,PixelProse,4449,56043,10.730051697010564,1.623286131714992,0.8518102171546847,0.12886533554592008,0.019324447299395107
+datacomp_qwen,datacomp_recap_llava15_paired_url__ours_datacomp_forward,Ours DataComp,4648,67724,13.839285714285714,0.5094664371772806,0.9498109975784065,0.03496544799480243,0.015223554426791094
+datacomp_qwen,datacomp_recap_llava15_paired_url__ref_datacomp_recap_llava15_llama3_8b,Ref DataComp LLaVA1.5/Llama3,4995,52194,8.495895895895895,1.8472472472472472,0.8130628041537341,0.17678277196612638,0.01015442388013948
+noncc12m_qwen,danbooru2023_florence2_paired__ours_danbooru2023,Ours Danbooru,4993,71491,12.731824554376127,0.8343681153615061,0.8892028367207061,0.05827306933739911,0.052524093941894785
+noncc12m_qwen,danbooru2023_florence2_paired__ref_danbooru_florence2,Ref Danbooru Florence2,4969,40755,6.379955725498088,1.7832561883678808,0.777867746288799,0.21742117531591215,0.004711078395288922
+noncc12m_qwen,laion_pop_llama32_paired__ours_laion_pop,Ours LAION-pop,4964,73489,14.21676067687349,0.45447219983883963,0.9603069847188015,0.030698471880145326,0.008994543401053219
+noncc12m_qwen,laion_pop_llama32_paired__ref_laion_pop_llama32_11b,Ref LAION-pop Llama3.2,4947,58903,10.79725085910653,0.9189407721851627,0.9068128957778042,0.07717773288287523,0.016009371339320577
+noncc12m_qwen,pd12m_full_paired__ours_pd12m_img2dataset,Ours PD12M,4957,74392,14.289086140810975,0.5045390357070809,0.9521319496720078,0.033619206366275946,0.014248843961716313
+noncc12m_qwen,pd12m_full_paired__ref_pd12m_full,Ref PD12M full,4989,48825,8.605532170775707,1.008819402685909,0.8793241167434716,0.10308243727598566,0.017593445980542754
+cc12m_gemma,ours_cc12m,Ours,4467,68019,13.823371390194762,1.011193194537721,0.9078198738587747,0.06640791543539305,0.025772210705832195
+cc12m_gemma,ref_cc12m_qwen3vl8b,Qwen3-VL-8B,4490,28911,6.194432071269488,0.1821826280623608,0.9620213759468714,0.02829372903047283,0.009684895022655737
+cc12m_gemma,ref_cc12m_llavanext,LLaVA-NeXT,4453,48092,9.439029867505052,1.0442398383112508,0.8739915162605008,0.09668967811694253,0.029318805622556766
+cc12m_gemma,ref_pixelprose_cc12m,PixelProse,4449,56043,10.195999100921556,2.02202742189256,0.8094141998108595,0.16051960102064486,0.03006619916849562

eval_results/cc12m_budget_frontier_plot.csv ADDED Viewed

	@@ -0,0 +1,17 @@

+budget,surface,label,valid,bad_json,cbu_per_cap,cbu_per_100tok,dup_row_rate,object_per_cap,attribute_per_cap,relation_per_cap,style_per_cap,camera_per_cap,lighting_per_cap,count_per_cap,text_rendering_per_cap,pareto_efficiency_yield
+16,ours_cc12m,Ours,1000,1,5.758,34.88851187590887,0.0,1.694,2.624,0.806,0.164,0.14,0.111,0.211,0.008,0
+16,ref_cc12m_llavanext,LLaVA-NeXT,999,1,4.842842842842843,29.99194098320005,0.0,1.9289289289289289,1.4964964964964964,0.6506506506506506,0.42342342342342343,0.06706706706706707,0.057057057057057055,0.18618618618618618,0.03303303303303303,0
+16,ref_cc12m_qwen3vl8b,Qwen3-VL-8B,1000,1,6.422,56.39269406392694,0.001,2.984,1.548,1.531,0.106,0.017,0.095,0.133,0.008,1
+16,ref_pixelprose_cc12m,PixelProse,1000,1,4.205,26.021039603960396,0.001,1.814,1.176,0.668,0.189,0.038,0.028,0.28,0.012,0
+32,ours_cc12m,Ours,999,9,9.70870870870871,29.478451157984317,0.0,2.71971971971972,4.43043043043043,1.6076076076076076,0.2122122122122122,0.2132132132132132,0.20520520520520522,0.3053053053053053,0.015015015015015015,1
+32,ref_cc12m_llavanext,LLaVA-NeXT,996,9,7.975903614457831,26.3491326412153,0.001004016064257028,3.0261044176706826,2.568273092369478,1.3052208835341366,0.5040160642570282,0.09337349397590361,0.11546184738955824,0.2791164658634538,0.08433734939759036,0
+32,ref_cc12m_qwen3vl8b,Qwen3-VL-8B,999,9,6.44044044044044,56.423748136455316,0.0,2.984984984984985,1.5425425425425425,1.5375375375375375,0.11011011011011011,0.01701701701701702,0.1001001001001001,0.13913913913913914,0.009009009009009009,1
+32,ref_pixelprose_cc12m,PixelProse,997,9,7.742226680040121,24.0497258225324,0.0010030090270812437,3.00802407221665,2.358074222668004,1.510531594784353,0.2086258776328987,0.05917753259779338,0.05315947843530592,0.4974924774322969,0.04714142427281846,0
+48,ours_cc12m,Ours,996,20,12.66566265060241,25.655365967745215,0.0,3.604417670682731,5.731927710843373,2.1546184738955825,0.22088353413654618,0.2319277108433735,0.29417670682730923,0.3815261044176707,0.04618473895582329,1
+48,ref_cc12m_llavanext,LLaVA-NeXT,992,20,9.808467741935484,23.807776064988133,0.0020161290322580645,3.6985887096774195,3.2056451612903225,1.6330645161290323,0.5453629032258065,0.13205645161290322,0.17237903225806453,0.3165322580645161,0.10483870967741936,0
+48,ref_cc12m_qwen3vl8b,Qwen3-VL-8B,999,20,6.425425425425425,56.29220380601596,0.0,2.991991991991992,1.5345345345345345,1.5315315315315314,0.10810810810810811,0.018018018018018018,0.0970970970970971,0.13613613613613615,0.008008008008008008,1
+48,ref_pixelprose_cc12m,PixelProse,993,20,10.506545820745217,21.99709038773746,0.002014098690835851,3.8298086606243706,3.2255790533736155,2.2678751258811682,0.27794561933534745,0.08257804632426989,0.08761329305135952,0.6576032225579054,0.07754279959718026,0
+64,ours_cc12m,Ours,995,25,15.484422110552764,23.56747330743109,0.0,4.5768844221105525,6.7658291457286435,2.7326633165829146,0.25125628140703515,0.24824120603015076,0.37386934673366834,0.46030150753768845,0.07537688442211055,1
+64,ref_cc12m_llavanext,LLaVA-NeXT,992,25,10.904233870967742,21.646555001901103,0.004032258064516129,4.047379032258065,3.595766129032258,1.8074596774193548,0.6280241935483871,0.13608870967741934,0.23387096774193547,0.3286290322580645,0.12701612903225806,0
+64,ref_cc12m_qwen3vl8b,Qwen3-VL-8B,1000,25,6.453,56.5061295971979,0.001,2.99,1.528,1.572,0.107,0.015,0.095,0.137,0.009,1
+64,ref_pixelprose_cc12m,PixelProse,988,25,12.625506072874494,20.52049746660525,0.004048582995951417,4.394736842105263,3.8856275303643724,2.729757085020243,0.458502024291498,0.11842105263157894,0.12246963562753037,0.791497975708502,0.12449392712550607,0

eval_results/cc12m_cbu_budget_frontier.png ADDED Viewed

eval_results/cc12m_cbu_vqa_bootstrap_ci.tsv ADDED Viewed

	@@ -0,0 +1,5 @@

+surface	cbu_cap	cbu_cap_ci	support	support_ci	risk	risk_ci	unsupported_cap	unsupported_cap_ci	supported_cap	supported_cap_ci
+Ours	15.2148	[15.1034,15.3244]	0.9591	[0.9572,0.9611]	0.0305	[0.0289,0.0322]	0.4638	[0.4376,0.4907]	14.5959	[14.4822,14.7058]
+Qwen3-VL-8B	6.4371	[6.3892,6.4850]	0.9796	[0.9775,0.9817]	0.0135	[0.0118,0.0152]	0.0878	[0.0773,0.0987]	6.3122	[6.2615,6.3613]
+LLaVA-NeXT	10.7809	[10.6874,10.8762]	0.9107	[0.9073,0.9142]	0.0681	[0.0651,0.0711]	0.7442	[0.7117,0.7777]	9.8421	[9.7471,9.9367]
+PixelProse	12.5706	[12.4826,12.6605]	0.8502	[0.8460,0.8544]	0.1293	[0.1254,0.1333]	1.6233	[1.5727,1.6748]	10.7301	[10.6350,10.8236]

eval_results/cc12m_cbu_yield_efficiency_scatter.png ADDED Viewed

eval_results/cc12m_gemma4_vqa_bootstrap_ci.tsv ADDED Viewed

	@@ -0,0 +1,5 @@

+surface	claim_cbu	claim_cbu_ci	support	support_ci	risk	risk_ci	unsupported_cap	unsupported_cap_ci	supported_cap	supported_cap_ci
+Ours	15.2270	[15.1208,15.3432]	0.9078	[0.9050,0.9109]	0.0664	[0.0638,0.0689]	1.0112	[0.9716,1.0495]	13.8234	[13.7174,13.9335]
+Qwen3-VL-8B	6.4390	[6.3893,6.4873]	0.9620	[0.9591,0.9648]	0.0283	[0.0259,0.0308]	0.1822	[0.1666,0.1984]	6.1944	[6.1427,6.2468]
+LLaVA-NeXT	10.7999	[10.7070,10.8905]	0.8740	[0.8701,0.8780]	0.0967	[0.0933,0.1002]	1.0442	[1.0069,1.0840]	9.4390	[9.3468,9.5261]
+PixelProse	12.5969	[12.5088,12.6876]	0.8094	[0.8049,0.8137]	0.1605	[0.1566,0.1648]	2.0220	[1.9694,2.0788]	10.1964	[10.1090,10.2890]

eval_results/cc12m_longclip_plot.csv ADDED Viewed

	@@ -0,0 +1,9 @@

+mode,surface,rows,trunc_gt_248,tok_mean,tok_p95,pos_mean,pos_ci95,i2t_margin_mean,i2t_margin_ci95,i2t_r1,i2t_r5,t2i_margin_mean,t2i_margin_ci95,t2i_r1,t2i_r5
+full,ours,4494,0.3238,231.67,320.0,0.322264,"[0.321269,0.323159]",0.052692,"[0.051671,0.053724]",0.9079,0.9878,0.051062,"[0.049759,0.052290]",0.9023,0.9855
+full,qwen3vl8b,4494,0.0000,15.41,20.0,0.329667,"[0.328596,0.330746]",0.047944,"[0.046669,0.049288]",0.8611,0.9791,0.048947,"[0.047556,0.050413]",0.8420,0.9706
+full,llavanext,4494,0.0040,81.88,153.0,0.335460,"[0.334478,0.336478]",0.057289,"[0.055980,0.058576]",0.9065,0.9893,0.057517,"[0.056162,0.058858]",0.8994,0.9862
+full,pixelprose,4494,0.0167,108.35,193.0,0.325949,"[0.324966,0.327034]",0.052218,"[0.050927,0.053507]",0.8903,0.9771,0.052438,"[0.051127,0.053663]",0.8741,0.9755
+input64,ours,4494,0.0000,81.76,90.0,0.316327,"[0.315241,0.317304]",0.045948,"[0.044788,0.047165]",0.8587,0.9713,0.045064,"[0.043627,0.046384]",0.8502,0.9651
+input64,qwen3vl8b,4494,0.0000,15.41,20.0,0.329667,"[0.328596,0.330746]",0.047944,"[0.046669,0.049288]",0.8611,0.9791,0.048947,"[0.047556,0.050413]",0.8420,0.9706
+input64,llavanext,4494,0.0000,60.62,83.0,0.334477,"[0.333420,0.335557]",0.055338,"[0.053996,0.056670]",0.8968,0.9849,0.055819,"[0.054487,0.057197]",0.8876,0.9829
+input64,pixelprose,4494,0.0007,73.39,82.3,0.324152,"[0.323114,0.325235]",0.049075,"[0.047727,0.050442]",0.8687,0.9677,0.049699,"[0.048287,0.051025]",0.8563,0.9631

eval_results/cc12m_vqa_supported_risk_pareto.csv ADDED Viewed

	@@ -0,0 +1,9 @@

+dataset,judge,surface,label,claim_cbu_cap,claim_cbu_cap_ci_half,supported_cap,supported_cap_ci_half,unsupported_cap,unsupported_cap_ci_half,support,support_ci_half,risk,risk_ci_half,pareto_supported_cost
+cc12m,Qwen3.5-397B-A17B-FP8,ours_cc12m,Ours,15.21476510067114,0.1114093959731548,14.59592567718827,0.11373404969778456,0.46384598164316093,0.026863666890530602,0.9591174271295853,0.0019560148290445056,0.03052671847444831,0.0016744650009232857,1
+cc12m,Qwen3.5-397B-A17B-FP8,ref_cc12m_qwen3vl8b,Qwen3-VL-8B,6.437096415052327,0.04787909151636516,6.3122494432071266,0.05078507795100151,0.08775055679287305,0.010913140311804001,0.979630050103324,0.002106863054663677,0.013491260039144228,0.0017132916421613506,1
+cc12m,Qwen3.5-397B-A17B-FP8,ref_cc12m_llavanext,LLaVA-NeXT,10.780892576810944,0.09532967032967044,9.842128901863912,0.09499775432292878,0.7442173815405345,0.03346058836739274,0.9107345292570356,0.0034450716214012855,0.06810172039716729,0.003039160276583866,0
+cc12m,Qwen3.5-397B-A17B-FP8,ref_pixelprose_cc12m,PixelProse,12.570563159075611,0.08997644155261497,10.730051697010564,0.09507754551584569,1.623286131714992,0.051472240953023274,0.8502215852149048,0.004186420482200637,0.12925547153139433,0.004071021279282744,0
+cc12m,Gemma-4-31B-it,ours_cc12m,Ours,15.226997985224983,0.11619812810340768,13.823371390194762,0.11008591877938656,1.011193194537721,0.03964272454927065,0.9078198738587747,0.003033190165912525,0.06640791543539305,0.002597111181184858,1
+cc12m,Gemma-4-31B-it,ref_cc12m_qwen3vl8b,Qwen3-VL-8B,6.438975501113585,0.04967149220489908,6.194432071269488,0.05235385643843937,0.1821826280623608,0.016260562688326985,0.9620213759468714,0.0028781310644838687,0.02829372903047283,0.0025388759897655086,1
+cc12m,Gemma-4-31B-it,ref_cc12m_llavanext,LLaVA-NeXT,10.799910172917135,0.09292042436137748,9.439029867505052,0.09223563302542459,1.0442398383112508,0.03980323956484666,0.8739915162605008,0.003965850263161097,0.09668967811694253,0.003529553703450483,0
+cc12m,Gemma-4-31B-it,ref_pixelprose_cc12m,PixelProse,12.596850393700787,0.09077978696827138,10.196400449943757,0.09259006956107108,2.0220472440944883,0.05678535826217557,0.8094404657725073,0.004546951426257162,0.16052006500812602,0.004311023541536035,0

eval_results/cc12m_vqa_supported_risk_pareto.png ADDED Viewed

eval_results/datacomp-naive-qwen35-baseline-2026-05-02/README.md ADDED Viewed

	@@ -0,0 +1,106 @@

+# DataComp naive-Qwen35 policy ablation
+Date: 2026-05-02
+This package is the DataComp-side same-captioner policy ablation:
+- `ours_datacomp_forward`: Qwen3.5 captioner with the grounded recap policy.
+- `naive_qwen35_datacomp`: Qwen3.5 captioner with a single naive captioning prompt.
+This is not a three-way attribution against the DataComp LLaVA-1.5 + LLaMA-3 reference captioner. The clean comparison here is `ours_datacomp_forward` vs `naive_qwen35_datacomp` on the same DataComp image surface.
+## Caption generation
+- Model: `Qwen/Qwen3.5-35B-A3B-FP8`
+- Prompt: `Please generate a detailed caption of this image. Please be as descriptive as possible.`
+- System prompt: none
+- Message policy: single user message with image, no system prompt
+- Image mode: local file first pass, resized data-URI retry for over-context images
+- Materialized requests: 4,775 / 4,997
+- Caption responses: 4,775 unique requests, bad 0
+- Caption token mean / median: 296.46 / 296
+Primary surface:
+- `naive_qwen35_datacomp.jsonl`
+- `naive_qwen35_caption.summary.json`
+## Judge settings
+All judge runs used deterministic default sampling from the local runners:
+| Stage | Runner | Model | Temperature | Notes |
+|---|---|---|---:|---|
+| Claim extraction | `run_text_json_requests.py` | `google/gemma-4-31B-it` | 0.0 | default `--temperature`; not passed explicitly |
+| Grounded CBU verify | `run_grounded_cbu_verify_requests.py` | `google/gemma-4-31B-it` | 0.0 | default `--temperature`; not passed explicitly |
+| CBU VQA | `run_cbu_vqa_requests.py` | `google/gemma-4-31B-it` | 0.0 | default `--temperature`; not passed explicitly |
+The response JSONL rows do not currently persist sampling parameters. Traceability for this run is through the runner defaults plus the command log and this README.
+## Gemma judge results
+CBU extraction and grounded audit:
+| Metric | Value |
+|---|---:|
+| Captions | 4,775 |
+| Claimed CBU / caption | 12.2119 [12.1177, 12.3129] |
+| Visual units | 58,260 |
+| Grounded units / caption | 11.7801 [11.6785, 11.8796] |
+| Grounded precision | 0.9655 [0.9638, 0.9674] |
+| Unsupported rate | 0.0104 [0.0093, 0.0116] |
+| Uncertain rate | 0.0241 [0.0227, 0.0255] |
+CBU VQA:
+| Surface | Resp | OK | Q | Support | Risk | Uncertain |
+|---|---:|---:|---:|---:|---:|---:|
+| `naive_qwen35_datacomp` | 4,775 | 4,775 | 58,335 | 0.9307 | 0.0403 | 0.0290 |
+Compared to the existing Gemma DataComp forward table:
+| Surface | Grounded precision | Unsupported | CBU-VQA support | CBU-VQA risk |
+|---|---:|---:|---:|---:|
+| `ours_datacomp_forward` | 0.9457 | 0.0246 | 0.8886 | 0.0840 |
+| `naive_qwen35_datacomp` | 0.9655 | 0.0104 | 0.9307 | 0.0403 |
+The naive surface has fewer claimed visual units than ours, so higher precision/support should be read together with coverage: 58,260 visual units for naive vs 70,894 visual units for ours in the existing DataComp Gemma table.
+## CPU text diagnostics
+CPU lexical diagnostics were run on the exact 4,775-image intersection between `ours_datacomp_forward` and `naive_qwen35_datacomp`.
+| Surface | Captions | Mean Lex | P95 Lex | Cov64 | Cov128 | Cov248 | Cov320 | D2 | D3 | M3 Top100 | Prefix Top100 | Rep4 | Viol/64 | Newline | Bullet | Top Opening |
+|---|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---|
+| `ours_datacomp_forward` | 4,775 | 175.08 | 249.0 | 1.0000 | 0.9125 | 0.0515 | 0.0098 | 0.286004 | 0.606353 | 0.0411 | 0.0821 | 0.1969 | 0.020223 | 0.3818 | 0.0010 | close up view |
+| `naive_qwen35_datacomp` | 4,775 | 298.02 | 392.0 | 0.9979 | 0.9927 | 0.7786 | 0.3929 | 0.262878 | 0.579717 | 0.0443 | 0.6921 | 0.2972 | 0.114788 | 0.9799 | 0.3749 | is a close |
+Interpretation: naive-Qwen35 is more verbose and has higher judged support/risk scores, but it is much more template-concentrated and format-heavy. The high `Prefix Top100`, newline rate, bullet rate, and repetition rate should be reported as the caveat against treating the naive improvement as purely semantic quality.
+Reference LLaVA/Recap-DataComp was not included in this exact CPU table because its available request slices use a different URL-paired surface; matching by `source_row` is invalid and URL overlap with this naive materialized slice is only a small residual subset.
+## Tables
+Paper-facing tables are under `gemma4_metric_tables/`:
+- `claimed_cbu_ci.tsv`
+- `grounded_cbu_ci.tsv`
+- `grounded_cbu_category_ci.tsv`
+- `cbu_bootstrap_summary.json`
+- `cbu_vqa_gemma4_table.md`
+- `cbu_vqa_gemma4_table.tex`
+- `../cpu_text_metrics/cpu_text_comparison.md`
+- `../cpu_text_metrics/cpu_text_comparison.tsv`
+- `../cpu_text_metrics/cpu_text_summary.json`
+## Portable image package
+Reusable E&D image package:
+- Directory: `image_packages/datacomp_naive_qwen35_policy_ablation/`
+- Tarball: `image_packages/datacomp_naive_qwen35_policy_ablation.tar.gz`
+- Images: 4,775
+- Missing rows: 0
+- Packaged requests: grounded CBU and CBU VQA, rewritten to package-relative `image_path`
+Verified with `gzip -t`.

eval_results/datacomp-naive-qwen35-baseline-2026-05-02/cpu_text_metrics/cpu_text_comparison.md ADDED Viewed

	@@ -0,0 +1,4 @@

+| Surface | Captions | Mean Lex | P95 Lex | Cov64 | Cov128 | Cov248 | Cov320 | D2 | D3 | M3 Top100 | Prefix Top100 | Rep4 | Viol/64 | Newline | Bullet | Top Opening |
+|---|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---|
+| `ours_datacomp_forward` | 4,775 | 175.08 | 249.0 | 1.0000 | 0.9125 | 0.0515 | 0.0098 | 0.286004 | 0.606353 | 0.0411 | 0.0821 | 0.1969 | 0.020223 | 0.3818 | 0.0010 | close up view |
+| `naive_qwen35_datacomp` | 4,775 | 298.02 | 392.0 | 0.9979 | 0.9927 | 0.7786 | 0.3929 | 0.262878 | 0.579717 | 0.0443 | 0.6921 | 0.2972 | 0.114788 | 0.9799 | 0.3749 | is a close |

eval_results/datacomp-naive-qwen35-baseline-2026-05-02/cpu_text_metrics/cpu_text_comparison.tsv ADDED Viewed

	@@ -0,0 +1,3 @@

+surface	records	avg_tokens	p95_tokens	cov64	cov128	cov248	cov320	distinct2_full	distinct3_full	m3_top100_full	prefix_raw_top100_full	rep4_full	within_d3_full	within_d4_full	violation_rate_full	viol_ind_per64_full	control_hits_per64_full	format_newline_rate	format_bullet_rate	format_numbered_list_rate	top_opening_1	top_opening_1_count
+ours_datacomp_forward	4775	175.081047	249.0	1.0	0.9125	0.0515	0.0098	0.286004	0.606353	0.041105	0.0821	0.1969	0.991174	0.995628	0.0595	0.020223	8.2583	0.3818	0.001	0.001	close up view	77
+naive_qwen35_datacomp	4775	298.016126	392.0	0.9979	0.9927	0.7786	0.3929	0.262878	0.579717	0.04432	0.6921	0.2972	0.989423	0.997198	0.5504	0.114788	6.2184	0.9799	0.3749	0.067	is a close	726

eval_results/datacomp-naive-qwen35-baseline-2026-05-02/cpu_text_metrics/cpu_text_summary.json ADDED Viewed

	@@ -0,0 +1,56 @@

+{
+  "tokenizer": "regex lexical units: [^\\W_]+(?:'[^\\W_]+)*",
+  "matched_rows": 4775,
+  "surfaces": {
+    "ours_datacomp_forward": {
+      "surface": "ours_datacomp_forward",
+      "records": 4775,
+      "avg_tokens": 175.08104712041884,
+      "p95_tokens": 249.0,
+      "cov64": 1.0,
+      "cov128": 0.9125,
+      "cov248": 0.0515,
+      "cov320": 0.0098,
+      "distinct2_full": 0.286004,
+      "distinct3_full": 0.606353,
+      "m3_top100_full": 0.041105,
+      "prefix_raw_top100_full": 0.0821,
+      "rep4_full": 0.1969,
+      "within_d3_full": 0.991174,
+      "within_d4_full": 0.995628,
+      "violation_rate_full": 0.0595,
+      "viol_ind_per64_full": 0.020223,
+      "control_hits_per64_full": 8.2583,
+      "format_newline_rate": 0.3818,
+      "format_bullet_rate": 0.001,
+      "format_numbered_list_rate": 0.001,
+      "top_opening_1": "close up view",
+      "top_opening_1_count": 77
+    },
+    "naive_qwen35_datacomp": {
+      "surface": "naive_qwen35_datacomp",
+      "records": 4775,
+      "avg_tokens": 298.01612565445026,
+      "p95_tokens": 392.0,
+      "cov64": 0.9979,
+      "cov128": 0.9927,
+      "cov248": 0.7786,
+      "cov320": 0.3929,
+      "distinct2_full": 0.262878,
+      "distinct3_full": 0.579717,
+      "m3_top100_full": 0.04432,
+      "prefix_raw_top100_full": 0.6921,
+      "rep4_full": 0.2972,
+      "within_d3_full": 0.989423,
+      "within_d4_full": 0.997198,
+      "violation_rate_full": 0.5504,
+      "viol_ind_per64_full": 0.114788,
+      "control_hits_per64_full": 6.2184,
+      "format_newline_rate": 0.9799,
+      "format_bullet_rate": 0.3749,
+      "format_numbered_list_rate": 0.067,
+      "top_opening_1": "is a close",
+      "top_opening_1_count": 726
+    }
+  }
+}

eval_results/datacomp-naive-qwen35-baseline-2026-05-02/gemma4_metric_tables/cbu_bootstrap_summary.json ADDED Viewed

	@@ -0,0 +1,238 @@

+{
+  "bootstrap_reps": 2000,
+  "seed": 0,
+  "claimed": {
+    "naive_qwen35_datacomp": {
+      "input": "artifacts/cbu/datacomp-naive-qwen35-baseline-2026-05-02/claimed_cbu_v2_naive_qwen35_datacomp_b64.responses.gemma4_31b_it_c512_mt1024.jsonl",
+      "captions": 4775,
+      "dedup_units_per_caption": {
+        "mean": 12.211937172774869,
+        "ci95_low": 12.117685863874346,
+        "ci95_high": 12.312884816753927
+      },
+      "dedup_units_per_100_tokens": {
+        "mean": 18.625390477772314,
+        "ci95_low": 18.485901003129428,
+        "ci95_high": 18.776241307080532
+      },
+      "duplicate_units_per_caption": {
+        "mean": 0.0035602094240837698,
+        "ci95_low": 0.0018848167539267015,
+        "ci95_high": 0.005235602094240838
+      },
+      "object_per_caption": {
+        "mean": 2.9212565445026177,
+        "ci95_low": 2.8699267015706806,
+        "ci95_high": 2.974874345549738
+      },
+      "attribute_per_caption": {
+        "mean": 5.352670157068063,
+        "ci95_low": 5.280413612565445,
+        "ci95_high": 5.423680628272251
+      },
+      "relation_per_caption": {
+        "mean": 1.381151832460733,
+        "ci95_low": 1.3505759162303665,
+        "ci95_high": 1.4121465968586386
+      },
+      "style_per_caption": {
+        "mean": 1.0284816753926702,
+        "ci95_low": 1.0054450261780106,
+        "ci95_high": 1.0517329842931937
+      },
+      "camera_per_caption": {
+        "mean": 0.660523560209424,
+        "ci95_low": 0.6376963350785341,
+        "ci95_high": 0.683565445026178
+      },
+      "lighting_per_caption": {
+        "mean": 0.2393717277486911,
+        "ci95_low": 0.22450261780104713,
+        "ci95_high": 0.2550837696335078
+      },
+      "count_per_caption": {
+        "mean": 0.3392670157068063,
+        "ci95_low": 0.3231413612565445,
+        "ci95_high": 0.35539267015706805
+      },
+      "text_rendering_per_caption": {
+        "mean": 0.28921465968586385,
+        "ci95_low": 0.27036649214659686,
+        "ci95_high": 0.3082827225130889
+      }
+    }
+  },
+  "grounded": {
+    "naive_qwen35_datacomp": {
+      "input": "artifacts/grounded-cbu/datacomp-naive-qwen35-baseline-2026-05-02/grounded_verify_v2_naive_qwen35_datacomp_b64.responses.gemma4_31b_c512_local_file_mt2048.jsonl",
+      "captions": 4775,
+      "visual_units": 58260,
+      "grounded_units_per_caption": {
+        "mean": 11.780104712041885,
+        "ci95_low": 11.678528795811518,
+        "ci95_high": 11.879596858638743
+      },
+      "grounded_precision": {
+        "mean": 0.9654994850669413,
+        "ci95_low": 0.9637811360068513,
+        "ci95_high": 0.9674050979253277
+      },
+      "unsupported_rate": {
+        "mean": 0.010384483350497768,
+        "ci95_low": 0.009253022043574697,
+        "ci95_high": 0.011574997848566833
+      },
+      "uncertain_rate": {
+        "mean": 0.024116031582560933,
+        "ci95_low": 0.0227306797964753,
+        "ci95_high": 0.025492158685210573
+      },
+      "categories": {
+        "object": {
+          "visual_units": 13943,
+          "grounded_precision": {
+            "mean": 0.9749695187549308,
+            "ci95_low": 0.9722321157964497,
+            "ci95_high": 0.9777625675102546
+          },
+          "unsupported_rate": {
+            "mean": 0.00817614573621172,
+            "ci95_low": 0.006573729389796407,
+            "ci95_high": 0.009942820581087936
+          },
+          "uncertain_rate": {
+            "mean": 0.016854335508857492,
+            "ci95_low": 0.014621329422784338,
+            "ci95_high": 0.019047263158799014
+          }
+        },
+        "attribute": {
+          "visual_units": 25508,
+          "grounded_precision": {
+            "mean": 0.9510741728085307,
+            "ci95_low": 0.9483290055013888,
+            "ci95_high": 0.9538571731863305
+          },
+          "unsupported_rate": {
+            "mean": 0.01003606711619884,
+            "ci95_low": 0.008637039503512835,
+            "ci95_high": 0.011506914261990438
+          },
+          "uncertain_rate": {
+            "mean": 0.0388897600752705,
+            "ci95_low": 0.03648831232635523,
+            "ci95_high": 0.041265412517918716
+          }
+        },
+        "relation": {
+          "visual_units": 6595,
+          "grounded_precision": {
+            "mean": 0.979226686884003,
+            "ci95_low": 0.9756088423314924,
+            "ci95_high": 0.9829912549589104
+          },
+          "unsupported_rate": {
+            "mean": 0.012736921910538287,
+            "ci95_low": 0.009840657602093744,
+            "ci95_high": 0.015594546249207836
+          },
+          "uncertain_rate": {
+            "mean": 0.008036391205458682,
+            "ci95_low": 0.005879463614752104,
+            "ci95_high": 0.010196506005969915
+          }
+        },
+        "style": {
+          "visual_units": 4903,
+          "grounded_precision": {
+            "mean": 0.9946971242096676,
+            "ci95_low": 0.992361424443466,
+            "ci95_high": 0.9967565868866474
+          },
+          "unsupported_rate": {
+            "mean": 0.0012237405669997961,
+            "ci95_low": 0.0004036204900748543,
+            "ci95_high": 0.0022689886006859402
+          },
+          "uncertain_rate": {
+            "mean": 0.004079135223332654,
+            "ci95_low": 0.0022297001685589526,
+            "ci95_high": 0.006222442286005553
+          }
+        },
+        "camera": {
+          "visual_units": 3153,
+          "grounded_precision": {
+            "mean": 0.987313669521091,
+            "ci95_low": 0.9833699876449492,
+            "ci95_high": 0.9911234711442108
+          },
+          "unsupported_rate": {
+            "mean": 0.009514747859181731,
+            "ci95_low": 0.006189461424091486,
+            "ci95_high": 0.012894916409768217
+          },
+          "uncertain_rate": {
+            "mean": 0.003171582619727244,
+            "ci95_low": 0.0012767112046430675,
+            "ci95_high": 0.0051597925278928725
+          }
+        },
+        "lighting": {
+          "visual_units": 1141,
+          "grounded_precision": {
+            "mean": 0.9877300613496932,
+            "ci95_low": 0.9809899256700093,
+            "ci95_high": 0.9941029042176894
+          },
+          "unsupported_rate": {
+            "mean": 0.0043821209465381246,
+            "ci95_low": 0.0008849166640607899,
+            "ci95_high": 0.008410605919016347
+          },
+          "uncertain_rate": {
+            "mean": 0.007887817703768623,
+            "ci95_low": 0.002643113637773779,
+            "ci95_high": 0.013688086732727058
+          }
+        },
+        "count": {
+          "visual_units": 1622,
+          "grounded_precision": {
+            "mean": 0.9340320591861899,
+            "ci95_low": 0.921660662579409,
+            "ci95_high": 0.946229913473424
+          },
+          "unsupported_rate": {
+            "mean": 0.04315659679408138,
+            "ci95_low": 0.03341480895617374,
+            "ci95_high": 0.05351880954899317
+          },
+          "uncertain_rate": {
+            "mean": 0.02281134401972873,
+            "ci95_low": 0.015662198229714857,
+            "ci95_high": 0.030853274629741658
+          }
+        },
+        "text_rendering": {
+          "visual_units": 1395,
+          "grounded_precision": {
+            "mean": 0.9362007168458781,
+            "ci95_low": 0.9220679765830679,
+            "ci95_high": 0.9496207804424454
+          },
+          "unsupported_rate": {
+            "mean": 0.02867383512544803,
+            "ci95_low": 0.019870493081955463,
+            "ci95_high": 0.038360938578329874
+          },
+          "uncertain_rate": {
+            "mean": 0.03512544802867384,
+            "ci95_low": 0.025244999513490084,
+            "ci95_high": 0.04539113137815038
+          }
+        }
+      }
+    }
+  }
+}

eval_results/datacomp-naive-qwen35-baseline-2026-05-02/gemma4_metric_tables/cbu_vqa_gemma4_table.md ADDED Viewed

	@@ -0,0 +1,3 @@

+| Surface | Resp | OK | Q | Support ↑ | Risk ↓ | Uncertain ↓ |
+|---|---:|---:|---:|---:|---:|---:|
+| naive_qwen35_datacomp | 4,775 | 4,775 | 58,335 | 0.9307 | 0.0403 | 0.0290 |

eval_results/datacomp-naive-qwen35-baseline-2026-05-02/gemma4_metric_tables/cbu_vqa_gemma4_table.tex ADDED Viewed

	@@ -0,0 +1,7 @@

+\begin{tabular}{lrrrrrr}
+\toprule
+Surface & Resp. & OK & Q & Support $\uparrow$ & Risk $\downarrow$ & Uncertain $\downarrow$ \\
+\midrule
+naive\_qwen35\_datacomp & 4,775 & 4,775 & 58,335 & 0.9307 & 0.0403 & 0.0290 \\
+\bottomrule
+\end{tabular}

eval_results/datacomp-naive-qwen35-baseline-2026-05-02/gemma4_metric_tables/claimed_cbu_ci.tsv ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ surface captions cbu_per_caption_ci95 cbu_per_100_tokens_ci95 object_per_caption_ci95 attribute_per_caption_ci95 relation_per_caption_ci95 camera_per_caption_ci95 lighting_per_caption_ci95 text_rendering_per_caption_ci95
2	+ naive_qwen35_datacomp 4775 12.2119 [12.1177, 12.3129] 18.6254 [18.4859, 18.7762] 2.9213 [2.8699, 2.9749] 5.3527 [5.2804, 5.4237] 1.3812 [1.3506, 1.4121] 0.6605 [0.6377, 0.6836] 0.2394 [0.2245, 0.2551] 0.2892 [0.2704, 0.3083]

eval_results/datacomp-naive-qwen35-baseline-2026-05-02/gemma4_metric_tables/grounded_cbu_category_ci.tsv ADDED Viewed

	@@ -0,0 +1,9 @@

+surface	category	visual_units	grounded_precision_ci95	unsupported_rate_ci95	uncertain_rate_ci95
+naive_qwen35_datacomp	object	13943	0.9750 [0.9722, 0.9778]	0.0082 [0.0066, 0.0099]	0.0169 [0.0146, 0.0190]
+naive_qwen35_datacomp	attribute	25508	0.9511 [0.9483, 0.9539]	0.0100 [0.0086, 0.0115]	0.0389 [0.0365, 0.0413]
+naive_qwen35_datacomp	relation	6595	0.9792 [0.9756, 0.9830]	0.0127 [0.0098, 0.0156]	0.0080 [0.0059, 0.0102]
+naive_qwen35_datacomp	style	4903	0.9947 [0.9924, 0.9968]	0.0012 [0.0004, 0.0023]	0.0041 [0.0022, 0.0062]
+naive_qwen35_datacomp	camera	3153	0.9873 [0.9834, 0.9911]	0.0095 [0.0062, 0.0129]	0.0032 [0.0013, 0.0052]
+naive_qwen35_datacomp	lighting	1141	0.9877 [0.9810, 0.9941]	0.0044 [0.0009, 0.0084]	0.0079 [0.0026, 0.0137]
+naive_qwen35_datacomp	count	1622	0.9340 [0.9217, 0.9462]	0.0432 [0.0334, 0.0535]	0.0228 [0.0157, 0.0309]
+naive_qwen35_datacomp	text_rendering	1395	0.9362 [0.9221, 0.9496]	0.0287 [0.0199, 0.0384]	0.0351 [0.0252, 0.0454]

eval_results/datacomp-naive-qwen35-baseline-2026-05-02/gemma4_metric_tables/grounded_cbu_ci.tsv ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ surface captions visual_units grounded_units_per_caption_ci95 grounded_precision_ci95 unsupported_rate_ci95 uncertain_rate_ci95
2	+ naive_qwen35_datacomp 4775 58260 11.7801 [11.6785, 11.8796] 0.9655 [0.9638, 0.9674] 0.0104 [0.0093, 0.0116] 0.0241 [0.0227, 0.0255]

eval_results/datacomp-naive-qwen35-baseline-2026-05-02/naive_qwen35_caption.summary.json ADDED Viewed

	@@ -0,0 +1,15 @@

+{
+  "responses": 4980,
+  "unique_requests": 4775,
+  "captions": 4775,
+  "bad": 0,
+  "surface": "naive_qwen35_datacomp",
+  "output_jsonl": "artifacts/recap-ed/datacomp-naive-qwen35-baseline-2026-05-02/naive_qwen35_datacomp.jsonl",
+  "prompt": "Please generate a detailed caption of this image. Please be as descriptive as possible.",
+  "system_prompt": null,
+  "messages_policy": "single_user_message_with_image_no_system_prompt",
+  "token_mean": 296.45968586387437,
+  "token_median": 296,
+  "token_min": 12,
+  "token_max": 432
+}

eval_results/embeddinggemma_pair_summary.tsv ADDED Viewed

	@@ -0,0 +1,8 @@

+pair	ours_surface	ref_surface	vendi_ours	vendi_ref	delta_vendi_o_minus_r	erank_ours	erank_ref	delta_erank_o_minus_r	top1mass_ours	top1mass_ref	delta_top1mass_o_minus_r	nn_o_to_r	nn_r_to_o	delta_nn_o_minus_r	support_o_in_r	support_r_in_o	delta_support_o_minus_r	density_o_in_r	density_r_in_o	delta_density_o_minus_r
+cc12m_llavanext_paired	ours_cc12m	ref_cc12m_llavanext	66.730734	90.160599	-23.429866	269.042328	286.025848	-16.983521	0.038855	0.034805	0.004051	0.769103	0.764140	0.004963	0.957660	0.843140	0.114520	0.497990	0.236742	0.261248
+cc12m_qwen3vl8b_paired	ours_cc12m	ref_cc12m_qwen3vl8b	51.634607	57.110133	-5.475526	222.048950	224.160461	-2.111511	0.053513	0.049385	0.004128	0.704297	0.707163	-0.002866	0.375800	0.354180	0.021620	0.049584	0.044978	0.004606
+cc12m_pixelprose_paired	ours_cc12m	ref_pixelprose_cc12m	66.854479	73.743682	-6.889202	269.228516	288.698914	-19.470398	0.038886	0.034999	0.003888	0.680594	0.676938	0.003656	0.602020	0.500780	0.101240	0.172554	0.115546	0.057008
+laion_pop_llama32_paired	ours_laion_pop	ref_laion_pop_llama32_11b	47.474582	63.807505	-16.332923	218.463516	241.209625	-22.746109	0.055081	0.048184	0.006897	0.794170	0.787970	0.006200	0.961740	0.845120	0.116620	0.506744	0.264092	0.242652
+pd12m_full_paired	ours_pd12m_img2dataset	ref_pd12m_full	51.291509	37.793293	13.498217	211.494385	169.747513	41.746872	0.069870	0.062775	0.007095	0.692069	0.704875	-0.012806	0.122240	0.257560	-0.135320	0.017254	0.038008	-0.020754
+danbooru2023_florence2_paired	ours_danbooru2023	ref_danbooru_florence2	42.259565	20.049109	22.210456	254.459122	104.725281	149.733841	0.043924	0.112505	-0.068580	0.638019	0.668428	-0.030409	0.028080	0.413040	-0.384960	0.003374	0.102292	-0.098918
+datacomp_recap_llava15_paired_url	ours_datacomp_forward	ref_datacomp_recap_llava15_llama3_8b	72.805853	85.648504	-12.842651	296.891449	285.083344	11.808105	0.038364	0.035212	0.003152	0.735544	0.738841	-0.003297	0.813480	0.815360	-0.001880	0.243068	0.208822	0.034246

eval_results/eval_results_summary.md ADDED Viewed

	@@ -0,0 +1,34 @@

+# Recap Evaluation Results Summary
+Date: 2026-04-27
+## Evaluation Families
+| Family | Main artifact | Paper role | Status |
+|---|---|---|---|
+| Mechanical text metrics | artifacts/caption-survey/cpu_remaining_2026-04-24 | surface concentration, violations, repetition, lexical diversity proxies | done |
+| Prompt-pool support | artifacts/caption-survey/prompt_support_bootstrap_b64_n2_250k_2026-04-24.tsv | caption-prompt distribution support over declared prompt pools | done |
+| Embedding diversity/support | artifacts/recap-ed/metrics-2026-04-25/embedding | Vendi/effective-rank/embedding support diagnostics | done; model-sensitive |
+| Claimed CBU | artifacts/cbu/pair5k-local/claimed_cbu_v2_all7_b64_5k.responses.qwen397_c1024_mt4096.summary.json | text-side controllable-unit density at B=64 | done |
+| CBU budget frontier | artifacts/cbu/cc12m-four-caption-llava-url-bridge-bgrid-1k | CC12M budget sensitivity B={16,32,48,64} | done |
+| Image-conditioned VQA | artifacts/vqa-cbu | supported yield / support rate / risk | done for Qwen; Gemma cross-family done on CC12M |
+| LongCLIP retrieval | artifacts/longclip | dual-encoder retrieval separability diagnostic | done for corrected CC12M |
+## Plot-Ready Outputs
+- `cc12m_budget_frontier_plot.csv`: B-grid CBU yield/efficiency; plot `budget` vs `cbu_per_cap`, or `cbu_per_100tok` vs `cbu_per_cap`.
+- `cc12m_vqa_supported_risk_pareto.csv`: CC12M VQA Pareto; plot `unsupported_cap` vs `supported_cap`, facet by `judge`, use `pareto_supported_cost`.
+- `cc12m_longclip_plot.csv`: LongCLIP full/input64 retrieval diagnostic.
+- `all_cbu_b64_summary.csv`: All available paired surfaces CBU@64.
+- `all_vqa_b64_summary.csv`: All available VQA@64 summaries.
+- `prompt_support_direction_summary.csv`: Prompt-pool support direction counts over prompt pools.
+## CC12M Pareto State
+- `VQA@64`: Ours and Qwen3-VL-8B form the Pareto frontier under both Qwen and Gemma judges. Ours is the high supported-yield endpoint; Qwen3-VL-8B is the low-risk short-caption endpoint. LLaVA-NeXT and PixelProse are dominated on supported yield vs unsupported cost.
+- `CBU@B`: Qwen3-VL-8B is the token-efficiency endpoint; Ours becomes the absolute-yield endpoint from B=32 onward. This is the cleanest plot for showing why length and density cannot be collapsed.
+- `LongCLIP`: LLaVA-NeXT is strongest on input-64 retrieval margin/R@1; Ours remains locally separable but LongCLIP should stay appendix/diagnostic, not headline faithfulness.
+## Main Readout
+The current evidence supports a Pareto framing rather than a scalar ranking. Ours increases supported controllable-unit yield; short-clean captions minimize risk and maximize per-token efficiency; public long baselines can be dense without matching the supported-yield/risk frontier.

eval_results/gemma-cross-corpus-2026-05-02/README.md ADDED Viewed

	@@ -0,0 +1,3 @@


1	+ # 2026-05-02 Gemma Cross-Corpus Add-on
2	+
3	+ This public-safe summary retains aggregate Gemma CBU/VQA tables only. The original run README contains local execution paths and remains in the private archive. Raw requests, raw responses, and image packages are not included in this anonymous code/results package.

eval_results/gemma-cross-corpus-2026-05-02/cbu_bootstrap_summary.json ADDED Viewed

	@@ -0,0 +1,1375 @@

+{
+  "bootstrap_reps": 2000,
+  "seed": 0,
+  "claimed": {},
+  "grounded": {
+    "datacomp_ours_gemma4": {
+      "input": "artifacts/grounded-cbu/grounded_verify_v2_ours_datacomp_forward_b64_5k.responses.gemma4_31b_c12_file_mt2048.jsonl",
+      "captions": 4975,
+      "visual_units": 70894,
+      "grounded_units_per_caption": {
+        "mean": 13.475979899497487,
+        "ci95_low": 13.371256281407035,
+        "ci95_high": 13.587547738693468
+      },
+      "grounded_precision": {
+        "mean": 0.9456794651169351,
+        "ci95_low": 0.9434585074215828,
+        "ci95_high": 0.9480059620159873
+      },
+      "unsupported_rate": {
+        "mean": 0.024586001636245663,
+        "ci95_low": 0.022810711845593374,
+        "ci95_high": 0.02637851194763035
+      },
+      "uncertain_rate": {
+        "mean": 0.029734533246819194,
+        "ci95_low": 0.028394080187231784,
+        "ci95_high": 0.03114928846803639
+      },
+      "categories": {
+        "object": {
+          "visual_units": 17907,
+          "grounded_precision": {
+            "mean": 0.9642597866756017,
+            "ci95_low": 0.9610363237643356,
+            "ci95_high": 0.9674664999321825
+          },
+          "unsupported_rate": {
+            "mean": 0.019154520578544703,
+            "ci95_low": 0.016682863959804863,
+            "ci95_high": 0.02152086512348487
+          },
+          "uncertain_rate": {
+            "mean": 0.016585692745853576,
+            "ci95_low": 0.014508736743501513,
+            "ci95_high": 0.018943918295760965
+          }
+        },
+        "attribute": {
+          "visual_units": 36712,
+          "grounded_precision": {
+            "mean": 0.9358247984310307,
+            "ci95_low": 0.9329203137056817,
+            "ci95_high": 0.9387029984241348
+          },
+          "unsupported_rate": {
+            "mean": 0.023343865765962084,
+            "ci95_low": 0.02125146772343254,
+            "ci95_high": 0.02541166652055019
+          },
+          "uncertain_rate": {
+            "mean": 0.04083133580300719,
+            "ci95_low": 0.038792389054040935,
+            "ci95_high": 0.04300736625807117
+          }
+        },
+        "relation": {
+          "visual_units": 8429,
+          "grounded_precision": {
+            "mean": 0.9569343931664491,
+            "ci95_low": 0.9522466319118269,
+            "ci95_high": 0.9616699703646194
+          },
+          "unsupported_rate": {
+            "mean": 0.027642662237513348,
+            "ci95_low": 0.023718871209624296,
+            "ci95_high": 0.03163265585009613
+          },
+          "uncertain_rate": {
+            "mean": 0.01542294459603749,
+            "ci95_low": 0.012585268641188942,
+            "ci95_high": 0.018206130406180576
+          }
+        },
+        "style": {
+          "visual_units": 1109,
+          "grounded_precision": {
+            "mean": 0.9828674481514879,
+            "ci95_low": 0.9745385100240553,
+            "ci95_high": 0.9901079136690647
+          },
+          "unsupported_rate": {
+            "mean": 0.007213706041478809,
+            "ci95_low": 0.0026977811281971727,
+            "ci95_high": 0.012545192457534672
+          },
+          "uncertain_rate": {
+            "mean": 0.009918845807033363,
+            "ci95_low": 0.004408880233418492,
+            "ci95_high": 0.017197192268298016
+          }
+        },
+        "camera": {
+          "visual_units": 678,
+          "grounded_precision": {
+            "mean": 0.9808259587020649,
+            "ci95_low": 0.9688426248938727,
+            "ci95_high": 0.9908814589665653
+          },
+          "unsupported_rate": {
+            "mean": 0.017699115044247787,
+            "ci95_low": 0.008570211038961039,
+            "ci95_high": 0.02886261350438178
+          },
+          "uncertain_rate": {
+            "mean": 0.0014749262536873156,
+            "ci95_low": 0.0,
+            "ci95_high": 0.004665811543436868
+          }
+        },
+        "lighting": {
+          "visual_units": 1616,
+          "grounded_precision": {
+            "mean": 0.9628712871287128,
+            "ci95_low": 0.9523794970310863,
+            "ci95_high": 0.9724717111656482
+          },
+          "unsupported_rate": {
+            "mean": 0.01608910891089109,
+            "ci95_low": 0.009432195604550915,
+            "ci95_high": 0.023197884730650722
+          },
+          "uncertain_rate": {
+            "mean": 0.02103960396039604,
+            "ci95_low": 0.014229015493625012,
+            "ci95_high": 0.028957791095690914
+          }
+        },
+        "count": {
+          "visual_units": 2519,
+          "grounded_precision": {
+            "mean": 0.9174275506153236,
+            "ci95_low": 0.9051838517773609,
+            "ci95_high": 0.9293143510760096
+          },
+          "unsupported_rate": {
+            "mean": 0.059944422389837236,
+            "ci95_low": 0.04997500859645053,
+            "ci95_high": 0.07063250288500046
+          },
+          "uncertain_rate": {
+            "mean": 0.02262802699483922,
+            "ci95_low": 0.016627515214735912,
+            "ci95_high": 0.029262822135353646
+          }
+        },
+        "text_rendering": {
+          "visual_units": 1924,
+          "grounded_precision": {
+            "mean": 0.9002079002079002,
+            "ci95_low": 0.883913116269738,
+            "ci95_high": 0.9158168932758676
+          },
+          "unsupported_rate": {
+            "mean": 0.058731808731808735,
+            "ci95_low": 0.04716448643471811,
+            "ci95_high": 0.07090820750728064
+          },
+          "uncertain_rate": {
+            "mean": 0.04106029106029106,
+            "ci95_low": 0.03121766477392487,
+            "ci95_high": 0.05085673527047604
+          }
+        }
+      }
+    },
+    "datacomp_ref_llava15_llama3_gemma4": {
+      "input": "artifacts/grounded-cbu/grounded_verify_v2_ref_datacomp_recap_llava15_b64_5k.responses.gemma4_31b_c12_file_mt2048.jsonl",
+      "captions": 4993,
+      "visual_units": 49844,
+      "grounded_units_per_caption": {
+        "mean": 8.284398157420389,
+        "ci95_low": 8.18925996394953,
+        "ci95_high": 8.376146605247348
+      },
+      "grounded_precision": {
+        "mean": 0.8298691918786614,
+        "ci95_low": 0.8248098325178892,
+        "ci95_high": 0.8346372137552281
+      },
+      "unsupported_rate": {
+        "mean": 0.1364256480218281,
+        "ci95_low": 0.13176538720325284,
+        "ci95_high": 0.14123288049770946
+      },
+      "uncertain_rate": {
+        "mean": 0.033705160099510474,
+        "ci95_low": 0.03192668676571995,
+        "ci95_high": 0.035440948366925316
+      },
+      "categories": {
+        "object": {
+          "visual_units": 17553,
+          "grounded_precision": {
+            "mean": 0.8627015325015667,
+            "ci95_low": 0.8561038616458624,
+            "ci95_high": 0.8689074008006997
+          },
+          "unsupported_rate": {
+            "mean": 0.11000968495413889,
+            "ci95_low": 0.10447249094349349,
+            "ci95_high": 0.11623041776294255
+          },
+          "uncertain_rate": {
+            "mean": 0.027288782544294423,
+            "ci95_low": 0.024654990286546868,
+            "ci95_high": 0.030052245129642233
+          }
+        },
+        "attribute": {
+          "visual_units": 18950,
+          "grounded_precision": {
+            "mean": 0.8221108179419525,
+            "ci95_low": 0.8151779991843955,
+            "ci95_high": 0.8291072843433401
+          },
+          "unsupported_rate": {
+            "mean": 0.1404221635883905,
+            "ci95_low": 0.13385510183117233,
+            "ci95_high": 0.14709778344098867
+          },
+          "uncertain_rate": {
+            "mean": 0.03746701846965699,
+            "ci95_low": 0.034650721292589025,
+            "ci95_high": 0.040226820003506435
+          }
+        },
+        "relation": {
+          "visual_units": 6437,
+          "grounded_precision": {
+            "mean": 0.8017710113406866,
+            "ci95_low": 0.7898056755645372,
+            "ci95_high": 0.8134298562953918
+          },
+          "unsupported_rate": {
+            "mean": 0.1673139661332919,
+            "ci95_low": 0.1566262814618043,
+            "ci95_high": 0.17846895550463168
+          },
+          "uncertain_rate": {
+            "mean": 0.03091502252602144,
+            "ci95_low": 0.02641733861850204,
+            "ci95_high": 0.03542643825320441
+          }
+        },
+        "style": {
+          "visual_units": 1363,
+          "grounded_precision": {
+            "mean": 0.909024211298606,
+            "ci95_low": 0.8905430120969045,
+            "ci95_high": 0.9264275304716482
+          },
+          "unsupported_rate": {
+            "mean": 0.08290535583272193,
+            "ci95_low": 0.06550163949711728,
+            "ci95_high": 0.10098194051552681
+          },
+          "uncertain_rate": {
+            "mean": 0.008070432868672046,
+            "ci95_low": 0.0036574442409453276,
+            "ci95_high": 0.012908330997616794
+          }
+        },
+        "camera": {
+          "visual_units": 586,
+          "grounded_precision": {
+            "mean": 0.9539249146757679,
+            "ci95_low": 0.9340990245313002,
+            "ci95_high": 0.971190084888673
+          },
+          "unsupported_rate": {
+            "mean": 0.03924914675767918,
+            "ci95_low": 0.023449907193625785,
+            "ci95_high": 0.0568568009573195
+          },
+          "uncertain_rate": {
+            "mean": 0.006825938566552901,
+            "ci95_low": 0.0,
+            "ci95_high": 0.016158570241643567
+          }
+        },
+        "lighting": {
+          "visual_units": 557,
+          "grounded_precision": {
+            "mean": 0.9425493716337523,
+            "ci95_low": 0.9209545841641741,
+            "ci95_high": 0.9637415139693121
+          },
+          "unsupported_rate": {
+            "mean": 0.03052064631956912,
+            "ci95_low": 0.016860036861274973,
+            "ci95_high": 0.04584162481699221
+          },
+          "uncertain_rate": {
+            "mean": 0.026929982046678635,
+            "ci95_low": 0.014284443314692424,
+            "ci95_high": 0.04159139842234417
+          }
+        },
+        "count": {
+          "visual_units": 1621,
+          "grounded_precision": {
+            "mean": 0.7717458359037631,
+            "ci95_low": 0.7495150656472152,
+            "ci95_high": 0.7925675387235496
+          },
+          "unsupported_rate": {
+            "mean": 0.2066625539790253,
+            "ci95_low": 0.18633889356134942,
+            "ci95_high": 0.2282719491411228
+          },
+          "uncertain_rate": {
+            "mean": 0.0215916101172116,
+            "ci95_low": 0.014293566934707975,
+            "ci95_high": 0.02975907779730107
+          }
+        },
+        "text_rendering": {
+          "visual_units": 2777,
+          "grounded_precision": {
+            "mean": 0.6867122794382428,
+            "ci95_low": 0.6659418299092212,
+            "ci95_high": 0.7066677204074464
+          },
+          "unsupported_rate": {
+            "mean": 0.23154483255311487,
+            "ci95_low": 0.2122531154567533,
+            "ci95_high": 0.2510471273839642
+          },
+          "uncertain_rate": {
+            "mean": 0.08174288800864242,
+            "ci95_low": 0.07049427394217786,
+            "ci95_high": 0.09295352323838081
+          }
+        }
+      }
+    },
+    "laion_pop_ours_gemma4": {
+      "input": "artifacts/grounded-cbu/gemma-cross-corpus-2026-05-02/responses/grounded_verify_v2_laion_pop_llama32_paired__ours_laion_pop_b64_5k.responses.gemma4_31b_file_mt2048.merged.jsonl",
+      "captions": 5235,
+      "visual_units": 76840,
+      "grounded_units_per_caption": {
+        "mean": 14.014708691499523,
+        "ci95_low": 13.919369627507164,
+        "ci95_high": 14.106432664756447
+      },
+      "grounded_precision": {
+        "mean": 0.9548021863612701,
+        "ci95_low": 0.9529896799722325,
+        "ci95_high": 0.9566572837454304
+      },
+      "unsupported_rate": {
+        "mean": 0.01773815720978657,
+        "ci95_low": 0.016448365431438447,
+        "ci95_high": 0.019014632987482634
+      },
+      "uncertain_rate": {
+        "mean": 0.02745965642894326,
+        "ci95_low": 0.02619943061801823,
+        "ci95_high": 0.028701005707437294
+      },
+      "categories": {
+        "object": {
+          "visual_units": 20835,
+          "grounded_precision": {
+            "mean": 0.9669786417086633,
+            "ci95_low": 0.9641924658755847,
+            "ci95_high": 0.9696334048225462
+          },
+          "unsupported_rate": {
+            "mean": 0.013486921046316295,
+            "ci95_low": 0.011761759215509404,
+            "ci95_high": 0.01523684166812232
+          },
+          "uncertain_rate": {
+            "mean": 0.019534437245020398,
+            "ci95_low": 0.01751867601543315,
+            "ci95_high": 0.02170879413425821
+          }
+        },
+        "attribute": {
+          "visual_units": 35923,
+          "grounded_precision": {
+            "mean": 0.9427664727333464,
+            "ci95_low": 0.9399496630663411,
+            "ci95_high": 0.9454086042005748
+          },
+          "unsupported_rate": {
+            "mean": 0.017565348105670463,
+            "ci95_low": 0.016032057535235764,
+            "ci95_high": 0.019248168499202422
+          },
+          "uncertain_rate": {
+            "mean": 0.03966817916098321,
+            "ci95_low": 0.037560816636684535,
+            "ci95_high": 0.0418013189080644
+          }
+        },
+        "relation": {
+          "visual_units": 12589,
+          "grounded_precision": {
+            "mean": 0.9656048931606959,
+            "ci95_low": 0.9620350212749554,
+            "ci95_high": 0.9689216951647477
+          },
+          "unsupported_rate": {
+            "mean": 0.021606164111525935,
+            "ci95_low": 0.018942978341140136,
+            "ci95_high": 0.024367183441305597
+          },
+          "uncertain_rate": {
+            "mean": 0.012788942727778219,
+            "ci95_low": 0.010787349834146821,
+            "ci95_high": 0.014976207495865219
+          }
+        },
+        "style": {
+          "visual_units": 1565,
+          "grounded_precision": {
+            "mean": 0.9763578274760384,
+            "ci95_low": 0.9657621612521076,
+            "ci95_high": 0.9846898671346335
+          },
+          "unsupported_rate": {
+            "mean": 0.021725239616613417,
+            "ci95_low": 0.013766778785982479,
+            "ci95_high": 0.031727679468470917
+          },
+          "uncertain_rate": {
+            "mean": 0.0019169329073482429,
+            "ci95_low": 0.0,
+            "ci95_high": 0.004405355696357039
+          }
+        },
+        "camera": {
+          "visual_units": 1402,
+          "grounded_precision": {
+            "mean": 0.9800285306704708,
+            "ci95_low": 0.9726416958689414,
+            "ci95_high": 0.9869196271807313
+          },
+          "unsupported_rate": {
+            "mean": 0.01783166904422254,
+            "ci95_low": 0.011387697868312216,
+            "ci95_high": 0.024673887724202186
+          },
+          "uncertain_rate": {
+            "mean": 0.0021398002853067048,
+            "ci95_low": 0.0,
+            "ci95_high": 0.004847729414903929
+          }
+        },
+        "lighting": {
+          "visual_units": 2120,
+          "grounded_precision": {
+            "mean": 0.9636792452830188,
+            "ci95_low": 0.9544776513590889,
+            "ci95_high": 0.9722486967620492
+          },
+          "unsupported_rate": {
+            "mean": 0.01650943396226415,
+            "ci95_low": 0.010505957588748659,
+            "ci95_high": 0.02313187443083103
+          },
+          "uncertain_rate": {
+            "mean": 0.01981132075471698,
+            "ci95_low": 0.013888390722142515,
+            "ci95_high": 0.02617801047120419
+          }
+        },
+        "count": {
+          "visual_units": 1882,
+          "grounded_precision": {
+            "mean": 0.9399574920297555,
+            "ci95_low": 0.9286488557993069,
+            "ci95_high": 0.9508577205269301
+          },
+          "unsupported_rate": {
+            "mean": 0.036663124335812966,
+            "ci95_low": 0.028046849152647695,
+            "ci95_high": 0.045934815491309684
+          },
+          "uncertain_rate": {
+            "mean": 0.023379383634431455,
+            "ci95_low": 0.01678006329983653,
+            "ci95_high": 0.030721138855536626
+          }
+        },
+        "text_rendering": {
+          "visual_units": 524,
+          "grounded_precision": {
+            "mean": 0.9217557251908397,
+            "ci95_low": 0.89171974522293,
+            "ci95_high": 0.9460091281751263
+          },
+          "unsupported_rate": {
+            "mean": 0.030534351145038167,
+            "ci95_low": 0.015685507079544147,
+            "ci95_high": 0.048290478163096215
+          },
+          "uncertain_rate": {
+            "mean": 0.04770992366412214,
+            "ci95_low": 0.027535899481451933,
+            "ci95_high": 0.07203408000697735
+          }
+        }
+      }
+    },
+    "laion_pop_ref_llama32_11b_gemma4": {
+      "input": "artifacts/grounded-cbu/gemma-cross-corpus-2026-05-02/responses/grounded_verify_v2_laion_pop_llama32_paired__ref_laion_pop_llama32_11b_b64_5k.responses.gemma4_31b_file_mt2048.merged.jsonl",
+      "captions": 4934,
+      "visual_units": 58530,
+      "grounded_units_per_caption": {
+        "mean": 10.883461694365627,
+        "ci95_low": 10.802391568706932,
+        "ci95_high": 10.96169943250912
+      },
+      "grounded_precision": {
+        "mean": 0.9174611310439091,
+        "ci95_low": 0.9144809990321734,
+        "ci95_high": 0.9203529591516613
+      },
+      "unsupported_rate": {
+        "mean": 0.04741158380317786,
+        "ci95_low": 0.045078869871934914,
+        "ci95_high": 0.049798940810027836
+      },
+      "uncertain_rate": {
+        "mean": 0.03512728515291304,
+        "ci95_low": 0.03354652742291209,
+        "ci95_high": 0.03688470763792352
+      },
+      "categories": {
+        "object": {
+          "visual_units": 17499,
+          "grounded_precision": {
+            "mean": 0.9395394022515572,
+            "ci95_low": 0.9354296734324157,
+            "ci95_high": 0.9435548006534396
+          },
+          "unsupported_rate": {
+            "mean": 0.030116006628950226,
+            "ci95_low": 0.02722223088526751,
+            "ci95_high": 0.03316094958286966
+          },
+          "uncertain_rate": {
+            "mean": 0.03034459111949254,
+            "ci95_low": 0.02755354972686743,
+            "ci95_high": 0.03310257103714487
+          }
+        },
+        "attribute": {
+          "visual_units": 24300,
+          "grounded_precision": {
+            "mean": 0.9122222222222223,
+            "ci95_low": 0.9080765707555861,
+            "ci95_high": 0.9160479385733995
+          },
+          "unsupported_rate": {
+            "mean": 0.04477366255144033,
+            "ci95_low": 0.041743665031443666,
+            "ci95_high": 0.04783829164134451
+          },
+          "uncertain_rate": {
+            "mean": 0.04300411522633745,
+            "ci95_low": 0.04052670604232754,
+            "ci95_high": 0.04575259635168874
+          }
+        },
+        "relation": {
+          "visual_units": 9388,
+          "grounded_precision": {
+            "mean": 0.8931614827439284,
+            "ci95_low": 0.8862970693301438,
+            "ci95_high": 0.8998709940861319
+          },
+          "unsupported_rate": {
+            "mean": 0.07754580315296122,
+            "ci95_low": 0.07152723210962283,
+            "ci95_high": 0.0835563589120201
+          },
+          "uncertain_rate": {
+            "mean": 0.029292714103110355,
+            "ci95_low": 0.025967829133300344,
+            "ci95_high": 0.03268904460200468
+          }
+        },
+        "style": {
+          "visual_units": 2803,
+          "grounded_precision": {
+            "mean": 0.981805208704959,
+            "ci95_low": 0.9766441929800787,
+            "ci95_high": 0.9867575739654666
+          },
+          "unsupported_rate": {
+            "mean": 0.008205494113449875,
+            "ci95_low": 0.004992822832673466,
+            "ci95_high": 0.01167728237791932
+          },
+          "uncertain_rate": {
+            "mean": 0.009989297181591153,
+            "ci95_low": 0.006585731512074666,
+            "ci95_high": 0.013724197034056185
+          }
+        },
+        "camera": {
+          "visual_units": 1689,
+          "grounded_precision": {
+            "mean": 0.9579632918886916,
+            "ci95_low": 0.9480353474320242,
+            "ci95_high": 0.9678189056986748
+          },
+          "unsupported_rate": {
+            "mean": 0.037892243931320305,
+            "ci95_low": 0.02838515125969358,
+            "ci95_high": 0.047281925510680764
+          },
+          "uncertain_rate": {
+            "mean": 0.0041444641799881585,
+            "ci95_low": 0.0012570118775073402,
+            "ci95_high": 0.007229680974351121
+          }
+        },
+        "lighting": {
+          "visual_units": 577,
+          "grounded_precision": {
+            "mean": 0.9428076256499134,
+            "ci95_low": 0.9221392337020697,
+            "ci95_high": 0.9619077757685353
+          },
+          "unsupported_rate": {
+            "mean": 0.025996533795493933,
+            "ci95_low": 0.01296236393509961,
+            "ci95_high": 0.040544726142071254
+          },
+          "uncertain_rate": {
+            "mean": 0.03119584055459272,
+            "ci95_low": 0.018032786885245903,
+            "ci95_high": 0.04659290720251998
+          }
+        },
+        "count": {
+          "visual_units": 1672,
+          "grounded_precision": {
+            "mean": 0.8038277511961722,
+            "ci95_low": 0.7818174008122463,
+            "ci95_high": 0.8253329734260568
+          },
+          "unsupported_rate": {
+            "mean": 0.13337320574162678,
+            "ci95_low": 0.11472343041732015,
+            "ci95_high": 0.15234714667072424
+          },
+          "uncertain_rate": {
+            "mean": 0.06279904306220095,
+            "ci95_low": 0.05067097561692991,
+            "ci95_high": 0.0765492834474793
+          }
+        },
+        "text_rendering": {
+          "visual_units": 602,
+          "grounded_precision": {
+            "mean": 0.7441860465116279,
+            "ci95_low": 0.7047773322456619,
+            "ci95_high": 0.7833348375451263
+          },
+          "unsupported_rate": {
+            "mean": 0.1777408637873754,
+            "ci95_low": 0.1421428961077926,
+            "ci95_high": 0.21182370376555062
+          },
+          "uncertain_rate": {
+            "mean": 0.07807308970099668,
+            "ci95_low": 0.053881497816718076,
+            "ci95_high": 0.10283172675481708
+          }
+        }
+      }
+    },
+    "pd12m_ours_gemma4": {
+      "input": "artifacts/grounded-cbu/gemma-cross-corpus-2026-05-02/responses/grounded_verify_v2_pd12m_full_paired__ours_pd12m_img2dataset_b64_5k.responses.gemma4_31b_file_mt2048.merged.jsonl",
+      "captions": 4878,
+      "visual_units": 72226,
+      "grounded_units_per_caption": {
+        "mean": 13.98339483394834,
+        "ci95_low": 13.892783927839279,
+        "ci95_high": 14.080975809758097
+      },
+      "grounded_precision": {
+        "mean": 0.9444106000609199,
+        "ci95_low": 0.9423015111074523,
+        "ci95_high": 0.9466077303227699
+      },
+      "unsupported_rate": {
+        "mean": 0.017168332733364718,
+        "ci95_low": 0.015683672986280997,
+        "ci95_high": 0.01852629414933296
+      },
+      "uncertain_rate": {
+        "mean": 0.03842106720571539,
+        "ci95_low": 0.03688762531728115,
+        "ci95_high": 0.039992564217287176
+      },
+      "categories": {
+        "object": {
+          "visual_units": 19969,
+          "grounded_precision": {
+            "mean": 0.9584355751414693,
+            "ci95_low": 0.9551936511076642,
+            "ci95_high": 0.9616730404750264
+          },
+          "unsupported_rate": {
+            "mean": 0.014622665130952978,
+            "ci95_low": 0.012649269110116218,
+            "ci95_high": 0.016574620430186544
+          },
+          "uncertain_rate": {
+            "mean": 0.026941759727577747,
+            "ci95_low": 0.024444098859569916,
+            "ci95_high": 0.029422380226120055
+          }
+        },
+        "attribute": {
+          "visual_units": 32324,
+          "grounded_precision": {
+            "mean": 0.9330528399950501,
+            "ci95_low": 0.9301680446952336,
+            "ci95_high": 0.935928329231497
+          },
+          "unsupported_rate": {
+            "mean": 0.01385967083281772,
+            "ci95_low": 0.012391638907511236,
+            "ci95_high": 0.015346772960610868
+          },
+          "uncertain_rate": {
+            "mean": 0.053087489172132164,
+            "ci95_low": 0.050520440024859774,
+            "ci95_high": 0.05560571293156231
+          }
+        },
+        "relation": {
+          "visual_units": 12392,
+          "grounded_precision": {
+            "mean": 0.954728857327308,
+            "ci95_low": 0.9505503287086146,
+            "ci95_high": 0.9587225334593927
+          },
+          "unsupported_rate": {
+            "mean": 0.02485474499677211,
+            "ci95_low": 0.02186390817849199,
+            "ci95_high": 0.028033882851901566
+          },
+          "uncertain_rate": {
+            "mean": 0.020416397675919948,
+            "ci95_low": 0.017864292388336762,
+            "ci95_high": 0.023058749646890597
+          }
+        },
+        "style": {
+          "visual_units": 1524,
+          "grounded_precision": {
+            "mean": 0.9862204724409449,
+            "ci95_low": 0.9797202731861921,
+            "ci95_high": 0.9920013245033112
+          },
+          "unsupported_rate": {
+            "mean": 0.007217847769028871,
+            "ci95_low": 0.003164208003318099,
+            "ci95_high": 0.012418503699714024
+          },
+          "uncertain_rate": {
+            "mean": 0.006561679790026247,
+            "ci95_low": 0.002668268089127504,
+            "ci95_high": 0.010596553830468846
+          }
+        },
+        "camera": {
+          "visual_units": 1082,
+          "grounded_precision": {
+            "mean": 0.9852125693160814,
+            "ci95_low": 0.9784240150093808,
+            "ci95_high": 0.9918867525372248
+          },
+          "unsupported_rate": {
+            "mean": 0.009242144177449169,
+            "ci95_low": 0.003787878787878788,
+            "ci95_high": 0.014955437709322918
+          },
+          "uncertain_rate": {
+            "mean": 0.005545286506469501,
+            "ci95_low": 0.0018095891857640724,
+            "ci95_high": 0.010428031210044313
+          }
+        },
+        "lighting": {
+          "visual_units": 1452,
+          "grounded_precision": {
+            "mean": 0.9545454545454546,
+            "ci95_low": 0.9434580065647575,
+            "ci95_high": 0.9655643556525163
+          },
+          "unsupported_rate": {
+            "mean": 0.014462809917355372,
+            "ci95_low": 0.009001684290707909,
+            "ci95_high": 0.02069425901201602
+          },
+          "uncertain_rate": {
+            "mean": 0.030991735537190084,
+            "ci95_low": 0.0222833477322651,
+            "ci95_high": 0.040167411791790744
+          }
+        },
+        "count": {
+          "visual_units": 2571,
+          "grounded_precision": {
+            "mean": 0.9237650719564372,
+            "ci95_low": 0.9131429308480155,
+            "ci95_high": 0.9345090816892005
+          },
+          "unsupported_rate": {
+            "mean": 0.03695060287825749,
+            "ci95_low": 0.029422150757030592,
+            "ci95_high": 0.04478193146417445
+          },
+          "uncertain_rate": {
+            "mean": 0.03928432516530533,
+            "ci95_low": 0.03158238405367036,
+            "ci95_high": 0.047523771907299846
+          }
+        },
+        "text_rendering": {
+          "visual_units": 912,
+          "grounded_precision": {
+            "mean": 0.8234649122807017,
+            "ci95_low": 0.7946618486235656,
+            "ci95_high": 0.8521741564827888
+          },
+          "unsupported_rate": {
+            "mean": 0.06030701754385965,
+            "ci95_low": 0.04179600692755259,
+            "ci95_high": 0.07931262648307606
+          },
+          "uncertain_rate": {
+            "mean": 0.1162280701754386,
+            "ci95_low": 0.09346544774388123,
+            "ci95_high": 0.13969858929173884
+          }
+        }
+      }
+    },
+    "pd12m_ref_gemma4": {
+      "input": "artifacts/grounded-cbu/gemma-cross-corpus-2026-05-02/responses/grounded_verify_v2_pd12m_full_paired__ref_pd12m_full_b64_5k.responses.gemma4_31b_file_mt2048.merged.jsonl",
+      "captions": 4989,
+      "visual_units": 48670,
+      "grounded_units_per_caption": {
+        "mean": 8.67849268390459,
+        "ci95_low": 8.588093806374022,
+        "ci95_high": 8.76830026057326
+      },
+      "grounded_precision": {
+        "mean": 0.8896034518183686,
+        "ci95_low": 0.8854235063494741,
+        "ci95_high": 0.8938611245657099
+      },
+      "unsupported_rate": {
+        "mean": 0.07507704951715635,
+        "ci95_low": 0.07145497249553306,
+        "ci95_high": 0.07861058520678543
+      },
+      "uncertain_rate": {
+        "mean": 0.035319498664475035,
+        "ci95_low": 0.03349925960458296,
+        "ci95_high": 0.03722167373614036
+      },
+      "categories": {
+        "object": {
+          "visual_units": 21867,
+          "grounded_precision": {
+            "mean": 0.924909681254859,
+            "ci95_low": 0.9211441883611063,
+            "ci95_high": 0.9290266397219379
+          },
+          "unsupported_rate": {
+            "mean": 0.04856633283029222,
+            "ci95_low": 0.04521409002469015,
+            "ci95_high": 0.05165961129865847
+          },
+          "uncertain_rate": {
+            "mean": 0.02652398591484886,
+            "ci95_low": 0.024442020449743983,
+            "ci95_high": 0.028826580589740913
+          }
+        },
+        "attribute": {
+          "visual_units": 10053,
+          "grounded_precision": {
+            "mean": 0.817268477071521,
+            "ci95_low": 0.8081304603115136,
+            "ci95_high": 0.8263230891360528
+          },
+          "unsupported_rate": {
+            "mean": 0.11150900228787426,
+            "ci95_low": 0.10416651002506266,
+            "ci95_high": 0.119007775466157
+          },
+          "uncertain_rate": {
+            "mean": 0.07122252064060479,
+            "ci95_low": 0.0659271432900421,
+            "ci95_high": 0.076343588817552
+          }
+        },
+        "relation": {
+          "visual_units": 11840,
+          "grounded_precision": {
+            "mean": 0.8788851351351351,
+            "ci95_low": 0.8722028400918369,
+            "ci95_high": 0.8860438985440879
+          },
+          "unsupported_rate": {
+            "mean": 0.09628378378378379,
+            "ci95_low": 0.08994703483955481,
+            "ci95_high": 0.10233539423457812
+          },
+          "uncertain_rate": {
+            "mean": 0.02483108108108108,
+            "ci95_low": 0.021855507466231116,
+            "ci95_high": 0.02752990768992023
+          }
+        },
+        "style": {
+          "visual_units": 2000,
+          "grounded_precision": {
+            "mean": 0.9705,
+            "ci95_low": 0.9620309397129904,
+            "ci95_high": 0.9784141959009558
+          },
+          "unsupported_rate": {
+            "mean": 0.008,
+            "ci95_low": 0.0039485699710470005,
+            "ci95_high": 0.012525678867775709
+          },
+          "uncertain_rate": {
+            "mean": 0.0215,
+            "ci95_low": 0.014712355895481156,
+            "ci95_high": 0.028557222991185252
+          }
+        },
+        "camera": {
+          "visual_units": 529,
+          "grounded_precision": {
+            "mean": 0.9905482041587902,
+            "ci95_low": 0.9815481071477382,
+            "ci95_high": 0.9980694980694981
+          },
+          "unsupported_rate": {
+            "mean": 0.00945179584120983,
+            "ci95_low": 0.0019305019305019305,
+            "ci95_high": 0.018451892852261845
+          },
+          "uncertain_rate": {
+            "mean": 0.0,
+            "ci95_low": 0.0,
+            "ci95_high": 0.0
+          }
+        },
+        "lighting": {
+          "visual_units": 151,
+          "grounded_precision": {
+            "mean": 0.7880794701986755,
+            "ci95_low": 0.7133333333333334,
+            "ci95_high": 0.8633610954263128
+          },
+          "unsupported_rate": {
+            "mean": 0.1390728476821192,
+            "ci95_low": 0.07842094284522319,
+            "ci95_high": 0.2025359219979473
+          },
+          "uncertain_rate": {
+            "mean": 0.0728476821192053,
+            "ci95_low": 0.031007751937984496,
+            "ci95_high": 0.1171875
+          }
+        },
+        "count": {
+          "visual_units": 1235,
+          "grounded_precision": {
+            "mean": 0.891497975708502,
+            "ci95_low": 0.8735133051522816,
+            "ci95_high": 0.9087186268570522
+          },
+          "unsupported_rate": {
+            "mean": 0.08016194331983806,
+            "ci95_low": 0.06572418315561385,
+            "ci95_high": 0.0962532336425011
+          },
+          "uncertain_rate": {
+            "mean": 0.02834008097165992,
+            "ci95_low": 0.0197850851610843,
+            "ci95_high": 0.03760887772194304
+          }
+        },
+        "text_rendering": {
+          "visual_units": 995,
+          "grounded_precision": {
+            "mean": 0.7688442211055276,
+            "ci95_low": 0.7408910471345435,
+            "ci95_high": 0.7962583053180506
+          },
+          "unsupported_rate": {
+            "mean": 0.19095477386934673,
+            "ci95_low": 0.16630810749564806,
+            "ci95_high": 0.21827528625954196
+          },
+          "uncertain_rate": {
+            "mean": 0.04020100502512563,
+            "ci95_low": 0.02758440751161421,
+            "ci95_high": 0.05378738930289204
+          }
+        }
+      }
+    },
+    "danbooru_ours_gemma4": {
+      "input": "artifacts/grounded-cbu/gemma-cross-corpus-2026-05-02/responses/grounded_verify_v2_danbooru2023_florence2_paired__ours_danbooru2023_b64_5k.responses.gemma4_31b_file_mt2048.merged.jsonl",
+      "captions": 4879,
+      "visual_units": 69427,
+      "grounded_units_per_caption": {
+        "mean": 13.343103094896495,
+        "ci95_low": 13.263358270137322,
+        "ci95_high": 13.420009223201475
+      },
+      "grounded_precision": {
+        "mean": 0.937689947714866,
+        "ci95_low": 0.9353210266173098,
+        "ci95_high": 0.9400678140345307
+      },
+      "unsupported_rate": {
+        "mean": 0.03569216587206706,
+        "ci95_low": 0.033793547632205025,
+        "ci95_high": 0.037533687334456604
+      },
+      "uncertain_rate": {
+        "mean": 0.026617886413066963,
+        "ci95_low": 0.025264995170242218,
+        "ci95_high": 0.0279962721024009
+      },
+      "categories": {
+        "object": {
+          "visual_units": 18718,
+          "grounded_precision": {
+            "mean": 0.9326851159311892,
+            "ci95_low": 0.928437766000394,
+            "ci95_high": 0.9372108650745524
+          },
+          "unsupported_rate": {
+            "mean": 0.03168073512127364,
+            "ci95_low": 0.0288041447965154,
+            "ci95_high": 0.03464041901717594
+          },
+          "uncertain_rate": {
+            "mean": 0.03563414894753713,
+            "ci95_low": 0.03228536573086078,
+            "ci95_high": 0.03895369338886005
+          }
+        },
+        "attribute": {
+          "visual_units": 33800,
+          "grounded_precision": {
+            "mean": 0.9344970414201184,
+            "ci95_low": 0.9313771418556193,
+            "ci95_high": 0.9374098248629728
+          },
+          "unsupported_rate": {
+            "mean": 0.03644970414201183,
+            "ci95_low": 0.03409606310795786,
+            "ci95_high": 0.03888285322216871
+          },
+          "uncertain_rate": {
+            "mean": 0.02905325443786982,
+            "ci95_low": 0.027214098892000937,
+            "ci95_high": 0.03089446256587214
+          }
+        },
+        "relation": {
+          "visual_units": 12258,
+          "grounded_precision": {
+            "mean": 0.9447707619513787,
+            "ci95_low": 0.940250251476046,
+            "ci95_high": 0.9488500322338025
+          },
+          "unsupported_rate": {
+            "mean": 0.04470549844999184,
+            "ci95_low": 0.040868106622205656,
+            "ci95_high": 0.04874403893744779
+          },
+          "uncertain_rate": {
+            "mean": 0.010523739598629466,
+            "ci95_low": 0.008758328912112647,
+            "ci95_high": 0.012379041421155493
+          }
+        },
+        "style": {
+          "visual_units": 1846,
+          "grounded_precision": {
+            "mean": 0.9739978331527628,
+            "ci95_low": 0.9665969610035756,
+            "ci95_high": 0.9811124814761271
+          },
+          "unsupported_rate": {
+            "mean": 0.008125677139761646,
+            "ci95_low": 0.004282540736754114,
+            "ci95_high": 0.012575177692728267
+          },
+          "uncertain_rate": {
+            "mean": 0.017876489707475622,
+            "ci95_low": 0.01222721386094503,
+            "ci95_high": 0.024109205741154226
+          }
+        },
+        "camera": {
+          "visual_units": 1112,
+          "grounded_precision": {
+            "mean": 0.9820143884892086,
+            "ci95_low": 0.9739271259004545,
+            "ci95_high": 0.9893428063943162
+          },
+          "unsupported_rate": {
+            "mean": 0.015287769784172662,
+            "ci95_low": 0.008795074758135445,
+            "ci95_high": 0.02246401628939388
+          },
+          "uncertain_rate": {
+            "mean": 0.002697841726618705,
+            "ci95_low": 0.0,
+            "ci95_high": 0.006227758007117438
+          }
+        },
+        "lighting": {
+          "visual_units": 572,
+          "grounded_precision": {
+            "mean": 0.9493006993006993,
+            "ci95_low": 0.9295981100021236,
+            "ci95_high": 0.966789667896679
+          },
+          "unsupported_rate": {
+            "mean": 0.02972027972027972,
+            "ci95_low": 0.01688358270137323,
+            "ci95_high": 0.04407311276502592
+          },
+          "uncertain_rate": {
+            "mean": 0.02097902097902098,
+            "ci95_low": 0.009208103130755065,
+            "ci95_high": 0.034488458250213704
+          }
+        },
+        "count": {
+          "visual_units": 977,
+          "grounded_precision": {
+            "mean": 0.9426816786079836,
+            "ci95_low": 0.9268797805121384,
+            "ci95_high": 0.9575213489201178
+          },
+          "unsupported_rate": {
+            "mean": 0.04503582395087001,
+            "ci95_low": 0.03157728881026095,
+            "ci95_high": 0.05983183483183482
+          },
+          "uncertain_rate": {
+            "mean": 0.012282497441146366,
+            "ci95_low": 0.005916946478678972,
+            "ci95_high": 0.02020355530993828
+          }
+        },
+        "text_rendering": {
+          "visual_units": 144,
+          "grounded_precision": {
+            "mean": 0.8472222222222222,
+            "ci95_low": 0.7883211678832117,
+            "ci95_high": 0.9056698746359376
+          },
+          "unsupported_rate": {
+            "mean": 0.08333333333333333,
+            "ci95_low": 0.041666666666666664,
+            "ci95_high": 0.1283812094217122
+          },
+          "uncertain_rate": {
+            "mean": 0.06944444444444445,
+            "ci95_low": 0.02938069594034797,
+            "ci95_high": 0.11489179396788078
+          }
+        }
+      }
+    },
+    "danbooru_ref_florence2_gemma4": {
+      "input": "artifacts/grounded-cbu/gemma-cross-corpus-2026-05-02/responses/grounded_verify_v2_danbooru2023_florence2_paired__ref_danbooru_florence2_b64_5k.responses.gemma4_31b_file_mt2048.merged.jsonl",
+      "captions": 4968,
+      "visual_units": 40646,
+      "grounded_units_per_caption": {
+        "mean": 6.439009661835748,
+        "ci95_low": 6.360904790660225,
+        "ci95_high": 6.514900362318841
+      },
+      "grounded_precision": {
+        "mean": 0.7870147123948236,
+        "ci95_low": 0.7811164877031025,
+        "ci95_high": 0.7933524180366222
+      },
+      "unsupported_rate": {
+        "mean": 0.17389164985484426,
+        "ci95_low": 0.16829958528079125,
+        "ci95_high": 0.17941198580066595
+      },
+      "uncertain_rate": {
+        "mean": 0.03909363775033214,
+        "ci95_low": 0.037210508029968115,
+        "ci95_high": 0.04105116788616811
+      },
+      "categories": {
+        "object": {
+          "visual_units": 15099,
+          "grounded_precision": {
+            "mean": 0.8533015431485529,
+            "ci95_low": 0.8464048689769926,
+            "ci95_high": 0.8600571230864825
+          },
+          "unsupported_rate": {
+            "mean": 0.12696205046691833,
+            "ci95_low": 0.12043951118207459,
+            "ci95_high": 0.13358764830785877
+          },
+          "uncertain_rate": {
+            "mean": 0.01973640638452878,
+            "ci95_low": 0.017507491853620417,
+            "ci95_high": 0.02205217334408175
+          }
+        },
+        "attribute": {
+          "visual_units": 12265,
+          "grounded_precision": {
+            "mean": 0.696942519364044,
+            "ci95_low": 0.6871524158965265,
+            "ci95_high": 0.7068862971725645
+          },
+          "unsupported_rate": {
+            "mean": 0.22380758255197716,
+            "ci95_low": 0.21517133587705695,
+            "ci95_high": 0.23231732751904313
+          },
+          "uncertain_rate": {
+            "mean": 0.0792498980839788,
+            "ci95_low": 0.07454460333231255,
+            "ci95_high": 0.08410270043229699
+          }
+        },
+        "relation": {
+          "visual_units": 7781,
+          "grounded_precision": {
+            "mean": 0.7528595296234417,
+            "ci95_low": 0.7421206268636046,
+            "ci95_high": 0.7633506200025479
+          },
+          "unsupported_rate": {
+            "mean": 0.2223364606091762,
+            "ci95_low": 0.2125031040559888,
+            "ci95_high": 0.2323494647471591
+          },
+          "uncertain_rate": {
+            "mean": 0.024804009767382083,
+            "ci95_low": 0.021322743895831952,
+            "ci95_high": 0.02829033983002319
+          }
+        },
+        "style": {
+          "visual_units": 2893,
+          "grounded_precision": {
+            "mean": 0.9633598340822676,
+            "ci95_low": 0.9559915586069002,
+            "ci95_high": 0.9707531672740175
+          },
+          "unsupported_rate": {
+            "mean": 0.02212236432768752,
+            "ci95_low": 0.016421652562371717,
+            "ci95_high": 0.02826700942420635
+          },
+          "uncertain_rate": {
+            "mean": 0.014517801590044937,
+            "ci95_low": 0.00987937478763167,
+            "ci95_high": 0.019464889394231646
+          }
+        },
+        "camera": {
+          "visual_units": 29,
+          "grounded_precision": {
+            "mean": 1.0,
+            "ci95_low": 1.0,
+            "ci95_high": 1.0
+          },
+          "unsupported_rate": {
+            "mean": 0.0,
+            "ci95_low": 0.0,
+            "ci95_high": 0.0
+          },
+          "uncertain_rate": {
+            "mean": 0.0,
+            "ci95_low": 0.0,
+            "ci95_high": 0.0
+          }
+        },
+        "lighting": {
+          "visual_units": 164,
+          "grounded_precision": {
+            "mean": 0.8048780487804879,
+            "ci95_low": 0.7405001643925694,
+            "ci95_high": 0.8647140021652833
+          },
+          "unsupported_rate": {
+            "mean": 0.1402439024390244,
+            "ci95_low": 0.08749273255813952,
+            "ci95_high": 0.1985839086938143
+          },
+          "uncertain_rate": {
+            "mean": 0.054878048780487805,
+            "ci95_low": 0.023668639053254437,
+            "ci95_high": 0.09090909090909091
+          }
+        },
+        "count": {
+          "visual_units": 946,
+          "grounded_precision": {
+            "mean": 0.9386892177589852,
+            "ci95_low": 0.9228329809725159,
+            "ci95_high": 0.953977018739048
+          },
+          "unsupported_rate": {
+            "mean": 0.05708245243128964,
+            "ci95_low": 0.04231166150670795,
+            "ci95_high": 0.0721654956552192
+          },
+          "uncertain_rate": {
+            "mean": 0.004228329809725159,
+            "ci95_low": 0.0010192387616229663,
+            "ci95_high": 0.008832707471540264
+          }
+        },
+        "text_rendering": {
+          "visual_units": 1469,
+          "grounded_precision": {
+            "mean": 0.5874744724302247,
+            "ci95_low": 0.5621583756000362,
+            "ci95_high": 0.6121796184489994
+          },
+          "unsupported_rate": {
+            "mean": 0.3641933287950987,
+            "ci95_low": 0.3403043178190916,
+            "ci95_high": 0.3885517920790823
+          },
+          "uncertain_rate": {
+            "mean": 0.04833219877467665,
+            "ci95_low": 0.03775967475698478,
+            "ci95_high": 0.05977279311715422
+          }
+        }
+      }
+    }
+  }
+}

eval_results/gemma-cross-corpus-2026-05-02/cbu_vqa_gemma4_cross_corpus_table.md ADDED Viewed

	@@ -0,0 +1,10 @@

+| Surface | Resp | OK | Q | Support ↑ | Risk ↓ | Uncertain ↓ |
+|---|---:|---:|---:|---:|---:|---:|
+| danbooru2023_florence2_paired__ours_danbooru2023 | 4,993 | 4,993 | 71,555 | 0.8276 | 0.0938 | 0.0786 |
+| danbooru2023_florence2_paired__ref_danbooru_florence2 | 4,969 | 4,969 | 40,755 | 0.7494 | 0.2345 | 0.0161 |
+| laion_pop_llama32_paired__ours_laion_pop | 4,964 | 4,964 | 73,564 | 0.9192 | 0.0601 | 0.0207 |
+| laion_pop_llama32_paired__ref_laion_pop_llama32_11b | 4,947 | 4,947 | 58,935 | 0.8583 | 0.1131 | 0.0286 |
+| ours_datacomp_forward | 4,775 | 4,775 | 68,500 | 0.8886 | 0.0840 | 0.0274 |
+| pd12m_full_paired__ours_pd12m_img2dataset | 4,957 | 4,957 | 74,463 | 0.9013 | 0.0659 | 0.0328 |
+| pd12m_full_paired__ref_pd12m_full | 4,989 | 4,989 | 48,825 | 0.8405 | 0.1308 | 0.0287 |
+| ref_datacomp_recap_llava15_llama3_8b | 4,779 | 4,779 | 47,878 | 0.7662 | 0.2170 | 0.0168 |

eval_results/gemma-cross-corpus-2026-05-02/cbu_vqa_gemma4_cross_corpus_table.tex ADDED Viewed

	@@ -0,0 +1,14 @@

+\begin{tabular}{lrrrrrr}
+\toprule
+Surface & Resp. & OK & Q & Support $\uparrow$ & Risk $\downarrow$ & Uncertain $\downarrow$ \\
+\midrule
+danbooru2023\_florence2\_paired\_\_ours\_danbooru2023 & 4,993 & 4,993 & 71,555 & 0.8276 & 0.0938 & 0.0786 \\
+danbooru2023\_florence2\_paired\_\_ref\_danbooru\_florence2 & 4,969 & 4,969 & 40,755 & 0.7494 & 0.2345 & 0.0161 \\
+laion\_pop\_llama32\_paired\_\_ours\_laion\_pop & 4,964 & 4,964 & 73,564 & 0.9192 & 0.0601 & 0.0207 \\
+laion\_pop\_llama32\_paired\_\_ref\_laion\_pop\_llama32\_11b & 4,947 & 4,947 & 58,935 & 0.8583 & 0.1131 & 0.0286 \\
+ours\_datacomp\_forward & 4,775 & 4,775 & 68,500 & 0.8886 & 0.0840 & 0.0274 \\
+pd12m\_full\_paired\_\_ours\_pd12m\_img2dataset & 4,957 & 4,957 & 74,463 & 0.9013 & 0.0659 & 0.0328 \\
+pd12m\_full\_paired\_\_ref\_pd12m\_full & 4,989 & 4,989 & 48,825 & 0.8405 & 0.1308 & 0.0287 \\
+ref\_datacomp\_recap\_llava15\_llama3\_8b & 4,779 & 4,779 & 47,878 & 0.7662 & 0.2170 & 0.0168 \\
+\bottomrule
+\end{tabular}

eval_results/gemma-cross-corpus-2026-05-02/claimed_cbu_ci.tsv ADDED Viewed

	@@ -0,0 +1 @@


1	+ surface captions cbu_per_caption_ci95 cbu_per_100_tokens_ci95 object_per_caption_ci95 attribute_per_caption_ci95 relation_per_caption_ci95 camera_per_caption_ci95 lighting_per_caption_ci95 text_rendering_per_caption_ci95

eval_results/gemma-cross-corpus-2026-05-02/grounded_cbu_category_ci.tsv ADDED Viewed

	@@ -0,0 +1,65 @@

+surface	category	visual_units	grounded_precision_ci95	unsupported_rate_ci95	uncertain_rate_ci95
+datacomp_ours_gemma4	object	17907	0.9643 [0.9610, 0.9675]	0.0192 [0.0167, 0.0215]	0.0166 [0.0145, 0.0189]
+datacomp_ours_gemma4	attribute	36712	0.9358 [0.9329, 0.9387]	0.0233 [0.0213, 0.0254]	0.0408 [0.0388, 0.0430]
+datacomp_ours_gemma4	relation	8429	0.9569 [0.9522, 0.9617]	0.0276 [0.0237, 0.0316]	0.0154 [0.0126, 0.0182]
+datacomp_ours_gemma4	style	1109	0.9829 [0.9745, 0.9901]	0.0072 [0.0027, 0.0125]	0.0099 [0.0044, 0.0172]
+datacomp_ours_gemma4	camera	678	0.9808 [0.9688, 0.9909]	0.0177 [0.0086, 0.0289]	0.0015 [0.0000, 0.0047]
+datacomp_ours_gemma4	lighting	1616	0.9629 [0.9524, 0.9725]	0.0161 [0.0094, 0.0232]	0.0210 [0.0142, 0.0290]
+datacomp_ours_gemma4	count	2519	0.9174 [0.9052, 0.9293]	0.0599 [0.0500, 0.0706]	0.0226 [0.0166, 0.0293]
+datacomp_ours_gemma4	text_rendering	1924	0.9002 [0.8839, 0.9158]	0.0587 [0.0472, 0.0709]	0.0411 [0.0312, 0.0509]
+datacomp_ref_llava15_llama3_gemma4	object	17553	0.8627 [0.8561, 0.8689]	0.1100 [0.1045, 0.1162]	0.0273 [0.0247, 0.0301]
+datacomp_ref_llava15_llama3_gemma4	attribute	18950	0.8221 [0.8152, 0.8291]	0.1404 [0.1339, 0.1471]	0.0375 [0.0347, 0.0402]
+datacomp_ref_llava15_llama3_gemma4	relation	6437	0.8018 [0.7898, 0.8134]	0.1673 [0.1566, 0.1785]	0.0309 [0.0264, 0.0354]
+datacomp_ref_llava15_llama3_gemma4	style	1363	0.9090 [0.8905, 0.9264]	0.0829 [0.0655, 0.1010]	0.0081 [0.0037, 0.0129]
+datacomp_ref_llava15_llama3_gemma4	camera	586	0.9539 [0.9341, 0.9712]	0.0392 [0.0234, 0.0569]	0.0068 [0.0000, 0.0162]
+datacomp_ref_llava15_llama3_gemma4	lighting	557	0.9425 [0.9210, 0.9637]	0.0305 [0.0169, 0.0458]	0.0269 [0.0143, 0.0416]
+datacomp_ref_llava15_llama3_gemma4	count	1621	0.7717 [0.7495, 0.7926]	0.2067 [0.1863, 0.2283]	0.0216 [0.0143, 0.0298]
+datacomp_ref_llava15_llama3_gemma4	text_rendering	2777	0.6867 [0.6659, 0.7067]	0.2315 [0.2123, 0.2510]	0.0817 [0.0705, 0.0930]
+laion_pop_ours_gemma4	object	20835	0.9670 [0.9642, 0.9696]	0.0135 [0.0118, 0.0152]	0.0195 [0.0175, 0.0217]
+laion_pop_ours_gemma4	attribute	35923	0.9428 [0.9399, 0.9454]	0.0176 [0.0160, 0.0192]	0.0397 [0.0376, 0.0418]
+laion_pop_ours_gemma4	relation	12589	0.9656 [0.9620, 0.9689]	0.0216 [0.0189, 0.0244]	0.0128 [0.0108, 0.0150]
+laion_pop_ours_gemma4	style	1565	0.9764 [0.9658, 0.9847]	0.0217 [0.0138, 0.0317]	0.0019 [0.0000, 0.0044]
+laion_pop_ours_gemma4	camera	1402	0.9800 [0.9726, 0.9869]	0.0178 [0.0114, 0.0247]	0.0021 [0.0000, 0.0048]
+laion_pop_ours_gemma4	lighting	2120	0.9637 [0.9545, 0.9722]	0.0165 [0.0105, 0.0231]	0.0198 [0.0139, 0.0262]
+laion_pop_ours_gemma4	count	1882	0.9400 [0.9286, 0.9509]	0.0367 [0.0280, 0.0459]	0.0234 [0.0168, 0.0307]
+laion_pop_ours_gemma4	text_rendering	524	0.9218 [0.8917, 0.9460]	0.0305 [0.0157, 0.0483]	0.0477 [0.0275, 0.0720]
+laion_pop_ref_llama32_11b_gemma4	object	17499	0.9395 [0.9354, 0.9436]	0.0301 [0.0272, 0.0332]	0.0303 [0.0276, 0.0331]
+laion_pop_ref_llama32_11b_gemma4	attribute	24300	0.9122 [0.9081, 0.9160]	0.0448 [0.0417, 0.0478]	0.0430 [0.0405, 0.0458]
+laion_pop_ref_llama32_11b_gemma4	relation	9388	0.8932 [0.8863, 0.8999]	0.0775 [0.0715, 0.0836]	0.0293 [0.0260, 0.0327]
+laion_pop_ref_llama32_11b_gemma4	style	2803	0.9818 [0.9766, 0.9868]	0.0082 [0.0050, 0.0117]	0.0100 [0.0066, 0.0137]
+laion_pop_ref_llama32_11b_gemma4	camera	1689	0.9580 [0.9480, 0.9678]	0.0379 [0.0284, 0.0473]	0.0041 [0.0013, 0.0072]
+laion_pop_ref_llama32_11b_gemma4	lighting	577	0.9428 [0.9221, 0.9619]	0.0260 [0.0130, 0.0405]	0.0312 [0.0180, 0.0466]
+laion_pop_ref_llama32_11b_gemma4	count	1672	0.8038 [0.7818, 0.8253]	0.1334 [0.1147, 0.1523]	0.0628 [0.0507, 0.0765]
+laion_pop_ref_llama32_11b_gemma4	text_rendering	602	0.7442 [0.7048, 0.7833]	0.1777 [0.1421, 0.2118]	0.0781 [0.0539, 0.1028]
+pd12m_ours_gemma4	object	19969	0.9584 [0.9552, 0.9617]	0.0146 [0.0126, 0.0166]	0.0269 [0.0244, 0.0294]
+pd12m_ours_gemma4	attribute	32324	0.9331 [0.9302, 0.9359]	0.0139 [0.0124, 0.0153]	0.0531 [0.0505, 0.0556]
+pd12m_ours_gemma4	relation	12392	0.9547 [0.9506, 0.9587]	0.0249 [0.0219, 0.0280]	0.0204 [0.0179, 0.0231]
+pd12m_ours_gemma4	style	1524	0.9862 [0.9797, 0.9920]	0.0072 [0.0032, 0.0124]	0.0066 [0.0027, 0.0106]
+pd12m_ours_gemma4	camera	1082	0.9852 [0.9784, 0.9919]	0.0092 [0.0038, 0.0150]	0.0055 [0.0018, 0.0104]
+pd12m_ours_gemma4	lighting	1452	0.9545 [0.9435, 0.9656]	0.0145 [0.0090, 0.0207]	0.0310 [0.0223, 0.0402]
+pd12m_ours_gemma4	count	2571	0.9238 [0.9131, 0.9345]	0.0370 [0.0294, 0.0448]	0.0393 [0.0316, 0.0475]
+pd12m_ours_gemma4	text_rendering	912	0.8235 [0.7947, 0.8522]	0.0603 [0.0418, 0.0793]	0.1162 [0.0935, 0.1397]
+pd12m_ref_gemma4	object	21867	0.9249 [0.9211, 0.9290]	0.0486 [0.0452, 0.0517]	0.0265 [0.0244, 0.0288]
+pd12m_ref_gemma4	attribute	10053	0.8173 [0.8081, 0.8263]	0.1115 [0.1042, 0.1190]	0.0712 [0.0659, 0.0763]
+pd12m_ref_gemma4	relation	11840	0.8789 [0.8722, 0.8860]	0.0963 [0.0899, 0.1023]	0.0248 [0.0219, 0.0275]
+pd12m_ref_gemma4	style	2000	0.9705 [0.9620, 0.9784]	0.0080 [0.0039, 0.0125]	0.0215 [0.0147, 0.0286]
+pd12m_ref_gemma4	camera	529	0.9905 [0.9815, 0.9981]	0.0095 [0.0019, 0.0185]	0.0000 [0.0000, 0.0000]
+pd12m_ref_gemma4	lighting	151	0.7881 [0.7133, 0.8634]	0.1391 [0.0784, 0.2025]	0.0728 [0.0310, 0.1172]
+pd12m_ref_gemma4	count	1235	0.8915 [0.8735, 0.9087]	0.0802 [0.0657, 0.0963]	0.0283 [0.0198, 0.0376]
+pd12m_ref_gemma4	text_rendering	995	0.7688 [0.7409, 0.7963]	0.1910 [0.1663, 0.2183]	0.0402 [0.0276, 0.0538]
+danbooru_ours_gemma4	object	18718	0.9327 [0.9284, 0.9372]	0.0317 [0.0288, 0.0346]	0.0356 [0.0323, 0.0390]
+danbooru_ours_gemma4	attribute	33800	0.9345 [0.9314, 0.9374]	0.0364 [0.0341, 0.0389]	0.0291 [0.0272, 0.0309]
+danbooru_ours_gemma4	relation	12258	0.9448 [0.9403, 0.9489]	0.0447 [0.0409, 0.0487]	0.0105 [0.0088, 0.0124]
+danbooru_ours_gemma4	style	1846	0.9740 [0.9666, 0.9811]	0.0081 [0.0043, 0.0126]	0.0179 [0.0122, 0.0241]
+danbooru_ours_gemma4	camera	1112	0.9820 [0.9739, 0.9893]	0.0153 [0.0088, 0.0225]	0.0027 [0.0000, 0.0062]
+danbooru_ours_gemma4	lighting	572	0.9493 [0.9296, 0.9668]	0.0297 [0.0169, 0.0441]	0.0210 [0.0092, 0.0345]
+danbooru_ours_gemma4	count	977	0.9427 [0.9269, 0.9575]	0.0450 [0.0316, 0.0598]	0.0123 [0.0059, 0.0202]
+danbooru_ours_gemma4	text_rendering	144	0.8472 [0.7883, 0.9057]	0.0833 [0.0417, 0.1284]	0.0694 [0.0294, 0.1149]
+danbooru_ref_florence2_gemma4	object	15099	0.8533 [0.8464, 0.8601]	0.1270 [0.1204, 0.1336]	0.0197 [0.0175, 0.0221]
+danbooru_ref_florence2_gemma4	attribute	12265	0.6969 [0.6872, 0.7069]	0.2238 [0.2152, 0.2323]	0.0792 [0.0745, 0.0841]
+danbooru_ref_florence2_gemma4	relation	7781	0.7529 [0.7421, 0.7634]	0.2223 [0.2125, 0.2323]	0.0248 [0.0213, 0.0283]
+danbooru_ref_florence2_gemma4	style	2893	0.9634 [0.9560, 0.9708]	0.0221 [0.0164, 0.0283]	0.0145 [0.0099, 0.0195]
+danbooru_ref_florence2_gemma4	camera	29	1.0000 [1.0000, 1.0000]	0.0000 [0.0000, 0.0000]	0.0000 [0.0000, 0.0000]
+danbooru_ref_florence2_gemma4	lighting	164	0.8049 [0.7405, 0.8647]	0.1402 [0.0875, 0.1986]	0.0549 [0.0237, 0.0909]
+danbooru_ref_florence2_gemma4	count	946	0.9387 [0.9228, 0.9540]	0.0571 [0.0423, 0.0722]	0.0042 [0.0010, 0.0088]
+danbooru_ref_florence2_gemma4	text_rendering	1469	0.5875 [0.5622, 0.6122]	0.3642 [0.3403, 0.3886]	0.0483 [0.0378, 0.0598]