Prompt_Squirrel_RAG / data /experiment_memory.jsonl
Food Desert
Consolidate pending pipeline, structural, and analysis updates
30bedf0
{"experiment_id":"EXP-2026-02-24-INTERLEAVE-ONLY","status":"kept","area":"stage3_chunking","hypothesis":"Alternative chunk strategies (group_by_source, balance_quota) might improve selection quality over simple interleave.","setup":{"dataset":"caption_evident_n10","selection_mode":"chunked_map_union","n":10},"evaluation":"Compared interleave vs group_by_source vs balance_quota on same eval harness.","result_summary":"Differences were small and inconsistent; complexity increased and did not justify maintenance cost.","decision":"Keep interleave only; remove other strategies from mainline.","why_abandoned_or_kept":"Simplicity and maintainability dominated because quality deltas were not practically significant.","evidence_files":["data/eval_results/eval_caption_cogvlm_n10_seed42_20260224_200820.jsonl","data/eval_results/eval_caption_cogvlm_n10_seed42_20260224_200909.jsonl","data/eval_results/eval_caption_cogvlm_n10_seed42_20260224_200949.jsonl"],"notes":"If revisited, require >=0.02 absolute F1 gain on matched runs before reintroducing complexity."}
{"experiment_id":"EXP-2026-02-24-MODEL-SWAP-DEEPSEEK","status":"abandoned","area":"stage3_model","hypothesis":"A stronger/different model might reduce over-specific false positives and improve precision.","setup":{"models":["meta-llama/llama-3.1-8b-instruct","deepseek/deepseek-v3.2"],"dataset":"caption_evident_n10","n":10},"evaluation":"Ran matched evals via OpenRouter and compared precision/recall/F1 plus cost.","result_summary":"DeepSeek did not outperform Llama on this task while being materially more expensive per token.","decision":"Do not switch default model.","why_abandoned_or_kept":"No quality win and worse price/performance.","evidence_files":["data/eval_results/eval_caption_cogvlm_n10_seed42_20260224_200820.jsonl","data/eval_results/eval_caption_cogvlm_n10_seed42_20260224_204216.jsonl"],"notes":"Revisit only if prompt/task definition changes substantially or a cheaper model appears."}
{"experiment_id":"EXP-2026-02-24-STAGE3-PROMPT-CONSERVATIVE","status":"abandoned","area":"stage3_prompting","hypothesis":"Conservative prompt wording could sharply raise precision by discouraging speculative specifics.","setup":{"styles":["default","conservative_v1","conservative_v2"],"dataset":"caption_evident_n10","n":10},"evaluation":"A/B tested prompt variants with same retrieval and scoring pipeline.","result_summary":"Conservative variants did not produce a meaningful precision/F1 win versus default.","decision":"Keep default Stage 3 prompt style.","why_abandoned_or_kept":"Prompt-only rewording did not create the desired precision lever.","evidence_files":["data/eval_results/eval_caption_cogvlm_n10_seed42_20260224_205407.jsonl","data/eval_results/eval_caption_cogvlm_n10_seed42_20260224_205449.jsonl","data/eval_results/eval_caption_cogvlm_n10_seed42_20260224_205529.jsonl"],"notes":"Future prompting changes should be tested only with clearly different decision constraints."}
{"experiment_id":"EXP-2026-02-25-STAGE3-BOOTSTRAP-RERANK","status":"abandoned","area":"stage3_bootstrap","hypothesis":"Use high-precision anchor selections first, rerank via TF-IDF+FastText fusion, then run normal Stage 3 selection.","setup":{"implementation":"single-pass bootstrap (non-iterative)","anchors":"top-k-per-phrase interleaved, min_why explicit","rerank_formula":"(1-context_weight)*score_fasttext + context_weight*score_context","dataset":"caption_evident_n10"},"evaluation":"1) Controlled tests with skip_rewrite=true and tuning grid for k in {1,2,3}, context_weight in {0.3,0.5,0.7}. 2) Full end-to-end runs with structural inference + implication expansion.","result_summary":"Controlled skip_rewrite runs improved over a weak baseline, but full end-to-end impact was effectively flat/slightly negative (F1 delta about -0.0006) with substantial Stage3 latency increase (~+16s/sample).","decision":"Remove bootstrap code from mainline; do not keep as default path.","why_abandoned_or_kept":"Extra complexity and latency without measurable end-to-end gain on target evaluation.","evidence_files":["data/eval_results/eval_caption_cogvlm_n10_bootstrap_baseline_explicit_skiprewrite.jsonl","data/eval_results/eval_caption_cogvlm_n10_bootstrap_tune_k3_cw0p5.jsonl","data/eval_results/eval_caption_cogvlm_n10_e2e_structimp_baseline_default.jsonl","data/eval_results/eval_caption_cogvlm_n10_e2e_structimp_bootstrap_k3_cw0p5.jsonl"],"notes":"If reconsidered, require matched end-to-end improvement (not just skip_rewrite) and bounded latency overhead."}
{"experiment_id": "EXP-2026-03-01-STRUCTURAL-TAG-FAILURE-DRILLDOWN", "status": "active_investigation", "area": "stage3_structural", "hypothesis": "Systematic false positives for ambiguous_gender and looking_at_viewer are driven by structural prompt/policy mismatch more than missing glossary.", "setup": {"dataset": "caption_evident_n10", "run": "data/eval_results/eval_caption_cogvlm_n10_seed42_20260301_045007.jsonl", "focus_tags": ["ambiguous_gender", "looking_at_viewer"]}, "result_summary": "Both tags were false positives predominantly from structural source. Descriptions already exist in structural prompt; failures appear tied to decision policy and caption style mismatch.", "decision": "Hold broader changes; investigate these two tags first.", "why_abandoned_or_kept": "Need targeted fixes before acting on wider cleanup list.", "evidence_files": ["data/eval_results/eval_caption_cogvlm_n10_seed42_20260301_045007.jsonl", "data/eval_results/eval_caption_cogvlm_n10_seed42_20260301_045007_detail.jsonl"], "pinned_followups": ["Reduce structural over-selection globally (especially looking_at_viewer, ambiguous_gender, anthro).", "Audit explicit-overconfident Stage3 false positives in clothing/color/species variants.", "Investigate recurrent leaf misses: fur/hair/human/tail/claws family.", "Revisit implication expansion inflation after structural gating changes.", "Review probe set edge cases (<3 and low-support tags)."], "date": "2026-03-01"}
{"experiment_id": "EXP-2026-03-01-STRUCTURAL-DEFINITION-TUNING-V1", "status": "kept", "area": "stage3_structural", "hypothesis": "Definition-only rewrites in structural_tag_definitions.csv can reduce structural false positives without changing gating logic.", "setup": {"dataset": "caption_evident_n10", "method": "structural-only evaluation using llm_infer_structural_tags", "comparison": "baseline definitions vs tuned definitions", "runs_per_config": 3}, "evaluation": "Controlled A/B by swapping CSV definitions and running 3 repeated passes over the same 10 captions for each config.", "result_summary": "Tuned definitions improved average micro metrics: baseline P/R/F1=0.5306/0.8642/0.6571 vs tuned P/R/F1=0.5923/0.8889/0.7103.", "decision": "Keep tuned structural definitions; revert later micro-tweak that reduced average F1.", "why_abandoned_or_kept": "Description-only changes gave measurable quality lift; additional tweaks to looking_at_viewer/intersex/taur did not help in controlled comparison.", "evidence_files": ["data/structural_tag_definitions.csv", "data/eval_samples/e621_sfw_sample_1000_seed123_buffer10000_caption_evident.jsonl"], "notes": "Persistent issues remain for looking_at_viewer, group/trio, taur/intersex false positives; likely requires non-definition controls if further reduction is needed.", "date": "2026-03-01"}