State of Project

What works

Question-first pipeline with JSONL chunk source by default; deterministic chunk IDs and JSONL caches for questions, generations, verifications, and rewards.
Providers: Ollama/OpenAI/HTTP plus mock provider; mock pathway enables full pipeline tests without GPUs or ES.
Verifier parsing tolerates distributor format (SCORE as number or PASS/FAIL with noisy prefixes); caching and retry logic in place.
Tests: 42 passing (retrieval mock/real, generator, verifier, reward, pipeline behaviour, cache, full mock pipeline).
CLI defaults: verbose on, question-first, JSONL chunks; chunk/question limits respected.
Generator parsing/logging: preserves provider .thinking (structured) and parsed thought/answer/confidence/evidence/limitations; verbose mode prints both parsed and raw (JSON pretty if parsable). Gold stores all generator fields.

Real pipeline currently fails at question generation when Ollama/question model is unreachable; run requires a live Ollama with the specified model pulled.
Reward is still mocked in the recent run; swap to real reward provider/model when available.
Verifier prompt must stay distributor-provided; parsing is tolerant but malformed outputs still log raw text in verbose mode.
Deprecation warning from punycode (Node) shows during tests; benign but noisy.

Long generator outputs can inflate verifier context; may need truncation or smaller verifier model to avoid context overruns.
Cache growth: JSONL caches can grow large; add rotation/compaction if running many cycles.
ES mode defaults to 100 chunk fetch if used; confirm chunk limits when switching from JSONL to ES.

Pull and start question/answer/verifier/reward models on Ollama (or configure OpenAI/HTTP) and re-run a small batch (--limit/--chunk-limit) to validate end-to-end with real models.
Add optional verifier retry when JSON parse fails (1 retry) and cap logged transcripts to reduce noise in verbose runs.
Consider a cache inspection/cleanup script for data/cache/*.jsonl.