distill-pipeline / state_of_project.md
htaf's picture
handoff stuff
ecd21e2

State of Project

What works

  • Question-first pipeline with JSONL chunk source by default; deterministic chunk IDs and JSONL caches for questions, generations, verifications, and rewards.
  • Providers: Ollama/OpenAI/HTTP plus mock provider; mock pathway enables full pipeline tests without GPUs or ES.
  • Verifier parsing tolerates distributor format (SCORE as number or PASS/FAIL with noisy prefixes); caching and retry logic in place.
  • Tests: 42 passing (retrieval mock/real, generator, verifier, reward, pipeline behaviour, cache, full mock pipeline).
  • CLI defaults: verbose on, question-first, JSONL chunks; chunk/question limits respected.
  • Generator parsing/logging: preserves provider .thinking (structured) and parsed thought/answer/confidence/evidence/limitations; verbose mode prints both parsed and raw (JSON pretty if parsable). Gold stores all generator fields.

What needs attention

  • Real pipeline currently fails at question generation when Ollama/question model is unreachable; run requires a live Ollama with the specified model pulled.
  • Reward is still mocked in the recent run; swap to real reward provider/model when available.
  • Verifier prompt must stay distributor-provided; parsing is tolerant but malformed outputs still log raw text in verbose mode.
  • Deprecation warning from punycode (Node) shows during tests; benign but noisy.

Risks

  • Long generator outputs can inflate verifier context; may need truncation or smaller verifier model to avoid context overruns.
  • Cache growth: JSONL caches can grow large; add rotation/compaction if running many cycles.
  • ES mode defaults to 100 chunk fetch if used; confirm chunk limits when switching from JSONL to ES.

Next steps (suggested)

  • Pull and start question/answer/verifier/reward models on Ollama (or configure OpenAI/HTTP) and re-run a small batch (--limit/--chunk-limit) to validate end-to-end with real models.
  • Add optional verifier retry when JSON parse fails (1 retry) and cap logged transcripts to reduce noise in verbose runs.
  • Consider a cache inspection/cleanup script for data/cache/*.jsonl.