forward-deployed-ai-sim / docs /case_memo.md
bobaoxu2001
Deploy forward-deployed AI simulation dashboard
c4fe0a4
# From Noisy Support Data to Reliable AI Workflow: A Forward-Deployed Simulation
## Problem
Enterprise support organizations generate thousands of tickets, emails, and chats daily. This text is noisy, multilingual, and manually classified β€” producing inconsistent labels, lagging reports, and no real-time visibility into churn drivers. The gap between raw text and operational decisions is structural, not staffing.
## Why AI fits
Extracting root cause, sentiment, risk, and next actions from unstructured text is high-volume, repetitive, and tolerant of human-in-the-loop correction. AI can structure; humans should decide. The critical choice is drawing that line before writing code.
## What I built
An end-to-end extraction pipeline: 40 real case bundles (two public HuggingFace datasets) flow through LLM structuring (forced JSON schema), post-validation, a 7-rule confidence/risk gate, and dual-write storage (SQLite + JSONL audit trail). A Streamlit dashboard provides case-level inspection and aggregate reliability views. Every extraction includes verbatim evidence quotes. Every gate decision records machine-readable reason codes.
## What evaluation revealed
On a 10-case diverse batch (English, German, dialogues, 7–99 words): 100% schema pass rate, 100% evidence coverage, 97.3% evidence grounding. The gate correctly routed 8/10 cases to review. But confidence calibration had a gap β€” two short inputs (8 and 14 words) received 0.90 confidence despite insufficient context.
## One iteration
I added a single prompt rule: "If the case text is very short (under ~30 words), cap confidence at 0.7." The overconfident cases dropped from 0.90 to 0.60–0.70. Long inputs were unaffected (0.90 β†’ 0.90). One line of prompt text. Zero code changes. Measurable improvement.
## System insight
The value of a forward-deployed AI system is not in the model call β€” it is in the reliability layer around it. Evidence grounding, gate logic, failure mode detection, and audit trails are what make the difference between a demo and a deployable system. The model is a component; the system is the product.
## Production next steps
Gold-label annotation for precision/recall, parallel async extraction for latency, a feedback loop that retrains gate thresholds, and a controlled root-cause taxonomy to replace free-text L2 categories.