prisma-chatbot / ARCHITECTURE.md
RolandM's picture
Rewrite ARCHITECTURE.md as finished v1 document
9dad4c8

Architecture

Design decisions and rationale for prisma-chatbot. This document captures why the system is built the way it is, not just what it does. See README.md for a user-facing overview and ROADMAP.md for the deployment plan and milestones.

System overview

Each user turn flows through a single, linear path. The Gradio app (app.py) appends the user message to the running history held in gr.State, prepended with the dual-role system prompt built once at startup by src.prompt.build_system_prompt. The full history is handed to PrismaInferenceClient.generate, which calls Llama 3.3 70B Instruct via the Hugging Face Inference API with response_format={"type": "json_object"} and returns the raw JSON content. parse_model_output strips any stray markdown fences, validates the schema (string response, integer scores per attribute in [1, 7]), and returns a ParsedTurn. The app then appends the assistant response to the chat display, records the evaluation in state, and re-renders the impressions panel (colored bar cells) and the trajectory plot. There is exactly one model call per user turn.

Key design decisions

Dual-role prompt, single LLM call per turn

The model is asked to produce both the conversational reply and the evaluation in a single structured response. Two reasons drive this. The first is operational: one network call keeps per-turn latency and cost predictable, and avoids the engineering of stitching two calls together under partial failure. The second is semantic: the response and the evaluation are grounded in exactly the same context window, so the evaluation reflects the model's actual current impression rather than a separately-prompted second-pass judgment. The trade-off is that the prompt is more elaborate than two narrowly-scoped calls would be, and the JSON contract has to be enforced both at the API boundary (response_format) and in the parser. This is accepted because the research thesis is precisely that the model's response and its perception of the user are two facets of one act of interpretation, and collapsing them into a single call mirrors that framing.

Structured JSON output

The model returns a JSON object with a string response field and an evaluation object mapping each attribute name to an integer score in [1, 7]. This makes downstream rendering deterministic: the chat display reads response directly, and the impressions panel and trajectory plot iterate over evaluation without any natural-language parsing. JSON output is enforced both by the prompt and by passing response_format={"type": "json_object"} to the Inference API — belt-and-suspenders, because Llama 3.3 70B occasionally drifts toward conversational preamble before the JSON when relying on prompt instructions alone. Two defensive choices in parse_model_output deserve a note. Markdown code fences are stripped if present, because some model snapshots wrap structured output in ``` blocks despite instructions otherwise. And because bool is a subclass of int in Python, the validator rejects True/False explicitly — without that check, a model returning true for a score would silently pass type validation and be coerced to 1.

Six evaluation attributes

The v1 attribute set is competent, likeable, considerate, polite, formal, demanding — defined once in src/config.DEFAULT_ATTRIBUTES and consumed by the prompt builder, the parser, and the UI renderer. The set is intentionally a general-purpose one, chosen to produce meaningful variance across the diverse, unconstrained conversational inputs a public demo will receive. It is not identical to the attribute set used in the CMCL 2026 / EMNLP 2026 studies: the research-specific dimensions there (notably pedantic and well-prepared) are highly informative around precision-related stimuli but produce flat scores in casual chat, which would make the live demo look broken. Those research attributes, alongside other candidates (pushy, knowledgeable, helpful, arrogant, warm, evasive), are slated for the v2 user-selectable extended list. The demo's research framing is therefore methodological — surfacing the model's ongoing social perception of the user — rather than a literal replication of the paper's stimuli.

Llama 3.3 70B Instruct via HF Inference API

Hosted inference on Hugging Face was chosen over self-hosting and over proprietary APIs (OpenAI, Anthropic) for three reasons. First, the deployment surface is minimal: no GPU provisioning, no model serving, no separate auth domain — the Space and the model live on the same platform and a single HF_TOKEN covers both. Second, the audience that arrives via the research papers is already familiar with the Hugging Face platform and trusts it, which removes a friction point that a custom-hosted endpoint or a third-party key requirement would introduce. Third, a 70B-class instruct model is empirically the threshold at which the structured JSON contract holds reliably across varied conversational inputs and the dual-role persona is maintained without prompt drift; smaller open-weight models tend to break the schema or leak the evaluation rationale into the reply. The trade-off is dependency on HF endpoint availability and the (low) per-call rate limits applied to public Spaces, which the per-session turn cap (SESSION_TURN_CAP = 12) helps absorb.

Gradio on Hugging Face Spaces

Gradio is the lowest-friction path to a public, shareable artifact: a single app.py declares the UI, the deployment is git push, and the HF Space provides the public URL and HTTPS termination. It also integrates natively with the HF Inference API used for the model call. The cost is limited UI flexibility compared to a custom React frontend, which is accepted because the demo's value is in the interaction itself, not in bespoke visual design.

Module responsibilities

  • src/config.py — single source of truth for the v1 attribute set, the model ID, decoding parameters, the session turn cap, and the attribute-to-color mapping. Centralizing these means the prompt, the parser, and the UI cannot drift out of sync with each other.
  • src/prompt.py — builds the dual-role system prompt from the attribute list at module import time. Templated rather than hardcoded so that the v2 selectable attributes (see above) plug in without touching the inference layer.
  • src/inference.py — thin wrapper around huggingface_hub. InferenceClient. Forces response_format={"type": "json_object"} uniformly, distinguishes API errors (InferenceError) from parse errors (EvaluationParseError) so the app layer can react differently, and validates the response envelope before handing the content to the parser.
  • src/evaluation.py — parses the JSON, validates the schema against the expected attribute list, and formats scores for display. Owns the intensifier scale (1 → "not at all", ..., 7 → "extremely") that pairs verbal labels with numeric scores in the UI.
  • app.py — Gradio Blocks assembly, theme/CSS, event wiring, and the rendering of the impressions panel (HTML bar cells) and trajectory plot (matplotlib). On parse or inference failure the user's message is rolled back from state and surfaced as a gr.Warning toast rather than as a fake assistant turn, so retries send clean history.

Open design questions

  • Single-call vs. two-call architecture if the attribute set grows. At six attributes the dual-role prompt is comfortable; if v2 lets users pick from a longer extended list, JSON-output reliability and attention to the response itself may degrade enough to justify splitting into two calls.
  • Whether to expose decoding controls (temperature) as a user setting. Currently fixed at 0.7 in config. Exposing it would let visitors probe how stochasticity affects the evaluation, which aligns with the research framing — but adds a knob that most visitors won't understand.
  • Behavior on non-English input. The persona prompt is English and the attribute labels are English; the model handles other languages reasonably but the social perception it produces has not been validated outside English. The honest disposition for v1 is to accept it silently; a more careful v2 might detect and either surface a disclaimer or refuse.