noteguard-agent / docs /user_guide.md
Chaeyoon
deploy: v0.2.0 Gold RAP structure
a08af1a
|
Raw
History Blame Contribute Delete
4.59 kB

User guide

Prerequisites

Installation

git clone https://github.com/chaeyoonyunakim/noteguard-agent.git
cd noteguard-agent

python -m venv .venv
.venv\Scripts\activate          # Windows
# source .venv/bin/activate     # macOS / Linux

pip install -e ".[dev]"

To also install the Streamlit demo or Superlinked retrieval:

pip install -e ".[demo]"       # Streamlit interactive demo
pip install -e ".[retrieval]"  # Superlinked NoteIndex (optional; not needed for the web UI)

Configuration

cp .env.example .env

Open .env and fill in your credentials:

GOOGLE_API_KEY=AIza...
TAVILY_API_KEY=tvly-...
LANGSMITH_API_KEY=ls__...
LANGSMITH_TRACING=true
# NOTEGUARD_MODEL=google_genai:gemini-2.5-flash   # optional override

Running the de-identification demo (no API keys needed)

python src/deid.py

Demonstrates the de-id core on a synthetic note β€” no network calls, no keys.

Running the interactive Streamlit demo (no API keys needed)

streamlit run streamlit_app.py

Lets you paste any text, click De-identify, and see the surrogate-token output alongside the vault contents and assert_clean result.

Downloading the dataset

python src/fetch_dataset.py

Downloads patients.csv, admissions.csv, and synthetic_clinical_notes.csv from the NHSEDataScience/synthetic_clinical_notes Hugging Face dataset into data/. Run once before starting the server to enable the note-picker and vault-based leakage metrics.

Running the clinician web UI (recommended for demos)

uvicorn app.api:app --reload --port 8000

Open http://localhost:8000.

  1. Click Load note (top-right) to open the note picker, or paste your own note.
  2. Click Generate (~20–30 s on first run; the model loads and de-identifies).
  3. Use the segmented toggle to switch views without re-calling the API:
    • Clinician view β€” the original note with every redacted identifier highlighted in red.
    • What the AI sees β€” the de-identified note; real identifiers are replaced by [TYPE_N] surrogate chips (e.g. [PERSON_1], [NHS_1]).
  4. The compact eDischarge card appears on the right, re-identified for the clinician.
  5. The trust panel below shows:
    • Re-id risk Β· model input β€” 0.0 % when the privacy guarantee holds; higher when leaks detected.
    • Identifiers removed β€” count of distinct tokens de-identified in this call.
    • Faithfulness β€” LLM-as-judge score (0–100 %), hidden when no retrieval context was available.
    • Grounded sources β€” number of distinct Tavily / NICE / NHS URLs cited by Gemini.
    • Leaked tokens β€” surrogate tokens that survived the model without being re-identified (should be empty).
  6. Click ← Edit note to reset and process a different note.

Running the LangGraph dev server

langgraph dev

Connect the Agent Chat UI and interact with the noteguard graph directly.

Running the LangSmith evaluations

python -m eval.run_eval

Needs LANGSMITH_API_KEY and LANGSMITH_TRACING=true. Runs two evaluators:

Evaluator Target What it measures
zero_phi_to_model 1.0 No known identifier appeared in any message seen by the model.
faithfulness 0.8+ Every clinical claim in the answer is supported by the source note.

Development

ruff check src agent app eval tests   # lint
ruff format src agent app eval tests  # format
pytest                                 # run the test suite
pytest --cov=src --cov-report=term    # with coverage

Loading a real patient vault

The de-id core can be seeded from the NHSEDataScience/synthetic_clinical_notes dataset (MIT licence, fully synthetic):

from src.deid import NoteGuard, load_known_from_csv

known = load_known_from_csv("data/patients.csv", "data/admissions.csv")
ng = NoteGuard(known=known)
result = ng.deidentify(raw_note)
ng.assert_clean(result.clean_text)

This builds the identifier vault from both structured tables β€” patient names and clinician names β€” so residual leakage is measured against ground truth.