Spaces:

below-threshold
/

ai-response-validator

Sleeping

mbochniak01 Claude Sonnet 4.6 commited on 12 days ago

Commit

5935cf6

1 Parent(s): e77a2f2

Fix Vectara label check and input format

Two bugs:
1. Label check used 'in' so "inconsistent" matched "consistent" — always returned
raw score regardless of label. Fixed with startswith("factually consistent").
2. text_pair dict format causes T5 tokenizer to encode pairs incorrectly. Switched
to single concatenated string: "{chunk} {response}". T5 handles one sequence
cleanly; text_pair semantics are BERT-specific.

Also added debug logging of raw Vectara output for future calibration.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Files changed (2) hide show

Dockerfile +2 -1
backend/grader.py +6 -4

Dockerfile CHANGED Viewed

@@ -16,7 +16,8 @@ from sentence_transformers import SentenceTransformer; \
 from transformers import T5Tokenizer, pipeline; \
 SentenceTransformer('all-MiniLM-L6-v2'); \
 tok = T5Tokenizer.from_pretrained('t5-small'); \
-pipeline('text-classification', model='vectara/hallucination_evaluation_model', tokenizer=tok, trust_remote_code=True)"
 COPY knowledge/ ./knowledge/
 COPY backend/   ./backend/

 from transformers import T5Tokenizer, pipeline; \
 SentenceTransformer('all-MiniLM-L6-v2'); \
 tok = T5Tokenizer.from_pretrained('t5-small'); \
+pipe = pipeline('text-classification', model='vectara/hallucination_evaluation_model', tokenizer=tok, trust_remote_code=True); \
+pipe(['test document test response'])"
 COPY knowledge/ ./knowledge/
 COPY backend/   ./backend/

backend/grader.py CHANGED Viewed

@@ -163,11 +163,13 @@ def grade_faithfulness(response: str, context: str) -> GradeResult:
     if not raw_chunks:
         return GradeResult(metric="faithfulness", passed=False, score=0.0, detail="No context")
     chunks = [_strip_chunk_title(c) for c in raw_chunks]
-    # Vectara HHEM v2: text-classification pipeline, label "Factually Consistent" = faithful
-    pairs = [{"text": chunk, "text_pair": response} for chunk in chunks]
-    results = model(pairs)
     scores = [
-        r["score"] if "consistent" in r["label"].lower() else 1.0 - r["score"]
         for r in results
     ]
     score = float(max(scores))

     if not raw_chunks:
         return GradeResult(metric="faithfulness", passed=False, score=0.0, detail="No context")
     chunks = [_strip_chunk_title(c) for c in raw_chunks]
+    # Vectara HHEM v2: single concatenated string avoids T5 text_pair encoding issues.
+    # Label "Factually Consistent" = faithful; use startswith to avoid "inconsistent" false match.
+    inputs = [f"{chunk} {response}" for chunk in chunks]
+    results = model(inputs)
+    log.debug("Vectara raw results: %s", results)
     scores = [
+        r["score"] if r["label"].lower().startswith("factually consistent") else 1.0 - r["score"]
         for r in results
     ]
     score = float(max(scores))