Basitha commited on
Commit
7204280
·
verified ·
1 Parent(s): 29d1fd2

Upload validation_utils.py

Browse files
Files changed (1) hide show
  1. common/validation_utils.py +109 -0
common/validation_utils.py ADDED
@@ -0,0 +1,109 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import logging
2
+
3
+ def validate_response(question, answer, user_profile_str, fast_facts_str, interview_transcript_text, respondent_type, ai_evaluator_agent, processor_llm):
4
+ """
5
+ Validates a response (answer) to a question using the appropriate evaluation method (exploratory or fact-based).
6
+ Uses the LLM to determine if the question is exploratory or fact-based, then applies the correct rating logic.
7
+ Returns True if the answer is valid (all relevant metrics >= 8/10), otherwise False.
8
+
9
+ ## Evaluation Criteria (0-10 Scores & Guidelines):
10
+ Ratings must be balanced and discriminative—do not default to high scores. Decimal ratings (e.g., 6.5, 8.2, 9.5) are allowed.
11
+ Always take into account what the question specifically asked for and any constraints it imposed.
12
+ If the prompt required a structured or limited response (e.g., numerical ratings, bullet points, or short reflections), do not penalize for brevity or lack of elaboration. In these cases, the format may naturally restrict how much nuance or personal detail can be included.
13
+ Instead, consider:
14
+ - Does the response feel authentic and realistic, given the format?
15
+ - Did the person make the most of what the prompt allowed, even in just a few words?
16
+ - Is the tone, reasoning, or phrasing consistent with what a real person might say in that situation?
17
+ Even short responses can show plausibility through subtle cues like word choice, minor hedging, or a relatable reaction. Only deduct for lack of nuance if the prompt clearly left room for more depth and the answer failed to take advantage of that—especially when the limitations are a direct result of the questioning style.
18
+
19
+ Plausibility (Behavioral & Contextual Realism):
20
+ - Assesses how realistic and in-character the response feels given all information, even if it's imagined.
21
+ - Focus on tone, lifestyle consistency, and alignment with demographic/cultural context.
22
+ - A response may be accurate but implausible if it feels forced, overly polished, or culturally mismatched.
23
+ - Aligns with the user’s known profile, preferences, and background
24
+ - Shows realistic attitudes or concerns based on the persona
25
+ - Avoids exaggeration or unlikely claims
26
+ - Demonstrates a nuanced, thoughtful perspective
27
+ - 10 = Fully realistic, natural, authentic and matches what you’d expect from this user. Feels like a genuine and spontaneous reflection of the respondent. Matches expected behavior and tone precisely for the profile. No forced phrasing, exaggeration, or cultural mismatch.
28
+ - 9 = Highly plausible. Very realistic tone with appropriate behavioral details. Slight hints of polish or idealization, but nothing that disrupts the realism. Aligns well with the persona. The only minor limitation is that the reflection could include slightly more personal detail or specific examples to deepen authenticity.
29
+ - 8 = Strongly plausible. Mostly consistent with the respondent’s profile and context. Minor moments of artificiality, vagueness, or slightly formal tone, but not enough to reduce credibility significantly.
30
+ - 7 = Generally plausible. The response aligns broadly with the respondent’s background, but contains language that is slightly generic or lacks personal detail. May feel too polished or overly scripted.
31
+ - 4–6 = Questionable plausibility. Portions of the response feel off-tone, culturally inconsistent, or unnaturally phrased. The voice may deviate from expected norms for the persona (e.g., sounding too robotic or inauthentic).
32
+ - 1–3 = Implausible. The response feels unnatural, exaggerated, or stereotyped. Tone or behavior contradicts the expected norms for the respondent’s profile. May reflect poor understanding of context.
33
+ - 0 = Entirely unrealistic or fabricated. The response is clearly out of character, culturally misaligned, or extremely artificial. Strong evidence that it was not written in the respondent’s voice or context.
34
+ - If a score is 8 or higher, justify why it's not a 10.
35
+
36
+ Relevance:
37
+ - Focus on how directly and completely the response answers the specific question asked.
38
+ - A relevant answer should stay on-topic, address the core of the question, and avoid vague or generic filler.
39
+ - Consider whether the response engages with all parts of the question (e.g., both "what" and "why" if asked).
40
+ - Watch out for digressions, misunderstandings, or partial answers—these should lower the score
41
+ - Covers all key points without going off-topic
42
+ - Connects ideas clearly and logically
43
+ - Includes considerations or barriers relevant to the context
44
+ - 10 = Fully relevant and comprehensive. The response answers every aspect of the question in a direct, specific, and complete manner. No filler, no drift. No clarification needed.
45
+ - 9 = Extremely relevant. Responds to the core of the question clearly and thoughtfully, but may be slightly brief or omit a minor aspect. Still feels complete and on-point. If the only reason for not being full mark is could have included a bit more detail, this is the appropriate score.
46
+ - 8 = Strongly relevant. Covers the main part of the question accurately, but may simplify or lightly gloss over a secondary part. No major content missing, but not fully comprehensive.
47
+ - 7 = Mostly relevant. Engages with the question but lacks specificity or detail. May partially skip a subpart of the question or rely too much on generalities.
48
+ - 4–6 = Partially relevant. Addresses only one component of the question or veers off-topic in parts. May misunderstand intent or provide vague and generic commentary.
49
+ - 1–3 = Barely relevant. The response is mostly off-topic, highly generic, or misinterprets the question. Gives the sense that the question was not understood or considered.
50
+ - 0 = Irrelevant. The answer does not address the question at all or responds to a completely different topic.
51
+ - If a score is 8 or higher, justify why it's not a 10.
52
+
53
+ Accuracy (Faithfulness to Profile & Transcript):
54
+ - Only include details that are clearly present in the user profile or transcript.
55
+ - Paraphrasing is allowed and responses do not need to match the source text word-for-word. Paraphrasing should not be penalised. However, all relevant details must be included; omitting any information present in the source is not allowed and should be penalised.
56
+ - Use exact matches or clear paraphrases when justifying scores, and explain how any indirect clues from the transcript support the claim.
57
+ - If unsure whether the content is grounded, err on the side of penalising it.
58
+ - If the user profile has missing values and the output explicitly states that the information is unavailable, this is considered accurate and faithful to the profile.
59
+ - If the profile lists multiple options (e.g., occupations, lifestages, etc.) and the output repeats them verbatim or as-is, it is considered accurate.
60
+ For example: If the profile lists the occupation as “IT Professional, Corporate Professional, Business Owner, Freelancer,” and the output includes this exact list, it is accurate, even if it's not clear whether the respondent holds all roles simultaneously. The output is not inventing new information.
61
+ Do not penalize responses for repeating multiple profile-listed values unless the response adds or implies details that aren't present.
62
+ - This repetition is not considered overgeneralization, even if it's unclear whether the user falls into all categories, because the information is directly from the profile.
63
+ - Safe summarizations are also accurate as long as they don’t introduce new, unsupported content or contradict the tone of the profile.
64
+ - 10 = Fully grounded in the profile or transcript. Every part of the answer is stated or clearly derivable from the profile or fast facts with no additions.
65
+ - 9 = Extremely accurate. Almost entirely based on the profile or transcript. May include very minor and safe generalizations or syntheses that do not introduce error (e.g., grouping similar preferences). No contradictions or fabrications.
66
+ - 8 = Highly accurate. Mostly grounded in the profile. May contain a few inferred or generalized statements that are plausible and supported by context, but not explicitly stated. No factual errors or contradictions.
67
+ - 7 = Mostly accurate, minor assumptions or safe generalizations. Minor implied elements or harmless generalizations, but nothing clearly false.
68
+ - 4–6 = Includes inaccurate or unsupported claims. The answer mixes truth with unverifiable or invented content.
69
+ - 1–3 = Mostly fabricated. Few if any parts match known facts.
70
+ - 0 = Contradicts known facts or introduces major hallucinations.
71
+ """
72
+ # Use LLM to determine evaluation mode
73
+ llm_mode_prompt = f"""
74
+ You are an expert in market research interview analysis. Given the following question, determine if it is:
75
+ - Exploratory: subjective, open-ended, opinion-based, or reflective (e.g., feelings, motivations, preferences, aspirations, values, beliefs, etc.)
76
+ - Fact-based: objective, factual, or directly verifiable from the respondent's profile or transcript (e.g., age, location, occupation, education, etc.)
77
+ Respondent Type: {respondent_type}
78
+ Question: {question}
79
+ Output strictly in this format:
80
+ Evaluation Mode: <Exploratory or Fact-based>
81
+ """
82
+ response = processor_llm.invoke(llm_mode_prompt)
83
+ output = response.content.strip()
84
+ evaluation_mode = "exploratory"
85
+ for line in output.split("\n"):
86
+ if line.lower().startswith("evaluation mode:"):
87
+ val = line.split(":",1)[1].strip().lower()
88
+ if "fact" in val:
89
+ evaluation_mode = "factbased"
90
+ else:
91
+ evaluation_mode = "exploratory"
92
+ logging.info(f"LLM determined evaluation mode: {evaluation_mode}")
93
+ # Now do the correct evaluation
94
+ from .InteractiveInterviewChatbot import evaluate_response_exploratory, evaluate_response_factbased
95
+ if evaluation_mode == "exploratory":
96
+ evaluation_result = evaluate_response_exploratory(user_profile_str, fast_facts_str, interview_transcript_text, question, answer, ai_evaluator_agent, respondent_type)
97
+ plausibility = evaluation_result.get("plausibility_rating")
98
+ relevance = evaluation_result.get("relevance_rating")
99
+ logging.info(f"Exploratory evaluation result: {evaluation_result}")
100
+ if plausibility is not None and relevance is not None:
101
+ return float(plausibility) >= 8.0 and float(relevance) >= 8.0
102
+ return False
103
+ else:
104
+ evaluation_result = evaluate_response_factbased(user_profile_str, fast_facts_str, interview_transcript_text, question, answer, ai_evaluator_agent, respondent_type)
105
+ accuracy = evaluation_result.get("accuracy_rating")
106
+ logging.info(f"Fact-based evaluation result: {evaluation_result}")
107
+ if accuracy is not None:
108
+ return float(accuracy) >= 8.0
109
+ return False