AI_Recruiting_Agent / Guidebook.md
19arjun89's picture
Update Guidebook.md
9fad5d1 verified
# AI Recruiting Assistant β€” Guide Book (Updated)
## 0) Overview
### What this tool does
This AI Recruiting Assistant is a **decision-support** system that helps recruiters and hiring managers:
* Extract **job requirements** from a job description (JD)
* Evaluate resumes against **verified requirements** using **evidence-based** matching
* Assess job-relevant **culture/working-style signals** using retrieved company documents
* Run **factuality checks** to detect ungrounded claims
* Run a **bias & fairness audit** across the JD, analyses, and the model’s final recommendation
### The problem it addresses
Recruiting teams often face three recurring issues when using AI:
1. **Hallucinated requirements**: LLMs may β€œinvent” skills that are not explicitly required.
2. **Opaque scoring**: Many tools produce fit scores without clearly showing evidence.
3. **Bias risks**: Hiring language and reasoning can leak pedigree/class proxies or subjective criteria.
This tool addresses those issues by enforcing:
* **Deterministic verification gates** (requirements are verified before scoring)
* **Evidence-backed scoring** (only verified requirements are scored; each match includes a quote)
* **Self-verification and self-correction** (factuality checks can trigger automatic revision)
* **Bias auditing** (flags risky language and inconsistent standards)
### How it differentiates from typical recruiting tools
Compared with β€œblack-box” resume screeners or generic LLM chatbots, this system emphasizes:
* **Transparency**: Outputs include *what was required*, *what was verified*, *what was dropped*, and *why*.
* **Auditability**: The scoring math is deterministic and traceable to inputs.
* **Self-verifying behavior**: Claims are checked against source text; unverified claims can be removed.
* **Bias checks by design**: Bias-sensitive content is audited explicitly instead of implicitly influencing scores.
* **Culture check that’s job-performance aligned**: Culture attributes are framed as job-relevant behaviors, not background proxies.
---
## 1) Inputs and Document Handling
### 1.1 What the user uploads
The tool operates on three inputs:
1. **Company culture / values documents** (PDF/DOCX)
2. **Resumes** (PDF/DOCX)
3. **Job description** (pasted text)
### 1.2 Resume anonymization
Before resumes are stored or analyzed, the tool applies heuristic redaction:
* Emails, phone numbers, URLs
* Addresses / location identifiers
* Explicit demographic fields
* Likely name header (first line)
This reduces exposure of personal identifiers and keeps analysis focused on job evidence.
### 1.3 Vector stores (retrieval)
The tool maintains two separate Chroma collections:
* **Resumes** (anonymized + chunked)
* **Culture docs** (chunked)
Chunking uses a recursive splitter with overlap to preserve context.
---
## 2) End-to-End Logic Flow (Step-by-Step)
Below is the stepwise flow executed when a recruiter clicks **Analyze Candidates**.
### Step 0 β€” Prerequisite: Documents exist in storage
* Culture docs and resumes must be stored first.
* If not stored, retrieval will be empty or low-signal.
### Step 1 β€” Extract required skills from the Job Description (LLM-driven)
**Goal:** Identify only skills that are explicitly required.
* The tool prompts the LLM to return **JSON only**:
* `required_skills: [{skill, evidence_quote}]`
* The LLM is instructed to:
* include only **MUST HAVE** / explicitly required skills
* exclude β€œnice-to-haves” and implied skills
* copy a short **verbatim quote** as evidence
**LLM role:** structured extraction.
**Failure behavior:** If JSON parsing fails, the tool stops and prints the raw output.
### Step 2 β€” Verify extracted skills against the JD (deterministic, Python)
**Goal:** Block hallucinated requirements from entering scoring.
Each extracted item is classified:
* **Quote-verified (strong):** the evidence quote appears verbatim in the JD
* **Name-only (weak):** the skill name appears in the JD, but the quote doesn’t match
* **Unverified (dropped):** neither quote nor name appears
**Deterministic gate:**
* Only **quote-verified** skills are used as the final required list for scoring.
* Name-only and dropped skills are reported for transparency.
**Output:** β€œRequirements Verification” section shows:
* extracted count
* quote-verified vs name-only vs dropped
* list of skills used for scoring
* list of retracted/dropped items (with reason)
### Step 3 β€” Retrieve the most relevant culture chunks (deterministic retrieval)
**Goal:** Ground culture evaluation in actual company documents.
* The tool runs similarity search over culture docs using the JD as query.
* It selects the top **k** chunks (e.g., k=3).
**Deterministic component:** vector retrieval parameters.
**Output artifact:** `culture_context` is the concatenated text of retrieved culture chunks.
### Step 4 β€” Generate job-performance culture attributes (LLM-driven)
**Goal:** Create a small set of job-relevant behavioral attributes to evaluate consistently.
* The tool prompts the LLM to return JSON:
* `cultural_attributes: ["...", "..."]` (4–6 items)
**Attribute rules:**
* Must be job-performance aligned behaviors (e.g., β€œevidence-based decision making”).
* Must avoid pedigree / class / prestige language.
* Must avoid non-performance preferences (e.g., remote-first, time zone).
**LLM role:** label generation from retrieved culture context.
### Step 5 β€” Retrieve top resume chunks for the JD (deterministic retrieval)
**Goal:** Identify the most relevant candidates and their relevant resume text.
* The tool runs similarity search over resumes using the JD.
* It retrieves top **k** chunks (e.g., k=10) and groups them by `resume_id`.
**Note:** Only retrieved chunks are analyzed. If relevant evidence isn’t retrieved, it may be missed.
### Step 6 β€” Culture evidence matching per candidate (LLM + deterministic cleanup + deterministic scoring)
**Goal:** Determine which culture attributes are supported by resume evidence.
**LLM-driven matching:**
* For each attribute, the LLM may return a match with:
* `evidence_type`: `direct` or `inferred`
* `evidence_quotes`: 1–2 verbatim resume quotes
* `inference`: required for inferred
* `confidence`: 1–5
**Deterministic cleanup rules (Python):**
A match is kept only if:
* attribute is present
* evidence_type is `direct` or `inferred`
* at least one non-trivial quote exists
* confidence is an integer 1–5
* inferred matches include an inference sentence
* inferred matches can be required to meet a minimum confidence
**Deterministic culture scoring (Python):**
* Direct evidence weight: **1.0**
* Inferred evidence weight: **0.5**
Culture score is computed as:
* `(sum(weights for matched attributes) / number_of_required_attributes) * 100`
### Step 7 β€” Skills matching per candidate (LLM + deterministic scoring)
**Goal:** Match only the verified required skills to resume evidence.
**Inputs:**
* Candidate resume text (retrieved chunks)
* Verified required skills list (quote-only)
**LLM output (JSON):**
* `matched: [{skill, evidence_snippet}]`
* `missing: [skill]` (treated as advisory; missing is recomputed deterministically)
**Deterministic missing calculation (Python):**
* Missing = required_set βˆ’ matched_set
**Deterministic skills scoring (Python):**
* `(number_of_matched_required_skills / number_of_required_skills) * 100`
### Step 8 β€” Implied competencies (NOT SCORED) for phone-screen guidance (LLM-driven, advisory)
**Goal:** When a required skill is missing explicitly, suggest whether it may be **implied** by adjacent evidence.
* This step is **not scored** and does not affect proceed/do-not-proceed.
* The LLM may suggest implied competencies only if it:
* uses conservative language (β€œmay be implied”)
* includes **verbatim resume quotes**
* provides a **phone-screen validation question**
**Hard guardrail:** Tool-specific skills (e.g., R/SAS/MATLAB) must be explicitly present in the resume to be suggested.
### Step 9 β€” Factuality verification (LLM-driven verifier)
**Goal:** Detect ungrounded evidence claims.
* The verifier checks evidence-backed match lines (e.g., `- Skill: snippet`).
* It ignores:
* numeric score lines
* missing lists
* policy text
**Outputs:**
* verified claims (βœ“)
* unverified claims (βœ—)
* factuality score
### Step 10 β€” Final recommendation (LLM, policy-constrained)
**Goal:** Produce a structured recommendation without changing scores.
* The model is given:
* skills analysis
* culture analysis
* fixed computed scores
* deterministic decision policy
**Decision policy:**
* If skills_score β‰₯ 70 β†’ PROCEED
* If skills_score < 60 β†’ DO NOT PROCEED
* If 60 ≀ skills_score < 70 β†’ PROCEED only if culture_score β‰₯ 70 else DO NOT PROCEED
**Non-negotiables:**
* LLM must not re-score.
* LLM must not introduce new claims.
### Step 11 β€” Self-correction (triggered by verification issues)
**Goal:** Remove/correct any unverified claims while preserving scores/policy.
* If any unverified claims exist:
* The tool asks the LLM to revise the recommendation
* Only the flagged claims may be removed/corrected
* Scores and policy must remain unchanged
### Step 12 β€” Bias audit (LLM-driven audit across docs + reasoning)
**Goal:** Flag biased reasoning, biased JD language, or inconsistent standards.
**Audit scope includes:**
* Job description
* Skills analysis
* Culture analysis
* Final recommendation text
* Culture context
**What it flags (examples):**
* Prestige/pedigree signals (elite employers/education as proxy)
* Vague β€œpolish/executive presence” language not tied to job requirements
* Non-job-related culture screening
* Inconsistent standards (penalizing requirements not in JD)
* Overclaiming certainty
**Outputs:**
* structured list of bias indicators (category, severity, trigger text, why it matters, recommended fix)
* recruiter guidance
---
## 3) Scoring and Decision Rules (Deterministic)
### 3.1 Skills score
* Only quote-verified required skills count.
* Score = matches / required.
### 3.2 Culture score
* Score = weighted matches / attributes.
* Direct = 1.0; inferred = 0.5.
### 3.3 Labels
* β‰₯70: Strong fit
* 50–69: Moderate fit
* <50: Not a fit
### 3.4 Recommendation
Recommendation follows the fixed policy described in Step 10.
---
## 4) System Flow Diagram (Textual)
Below is a simplified, end-to-end flow of how data moves through the system.
```
[User Uploads]
|
v
+-------------------+
| Culture Documents |
+-------------------+ +-----------+
| | Job Desc |
v +-----------+
+-------------------+ |
| Culture Vector DB |<--------------+
+-------------------+ |
| v
| +---------------------+
| | Skill Extraction |
| | (LLM, JSON Output) |
| +---------------------+
| |
| v
| +---------------------+
| | Requirement |
| | Verification |
| | (Deterministic) |
| +---------------------+
| |
| v
| Verified Required Skills
| |
| v
+-------------------+ +---------------------+
| Resume Documents |------->| Resume Vector DB |
+-------------------+ +---------------------+
|
v
Similarity Search (k=10)
|
v
Resume Chunks (Grouped)
|
v
+-----------------------------+
| Culture Attribute Generator |
| (LLM, JSON Output) |
+-----------------------------+
|
v
+-----------------------------+
| Culture Evidence Matching |
| (LLM + Rules + Weights) |
+-----------------------------+
|
v
Culture Score (Deterministic)
|
v
+-----------------------------+
| Technical Skill Matching |
| (LLM + Deterministic Scoring)|
+-----------------------------+
|
v
Skills Score (Deterministic)
|
v
+-----------------------------+
| Implied Competencies (LLM) |
| (Not Scored, Advisory) |
+-----------------------------+
|
v
+-----------------------------+
| Factuality Verification |
| (LLM Verifier) |
+-----------------------------+
|
v
+-----------------------------+
| Recommendation Generator |
| (Policy-Constrained LLM) |
+-----------------------------+
|
v
+-----------------------------+
| Bias & Fairness Audit |
| (LLM Audit) |
+-----------------------------+
|
v
Final Recruiter Report
```
---
## 5) Audit Artifacts and Traceability
For every analysis run, the system produces and retains multiple audit artifacts that enable post-hoc review, regulatory defensibility, and debugging.
### 5.1 Input Artifacts
1. **Original Job Description**
* Full pasted JD text
2. **Sanitized Resume Text**
* Redacted resume content
* Redaction summary (internal)
3. **Retrieved Culture Chunks**
* Top-k (default: 3) culture document segments
* Vector similarity scores (internal)
4. **Retrieved Resume Chunks**
* Top-k (default: 10) resume segments
* Resume ID metadata
---
### 5.2 Requirement Verification Artifacts
1. **Raw LLM Skill Extraction Output**
2. **Parsed Required Skills JSON**
3. **Verification Classification Table**
* Quote-verified
* Name-only
* Dropped
4. **Dropped-Skill Justifications**
---
### 5.3 Culture Analysis Artifacts
1. **Generated Culture Attribute List**
2. **LLM Raw Matching Output**
3. **Cleaned Match Records**
* Evidence type
* Quotes
* Inference
* Confidence
4. **Weighted Match Table**
5. **Computed Culture Score**
---
### 5.4 Skills Analysis Artifacts
1. **Verified Required Skill List**
2. **LLM Raw Matching Output**
3. **Accepted Matched Skills**
4. **Deterministic Missing-Skill Set**
5. **Computed Skills Score**
---
### 5.5 Implied Competency Artifacts (Advisory)
1. **Missing Skill List**
2. **LLM Implied Output (JSON)**
3. **Accepted Implied Records**
* Resume quotes
* Explanation
* Phone-screen questions
4. **Rejected Inferences (internal)**
---
### 5.6 Verification and Correction Artifacts
1. **Verifier Prompt and Output**
2. **Verified / Unverified Claim Lists**
3. **Factuality Scores**
4. **Self-Correction Prompts and Revisions (if triggered)**
---
### 5.7 Recommendation and Policy Artifacts
1. **Final Recommendation Prompt**
2. **Policy Threshold Snapshot**
3. **Immutable Score Values**
4. **Generated Recommendation Text**
---
### 5.8 Bias Audit Artifacts
1. **Bias Audit Prompt**
2. **Audit Input Bundle (JD + Analyses + Recommendation)**
3. **Structured Bias Indicator List**
4. **Severity and Mitigation Suggestions**
5. **Recruiter Guidance Text**
---
### 5.9 System Metadata
1. Timestamp of run
2. Model version
3. Prompt versions
4. Chunking parameters
5. Retrieval k-values
6. Scoring parameters
---
## 6) Known Limitations
1. **Retrieval scope**: evaluation depends on retrieved chunks; some evidence may be missed.
2. **Attribute generation variance**: culture attributes can vary per run unless cached or cataloged.
3. **LLM evidence overreach**: mitigated by verification and cleanup, but not eliminated.
4. **Bias audit is advisory**: it flags issues; it does not enforce policy changes unless you add an auto-rewrite step.
---
## 6) Governance and Change Control
* Prompt changes must preserve JSON contracts.
* Any change that affects scoring or policy should be versioned.
* Audit outputs should be retained for traceability.
---
## 7) Intended Use
This tool is built for:
* faster, evidence-based screening
* transparent reasoning
* safer use of LLMs via verification and audits
It is not a substitute for:
* human judgment
* legal review
* formal HR policy compliance
---
### High-level pipeline (inputs β†’ outputs)
**Inputs uploaded by recruiter**
1. Company culture/values docs (PDF/DOCX)
2. Resumes (PDF/DOCX)
3. Job description (text)
⬇️
**Indexing (deterministic, Python)**
* Culture docs β†’ chunk + embed β†’ `culture_store`
* Resumes β†’ anonymize β†’ chunk + embed β†’ `resume_store`
⬇️
**Candidate assessment (per JD run)**
1. **Extract required skills (LLM)** β†’ JSON `required_skills[{skill,evidence_quote}]`
2. **Verify extracted skills (Python)** β†’ quote-verified / name-only / dropped β†’ *quote-only list used for scoring*
3. **Retrieve relevant culture context (deterministic retrieval)**
* Query: JD
* Retrieve: top-k culture chunks (**current: k=3**)
* Output: `culture_context`
4. **Generate job-relevant culture attributes (LLM)** β†’ JSON `cultural_attributes[4–6]`
5. **Retrieve relevant resume chunks (deterministic retrieval)**
* Query: JD
* Retrieve: top-k resume chunks (**current: k=10**)
* Group by `resume_id`
6. **Per candidate: culture matching (LLM β†’ cleanup β†’ deterministic score)**
* LLM proposes matches (direct/inferred) + quotes
* Python enforces validity gates
* Deterministic weighted culture score (direct=1.0, inferred=0.5)
7. **Per candidate: skills matching (LLM β†’ deterministic score)**
* LLM proposes matched skills + evidence snippets
* Python recomputes missing list deterministically
* Deterministic skills score using quote-verified requirements only
8. **Per candidate: implied competencies (LLM, NOT SCORED)**
* Inputs: missing skills + matched skills + resume + JD
* Output: implied items with quotes + phone-screen questions
* Guardrail: tool-like skills (R/SAS/MATLAB) require explicit mention
9. **Factuality verification (LLM verifier)** β†’ βœ“/βœ— for evidence-backed match lines + factuality score
10. **Recommendation (LLM, policy constrained)** β†’ uses fixed scores + fixed decision policy
11. **Self-correction (conditional)** β†’ triggered if any unverified claims exist
12. **Bias audit (LLM)** β†’ audits JD + analyses + recommendation β†’ structured bias indicators + guidance
⬇️
**Outputs per candidate**
* Requirements verification summary (global)
* Culture analysis + score
* Skills analysis + score
* Implied (not scored) follow-ups
* Fact-check results
* Final recommendation (+ revision note if corrected)
* Bias audit
---
### Component map (LLM vs deterministic)
**LLM-driven components**
* Required skill extraction (JSON)
* Culture attribute generation (JSON)
* Culture match proposals (JSON)
* Skills match proposals (JSON)
* Implied (not scored) follow-ups (JSON)
* Factuality verification (βœ“/βœ—)
* Final recommendation (policy constrained)
* Bias audit (structured)
**Deterministic / Python-enforced components**
* Resume anonymization
* Chunking + embedding + storage
* Retrieval parameters (top-k)
* Required-skill verification (quote/name-only/dropped)
* Deduplication of requirements
* Culture match cleanup rules (validity gates)
* Skills missing list recomputation
* Skills score computation
* Culture score computation with weights
* Decision thresholds (proceed / do not proceed)
* Self-correction trigger (presence of unverified claims)
---
## Audit Artifacts
This section lists the primary artifacts produced (or recommended to persist) to make runs reviewable and defensible.
### Inputs (source-of-truth)
* Job description text (as provided)
* Culture documents (original files)
* Resumes (original files)
### Pre-processing
* Sanitized resume text (post-anonymization)
* Redaction notes (what was removed/masked)
* Chunking configuration (chunk_size, chunk_overlap)
* Embedding configuration (embedding model + settings)
### Retrieval
* Culture retrieval query: JD text
* Culture retrieved chunks: top-k (**current: k=3**)
* Resume retrieval query: JD text
* Resume retrieved chunks: top-k (**current: k=10**)
* Candidate grouping: chunks grouped by `resume_id`
### Requirements verification
* LLM `required_skills` JSON (raw)
* Normalized required skill list (deduped)
* Verification output:
* quote-verified list
* name-only list
* dropped/unverified list
* counts and factuality score
* Final scoring-required list: quote-verified only
### Per-candidate analyses
**Culture analysis**
* Raw LLM culture-match JSON
* Post-cleanup matched culture list
* Missing culture attributes list
* Culture score + label
* Culture evidence lines shown to recruiters
**Skills analysis**
* Raw LLM skills-match JSON
* Matched skills list (with evidence snippets)
* Deterministically computed missing skills list
* Skills score + label
**Implied (NOT SCORED)**
* Raw LLM implied JSON
* Filtered implied list (must include resume quotes + phone-screen questions)
### Verification & correction
* Verifier raw output (βœ“/βœ— lines)
* Verified claims list
* Unverified claims list
* Factuality score
* Self-correction trigger status (yes/no)
* Corrected recommendation (if triggered) + revision note
### Bias audit
* Bias audit raw output (structured)
* Bias indicators list (category, severity, trigger_text, why_it_matters, recommended_fix)
* Overall assessment
* Recruiter guidance
### Run-level trace (recommended)
For reproducibility/governance, also persist:
* Timestamp, model name, temperature, seed
* Prompt versions (hash or version ID)
* Retrieval parameters (k values)
* Score thresholds and policy version
* Any configuration overrides used during the run
## End-to-End Pipeline (Swim-Lane View)
| Step | Recruiter / Input | Python / Deterministic Logic | LLM (Groq) | Storage / Output |
|------|------------------|------------------------------|-----------|------------------|
| 1 | Upload culture documents | Chunk + embed | β€” | `culture_store` (indexed) |
| 2 | Upload resumes | Anonymize β†’ chunk β†’ embed | β€” | `resume_store` (indexed) |
| 3 | Paste JD + Run | Send JD to LLM | Extract required skills + evidence quotes | `required_skills` JSON |
| 4 | β€” | Verify requirements (quote / name-only / dropped) | β€” | Verified list + debug report |
| 5 | β€” | Retrieve culture context (k=3) | β€” | `culture_context` |
| 6 | β€” | β€” | Generate culture attributes (job-performance aligned) | `cultural_attributes` JSON |
| 7 | β€” | Retrieve resume chunks (k=10), group by `resume_id` | β€” | Candidate chunks |
| 8 | β€” | β€” | Propose culture matches (direct/inferred + quotes) | Raw culture-match JSON |
| 9 | β€” | Cleanup + weighted scoring (direct=1.0, inferred=0.5) | β€” | Culture score + evidence |
| 10 | β€” | β€” | Propose skill matches + evidence snippets | Raw skills-match JSON |
| 11 | β€” | Compute missing list + skills score (verified reqs only) | β€” | Skills score + missing list |
| 12 | β€” | β€” | Infer implied skills (NOT SCORED) + phone questions | Implied follow-ups |
| 13 | β€” | β€” | Verify evidence (βœ“/βœ—) | Factuality report |
| 14 | β€” | β€” | Generate recommendation (policy constrained) | Final recommendation |
| 15 | β€” | Trigger self-correction (if needed) | Revise flagged claims only | Corrected recommendation |
| 16 | β€” | β€” | Run bias audit (JD + analyses + decision) | Bias indicators + guidance |
| 17 | Review output | Assemble final report | β€” | Full candidate report |
### Current Retrieval Parameters
- Culture store: `k = 3` chunks (JD query)
- Resume store: `k = 10` chunks (JD query)