Spaces:

19arjun89
/

AI_Recruiting_Agent

Running

App Files Files Community

AI_Recruiting_Agent / Guidebook.md

19arjun89

Update Guidebook.md

9fad5d1 verified about 1 month ago

preview code

raw

history blame contribute delete

25.1 kB

	# AI Recruiting Assistant — Guide Book (Updated)

	## 0) Overview

	### What this tool does

	This AI Recruiting Assistant is a decision-support system that helps recruiters and hiring managers:

	* Extract job requirements from a job description (JD)
	* Evaluate resumes against verified requirements using evidence-based matching
	* Assess job-relevant culture/working-style signals using retrieved company documents
	* Run factuality checks to detect ungrounded claims
	* Run a bias & fairness audit across the JD, analyses, and the model’s final recommendation

	### The problem it addresses

	Recruiting teams often face three recurring issues when using AI:

	1. Hallucinated requirements: LLMs may “invent” skills that are not explicitly required.
	2. Opaque scoring: Many tools produce fit scores without clearly showing evidence.
	3. Bias risks: Hiring language and reasoning can leak pedigree/class proxies or subjective criteria.

	This tool addresses those issues by enforcing:

	* Deterministic verification gates (requirements are verified before scoring)
	* Evidence-backed scoring (only verified requirements are scored; each match includes a quote)
	* Self-verification and self-correction (factuality checks can trigger automatic revision)
	* Bias auditing (flags risky language and inconsistent standards)

	### How it differentiates from typical recruiting tools

	Compared with “black-box” resume screeners or generic LLM chatbots, this system emphasizes:

	* Transparency: Outputs include what was required, what was verified, what was dropped, and why.
	* Auditability: The scoring math is deterministic and traceable to inputs.
	* Self-verifying behavior: Claims are checked against source text; unverified claims can be removed.
	* Bias checks by design: Bias-sensitive content is audited explicitly instead of implicitly influencing scores.
	* Culture check that’s job-performance aligned: Culture attributes are framed as job-relevant behaviors, not background proxies.

	---

	## 1) Inputs and Document Handling

	### 1.1 What the user uploads

	The tool operates on three inputs:

	1. Company culture / values documents (PDF/DOCX)
	2. Resumes (PDF/DOCX)
	3. Job description (pasted text)

	### 1.2 Resume anonymization

	Before resumes are stored or analyzed, the tool applies heuristic redaction:

	* Emails, phone numbers, URLs
	* Addresses / location identifiers
	* Explicit demographic fields
	* Likely name header (first line)

	This reduces exposure of personal identifiers and keeps analysis focused on job evidence.

	### 1.3 Vector stores (retrieval)

	The tool maintains two separate Chroma collections:

	* Resumes (anonymized + chunked)
	* Culture docs (chunked)

	Chunking uses a recursive splitter with overlap to preserve context.

	---

	## 2) End-to-End Logic Flow (Step-by-Step)

	Below is the stepwise flow executed when a recruiter clicks Analyze Candidates.

	### Step 0 — Prerequisite: Documents exist in storage

	* Culture docs and resumes must be stored first.
	* If not stored, retrieval will be empty or low-signal.

	### Step 1 — Extract required skills from the Job Description (LLM-driven)

	Goal: Identify only skills that are explicitly required.

	* The tool prompts the LLM to return JSON only:

	* `required_skills: [{skill, evidence_quote}]`
	* The LLM is instructed to:

	* include only MUST HAVE / explicitly required skills
	* exclude “nice-to-haves” and implied skills
	* copy a short verbatim quote as evidence

	LLM role: structured extraction.

	Failure behavior: If JSON parsing fails, the tool stops and prints the raw output.

	### Step 2 — Verify extracted skills against the JD (deterministic, Python)

	Goal: Block hallucinated requirements from entering scoring.

	Each extracted item is classified:

	* Quote-verified (strong): the evidence quote appears verbatim in the JD
	* Name-only (weak): the skill name appears in the JD, but the quote doesn’t match
	* Unverified (dropped): neither quote nor name appears

	Deterministic gate:

	* Only quote-verified skills are used as the final required list for scoring.
	* Name-only and dropped skills are reported for transparency.

	Output: “Requirements Verification” section shows:

	* extracted count
	* quote-verified vs name-only vs dropped
	* list of skills used for scoring
	* list of retracted/dropped items (with reason)

	### Step 3 — Retrieve the most relevant culture chunks (deterministic retrieval)

	Goal: Ground culture evaluation in actual company documents.

	* The tool runs similarity search over culture docs using the JD as query.
	* It selects the top k chunks (e.g., k=3).

	Deterministic component: vector retrieval parameters.

	Output artifact: `culture_context` is the concatenated text of retrieved culture chunks.

	### Step 4 — Generate job-performance culture attributes (LLM-driven)

	Goal: Create a small set of job-relevant behavioral attributes to evaluate consistently.

	* The tool prompts the LLM to return JSON:

	* `cultural_attributes: ["...", "..."]` (4–6 items)

	Attribute rules:

	* Must be job-performance aligned behaviors (e.g., “evidence-based decision making”).
	* Must avoid pedigree / class / prestige language.
	* Must avoid non-performance preferences (e.g., remote-first, time zone).

	LLM role: label generation from retrieved culture context.

	### Step 5 — Retrieve top resume chunks for the JD (deterministic retrieval)

	Goal: Identify the most relevant candidates and their relevant resume text.

	* The tool runs similarity search over resumes using the JD.
	* It retrieves top k chunks (e.g., k=10) and groups them by `resume_id`.

	Note: Only retrieved chunks are analyzed. If relevant evidence isn’t retrieved, it may be missed.

	### Step 6 — Culture evidence matching per candidate (LLM + deterministic cleanup + deterministic scoring)

	Goal: Determine which culture attributes are supported by resume evidence.

	LLM-driven matching:

	* For each attribute, the LLM may return a match with:

	* `evidence_type`: `direct` or `inferred`
	* `evidence_quotes`: 1–2 verbatim resume quotes
	* `inference`: required for inferred
	* `confidence`: 1–5

	Deterministic cleanup rules (Python):
	A match is kept only if:

	* attribute is present
	* evidence_type is `direct` or `inferred`
	* at least one non-trivial quote exists
	* confidence is an integer 1–5
	* inferred matches include an inference sentence
	* inferred matches can be required to meet a minimum confidence

	Deterministic culture scoring (Python):

	* Direct evidence weight: 1.0
	* Inferred evidence weight: 0.5

	Culture score is computed as:

	* `(sum(weights for matched attributes) / number_of_required_attributes) * 100`

	### Step 7 — Skills matching per candidate (LLM + deterministic scoring)

	Goal: Match only the verified required skills to resume evidence.

	Inputs:

	* Candidate resume text (retrieved chunks)
	* Verified required skills list (quote-only)

	LLM output (JSON):

	* `matched: [{skill, evidence_snippet}]`
	* `missing: [skill]` (treated as advisory; missing is recomputed deterministically)

	Deterministic missing calculation (Python):

	* Missing = required_set − matched_set

	Deterministic skills scoring (Python):

	* `(number_of_matched_required_skills / number_of_required_skills) * 100`

	### Step 8 — Implied competencies (NOT SCORED) for phone-screen guidance (LLM-driven, advisory)

	Goal: When a required skill is missing explicitly, suggest whether it may be implied by adjacent evidence.

	* This step is not scored and does not affect proceed/do-not-proceed.
	* The LLM may suggest implied competencies only if it:

	* uses conservative language (“may be implied”)
	* includes verbatim resume quotes
	* provides a phone-screen validation question

	Hard guardrail: Tool-specific skills (e.g., R/SAS/MATLAB) must be explicitly present in the resume to be suggested.

	### Step 9 — Factuality verification (LLM-driven verifier)

	Goal: Detect ungrounded evidence claims.

	* The verifier checks evidence-backed match lines (e.g., `- Skill: snippet`).
	* It ignores:

	* numeric score lines
	* missing lists
	* policy text

	Outputs:

	* verified claims (✓)
	* unverified claims (✗)
	* factuality score

	### Step 10 — Final recommendation (LLM, policy-constrained)

	Goal: Produce a structured recommendation without changing scores.

	* The model is given:

	* skills analysis
	* culture analysis
	* fixed computed scores
	* deterministic decision policy

	Decision policy:

	* If skills_score ≥ 70 → PROCEED
	* If skills_score < 60 → DO NOT PROCEED
	* If 60 ≤ skills_score < 70 → PROCEED only if culture_score ≥ 70 else DO NOT PROCEED

	Non-negotiables:

	* LLM must not re-score.
	* LLM must not introduce new claims.

	### Step 11 — Self-correction (triggered by verification issues)

	Goal: Remove/correct any unverified claims while preserving scores/policy.

	* If any unverified claims exist:

	* The tool asks the LLM to revise the recommendation
	* Only the flagged claims may be removed/corrected
	* Scores and policy must remain unchanged

	### Step 12 — Bias audit (LLM-driven audit across docs + reasoning)

	Goal: Flag biased reasoning, biased JD language, or inconsistent standards.

	Audit scope includes:

	* Job description
	* Skills analysis
	* Culture analysis
	* Final recommendation text
	* Culture context

	What it flags (examples):

	* Prestige/pedigree signals (elite employers/education as proxy)
	* Vague “polish/executive presence” language not tied to job requirements
	* Non-job-related culture screening
	* Inconsistent standards (penalizing requirements not in JD)
	* Overclaiming certainty

	Outputs:

	* structured list of bias indicators (category, severity, trigger text, why it matters, recommended fix)
	* recruiter guidance

	---

	## 3) Scoring and Decision Rules (Deterministic)

	### 3.1 Skills score

	* Only quote-verified required skills count.
	* Score = matches / required.

	### 3.2 Culture score

	* Score = weighted matches / attributes.
	* Direct = 1.0; inferred = 0.5.

	### 3.3 Labels

	* ≥70: Strong fit
	* 50–69: Moderate fit
	* <50: Not a fit

	### 3.4 Recommendation

	Recommendation follows the fixed policy described in Step 10.

	---

	## 4) System Flow Diagram (Textual)

	Below is a simplified, end-to-end flow of how data moves through the system.

	```
	[User Uploads]
	\|
	v
	+-------------------+
	\| Culture Documents \|
	+-------------------+ +-----------+
	\| \| Job Desc \|
	v +-----------+
	+-------------------+ \|
	\| Culture Vector DB \|<--------------+
	+-------------------+ \|
	\| v
	\| +---------------------+
	\| \| Skill Extraction \|
	\| \| (LLM, JSON Output) \|
	\| +---------------------+
	\| \|
	\| v
	\| +---------------------+
	\| \| Requirement \|
	\| \| Verification \|
	\| \| (Deterministic) \|
	\| +---------------------+
	\| \|
	\| v
	\| Verified Required Skills
	\| \|
	\| v
	+-------------------+ +---------------------+
	\| Resume Documents \|------->\| Resume Vector DB \|
	+-------------------+ +---------------------+
	\|
	v
	Similarity Search (k=10)
	\|
	v
	Resume Chunks (Grouped)
	\|
	v
	+-----------------------------+
	\| Culture Attribute Generator \|
	\| (LLM, JSON Output) \|
	+-----------------------------+
	\|
	v
	+-----------------------------+
	\| Culture Evidence Matching \|
	\| (LLM + Rules + Weights) \|
	+-----------------------------+
	\|
	v
	Culture Score (Deterministic)
	\|
	v
	+-----------------------------+
	\| Technical Skill Matching \|
	\| (LLM + Deterministic Scoring)\|
	+-----------------------------+
	\|
	v
	Skills Score (Deterministic)
	\|
	v
	+-----------------------------+
	\| Implied Competencies (LLM) \|
	\| (Not Scored, Advisory) \|
	+-----------------------------+
	\|
	v
	+-----------------------------+
	\| Factuality Verification \|
	\| (LLM Verifier) \|
	+-----------------------------+
	\|
	v
	+-----------------------------+
	\| Recommendation Generator \|
	\| (Policy-Constrained LLM) \|
	+-----------------------------+
	\|
	v
	+-----------------------------+
	\| Bias & Fairness Audit \|
	\| (LLM Audit) \|
	+-----------------------------+
	\|
	v
	Final Recruiter Report
	```

	---

	## 5) Audit Artifacts and Traceability

	For every analysis run, the system produces and retains multiple audit artifacts that enable post-hoc review, regulatory defensibility, and debugging.

	### 5.1 Input Artifacts

	1. Original Job Description

	* Full pasted JD text

	2. Sanitized Resume Text

	* Redacted resume content
	* Redaction summary (internal)

	3. Retrieved Culture Chunks

	* Top-k (default: 3) culture document segments
	* Vector similarity scores (internal)

	4. Retrieved Resume Chunks

	* Top-k (default: 10) resume segments
	* Resume ID metadata

	---

	### 5.2 Requirement Verification Artifacts

	1. Raw LLM Skill Extraction Output
	2. Parsed Required Skills JSON
	3. Verification Classification Table

	* Quote-verified
	* Name-only
	* Dropped
	4. Dropped-Skill Justifications

	---

	### 5.3 Culture Analysis Artifacts

	1. Generated Culture Attribute List
	2. LLM Raw Matching Output
	3. Cleaned Match Records

	* Evidence type
	* Quotes
	* Inference
	* Confidence
	4. Weighted Match Table
	5. Computed Culture Score

	---

	### 5.4 Skills Analysis Artifacts

	1. Verified Required Skill List
	2. LLM Raw Matching Output
	3. Accepted Matched Skills
	4. Deterministic Missing-Skill Set
	5. Computed Skills Score

	---

	### 5.5 Implied Competency Artifacts (Advisory)

	1. Missing Skill List
	2. LLM Implied Output (JSON)
	3. Accepted Implied Records

	* Resume quotes
	* Explanation
	* Phone-screen questions
	4. Rejected Inferences (internal)

	---

	### 5.6 Verification and Correction Artifacts

	1. Verifier Prompt and Output
	2. Verified / Unverified Claim Lists
	3. Factuality Scores
	4. Self-Correction Prompts and Revisions (if triggered)

	---

	### 5.7 Recommendation and Policy Artifacts

	1. Final Recommendation Prompt
	2. Policy Threshold Snapshot
	3. Immutable Score Values
	4. Generated Recommendation Text

	---

	### 5.8 Bias Audit Artifacts

	1. Bias Audit Prompt
	2. Audit Input Bundle (JD + Analyses + Recommendation)
	3. Structured Bias Indicator List
	4. Severity and Mitigation Suggestions
	5. Recruiter Guidance Text

	---

	### 5.9 System Metadata

	1. Timestamp of run
	2. Model version
	3. Prompt versions
	4. Chunking parameters
	5. Retrieval k-values
	6. Scoring parameters

	---

	## 6) Known Limitations

	1. Retrieval scope: evaluation depends on retrieved chunks; some evidence may be missed.
	2. Attribute generation variance: culture attributes can vary per run unless cached or cataloged.
	3. LLM evidence overreach: mitigated by verification and cleanup, but not eliminated.
	4. Bias audit is advisory: it flags issues; it does not enforce policy changes unless you add an auto-rewrite step.

	---

	## 6) Governance and Change Control

	* Prompt changes must preserve JSON contracts.
	* Any change that affects scoring or policy should be versioned.
	* Audit outputs should be retained for traceability.

	---

	## 7) Intended Use

	This tool is built for:

	* faster, evidence-based screening
	* transparent reasoning
	* safer use of LLMs via verification and audits

	It is not a substitute for:

	* human judgment
	* legal review
	* formal HR policy compliance

	---

	### High-level pipeline (inputs → outputs)

	Inputs uploaded by recruiter

	1. Company culture/values docs (PDF/DOCX)
	2. Resumes (PDF/DOCX)
	3. Job description (text)

	⬇️

	Indexing (deterministic, Python)

	* Culture docs → chunk + embed → `culture_store`
	* Resumes → anonymize → chunk + embed → `resume_store`

	⬇️

	Candidate assessment (per JD run)

	1. Extract required skills (LLM) → JSON `required_skills[{skill,evidence_quote}]`

	2. Verify extracted skills (Python) → quote-verified / name-only / dropped → quote-only list used for scoring

	3. Retrieve relevant culture context (deterministic retrieval)

	* Query: JD
	* Retrieve: top-k culture chunks (current: k=3)
	* Output: `culture_context`

	4. Generate job-relevant culture attributes (LLM) → JSON `cultural_attributes[4–6]`

	5. Retrieve relevant resume chunks (deterministic retrieval)

	* Query: JD
	* Retrieve: top-k resume chunks (current: k=10)
	* Group by `resume_id`

	6. Per candidate: culture matching (LLM → cleanup → deterministic score)

	* LLM proposes matches (direct/inferred) + quotes
	* Python enforces validity gates
	* Deterministic weighted culture score (direct=1.0, inferred=0.5)

	7. Per candidate: skills matching (LLM → deterministic score)

	* LLM proposes matched skills + evidence snippets
	* Python recomputes missing list deterministically
	* Deterministic skills score using quote-verified requirements only

	8. Per candidate: implied competencies (LLM, NOT SCORED)

	* Inputs: missing skills + matched skills + resume + JD
	* Output: implied items with quotes + phone-screen questions
	* Guardrail: tool-like skills (R/SAS/MATLAB) require explicit mention

	9. Factuality verification (LLM verifier) → ✓/✗ for evidence-backed match lines + factuality score

	10. Recommendation (LLM, policy constrained) → uses fixed scores + fixed decision policy

	11. Self-correction (conditional) → triggered if any unverified claims exist

	12. Bias audit (LLM) → audits JD + analyses + recommendation → structured bias indicators + guidance

	⬇️

	Outputs per candidate

	* Requirements verification summary (global)
	* Culture analysis + score
	* Skills analysis + score
	* Implied (not scored) follow-ups
	* Fact-check results
	* Final recommendation (+ revision note if corrected)
	* Bias audit

	---

	### Component map (LLM vs deterministic)

	LLM-driven components

	* Required skill extraction (JSON)
	* Culture attribute generation (JSON)
	* Culture match proposals (JSON)
	* Skills match proposals (JSON)
	* Implied (not scored) follow-ups (JSON)
	* Factuality verification (✓/✗)
	* Final recommendation (policy constrained)
	* Bias audit (structured)

	Deterministic / Python-enforced components

	* Resume anonymization
	* Chunking + embedding + storage
	* Retrieval parameters (top-k)
	* Required-skill verification (quote/name-only/dropped)
	* Deduplication of requirements
	* Culture match cleanup rules (validity gates)
	* Skills missing list recomputation
	* Skills score computation
	* Culture score computation with weights
	* Decision thresholds (proceed / do not proceed)
	* Self-correction trigger (presence of unverified claims)

	---

	## Audit Artifacts

	This section lists the primary artifacts produced (or recommended to persist) to make runs reviewable and defensible.

	### Inputs (source-of-truth)

	* Job description text (as provided)
	* Culture documents (original files)
	* Resumes (original files)

	### Pre-processing

	* Sanitized resume text (post-anonymization)
	* Redaction notes (what was removed/masked)
	* Chunking configuration (chunk_size, chunk_overlap)
	* Embedding configuration (embedding model + settings)

	### Retrieval

	* Culture retrieval query: JD text
	* Culture retrieved chunks: top-k (current: k=3)
	* Resume retrieval query: JD text
	* Resume retrieved chunks: top-k (current: k=10)
	* Candidate grouping: chunks grouped by `resume_id`

	### Requirements verification

	* LLM `required_skills` JSON (raw)
	* Normalized required skill list (deduped)
	* Verification output:

	* quote-verified list
	* name-only list
	* dropped/unverified list
	* counts and factuality score
	* Final scoring-required list: quote-verified only

	### Per-candidate analyses

	Culture analysis

	* Raw LLM culture-match JSON
	* Post-cleanup matched culture list
	* Missing culture attributes list
	* Culture score + label
	* Culture evidence lines shown to recruiters

	Skills analysis

	* Raw LLM skills-match JSON
	* Matched skills list (with evidence snippets)
	* Deterministically computed missing skills list
	* Skills score + label

	Implied (NOT SCORED)

	* Raw LLM implied JSON
	* Filtered implied list (must include resume quotes + phone-screen questions)

	### Verification & correction

	* Verifier raw output (✓/✗ lines)
	* Verified claims list
	* Unverified claims list
	* Factuality score
	* Self-correction trigger status (yes/no)
	* Corrected recommendation (if triggered) + revision note

	### Bias audit

	* Bias audit raw output (structured)
	* Bias indicators list (category, severity, trigger_text, why_it_matters, recommended_fix)
	* Overall assessment
	* Recruiter guidance

	### Run-level trace (recommended)

	For reproducibility/governance, also persist:

	* Timestamp, model name, temperature, seed
	* Prompt versions (hash or version ID)
	* Retrieval parameters (k values)
	* Score thresholds and policy version
	* Any configuration overrides used during the run


	## End-to-End Pipeline (Swim-Lane View)

	\| Step \| Recruiter / Input \| Python / Deterministic Logic \| LLM (Groq) \| Storage / Output \|
	\|------\|------------------\|------------------------------\|-----------\|------------------\|
	\| 1 \| Upload culture documents \| Chunk + embed \| — \| `culture_store` (indexed) \|
	\| 2 \| Upload resumes \| Anonymize → chunk → embed \| — \| `resume_store` (indexed) \|
	\| 3 \| Paste JD + Run \| Send JD to LLM \| Extract required skills + evidence quotes \| `required_skills` JSON \|
	\| 4 \| — \| Verify requirements (quote / name-only / dropped) \| — \| Verified list + debug report \|
	\| 5 \| — \| Retrieve culture context (k=3) \| — \| `culture_context` \|
	\| 6 \| — \| — \| Generate culture attributes (job-performance aligned) \| `cultural_attributes` JSON \|
	\| 7 \| — \| Retrieve resume chunks (k=10), group by `resume_id` \| — \| Candidate chunks \|
	\| 8 \| — \| — \| Propose culture matches (direct/inferred + quotes) \| Raw culture-match JSON \|
	\| 9 \| — \| Cleanup + weighted scoring (direct=1.0, inferred=0.5) \| — \| Culture score + evidence \|
	\| 10 \| — \| — \| Propose skill matches + evidence snippets \| Raw skills-match JSON \|
	\| 11 \| — \| Compute missing list + skills score (verified reqs only) \| — \| Skills score + missing list \|
	\| 12 \| — \| — \| Infer implied skills (NOT SCORED) + phone questions \| Implied follow-ups \|
	\| 13 \| — \| — \| Verify evidence (✓/✗) \| Factuality report \|
	\| 14 \| — \| — \| Generate recommendation (policy constrained) \| Final recommendation \|
	\| 15 \| — \| Trigger self-correction (if needed) \| Revise flagged claims only \| Corrected recommendation \|
	\| 16 \| — \| — \| Run bias audit (JD + analyses + decision) \| Bias indicators + guidance \|
	\| 17 \| Review output \| Assemble final report \| — \| Full candidate report \|

	### Current Retrieval Parameters

	- Culture store: `k = 3` chunks (JD query)
	- Resume store: `k = 10` chunks (JD query)