Spaces:

jbeiroa
/

thereisnohr

Sleeping

App Files Files Community

thereisnohr / docs /full-software-guide.md

jbeiroa

Initial clean deploy of demo app

74711df about 2 months ago

preview code

raw

history blame contribute delete

4.83 kB

Full Software Guide

Purpose and Reader

This guide is the canonical technical walkthrough of the thereisnohr codebase for engineers who need to understand, operate, and extend the system. It describes the complete implementation as of the final stage.

System Snapshot

thereisnohr is a complete Applicant Tracking System (ATS) featuring:

Async API: FastAPI-based REST backend with database-backed background task management.
Modern UI: Multi-page Streamlit application for recruiters.
Advanced Ingestion: Multi-stage PDF parsing, identity resolution, and structured extraction.
Hybrid Ranking: Vector retrieval (pgvector) combined with deterministic heuristics and LLM reranking.
Full Persitence: Robust SQLAlchemy models and Alembic migrations.

Repository Map

src/api/: REST API surface and async task framework.
ui/: Streamlit frontend application.
src/ingest/: PDF parsing, candidate identity resolution, and Metaflow flows.
src/extract/: Signal extraction logic for resumes and job descriptions.
src/retrieval/: Semantic vector retrieval service.
src/ranking/: Hybrid scoring and LLM reranking services.
src/storage/: Database models, repositories, and migration runtime.
src/llm/: Provider-agnostic client and model alias registry.
src/core/: Runtime configuration and structured logging.
config/: Model routing aliases and fallback policies.
tests/: Unit and integration test suites.

Architecture Layers

1) Runtime & Configuration

Settings: Defined in src/core/config.py using pydantic-settings. Loads from .env.
Logging: Structured JSON logging with run_id correlation for all internal and external (LLM) calls.

2) Persistence Layer

Database: PostgreSQL with pgvector for semantic search.
Models: SQLAlchemy 2.0 Mapped styles in src/storage/models.py.
Repositories: Standardized CRUD and complex query logic in src/storage/repositories.py.
Migrations: Alembic-managed schema revisions.

3) LLM Abstraction

LLMClient: Abstract contract for structured generation and embeddings.
LiteLLM: Concrete implementation supporting hundreds of providers (OpenAI, Ollama, Anthropic, etc.).
Alias Registry: Decouples feature logic from specific models. Features request an alias (e.g., ranker_default), and the registry routes it to the configured provider.

4) Ingestion & Parsing

Parser: Uses pymupdf4llm to convert PDFs to markdown, followed by custom cleaning and heading detection.
Identity Resolution: Determines if a resume belongs to an existing candidate using deterministic signals (email, phone) and LLM fallback for names.
Ingestion Service: Orchestrates the per-file pipeline (parse -> identify -> extract signals -> persist).
Metaflow: Used for high-volume batch ingestion from local directories.

5) Retrieval & Ranking

Stage 1: Vector Retrieval: Wide-net search using cosine similarity on resume section embeddings.
Stage 2: Deterministic Scorer: Heuristic filter based on explicit hard-skill overlap between the JD and candidate signals.
Stage 3: LLM Reranking: Qualitative refinement of top candidates, generating fit summaries and gap/risk analysis.

6) API & UI (Interface Layer)

FastAPI: Provides endpoints for job management, resume uploads, and ranking triggers.
AsyncTask Runner: A zero-dependency worker that executes long-running functions using FastAPI's BackgroundTasks while updating the async_tasks table for UI polling.
Streamlit: A dashboard for recruiters to upload files, manage job postings, and review/export candidate rankings.

Developer Runbook

Setup

uv sync --all-extras
cp .env.example .env
# Edit .env with your database and API keys

Database Operations

uv run alembic upgrade head

Running the App

Start Backend: uv run uvicorn src.api.app:app
Start Frontend: uv run streamlit run ui/app.py

Testing

uv run pytest -q
# For database integration tests (requires Docker)
uv run pytest -q -m integration

Technical Contracts

Public REST API

POST /api/ingest/upload: Upload PDF resumes.
POST /api/jobs: Create a job and extract requirements.
POST /api/jobs/{id}/rank: Start the ranking process.
GET /api/tasks/{id}: Poll for background task status.
GET /api/matches: List ranked candidates for a job.

Model Alias System

Aliases are defined in config/model_aliases.yaml:

embedding_default: Used for semantic indexing.
extractor_default: Used for structured signal extraction.
ranker_default: Used for qualitative reranking and explanations.
explainer_default: Used for interview prep pack generation.