Spaces:

jbeiroa
/

thereisnohr

Sleeping

App Files Files Community

thereisnohr / docs /full-software-guide.md

jbeiroa

Initial clean deploy of demo app

74711df about 2 months ago

preview code

raw

history blame contribute delete

4.83 kB

	# Full Software Guide

	## Purpose and Reader
	This guide is the canonical technical walkthrough of the `thereisnohr` codebase for engineers who need to understand, operate, and extend the system. It describes the complete implementation as of the final stage.

	## System Snapshot
	`thereisnohr` is a complete Applicant Tracking System (ATS) featuring:
	- Async API: FastAPI-based REST backend with database-backed background task management.
	- Modern UI: Multi-page Streamlit application for recruiters.
	- Advanced Ingestion: Multi-stage PDF parsing, identity resolution, and structured extraction.
	- Hybrid Ranking: Vector retrieval (`pgvector`) combined with deterministic heuristics and LLM reranking.
	- Full Persitence: Robust SQLAlchemy models and Alembic migrations.

	## Repository Map
	- `src/api/`: REST API surface and async task framework.
	- `ui/`: Streamlit frontend application.
	- `src/ingest/`: PDF parsing, candidate identity resolution, and Metaflow flows.
	- `src/extract/`: Signal extraction logic for resumes and job descriptions.
	- `src/retrieval/`: Semantic vector retrieval service.
	- `src/ranking/`: Hybrid scoring and LLM reranking services.
	- `src/storage/`: Database models, repositories, and migration runtime.
	- `src/llm/`: Provider-agnostic client and model alias registry.
	- `src/core/`: Runtime configuration and structured logging.
	- `config/`: Model routing aliases and fallback policies.
	- `tests/`: Unit and integration test suites.

	---

	## Architecture Layers

	### 1) Runtime & Configuration
	- Settings: Defined in `src/core/config.py` using `pydantic-settings`. Loads from `.env`.
	- Logging: Structured JSON logging with `run_id` correlation for all internal and external (LLM) calls.

	### 2) Persistence Layer
	- Database: PostgreSQL with `pgvector` for semantic search.
	- Models: SQLAlchemy 2.0 `Mapped` styles in `src/storage/models.py`.
	- Repositories: Standardized CRUD and complex query logic in `src/storage/repositories.py`.
	- Migrations: Alembic-managed schema revisions.

	### 3) LLM Abstraction
	- LLMClient: Abstract contract for structured generation and embeddings.
	- LiteLLM: Concrete implementation supporting hundreds of providers (OpenAI, Ollama, Anthropic, etc.).
	- Alias Registry: Decouples feature logic from specific models. Features request an alias (e.g., `ranker_default`), and the registry routes it to the configured provider.

	### 4) Ingestion & Parsing
	- Parser: Uses `pymupdf4llm` to convert PDFs to markdown, followed by custom cleaning and heading detection.
	- Identity Resolution: Determines if a resume belongs to an existing candidate using deterministic signals (email, phone) and LLM fallback for names.
	- Ingestion Service: Orchestrates the per-file pipeline (parse -> identify -> extract signals -> persist).
	- Metaflow: Used for high-volume batch ingestion from local directories.

	### 5) Retrieval & Ranking
	- Stage 1: Vector Retrieval: Wide-net search using cosine similarity on resume section embeddings.
	- Stage 2: Deterministic Scorer: Heuristic filter based on explicit hard-skill overlap between the JD and candidate signals.
	- Stage 3: LLM Reranking: Qualitative refinement of top candidates, generating fit summaries and gap/risk analysis.

	### 6) API & UI (Interface Layer)
	- FastAPI: Provides endpoints for job management, resume uploads, and ranking triggers.
	- AsyncTask Runner: A zero-dependency worker that executes long-running functions using FastAPI's `BackgroundTasks` while updating the `async_tasks` table for UI polling.
	- Streamlit: A dashboard for recruiters to upload files, manage job postings, and review/export candidate rankings.

	---

	## Developer Runbook

	### Setup
	```bash
	uv sync --all-extras
	cp .env.example .env
	# Edit .env with your database and API keys
	```

	### Database Operations
	```bash
	uv run alembic upgrade head
	```

	### Running the App
	1. Start Backend: `uv run uvicorn src.api.app:app`
	2. Start Frontend: `uv run streamlit run ui/app.py`

	### Testing
	```bash
	uv run pytest -q
	# For database integration tests (requires Docker)
	uv run pytest -q -m integration
	```

	---

	## Technical Contracts

	### Public REST API
	- `POST /api/ingest/upload`: Upload PDF resumes.
	- `POST /api/jobs`: Create a job and extract requirements.
	- `POST /api/jobs/{id}/rank`: Start the ranking process.
	- `GET /api/tasks/{id}`: Poll for background task status.
	- `GET /api/matches`: List ranked candidates for a job.

	### Model Alias System
	Aliases are defined in `config/model_aliases.yaml`:
	- `embedding_default`: Used for semantic indexing.
	- `extractor_default`: Used for structured signal extraction.
	- `ranker_default`: Used for qualitative reranking and explanations.
	- `explainer_default`: Used for interview prep pack generation.