thereisnohr / docs /full-software-guide.md
jbeiroa's picture
Initial clean deploy of demo app
74711df
# Full Software Guide
## Purpose and Reader
This guide is the canonical technical walkthrough of the `thereisnohr` codebase for engineers who need to understand, operate, and extend the system. It describes the complete implementation as of the final stage.
## System Snapshot
`thereisnohr` is a complete Applicant Tracking System (ATS) featuring:
- **Async API**: FastAPI-based REST backend with database-backed background task management.
- **Modern UI**: Multi-page Streamlit application for recruiters.
- **Advanced Ingestion**: Multi-stage PDF parsing, identity resolution, and structured extraction.
- **Hybrid Ranking**: Vector retrieval (`pgvector`) combined with deterministic heuristics and LLM reranking.
- **Full Persitence**: Robust SQLAlchemy models and Alembic migrations.
## Repository Map
- `src/api/`: REST API surface and async task framework.
- `ui/`: Streamlit frontend application.
- `src/ingest/`: PDF parsing, candidate identity resolution, and Metaflow flows.
- `src/extract/`: Signal extraction logic for resumes and job descriptions.
- `src/retrieval/`: Semantic vector retrieval service.
- `src/ranking/`: Hybrid scoring and LLM reranking services.
- `src/storage/`: Database models, repositories, and migration runtime.
- `src/llm/`: Provider-agnostic client and model alias registry.
- `src/core/`: Runtime configuration and structured logging.
- `config/`: Model routing aliases and fallback policies.
- `tests/`: Unit and integration test suites.
---
## Architecture Layers
### 1) Runtime & Configuration
- **Settings**: Defined in `src/core/config.py` using `pydantic-settings`. Loads from `.env`.
- **Logging**: Structured JSON logging with `run_id` correlation for all internal and external (LLM) calls.
### 2) Persistence Layer
- **Database**: PostgreSQL with `pgvector` for semantic search.
- **Models**: SQLAlchemy 2.0 `Mapped` styles in `src/storage/models.py`.
- **Repositories**: Standardized CRUD and complex query logic in `src/storage/repositories.py`.
- **Migrations**: Alembic-managed schema revisions.
### 3) LLM Abstraction
- **LLMClient**: Abstract contract for structured generation and embeddings.
- **LiteLLM**: Concrete implementation supporting hundreds of providers (OpenAI, Ollama, Anthropic, etc.).
- **Alias Registry**: Decouples feature logic from specific models. Features request an alias (e.g., `ranker_default`), and the registry routes it to the configured provider.
### 4) Ingestion & Parsing
- **Parser**: Uses `pymupdf4llm` to convert PDFs to markdown, followed by custom cleaning and heading detection.
- **Identity Resolution**: Determines if a resume belongs to an existing candidate using deterministic signals (email, phone) and LLM fallback for names.
- **Ingestion Service**: Orchestrates the per-file pipeline (parse -> identify -> extract signals -> persist).
- **Metaflow**: Used for high-volume batch ingestion from local directories.
### 5) Retrieval & Ranking
- **Stage 1: Vector Retrieval**: Wide-net search using cosine similarity on resume section embeddings.
- **Stage 2: Deterministic Scorer**: Heuristic filter based on explicit hard-skill overlap between the JD and candidate signals.
- **Stage 3: LLM Reranking**: Qualitative refinement of top candidates, generating fit summaries and gap/risk analysis.
### 6) API & UI (Interface Layer)
- **FastAPI**: Provides endpoints for job management, resume uploads, and ranking triggers.
- **AsyncTask Runner**: A zero-dependency worker that executes long-running functions using FastAPI's `BackgroundTasks` while updating the `async_tasks` table for UI polling.
- **Streamlit**: A dashboard for recruiters to upload files, manage job postings, and review/export candidate rankings.
---
## Developer Runbook
### Setup
```bash
uv sync --all-extras
cp .env.example .env
# Edit .env with your database and API keys
```
### Database Operations
```bash
uv run alembic upgrade head
```
### Running the App
1. **Start Backend**: `uv run uvicorn src.api.app:app`
2. **Start Frontend**: `uv run streamlit run ui/app.py`
### Testing
```bash
uv run pytest -q
# For database integration tests (requires Docker)
uv run pytest -q -m integration
```
---
## Technical Contracts
### Public REST API
- `POST /api/ingest/upload`: Upload PDF resumes.
- `POST /api/jobs`: Create a job and extract requirements.
- `POST /api/jobs/{id}/rank`: Start the ranking process.
- `GET /api/tasks/{id}`: Poll for background task status.
- `GET /api/matches`: List ranked candidates for a job.
### Model Alias System
Aliases are defined in `config/model_aliases.yaml`:
- `embedding_default`: Used for semantic indexing.
- `extractor_default`: Used for structured signal extraction.
- `ranker_default`: Used for qualitative reranking and explanations.
- `explainer_default`: Used for interview prep pack generation.