dbhavery
/

hnsw-vector-engine

@@ -1,79 +1,281 @@
----
-library_name: custom
-tags:
-  - vector-search
-  - hnsw
-  - nearest-neighbor
-  - information-retrieval
-  - from-scratch
-license: mit
----
-# HNSW Vector Engine
-A zero-dependency implementation of Hierarchical Navigable Small World (HNSW) approximate nearest-neighbor search, built directly from the [Malkov & Yashunin 2018 paper](https://arxiv.org/abs/1603.09320).
-Part of [Citadel](https://github.com/dbhavery/citadel), an open-source AI operations platform.
-## Why From Scratch?
-Most vector search tools wrap existing libraries (FAISS, Annoy, HNSWlib). This implementation builds the full HNSW index from first principles — multi-layer graph construction, greedy search with backtracking, configurable M/ef parameters — to demonstrate deep understanding of the algorithm, not just API usage.
-## Features
-- **Multi-layer graph** with probabilistic level assignment
-- **Greedy beam search** with configurable ef (exploration factor)
-- **Cosine similarity** distance metric
-- **Batch insert and query** with REST API wrapper
-- **Persistent storage** to disk
-- **18 tests** covering index construction, search accuracy, edge cases
-## Parameters
-| Parameter | Default | Description |
-|-----------|---------|-------------|
-| `M` | 16 | Max connections per node per layer |
-| `ef_construction` | 200 | Beam width during index building |
-| `ef_search` | 50 | Beam width during query |
-| `max_elements` | 10000 | Pre-allocated index size |
-## Usage
-```python
-from citadel_vector import VectorStore
-store = VectorStore(collection="docs", dim=384)
-store.add(vectors=embeddings, metadata=metadata)
-results = store.search(query_vector, k=10)
-```
-## Architecture
-```
-Query Vector
-    |
-    v
-[Top Layer] -- sparse graph, long-range connections
-    |
-    v
-[Layer N-1] -- denser graph
-    |
-    v
-[Layer 0]   -- full graph, all nodes, local connections
-    |
-    v
-Top-K Results (sorted by cosine similarity)
-```
-## Part of Citadel
-This vector engine is one of 6 independently installable packages in the Citadel AI Operations Platform:
-- **citadel-gateway** — LLM proxy with routing, caching, circuit breakers
-- **citadel-vector** — This package (HNSW vector search)
-- **citadel-agents** — ReAct agent runtime with tool registry
-- **citadel-ingest** — Document parsing and chunking pipeline
-- **citadel-trace** — LLM observability and cost tracking
-- **citadel-dashboard** — Real-time operations UI
-[GitHub Repository](https://github.com/dbhavery/citadel) | [Author](https://github.com/dbhavery)

+---
+library_name: custom
+license: mit
+tags:
+  - vector-search
+  - hnsw
+  - nearest-neighbor
+  - information-retrieval
+  - from-scratch
+  - approximate-nearest-neighbor
+language:
+  - en
+pipeline_tag: feature-extraction
+---
+# HNSW Vector Engine -- Zero-Dependency Approximate Nearest Neighbor Search
+A from-scratch implementation of the HNSW (Hierarchical Navigable Small World) algorithm for approximate nearest neighbor search. No FAISS. No ChromaDB. No Annoy. Just the algorithm described in the Malkov & Yashunin 2018 paper, implemented directly in Python with NumPy.
+This is the vector search engine from the [Citadel](https://github.com/dbhavery/citadel) AI operations platform, extracted here as a standalone reference. Citadel uses it as its built-in vector index -- no external vector database required.
+**Paper**: [Efficient and robust approximate nearest neighbor using Hierarchical Navigable Small World graphs](https://arxiv.org/abs/1603.09320) (Malkov & Yashunin, 2018)
+**Source**: [github.com/dbhavery/citadel](https://github.com/dbhavery/citadel) -- see `packages/citadel-vector/`
+---
+## Why Build HNSW From Scratch
+Most projects that need vector search reach for a prebuilt library -- FAISS, Hnswlib, ChromaDB, Pinecone. These are excellent tools. But they are also opaque. When your recall drops, when your index corrupts, when performance degrades on a specific data distribution, you are debugging a black box.
+Building HNSW from the paper forces you to understand every decision the algorithm makes: how layers are assigned, why the greedy search converges, what happens when you prune neighbor lists, and how the entry point selection affects traversal. That understanding is the difference between using a tool and knowing a tool.
+This implementation prioritizes clarity and correctness over raw speed. It is intended as a readable, well-documented reference that you can study, modify, and extend. For production workloads at scale, FAISS or Hnswlib will be faster due to their C++/SIMD cores. For applications in the tens-of-thousands to low-millions range -- which covers most RAG pipelines, recommendation engines, and local AI applications -- this implementation is both fast enough and transparent enough to use directly.
+---
+## How HNSW Works
+HNSW builds a multi-layer navigable graph over the dataset. The core insight is borrowed from skip lists: maintain multiple layers of the same graph at decreasing density, so that search starts with coarse, long-range hops at the top and refines to precise, short-range hops at the bottom.
+### The Multi-Layer Structure
+```
+Layer 3 (sparsest):    [A] -------------- [F]
+                         |                   |
+Layer 2:               [A] --- [C] ------- [F] --- [H]
+                         |      |            |       |
+Layer 1:          [A] - [B] - [C] - [E] - [F] - [G] - [H]
+                    |    |     |     |      |     |      |
+Layer 0 (densest): [A] [B] - [C] - [D] - [E] - [F] - [G] - [H] - [I] - [J]
+```
+Every node exists in layer 0. Each higher layer contains a geometrically decreasing random subset of nodes. When a new vector is inserted, its maximum layer is drawn from an exponential distribution: `level = floor(-ln(uniform()) * mL)`, where `mL = 1/ln(M)` and `M` is the maximum connections per node. Most nodes land on layer 0. A few reach layer 1. Even fewer reach layer 2. This gives the probabilistic skip-list property without any explicit balancing.
+### Insertion Algorithm
+When inserting a new vector `q` at assigned level `L`:
+1. **Top-down greedy phase**: Starting from the global entry point at the highest layer, perform greedy search (beam width = 1) down to layer `L+1`. At each layer, find the single nearest neighbor to `q` and use it as the entry point for the next layer down. This quickly navigates to the region of the graph closest to `q`.
+2. **Connection phase**: From layer `min(L, max_layer)` down to layer 0, perform beam search with `ef_construction` candidates. Select the `M` nearest neighbors from the candidates and create bidirectional edges between `q` and each selected neighbor.
+3. **Pruning**: If any neighbor now exceeds its maximum connection count (`M` for layers > 0, `2*M` for layer 0), prune its edge list down to the `M` nearest connections. Layer 0 is allowed double the connections because it carries all the data and needs higher connectivity for recall.
+4. **Entry point update**: If `L` exceeds the current maximum layer, `q` becomes the new global entry point.
+### Search Algorithm
+Searching for the `k` nearest neighbors of a query vector `q`:
+1. **Top-down traversal**: Starting from the entry point, greedily descend from the top layer to layer 1, keeping only the single closest node at each layer. This narrows the search to the right neighborhood in O(log n) hops.
+2. **Layer 0 beam search**: At layer 0, perform a beam search with width `ef_search` (a tunable parameter). The search maintains two heaps -- a min-heap of candidates to expand, and a max-heap of the current best results. At each step, pop the closest unexpanded candidate, examine its neighbors, and add any neighbor closer than the current worst result to both heaps.
+3. **Termination**: The search stops when the closest unexpanded candidate is farther than the worst result in the beam and the beam is full. Return the top `k` results from the beam.
+### Key Parameters
+| Parameter | Default | Role |
+|-----------|---------|------|
+| `M` | 16 | Max edges per node per layer. Higher M = better recall, more memory, slower insertion. |
+| `ef_construction` | 200 | Beam width during insertion. Higher = better graph quality, slower build. |
+| `ef_search` | 50 | Beam width during search. Higher = better recall, slower queries. This is the primary recall/speed knob at query time. |
+---
+## Performance Characteristics
+| Operation | Time Complexity | Notes |
+|-----------|----------------|-------|
+| Insert | O(log n * ef_construction * M) | Dominated by the beam search at each layer during connection. |
+| Search | O(log n * ef_search * M) | Top-down traversal is O(log n) layers; layer 0 search is bounded by ef_search. |
+| Delete | O(1) | Lazy deletion -- marks the node, does not rebuild the graph. Deleted nodes are still traversed but excluded from results. |
+| Memory | O(n * M * layers) | Each node stores M neighbor IDs per layer. Average layers per node is ~1/ln(M). |
+For typical configurations (M=16, ef_construction=200, ef_search=50), this implementation achieves **>90% recall@10** against brute-force search on 1,000 random 64-dimensional vectors, verified by the test suite.
+---
+## Supported Distance Metrics
+| Metric | Function | Range | Use Case |
+|--------|----------|-------|----------|
+| `cosine` | `1 - cos(a, b)` | [0, 2] | Normalized embeddings (most embedding models). Default. |
+| `euclidean` | `\|\|a - b\|\|_2` | [0, inf) | Raw feature vectors, spatial data. |
+| `dot` | `-dot(a, b)` | (-inf, inf) | Maximum inner product search (MIPS). |
+All distance functions are implemented with NumPy for vectorized computation. Batch variants (`batch_cosine_distance`, `batch_euclidean_distance`) are provided for operations over multiple vectors.
+---
+## Usage
+### Basic Index Operations
+```python
+import numpy as np
+from citadel_vector import HNSWIndex
+# Create an index for 384-dimensional vectors (e.g., sentence-transformers output)
+index = HNSWIndex(
+    dim=384,
+    max_elements=50_000,
+    M=16,
+    ef_construction=200,
+    metric="cosine",
+)
+# Insert vectors with metadata
+index.add(
+    vector=np.random.randn(384),
+    id="doc_001",
+    metadata={"source": "readme.md", "chunk": 0},
+)
+# Batch insert
+vectors = np.random.randn(100, 384)
+ids = [f"doc_{i:03d}" for i in range(100)]
+metadatas = [{"source": f"file_{i}.txt"} for i in range(100)]
+index.batch_add(vectors, ids, metadatas)
+# Search
+query = np.random.randn(384)
+results = index.search(query, k=5, ef_search=100)
+for doc_id, distance, metadata in results:
+    print(f"  {doc_id}: distance={distance:.4f}, source={metadata['source']}")
+# Filtered search -- only return results matching a predicate
+results = index.search(
+    query,
+    k=5,
+    filter_fn=lambda meta: meta is not None and meta.get("source") == "readme.md",
+)
+# Lazy deletion
+index.delete("doc_001")
+assert "doc_001" not in index
+```
+### Persistent Storage (VectorStore)
+```python
+from citadel_vector import VectorStore
+# Create a persistent store -- data is saved to disk
+store = VectorStore(path="./my_vectors", dim=384, metric="cosine")
+# Add vectors (metadata is persisted to SQLite)
+store.add(np.random.randn(384), "doc_1", metadata={"title": "Introduction"})
+store.add(np.random.randn(384), "doc_2", metadata={"title": "Methods"})
+# Save the index (vectors as .npy, graph as JSON, metadata in SQLite)
+store.save()
+# Load from disk in another process
+store = VectorStore.load("./my_vectors")
+results = store.search(np.random.randn(384), k=5)
+# Inspect index statistics
+print(store.stats())
+# {'count': 2, 'dim': 384, 'metric': 'cosine', 'max_elements': 100000, 'layers': 1, ...}
+```
+### REST API Server
+```python
+# Start the server
+# citadel-vector serve --port 8082
+# Or programmatically:
+from citadel_vector.server import create_app
+app = create_app(storage_dir="./vector_data")
+```
+```bash
+# Create a collection
+curl -X POST http://localhost:8082/collections \
+  -H "Content-Type: application/json" \
+  -d '{"name": "documents", "dim": 384, "metric": "cosine"}'
+# Add vectors
+curl -X POST http://localhost:8082/collections/documents/add \
+  -H "Content-Type: application/json" \
+  -d '{"vectors": [[0.1, 0.2, ...]], "ids": ["doc_1"], "metadatas": [{"title": "test"}]}'
+# Search
+curl -X POST http://localhost:8082/collections/documents/search \
+  -H "Content-Type: application/json" \
+  -d '{"query": [0.1, 0.2, ...], "k": 5}'
+```
+---
+## Architecture
+```
+citadel-vector/
+  citadel_vector/
+    __init__.py          # Public API: HNSWIndex, VectorStore, distance functions
+    hnsw.py              # Core HNSW algorithm -- graph construction, beam search, deletion
+    distance.py          # Distance metrics: cosine, euclidean, dot product (scalar + batch)
+    storage.py           # Persistent storage: NumPy arrays + JSON graph + SQLite metadata
+    config.py            # VectorConfig dataclass with tunable defaults
+    server.py            # Optional FastAPI REST server for HTTP access
+  tests/
+    test_hnsw.py         # Unit tests + recall benchmark (>90% recall@10 on 1000 vectors)
+    test_distance.py     # Distance function correctness tests
+    test_storage.py      # Persistence round-trip tests
+```
+### Storage Format
+The `VectorStore` persists three files:
+| File | Format | Contents |
+|------|--------|----------|
+| `vectors.npy` | NumPy binary | All vector data as a float64 array |
+| `graph.pkl` | JSON | Graph adjacency lists, node levels, entry point, index parameters |
+| `metadata.db` | SQLite | Key-value metadata store (one row per vector ID) |
+This separation means vectors can be memory-mapped for large datasets, the graph structure is human-readable for debugging, and metadata queries can use SQL.
+---
+## Dependencies
+- **Python 3.10+**
+- **NumPy** -- the only required dependency. All distance calculations and vector storage use NumPy arrays.
+- **FastAPI + Uvicorn** (optional) -- only needed for the REST server. Install with `pip install citadel-vector[server]`.
+No C extensions. No compiled binaries. No CUDA. Pure Python + NumPy.
+---
+## Relationship to Citadel
+This HNSW engine is one of six packages in the [Citadel](https://github.com/dbhavery/citadel) platform:
+| Package | Purpose |
+|---------|---------|
+| **citadel-vector** (this) | HNSW vector search engine |
+| citadel-gateway | LLM provider proxy with caching, rate limiting, circuit breaking |
+| citadel-agents | YAML-defined ReAct agents with tool use and multi-agent orchestration |
+| citadel-ingest | Document ingestion pipeline (parse, chunk, embed, deduplicate) |
+| citadel-trace | LLM observability -- spans, cost tracking, latency metrics, alerts |
+| citadel-dashboard | Single-file HTML operations dashboard |
+Each package is independently installable and usable. The ingest pipeline writes to this vector store, the agent framework reads from it for RAG, and the gateway routes LLM calls for embedding generation. But none of these connections are required -- `citadel-vector` works entirely on its own.
+---
+## References
+- Malkov, Y. A., & Yashunin, D. A. (2018). *Efficient and robust approximate nearest neighbor using Hierarchical Navigable Small World graphs*. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(4), 824-836. [arXiv:1603.09320](https://arxiv.org/abs/1603.09320)
+---
+## License
+MIT License. See [LICENSE](https://github.com/dbhavery/citadel/blob/main/LICENSE) for details.