dbhavery commited on
Commit
227eab8
·
verified ·
1 Parent(s): 8fcc47f

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +281 -79
README.md CHANGED
@@ -1,79 +1,281 @@
1
- ---
2
- library_name: custom
3
- tags:
4
- - vector-search
5
- - hnsw
6
- - nearest-neighbor
7
- - information-retrieval
8
- - from-scratch
9
- license: mit
10
- ---
11
-
12
- # HNSW Vector Engine
13
-
14
- A zero-dependency implementation of Hierarchical Navigable Small World (HNSW) approximate nearest-neighbor search, built directly from the [Malkov & Yashunin 2018 paper](https://arxiv.org/abs/1603.09320).
15
-
16
- Part of [Citadel](https://github.com/dbhavery/citadel), an open-source AI operations platform.
17
-
18
- ## Why From Scratch?
19
-
20
- Most vector search tools wrap existing libraries (FAISS, Annoy, HNSWlib). This implementation builds the full HNSW index from first principles multi-layer graph construction, greedy search with backtracking, configurable M/ef parameters to demonstrate deep understanding of the algorithm, not just API usage.
21
-
22
- ## Features
23
-
24
- - **Multi-layer graph** with probabilistic level assignment
25
- - **Greedy beam search** with configurable ef (exploration factor)
26
- - **Cosine similarity** distance metric
27
- - **Batch insert and query** with REST API wrapper
28
- - **Persistent storage** to disk
29
- - **18 tests** covering index construction, search accuracy, edge cases
30
-
31
- ## Parameters
32
-
33
- | Parameter | Default | Description |
34
- |-----------|---------|-------------|
35
- | `M` | 16 | Max connections per node per layer |
36
- | `ef_construction` | 200 | Beam width during index building |
37
- | `ef_search` | 50 | Beam width during query |
38
- | `max_elements` | 10000 | Pre-allocated index size |
39
-
40
- ## Usage
41
-
42
- ```python
43
- from citadel_vector import VectorStore
44
-
45
- store = VectorStore(collection="docs", dim=384)
46
- store.add(vectors=embeddings, metadata=metadata)
47
- results = store.search(query_vector, k=10)
48
- ```
49
-
50
- ## Architecture
51
-
52
- ```
53
- Query Vector
54
- |
55
- v
56
- [Top Layer] -- sparse graph, long-range connections
57
- |
58
- v
59
- [Layer N-1] -- denser graph
60
- |
61
- v
62
- [Layer 0] -- full graph, all nodes, local connections
63
- |
64
- v
65
- Top-K Results (sorted by cosine similarity)
66
- ```
67
-
68
- ## Part of Citadel
69
-
70
- This vector engine is one of 6 independently installable packages in the Citadel AI Operations Platform:
71
-
72
- - **citadel-gateway** LLM proxy with routing, caching, circuit breakers
73
- - **citadel-vector** — This package (HNSW vector search)
74
- - **citadel-agents** ReAct agent runtime with tool registry
75
- - **citadel-ingest** — Document parsing and chunking pipeline
76
- - **citadel-trace** LLM observability and cost tracking
77
- - **citadel-dashboard** — Real-time operations UI
78
-
79
- [GitHub Repository](https://github.com/dbhavery/citadel) | [Author](https://github.com/dbhavery)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: custom
3
+ license: mit
4
+ tags:
5
+ - vector-search
6
+ - hnsw
7
+ - nearest-neighbor
8
+ - information-retrieval
9
+ - from-scratch
10
+ - approximate-nearest-neighbor
11
+ language:
12
+ - en
13
+ pipeline_tag: feature-extraction
14
+ ---
15
+
16
+ # HNSW Vector Engine -- Zero-Dependency Approximate Nearest Neighbor Search
17
+
18
+ A from-scratch implementation of the HNSW (Hierarchical Navigable Small World) algorithm for approximate nearest neighbor search. No FAISS. No ChromaDB. No Annoy. Just the algorithm described in the Malkov & Yashunin 2018 paper, implemented directly in Python with NumPy.
19
+
20
+ This is the vector search engine from the [Citadel](https://github.com/dbhavery/citadel) AI operations platform, extracted here as a standalone reference. Citadel uses it as its built-in vector index -- no external vector database required.
21
+
22
+ **Paper**: [Efficient and robust approximate nearest neighbor using Hierarchical Navigable Small World graphs](https://arxiv.org/abs/1603.09320) (Malkov & Yashunin, 2018)
23
+
24
+ **Source**: [github.com/dbhavery/citadel](https://github.com/dbhavery/citadel) -- see `packages/citadel-vector/`
25
+
26
+ ---
27
+
28
+ ## Why Build HNSW From Scratch
29
+
30
+ Most projects that need vector search reach for a prebuilt library -- FAISS, Hnswlib, ChromaDB, Pinecone. These are excellent tools. But they are also opaque. When your recall drops, when your index corrupts, when performance degrades on a specific data distribution, you are debugging a black box.
31
+
32
+ Building HNSW from the paper forces you to understand every decision the algorithm makes: how layers are assigned, why the greedy search converges, what happens when you prune neighbor lists, and how the entry point selection affects traversal. That understanding is the difference between using a tool and knowing a tool.
33
+
34
+ This implementation prioritizes clarity and correctness over raw speed. It is intended as a readable, well-documented reference that you can study, modify, and extend. For production workloads at scale, FAISS or Hnswlib will be faster due to their C++/SIMD cores. For applications in the tens-of-thousands to low-millions range -- which covers most RAG pipelines, recommendation engines, and local AI applications -- this implementation is both fast enough and transparent enough to use directly.
35
+
36
+ ---
37
+
38
+ ## How HNSW Works
39
+
40
+ HNSW builds a multi-layer navigable graph over the dataset. The core insight is borrowed from skip lists: maintain multiple layers of the same graph at decreasing density, so that search starts with coarse, long-range hops at the top and refines to precise, short-range hops at the bottom.
41
+
42
+ ### The Multi-Layer Structure
43
+
44
+ ```
45
+ Layer 3 (sparsest): [A] -------------- [F]
46
+ | |
47
+ Layer 2: [A] --- [C] ------- [F] --- [H]
48
+ | | | |
49
+ Layer 1: [A] - [B] - [C] - [E] - [F] - [G] - [H]
50
+ | | | | | | |
51
+ Layer 0 (densest): [A] [B] - [C] - [D] - [E] - [F] - [G] - [H] - [I] - [J]
52
+ ```
53
+
54
+ Every node exists in layer 0. Each higher layer contains a geometrically decreasing random subset of nodes. When a new vector is inserted, its maximum layer is drawn from an exponential distribution: `level = floor(-ln(uniform()) * mL)`, where `mL = 1/ln(M)` and `M` is the maximum connections per node. Most nodes land on layer 0. A few reach layer 1. Even fewer reach layer 2. This gives the probabilistic skip-list property without any explicit balancing.
55
+
56
+ ### Insertion Algorithm
57
+
58
+ When inserting a new vector `q` at assigned level `L`:
59
+
60
+ 1. **Top-down greedy phase**: Starting from the global entry point at the highest layer, perform greedy search (beam width = 1) down to layer `L+1`. At each layer, find the single nearest neighbor to `q` and use it as the entry point for the next layer down. This quickly navigates to the region of the graph closest to `q`.
61
+
62
+ 2. **Connection phase**: From layer `min(L, max_layer)` down to layer 0, perform beam search with `ef_construction` candidates. Select the `M` nearest neighbors from the candidates and create bidirectional edges between `q` and each selected neighbor.
63
+
64
+ 3. **Pruning**: If any neighbor now exceeds its maximum connection count (`M` for layers > 0, `2*M` for layer 0), prune its edge list down to the `M` nearest connections. Layer 0 is allowed double the connections because it carries all the data and needs higher connectivity for recall.
65
+
66
+ 4. **Entry point update**: If `L` exceeds the current maximum layer, `q` becomes the new global entry point.
67
+
68
+ ### Search Algorithm
69
+
70
+ Searching for the `k` nearest neighbors of a query vector `q`:
71
+
72
+ 1. **Top-down traversal**: Starting from the entry point, greedily descend from the top layer to layer 1, keeping only the single closest node at each layer. This narrows the search to the right neighborhood in O(log n) hops.
73
+
74
+ 2. **Layer 0 beam search**: At layer 0, perform a beam search with width `ef_search` (a tunable parameter). The search maintains two heaps -- a min-heap of candidates to expand, and a max-heap of the current best results. At each step, pop the closest unexpanded candidate, examine its neighbors, and add any neighbor closer than the current worst result to both heaps.
75
+
76
+ 3. **Termination**: The search stops when the closest unexpanded candidate is farther than the worst result in the beam and the beam is full. Return the top `k` results from the beam.
77
+
78
+ ### Key Parameters
79
+
80
+ | Parameter | Default | Role |
81
+ |-----------|---------|------|
82
+ | `M` | 16 | Max edges per node per layer. Higher M = better recall, more memory, slower insertion. |
83
+ | `ef_construction` | 200 | Beam width during insertion. Higher = better graph quality, slower build. |
84
+ | `ef_search` | 50 | Beam width during search. Higher = better recall, slower queries. This is the primary recall/speed knob at query time. |
85
+
86
+ ---
87
+
88
+ ## Performance Characteristics
89
+
90
+ | Operation | Time Complexity | Notes |
91
+ |-----------|----------------|-------|
92
+ | Insert | O(log n * ef_construction * M) | Dominated by the beam search at each layer during connection. |
93
+ | Search | O(log n * ef_search * M) | Top-down traversal is O(log n) layers; layer 0 search is bounded by ef_search. |
94
+ | Delete | O(1) | Lazy deletion -- marks the node, does not rebuild the graph. Deleted nodes are still traversed but excluded from results. |
95
+ | Memory | O(n * M * layers) | Each node stores M neighbor IDs per layer. Average layers per node is ~1/ln(M). |
96
+
97
+ For typical configurations (M=16, ef_construction=200, ef_search=50), this implementation achieves **>90% recall@10** against brute-force search on 1,000 random 64-dimensional vectors, verified by the test suite.
98
+
99
+ ---
100
+
101
+ ## Supported Distance Metrics
102
+
103
+ | Metric | Function | Range | Use Case |
104
+ |--------|----------|-------|----------|
105
+ | `cosine` | `1 - cos(a, b)` | [0, 2] | Normalized embeddings (most embedding models). Default. |
106
+ | `euclidean` | `\|\|a - b\|\|_2` | [0, inf) | Raw feature vectors, spatial data. |
107
+ | `dot` | `-dot(a, b)` | (-inf, inf) | Maximum inner product search (MIPS). |
108
+
109
+ All distance functions are implemented with NumPy for vectorized computation. Batch variants (`batch_cosine_distance`, `batch_euclidean_distance`) are provided for operations over multiple vectors.
110
+
111
+ ---
112
+
113
+ ## Usage
114
+
115
+ ### Basic Index Operations
116
+
117
+ ```python
118
+ import numpy as np
119
+ from citadel_vector import HNSWIndex
120
+
121
+ # Create an index for 384-dimensional vectors (e.g., sentence-transformers output)
122
+ index = HNSWIndex(
123
+ dim=384,
124
+ max_elements=50_000,
125
+ M=16,
126
+ ef_construction=200,
127
+ metric="cosine",
128
+ )
129
+
130
+ # Insert vectors with metadata
131
+ index.add(
132
+ vector=np.random.randn(384),
133
+ id="doc_001",
134
+ metadata={"source": "readme.md", "chunk": 0},
135
+ )
136
+
137
+ # Batch insert
138
+ vectors = np.random.randn(100, 384)
139
+ ids = [f"doc_{i:03d}" for i in range(100)]
140
+ metadatas = [{"source": f"file_{i}.txt"} for i in range(100)]
141
+ index.batch_add(vectors, ids, metadatas)
142
+
143
+ # Search
144
+ query = np.random.randn(384)
145
+ results = index.search(query, k=5, ef_search=100)
146
+ for doc_id, distance, metadata in results:
147
+ print(f" {doc_id}: distance={distance:.4f}, source={metadata['source']}")
148
+
149
+ # Filtered search -- only return results matching a predicate
150
+ results = index.search(
151
+ query,
152
+ k=5,
153
+ filter_fn=lambda meta: meta is not None and meta.get("source") == "readme.md",
154
+ )
155
+
156
+ # Lazy deletion
157
+ index.delete("doc_001")
158
+ assert "doc_001" not in index
159
+ ```
160
+
161
+ ### Persistent Storage (VectorStore)
162
+
163
+ ```python
164
+ from citadel_vector import VectorStore
165
+
166
+ # Create a persistent store -- data is saved to disk
167
+ store = VectorStore(path="./my_vectors", dim=384, metric="cosine")
168
+
169
+ # Add vectors (metadata is persisted to SQLite)
170
+ store.add(np.random.randn(384), "doc_1", metadata={"title": "Introduction"})
171
+ store.add(np.random.randn(384), "doc_2", metadata={"title": "Methods"})
172
+
173
+ # Save the index (vectors as .npy, graph as JSON, metadata in SQLite)
174
+ store.save()
175
+
176
+ # Load from disk in another process
177
+ store = VectorStore.load("./my_vectors")
178
+ results = store.search(np.random.randn(384), k=5)
179
+
180
+ # Inspect index statistics
181
+ print(store.stats())
182
+ # {'count': 2, 'dim': 384, 'metric': 'cosine', 'max_elements': 100000, 'layers': 1, ...}
183
+ ```
184
+
185
+ ### REST API Server
186
+
187
+ ```python
188
+ # Start the server
189
+ # citadel-vector serve --port 8082
190
+
191
+ # Or programmatically:
192
+ from citadel_vector.server import create_app
193
+ app = create_app(storage_dir="./vector_data")
194
+ ```
195
+
196
+ ```bash
197
+ # Create a collection
198
+ curl -X POST http://localhost:8082/collections \
199
+ -H "Content-Type: application/json" \
200
+ -d '{"name": "documents", "dim": 384, "metric": "cosine"}'
201
+
202
+ # Add vectors
203
+ curl -X POST http://localhost:8082/collections/documents/add \
204
+ -H "Content-Type: application/json" \
205
+ -d '{"vectors": [[0.1, 0.2, ...]], "ids": ["doc_1"], "metadatas": [{"title": "test"}]}'
206
+
207
+ # Search
208
+ curl -X POST http://localhost:8082/collections/documents/search \
209
+ -H "Content-Type: application/json" \
210
+ -d '{"query": [0.1, 0.2, ...], "k": 5}'
211
+ ```
212
+
213
+ ---
214
+
215
+ ## Architecture
216
+
217
+ ```
218
+ citadel-vector/
219
+ citadel_vector/
220
+ __init__.py # Public API: HNSWIndex, VectorStore, distance functions
221
+ hnsw.py # Core HNSW algorithm -- graph construction, beam search, deletion
222
+ distance.py # Distance metrics: cosine, euclidean, dot product (scalar + batch)
223
+ storage.py # Persistent storage: NumPy arrays + JSON graph + SQLite metadata
224
+ config.py # VectorConfig dataclass with tunable defaults
225
+ server.py # Optional FastAPI REST server for HTTP access
226
+ tests/
227
+ test_hnsw.py # Unit tests + recall benchmark (>90% recall@10 on 1000 vectors)
228
+ test_distance.py # Distance function correctness tests
229
+ test_storage.py # Persistence round-trip tests
230
+ ```
231
+
232
+ ### Storage Format
233
+
234
+ The `VectorStore` persists three files:
235
+
236
+ | File | Format | Contents |
237
+ |------|--------|----------|
238
+ | `vectors.npy` | NumPy binary | All vector data as a float64 array |
239
+ | `graph.pkl` | JSON | Graph adjacency lists, node levels, entry point, index parameters |
240
+ | `metadata.db` | SQLite | Key-value metadata store (one row per vector ID) |
241
+
242
+ This separation means vectors can be memory-mapped for large datasets, the graph structure is human-readable for debugging, and metadata queries can use SQL.
243
+
244
+ ---
245
+
246
+ ## Dependencies
247
+
248
+ - **Python 3.10+**
249
+ - **NumPy** -- the only required dependency. All distance calculations and vector storage use NumPy arrays.
250
+ - **FastAPI + Uvicorn** (optional) -- only needed for the REST server. Install with `pip install citadel-vector[server]`.
251
+
252
+ No C extensions. No compiled binaries. No CUDA. Pure Python + NumPy.
253
+
254
+ ---
255
+
256
+ ## Relationship to Citadel
257
+
258
+ This HNSW engine is one of six packages in the [Citadel](https://github.com/dbhavery/citadel) platform:
259
+
260
+ | Package | Purpose |
261
+ |---------|---------|
262
+ | **citadel-vector** (this) | HNSW vector search engine |
263
+ | citadel-gateway | LLM provider proxy with caching, rate limiting, circuit breaking |
264
+ | citadel-agents | YAML-defined ReAct agents with tool use and multi-agent orchestration |
265
+ | citadel-ingest | Document ingestion pipeline (parse, chunk, embed, deduplicate) |
266
+ | citadel-trace | LLM observability -- spans, cost tracking, latency metrics, alerts |
267
+ | citadel-dashboard | Single-file HTML operations dashboard |
268
+
269
+ Each package is independently installable and usable. The ingest pipeline writes to this vector store, the agent framework reads from it for RAG, and the gateway routes LLM calls for embedding generation. But none of these connections are required -- `citadel-vector` works entirely on its own.
270
+
271
+ ---
272
+
273
+ ## References
274
+
275
+ - Malkov, Y. A., & Yashunin, D. A. (2018). *Efficient and robust approximate nearest neighbor using Hierarchical Navigable Small World graphs*. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(4), 824-836. [arXiv:1603.09320](https://arxiv.org/abs/1603.09320)
276
+
277
+ ---
278
+
279
+ ## License
280
+
281
+ MIT License. See [LICENSE](https://github.com/dbhavery/citadel/blob/main/LICENSE) for details.