Upload README.md with huggingface_hub

227eab8 verified 3 days ago

13.2 kB

library_name: custom
license: mit
tags:
  - vector-search
  - hnsw
  - nearest-neighbor
  - information-retrieval
  - from-scratch
  - approximate-nearest-neighbor
language:
  - en
pipeline_tag: feature-extraction

HNSW Vector Engine -- Zero-Dependency Approximate Nearest Neighbor Search

A from-scratch implementation of the HNSW (Hierarchical Navigable Small World) algorithm for approximate nearest neighbor search. No FAISS. No ChromaDB. No Annoy. Just the algorithm described in the Malkov & Yashunin 2018 paper, implemented directly in Python with NumPy.

This is the vector search engine from the Citadel AI operations platform, extracted here as a standalone reference. Citadel uses it as its built-in vector index -- no external vector database required.

Paper: Efficient and robust approximate nearest neighbor using Hierarchical Navigable Small World graphs (Malkov & Yashunin, 2018)

Source: github.com/dbhavery/citadel -- see packages/citadel-vector/

Why Build HNSW From Scratch

Most projects that need vector search reach for a prebuilt library -- FAISS, Hnswlib, ChromaDB, Pinecone. These are excellent tools. But they are also opaque. When your recall drops, when your index corrupts, when performance degrades on a specific data distribution, you are debugging a black box.

Building HNSW from the paper forces you to understand every decision the algorithm makes: how layers are assigned, why the greedy search converges, what happens when you prune neighbor lists, and how the entry point selection affects traversal. That understanding is the difference between using a tool and knowing a tool.

This implementation prioritizes clarity and correctness over raw speed. It is intended as a readable, well-documented reference that you can study, modify, and extend. For production workloads at scale, FAISS or Hnswlib will be faster due to their C++/SIMD cores. For applications in the tens-of-thousands to low-millions range -- which covers most RAG pipelines, recommendation engines, and local AI applications -- this implementation is both fast enough and transparent enough to use directly.

How HNSW Works

HNSW builds a multi-layer navigable graph over the dataset. The core insight is borrowed from skip lists: maintain multiple layers of the same graph at decreasing density, so that search starts with coarse, long-range hops at the top and refines to precise, short-range hops at the bottom.

The Multi-Layer Structure

Layer 3 (sparsest):    [A] -------------- [F]
                         |                   |
Layer 2:               [A] --- [C] ------- [F] --- [H]
                         |      |            |       |
Layer 1:          [A] - [B] - [C] - [E] - [F] - [G] - [H]
                    |    |     |     |      |     |      |
Layer 0 (densest): [A] [B] - [C] - [D] - [E] - [F] - [G] - [H] - [I] - [J]

Every node exists in layer 0. Each higher layer contains a geometrically decreasing random subset of nodes. When a new vector is inserted, its maximum layer is drawn from an exponential distribution: level = floor(-ln(uniform()) * mL), where mL = 1/ln(M) and M is the maximum connections per node. Most nodes land on layer 0. A few reach layer 1. Even fewer reach layer 2. This gives the probabilistic skip-list property without any explicit balancing.

Insertion Algorithm

When inserting a new vector q at assigned level L:

Top-down greedy phase: Starting from the global entry point at the highest layer, perform greedy search (beam width = 1) down to layer L+1. At each layer, find the single nearest neighbor to q and use it as the entry point for the next layer down. This quickly navigates to the region of the graph closest to q.
Connection phase: From layer min(L, max_layer) down to layer 0, perform beam search with ef_construction candidates. Select the M nearest neighbors from the candidates and create bidirectional edges between q and each selected neighbor.
Pruning: If any neighbor now exceeds its maximum connection count (M for layers > 0, 2*M for layer 0), prune its edge list down to the M nearest connections. Layer 0 is allowed double the connections because it carries all the data and needs higher connectivity for recall.
Entry point update: If L exceeds the current maximum layer, q becomes the new global entry point.

Search Algorithm

Searching for the k nearest neighbors of a query vector q:

Top-down traversal: Starting from the entry point, greedily descend from the top layer to layer 1, keeping only the single closest node at each layer. This narrows the search to the right neighborhood in O(log n) hops.
Layer 0 beam search: At layer 0, perform a beam search with width ef_search (a tunable parameter). The search maintains two heaps -- a min-heap of candidates to expand, and a max-heap of the current best results. At each step, pop the closest unexpanded candidate, examine its neighbors, and add any neighbor closer than the current worst result to both heaps.
Termination: The search stops when the closest unexpanded candidate is farther than the worst result in the beam and the beam is full. Return the top k results from the beam.

Key Parameters

Parameter	Default	Role
`M`	16	Max edges per node per layer. Higher M = better recall, more memory, slower insertion.
`ef_construction`	200	Beam width during insertion. Higher = better graph quality, slower build.
`ef_search`	50	Beam width during search. Higher = better recall, slower queries. This is the primary recall/speed knob at query time.

Performance Characteristics

Operation	Time Complexity	Notes
Insert	O(log n * ef_construction * M)	Dominated by the beam search at each layer during connection.
Search	O(log n * ef_search * M)	Top-down traversal is O(log n) layers; layer 0 search is bounded by ef_search.
Delete	O(1)	Lazy deletion -- marks the node, does not rebuild the graph. Deleted nodes are still traversed but excluded from results.
Memory	O(n * M * layers)	Each node stores M neighbor IDs per layer. Average layers per node is ~1/ln(M).

For typical configurations (M=16, ef_construction=200, ef_search=50), this implementation achieves >90% recall@10 against brute-force search on 1,000 random 64-dimensional vectors, verified by the test suite.

Supported Distance Metrics

Metric	Function	Range	Use Case
`cosine`	`1 - cos(a, b)`	[0, 2]	Normalized embeddings (most embedding models). Default.
`euclidean`	`\|\|a - b\|\|_2`	[0, inf)	Raw feature vectors, spatial data.
`dot`	`-dot(a, b)`	(-inf, inf)	Maximum inner product search (MIPS).

All distance functions are implemented with NumPy for vectorized computation. Batch variants (batch_cosine_distance, batch_euclidean_distance) are provided for operations over multiple vectors.

Usage

Basic Index Operations

import numpy as np
from citadel_vector import HNSWIndex

# Create an index for 384-dimensional vectors (e.g., sentence-transformers output)
index = HNSWIndex(
    dim=384,
    max_elements=50_000,
    M=16,
    ef_construction=200,
    metric="cosine",
)

# Insert vectors with metadata
index.add(
    vector=np.random.randn(384),
    id="doc_001",
    metadata={"source": "readme.md", "chunk": 0},
)

# Batch insert
vectors = np.random.randn(100, 384)
ids = [f"doc_{i:03d}" for i in range(100)]
metadatas = [{"source": f"file_{i}.txt"} for i in range(100)]
index.batch_add(vectors, ids, metadatas)

# Search
query = np.random.randn(384)
results = index.search(query, k=5, ef_search=100)
for doc_id, distance, metadata in results:
    print(f"  {doc_id}: distance={distance:.4f}, source={metadata['source']}")

# Filtered search -- only return results matching a predicate
results = index.search(
    query,
    k=5,
    filter_fn=lambda meta: meta is not None and meta.get("source") == "readme.md",
)

# Lazy deletion
index.delete("doc_001")
assert "doc_001" not in index

Persistent Storage (VectorStore)

from citadel_vector import VectorStore

# Create a persistent store -- data is saved to disk
store = VectorStore(path="./my_vectors", dim=384, metric="cosine")

# Add vectors (metadata is persisted to SQLite)
store.add(np.random.randn(384), "doc_1", metadata={"title": "Introduction"})
store.add(np.random.randn(384), "doc_2", metadata={"title": "Methods"})

# Save the index (vectors as .npy, graph as JSON, metadata in SQLite)
store.save()

# Load from disk in another process
store = VectorStore.load("./my_vectors")
results = store.search(np.random.randn(384), k=5)

# Inspect index statistics
print(store.stats())
# {'count': 2, 'dim': 384, 'metric': 'cosine', 'max_elements': 100000, 'layers': 1, ...}

REST API Server

# Start the server
# citadel-vector serve --port 8082

# Or programmatically:
from citadel_vector.server import create_app
app = create_app(storage_dir="./vector_data")

# Create a collection
curl -X POST http://localhost:8082/collections \
  -H "Content-Type: application/json" \
  -d '{"name": "documents", "dim": 384, "metric": "cosine"}'

# Add vectors
curl -X POST http://localhost:8082/collections/documents/add \
  -H "Content-Type: application/json" \
  -d '{"vectors": [[0.1, 0.2, ...]], "ids": ["doc_1"], "metadatas": [{"title": "test"}]}'

# Search
curl -X POST http://localhost:8082/collections/documents/search \
  -H "Content-Type: application/json" \
  -d '{"query": [0.1, 0.2, ...], "k": 5}'

Architecture

citadel-vector/
  citadel_vector/
    __init__.py          # Public API: HNSWIndex, VectorStore, distance functions
    hnsw.py              # Core HNSW algorithm -- graph construction, beam search, deletion
    distance.py          # Distance metrics: cosine, euclidean, dot product (scalar + batch)
    storage.py           # Persistent storage: NumPy arrays + JSON graph + SQLite metadata
    config.py            # VectorConfig dataclass with tunable defaults
    server.py            # Optional FastAPI REST server for HTTP access
  tests/
    test_hnsw.py         # Unit tests + recall benchmark (>90% recall@10 on 1000 vectors)
    test_distance.py     # Distance function correctness tests
    test_storage.py      # Persistence round-trip tests

Storage Format

The VectorStore persists three files:

File	Format	Contents
`vectors.npy`	NumPy binary	All vector data as a float64 array
`graph.pkl`	JSON	Graph adjacency lists, node levels, entry point, index parameters
`metadata.db`	SQLite	Key-value metadata store (one row per vector ID)

This separation means vectors can be memory-mapped for large datasets, the graph structure is human-readable for debugging, and metadata queries can use SQL.

Dependencies

Python 3.10+
NumPy -- the only required dependency. All distance calculations and vector storage use NumPy arrays.
FastAPI + Uvicorn (optional) -- only needed for the REST server. Install with pip install citadel-vector[server].

No C extensions. No compiled binaries. No CUDA. Pure Python + NumPy.

Relationship to Citadel

This HNSW engine is one of six packages in the Citadel platform:

Package	Purpose
citadel-vector (this)	HNSW vector search engine
citadel-gateway	LLM provider proxy with caching, rate limiting, circuit breaking
citadel-agents	YAML-defined ReAct agents with tool use and multi-agent orchestration
citadel-ingest	Document ingestion pipeline (parse, chunk, embed, deduplicate)
citadel-trace	LLM observability -- spans, cost tracking, latency metrics, alerts
citadel-dashboard	Single-file HTML operations dashboard

Each package is independently installable and usable. The ingest pipeline writes to this vector store, the agent framework reads from it for RAG, and the gateway routes LLM calls for embedding generation. But none of these connections are required -- citadel-vector works entirely on its own.

References

Malkov, Y. A., & Yashunin, D. A. (2018). Efficient and robust approximate nearest neighbor using Hierarchical Navigable Small World graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(4), 824-836. arXiv:1603.09320

License

MIT License. See LICENSE for details.