Initial README for semantic-cache-bench

7be0d58 verified 12 days ago

7.33 kB

	# aurekai/semantic-cache-bench

	Semantic caching benchmarks and performance suite for Aurekai. Validates cache consistency, hit rates, and query latency across different model architectures and corpus sizes.

	## Overview

	Semantic caching is a core optimization in Aurekai that deduplicates semantically similar queries without exact matching. This repository hosts:

	- Benchmark Datasets: Query corpora with semantic similarity annotations
	- Evaluation Scripts: Performance measurement and validation tools
	- Results: Baseline metrics across different models and cache configurations
	- Methodology: Detailed documentation of benchmark setup and evaluation protocols

	## Quick Start

	```bash
	# Download benchmark suite
	git clone https://huggingface.co/aurekai/semantic-cache-bench
	cd semantic-cache-bench

	# Run quick benchmark
	akai semantic-cache:bench \
	--dataset queries-10k.jsonl \
	--model qwen3-8b \
	--cache-size 1GB \
	--output results.json

	# Compare results
	akai semantic-cache:compare \
	--baseline baseline-results.json \
	--current results.json
	```

	## Benchmark Datasets

	### queries-1k (Minimal Validation)

	- Size: 1,024 queries
	- Purpose: Quick validation of cache functionality
	- Format: JSONL with semantic similarity pairs
	- Runtime: ~5 minutes on GPU

	Schema:
	```json
	{
	"id": "q_001_234",
	"query": "What are the benefits of renewable energy?",
	"semantic_variations": [
	"Advantages of wind and solar power",
	"Why should we invest in renewables?"
	],
	"dissimilar_queries": [
	"How do fossil fuels work?"
	],
	"expected_cache_hit": true,
	"similarity_threshold": 0.87
	}
	```

	### queries-10k (Standard Benchmark)

	- Size: 10,240 queries
	- Purpose: Standard performance baseline
	- Corpus: Diverse knowledge domains and query patterns
	- Expected cache hit rate: 68-72%
	- Runtime: ~45 minutes on GPU

	### queries-100k (Comprehensive)

	- Size: 102,400 queries
	- Purpose: Large-scale cache behavior validation
	- Corpus: Realistic production query distribution
	- Expected cache hit rate: 72-76%
	- Runtime: ~8 hours on high-end GPU
	- Disk space: 15 GB decompressed

	## Metrics & Evaluation

	### Cache Performance

	\| Metric \| Qwen3-8B \| LLaMA3-8B \| Meaning \|
	\|--------\|----------\|-----------\|---------\|
	\| Hit Rate \| 71.2% \| 69.8% \| % of queries found in cache \|
	\| False Positives \| 0.3% \| 0.4% \| Incorrect cache matches \|
	\| False Negatives \| 2.1% \| 2.4% \| Missed cache opportunities \|
	\| Recall @ 0.90 \| 94.2% \| 92.8% \| True positives at high threshold \|

	### Latency Improvement

	```
	Cache Miss: 125 ms (full inference)
	Cache Hit: 2 ms (embedding lookup + cache retrieval)
	Speedup: 62.5x

	Average (71% hit rate): 1250.29 + 20.71 = 38 ms
	Effective speedup: 3.3x vs. no cache
	```

	### Memory Efficiency

	- Embedding cache size: ~800 MB for 100K queries
	- Memory per cached embedding: ~8 KB
	- Compression ratio: 1.4x with optional zstd compression
	- Peak memory during benchmark: 4 GB (with batch size 32)

	## Running Benchmarks

	### Standard Evaluation

	```bash
	# Benchmark specific model
	akai semantic-cache:bench \
	--model qwen3-8b \
	--dataset queries-10k.jsonl \
	--batch-size 32 \
	--cache-size 2GB \
	--output results.json

	# With logging
	akai semantic-cache:bench \
	--model qwen3-8b \
	--dataset queries-10k.jsonl \
	--cache-size 2GB \
	--verbose \
	--log-interval 100 \
	--output results.json
	```

	### Comparison Between Models

	```bash
	# Run on multiple models
	for model in qwen3-8b llama3-8b; do
	akai semantic-cache:bench \
	--model $model \
	--dataset queries-10k.jsonl \
	--output results-$model.json
	done

	# Compare results
	akai semantic-cache:compare \
	--results results-qwen3-8b.json results-llama3-8b.json \
	--report comparison-report.md
	```

	### Validation with Custom Threshold

	```bash
	# Test different similarity thresholds
	akai semantic-cache:threshold-sweep \
	--model qwen3-8b \
	--dataset queries-10k.jsonl \
	--thresholds "0.80,0.85,0.90,0.95" \
	--output threshold-sweep.json
	```

	## Benchmark Results

	### Latest Results (Aurekai v0.8.0-alpha.1)

	Hardware: NVIDIA H100 80GB, AMD EPYC 9654
	Date: 2026-05-02

	\| Model \| Dataset \| Hit Rate \| P@0.90 \| Latency (hit) \| Latency (miss) \|
	\|-------\|---------\|----------\|--------\|---------------\|----------------\|
	\| Qwen3-8B \| 1K \| 72.3% \| 94.1% \| 1.8ms \| 124ms \|
	\| Qwen3-8B \| 10K \| 71.2% \| 93.8% \| 1.9ms \| 126ms \|
	\| LLaMA3-8B \| 1K \| 70.1% \| 92.4% \| 2.1ms \| 127ms \|
	\| LLaMA3-8B \| 10K \| 69.8% \| 92.1% \| 2.2ms \| 129ms \|

	See [results/](./results/) for detailed breakdowns by domain and query type.

	## Implementation Notes

	### Cache Configuration

	```json
	{
	"semantic_cache": {
	"enabled": true,
	"similarity_threshold": 0.88,
	"max_cache_size": "2GB",
	"eviction_policy": "lru",
	"embedding_model": "qwen3-8b",
	"batch_size": 32,
	"use_mmap": true
	}
	}
	```

	### Threshold Selection

	- Aggressive (0.80): High hit rate (75%+), more false positives
	- Balanced (0.88): Recommended default, 71% hit rate, minimal false positives
	- Conservative (0.95): Very few false positives, lower hit rate

	## Methodology

	### Query Similarity Annotation

	Each benchmark dataset includes human-validated semantic similarity annotations:

	1. Query pairs sampled from corpus
	2. Annotators rate similarity (0-1)
	3. Disagreements resolved with third annotator
	4. Inter-rater reliability: Krippendorff's α = 0.89

	### Cache Consistency Validation

	All cache results validated against ground truth:

	```python
	# For each cached result:
	1. Verify embedding matches original query
	2. Re-rank all cached results for current query
	3. Confirm top match was indeed in cache
	4. Validate latency was significantly improved
	```

	## Contributing Results

	To contribute benchmark results:

	1. Run benchmark suite on your hardware
	2. Include system specs (GPU, CPU, memory, disk)
	3. Report all metrics from evaluation output
	4. Submit results via PR with hardware metadata

	Result file format:
	```json
	{
	"metadata": {
	"hardware": "NVIDIA H100, 512GB RAM",
	"date": "2026-05-02",
	"aurekai_version": "0.8.0-alpha.1"
	},
	"results": [
	{
	"model": "qwen3-8b",
	"dataset": "queries-10k",
	"hit_rate": 0.712,
	"recall_at_0_90": 0.938
	}
	]
	}
	```

	## Related Repositories

	- Main Aurekai Repo: https://github.com/aurekai/aurekai
	- Model Memory: https://huggingface.co/aurekai/model-memory
	- SAE Dictionaries: https://huggingface.co/aurekai/sae-dictionaries
	- FPQx Alignments: https://huggingface.co/aurekai/fpqx-alignments

	## Tools & Scripts

	- `akai semantic-cache:bench`: Run full benchmark suite
	- `akai semantic-cache:compare`: Compare benchmark results
	- `akai semantic-cache:threshold-sweep`: Test different thresholds
	- `benchmark_to_csv.py`: Export results to CSV format
	- `visualize_results.py`: Generate performance plots

	## Citation

	If you reference these benchmarks in research:

	```bibtex
	@dataset{aurekai_semantic_cache_bench_2026,
	title={Aurekai Semantic Cache Benchmarks},
	author={Aurekai Community},
	year={2026},
	url={https://huggingface.co/aurekai/semantic-cache-bench}
	}
	```

	## License

	Licensed under the Aurekai Open Source License. See main repository for details.