# aurekai/semantic-cache-bench Semantic caching benchmarks and performance suite for Aurekai. Validates cache consistency, hit rates, and query latency across different model architectures and corpus sizes. ## Overview Semantic caching is a core optimization in Aurekai that deduplicates semantically similar queries without exact matching. This repository hosts: - **Benchmark Datasets**: Query corpora with semantic similarity annotations - **Evaluation Scripts**: Performance measurement and validation tools - **Results**: Baseline metrics across different models and cache configurations - **Methodology**: Detailed documentation of benchmark setup and evaluation protocols ## Quick Start ```bash # Download benchmark suite git clone https://huggingface.co/aurekai/semantic-cache-bench cd semantic-cache-bench # Run quick benchmark akai semantic-cache:bench \ --dataset queries-10k.jsonl \ --model qwen3-8b \ --cache-size 1GB \ --output results.json # Compare results akai semantic-cache:compare \ --baseline baseline-results.json \ --current results.json ``` ## Benchmark Datasets ### queries-1k (Minimal Validation) - **Size**: 1,024 queries - **Purpose**: Quick validation of cache functionality - **Format**: JSONL with semantic similarity pairs - **Runtime**: ~5 minutes on GPU **Schema**: ```json { "id": "q_001_234", "query": "What are the benefits of renewable energy?", "semantic_variations": [ "Advantages of wind and solar power", "Why should we invest in renewables?" ], "dissimilar_queries": [ "How do fossil fuels work?" ], "expected_cache_hit": true, "similarity_threshold": 0.87 } ``` ### queries-10k (Standard Benchmark) - **Size**: 10,240 queries - **Purpose**: Standard performance baseline - **Corpus**: Diverse knowledge domains and query patterns - **Expected cache hit rate**: 68-72% - **Runtime**: ~45 minutes on GPU ### queries-100k (Comprehensive) - **Size**: 102,400 queries - **Purpose**: Large-scale cache behavior validation - **Corpus**: Realistic production query distribution - **Expected cache hit rate**: 72-76% - **Runtime**: ~8 hours on high-end GPU - **Disk space**: 15 GB decompressed ## Metrics & Evaluation ### Cache Performance | Metric | Qwen3-8B | LLaMA3-8B | Meaning | |--------|----------|-----------|---------| | **Hit Rate** | 71.2% | 69.8% | % of queries found in cache | | **False Positives** | 0.3% | 0.4% | Incorrect cache matches | | **False Negatives** | 2.1% | 2.4% | Missed cache opportunities | | **Recall @ 0.90** | 94.2% | 92.8% | True positives at high threshold | ### Latency Improvement ``` Cache Miss: 125 ms (full inference) Cache Hit: 2 ms (embedding lookup + cache retrieval) Speedup: 62.5x Average (71% hit rate): 125*0.29 + 2*0.71 = 38 ms Effective speedup: 3.3x vs. no cache ``` ### Memory Efficiency - **Embedding cache size**: ~800 MB for 100K queries - **Memory per cached embedding**: ~8 KB - **Compression ratio**: 1.4x with optional zstd compression - **Peak memory during benchmark**: 4 GB (with batch size 32) ## Running Benchmarks ### Standard Evaluation ```bash # Benchmark specific model akai semantic-cache:bench \ --model qwen3-8b \ --dataset queries-10k.jsonl \ --batch-size 32 \ --cache-size 2GB \ --output results.json # With logging akai semantic-cache:bench \ --model qwen3-8b \ --dataset queries-10k.jsonl \ --cache-size 2GB \ --verbose \ --log-interval 100 \ --output results.json ``` ### Comparison Between Models ```bash # Run on multiple models for model in qwen3-8b llama3-8b; do akai semantic-cache:bench \ --model $model \ --dataset queries-10k.jsonl \ --output results-$model.json done # Compare results akai semantic-cache:compare \ --results results-qwen3-8b.json results-llama3-8b.json \ --report comparison-report.md ``` ### Validation with Custom Threshold ```bash # Test different similarity thresholds akai semantic-cache:threshold-sweep \ --model qwen3-8b \ --dataset queries-10k.jsonl \ --thresholds "0.80,0.85,0.90,0.95" \ --output threshold-sweep.json ``` ## Benchmark Results ### Latest Results (Aurekai v0.8.0-alpha.1) **Hardware**: NVIDIA H100 80GB, AMD EPYC 9654 **Date**: 2026-05-02 | Model | Dataset | Hit Rate | P@0.90 | Latency (hit) | Latency (miss) | |-------|---------|----------|--------|---------------|----------------| | Qwen3-8B | 1K | 72.3% | 94.1% | 1.8ms | 124ms | | Qwen3-8B | 10K | 71.2% | 93.8% | 1.9ms | 126ms | | LLaMA3-8B | 1K | 70.1% | 92.4% | 2.1ms | 127ms | | LLaMA3-8B | 10K | 69.8% | 92.1% | 2.2ms | 129ms | See [results/](./results/) for detailed breakdowns by domain and query type. ## Implementation Notes ### Cache Configuration ```json { "semantic_cache": { "enabled": true, "similarity_threshold": 0.88, "max_cache_size": "2GB", "eviction_policy": "lru", "embedding_model": "qwen3-8b", "batch_size": 32, "use_mmap": true } } ``` ### Threshold Selection - **Aggressive (0.80)**: High hit rate (75%+), more false positives - **Balanced (0.88)**: Recommended default, 71% hit rate, minimal false positives - **Conservative (0.95)**: Very few false positives, lower hit rate ## Methodology ### Query Similarity Annotation Each benchmark dataset includes human-validated semantic similarity annotations: 1. Query pairs sampled from corpus 2. Annotators rate similarity (0-1) 3. Disagreements resolved with third annotator 4. Inter-rater reliability: Krippendorff's α = 0.89 ### Cache Consistency Validation All cache results validated against ground truth: ```python # For each cached result: 1. Verify embedding matches original query 2. Re-rank all cached results for current query 3. Confirm top match was indeed in cache 4. Validate latency was significantly improved ``` ## Contributing Results To contribute benchmark results: 1. Run benchmark suite on your hardware 2. Include system specs (GPU, CPU, memory, disk) 3. Report all metrics from evaluation output 4. Submit results via PR with hardware metadata **Result file format**: ```json { "metadata": { "hardware": "NVIDIA H100, 512GB RAM", "date": "2026-05-02", "aurekai_version": "0.8.0-alpha.1" }, "results": [ { "model": "qwen3-8b", "dataset": "queries-10k", "hit_rate": 0.712, "recall_at_0_90": 0.938 } ] } ``` ## Related Repositories - **Main Aurekai Repo**: https://github.com/aurekai/aurekai - **Model Memory**: https://huggingface.co/aurekai/model-memory - **SAE Dictionaries**: https://huggingface.co/aurekai/sae-dictionaries - **FPQx Alignments**: https://huggingface.co/aurekai/fpqx-alignments ## Tools & Scripts - `akai semantic-cache:bench`: Run full benchmark suite - `akai semantic-cache:compare`: Compare benchmark results - `akai semantic-cache:threshold-sweep`: Test different thresholds - `benchmark_to_csv.py`: Export results to CSV format - `visualize_results.py`: Generate performance plots ## Citation If you reference these benchmarks in research: ```bibtex @dataset{aurekai_semantic_cache_bench_2026, title={Aurekai Semantic Cache Benchmarks}, author={Aurekai Community}, year={2026}, url={https://huggingface.co/aurekai/semantic-cache-bench} } ``` ## License Licensed under the Aurekai Open Source License. See main repository for details.