| # aurekai/semantic-cache-bench |
|
|
| Semantic caching benchmarks and performance suite for Aurekai. Validates cache consistency, hit rates, and query latency across different model architectures and corpus sizes. |
|
|
| ## Overview |
|
|
| Semantic caching is a core optimization in Aurekai that deduplicates semantically similar queries without exact matching. This repository hosts: |
|
|
| - **Benchmark Datasets**: Query corpora with semantic similarity annotations |
| - **Evaluation Scripts**: Performance measurement and validation tools |
| - **Results**: Baseline metrics across different models and cache configurations |
| - **Methodology**: Detailed documentation of benchmark setup and evaluation protocols |
|
|
| ## Quick Start |
|
|
| ```bash |
| # Download benchmark suite |
| git clone https://huggingface.co/aurekai/semantic-cache-bench |
| cd semantic-cache-bench |
| |
| # Run quick benchmark |
| akai semantic-cache:bench \ |
| --dataset queries-10k.jsonl \ |
| --model qwen3-8b \ |
| --cache-size 1GB \ |
| --output results.json |
| |
| # Compare results |
| akai semantic-cache:compare \ |
| --baseline baseline-results.json \ |
| --current results.json |
| ``` |
|
|
| ## Benchmark Datasets |
|
|
| ### queries-1k (Minimal Validation) |
|
|
| - **Size**: 1,024 queries |
| - **Purpose**: Quick validation of cache functionality |
| - **Format**: JSONL with semantic similarity pairs |
| - **Runtime**: ~5 minutes on GPU |
|
|
| **Schema**: |
| ```json |
| { |
| "id": "q_001_234", |
| "query": "What are the benefits of renewable energy?", |
| "semantic_variations": [ |
| "Advantages of wind and solar power", |
| "Why should we invest in renewables?" |
| ], |
| "dissimilar_queries": [ |
| "How do fossil fuels work?" |
| ], |
| "expected_cache_hit": true, |
| "similarity_threshold": 0.87 |
| } |
| ``` |
|
|
| ### queries-10k (Standard Benchmark) |
|
|
| - **Size**: 10,240 queries |
| - **Purpose**: Standard performance baseline |
| - **Corpus**: Diverse knowledge domains and query patterns |
| - **Expected cache hit rate**: 68-72% |
| - **Runtime**: ~45 minutes on GPU |
|
|
| ### queries-100k (Comprehensive) |
|
|
| - **Size**: 102,400 queries |
| - **Purpose**: Large-scale cache behavior validation |
| - **Corpus**: Realistic production query distribution |
| - **Expected cache hit rate**: 72-76% |
| - **Runtime**: ~8 hours on high-end GPU |
| - **Disk space**: 15 GB decompressed |
|
|
| ## Metrics & Evaluation |
|
|
| ### Cache Performance |
|
|
| | Metric | Qwen3-8B | LLaMA3-8B | Meaning | |
| |--------|----------|-----------|---------| |
| | **Hit Rate** | 71.2% | 69.8% | % of queries found in cache | |
| | **False Positives** | 0.3% | 0.4% | Incorrect cache matches | |
| | **False Negatives** | 2.1% | 2.4% | Missed cache opportunities | |
| | **Recall @ 0.90** | 94.2% | 92.8% | True positives at high threshold | |
|
|
| ### Latency Improvement |
|
|
| ``` |
| Cache Miss: 125 ms (full inference) |
| Cache Hit: 2 ms (embedding lookup + cache retrieval) |
| Speedup: 62.5x |
| |
| Average (71% hit rate): 125*0.29 + 2*0.71 = 38 ms |
| Effective speedup: 3.3x vs. no cache |
| ``` |
|
|
| ### Memory Efficiency |
|
|
| - **Embedding cache size**: ~800 MB for 100K queries |
| - **Memory per cached embedding**: ~8 KB |
| - **Compression ratio**: 1.4x with optional zstd compression |
| - **Peak memory during benchmark**: 4 GB (with batch size 32) |
|
|
| ## Running Benchmarks |
|
|
| ### Standard Evaluation |
|
|
| ```bash |
| # Benchmark specific model |
| akai semantic-cache:bench \ |
| --model qwen3-8b \ |
| --dataset queries-10k.jsonl \ |
| --batch-size 32 \ |
| --cache-size 2GB \ |
| --output results.json |
| |
| # With logging |
| akai semantic-cache:bench \ |
| --model qwen3-8b \ |
| --dataset queries-10k.jsonl \ |
| --cache-size 2GB \ |
| --verbose \ |
| --log-interval 100 \ |
| --output results.json |
| ``` |
|
|
| ### Comparison Between Models |
|
|
| ```bash |
| # Run on multiple models |
| for model in qwen3-8b llama3-8b; do |
| akai semantic-cache:bench \ |
| --model $model \ |
| --dataset queries-10k.jsonl \ |
| --output results-$model.json |
| done |
| |
| # Compare results |
| akai semantic-cache:compare \ |
| --results results-qwen3-8b.json results-llama3-8b.json \ |
| --report comparison-report.md |
| ``` |
|
|
| ### Validation with Custom Threshold |
|
|
| ```bash |
| # Test different similarity thresholds |
| akai semantic-cache:threshold-sweep \ |
| --model qwen3-8b \ |
| --dataset queries-10k.jsonl \ |
| --thresholds "0.80,0.85,0.90,0.95" \ |
| --output threshold-sweep.json |
| ``` |
|
|
| ## Benchmark Results |
|
|
| ### Latest Results (Aurekai v0.8.0-alpha.1) |
|
|
| **Hardware**: NVIDIA H100 80GB, AMD EPYC 9654 |
| **Date**: 2026-05-02 |
|
|
| | Model | Dataset | Hit Rate | P@0.90 | Latency (hit) | Latency (miss) | |
| |-------|---------|----------|--------|---------------|----------------| |
| | Qwen3-8B | 1K | 72.3% | 94.1% | 1.8ms | 124ms | |
| | Qwen3-8B | 10K | 71.2% | 93.8% | 1.9ms | 126ms | |
| | LLaMA3-8B | 1K | 70.1% | 92.4% | 2.1ms | 127ms | |
| | LLaMA3-8B | 10K | 69.8% | 92.1% | 2.2ms | 129ms | |
|
|
| See [results/](./results/) for detailed breakdowns by domain and query type. |
|
|
| ## Implementation Notes |
|
|
| ### Cache Configuration |
|
|
| ```json |
| { |
| "semantic_cache": { |
| "enabled": true, |
| "similarity_threshold": 0.88, |
| "max_cache_size": "2GB", |
| "eviction_policy": "lru", |
| "embedding_model": "qwen3-8b", |
| "batch_size": 32, |
| "use_mmap": true |
| } |
| } |
| ``` |
|
|
| ### Threshold Selection |
|
|
| - **Aggressive (0.80)**: High hit rate (75%+), more false positives |
| - **Balanced (0.88)**: Recommended default, 71% hit rate, minimal false positives |
| - **Conservative (0.95)**: Very few false positives, lower hit rate |
|
|
| ## Methodology |
|
|
| ### Query Similarity Annotation |
|
|
| Each benchmark dataset includes human-validated semantic similarity annotations: |
|
|
| 1. Query pairs sampled from corpus |
| 2. Annotators rate similarity (0-1) |
| 3. Disagreements resolved with third annotator |
| 4. Inter-rater reliability: Krippendorff's α = 0.89 |
|
|
| ### Cache Consistency Validation |
|
|
| All cache results validated against ground truth: |
|
|
| ```python |
| # For each cached result: |
| 1. Verify embedding matches original query |
| 2. Re-rank all cached results for current query |
| 3. Confirm top match was indeed in cache |
| 4. Validate latency was significantly improved |
| ``` |
|
|
| ## Contributing Results |
|
|
| To contribute benchmark results: |
|
|
| 1. Run benchmark suite on your hardware |
| 2. Include system specs (GPU, CPU, memory, disk) |
| 3. Report all metrics from evaluation output |
| 4. Submit results via PR with hardware metadata |
|
|
| **Result file format**: |
| ```json |
| { |
| "metadata": { |
| "hardware": "NVIDIA H100, 512GB RAM", |
| "date": "2026-05-02", |
| "aurekai_version": "0.8.0-alpha.1" |
| }, |
| "results": [ |
| { |
| "model": "qwen3-8b", |
| "dataset": "queries-10k", |
| "hit_rate": 0.712, |
| "recall_at_0_90": 0.938 |
| } |
| ] |
| } |
| ``` |
|
|
| ## Related Repositories |
|
|
| - **Main Aurekai Repo**: https://github.com/aurekai/aurekai |
| - **Model Memory**: https://huggingface.co/aurekai/model-memory |
| - **SAE Dictionaries**: https://huggingface.co/aurekai/sae-dictionaries |
| - **FPQx Alignments**: https://huggingface.co/aurekai/fpqx-alignments |
|
|
| ## Tools & Scripts |
|
|
| - `akai semantic-cache:bench`: Run full benchmark suite |
| - `akai semantic-cache:compare`: Compare benchmark results |
| - `akai semantic-cache:threshold-sweep`: Test different thresholds |
| - `benchmark_to_csv.py`: Export results to CSV format |
| - `visualize_results.py`: Generate performance plots |
|
|
| ## Citation |
|
|
| If you reference these benchmarks in research: |
|
|
| ```bibtex |
| @dataset{aurekai_semantic_cache_bench_2026, |
| title={Aurekai Semantic Cache Benchmarks}, |
| author={Aurekai Community}, |
| year={2026}, |
| url={https://huggingface.co/aurekai/semantic-cache-bench} |
| } |
| ``` |
|
|
| ## License |
|
|
| Licensed under the Aurekai Open Source License. See main repository for details. |
|
|