NICKO's picture
Initial README for semantic-cache-bench
7be0d58 verified
# aurekai/semantic-cache-bench
Semantic caching benchmarks and performance suite for Aurekai. Validates cache consistency, hit rates, and query latency across different model architectures and corpus sizes.
## Overview
Semantic caching is a core optimization in Aurekai that deduplicates semantically similar queries without exact matching. This repository hosts:
- **Benchmark Datasets**: Query corpora with semantic similarity annotations
- **Evaluation Scripts**: Performance measurement and validation tools
- **Results**: Baseline metrics across different models and cache configurations
- **Methodology**: Detailed documentation of benchmark setup and evaluation protocols
## Quick Start
```bash
# Download benchmark suite
git clone https://huggingface.co/aurekai/semantic-cache-bench
cd semantic-cache-bench
# Run quick benchmark
akai semantic-cache:bench \
--dataset queries-10k.jsonl \
--model qwen3-8b \
--cache-size 1GB \
--output results.json
# Compare results
akai semantic-cache:compare \
--baseline baseline-results.json \
--current results.json
```
## Benchmark Datasets
### queries-1k (Minimal Validation)
- **Size**: 1,024 queries
- **Purpose**: Quick validation of cache functionality
- **Format**: JSONL with semantic similarity pairs
- **Runtime**: ~5 minutes on GPU
**Schema**:
```json
{
"id": "q_001_234",
"query": "What are the benefits of renewable energy?",
"semantic_variations": [
"Advantages of wind and solar power",
"Why should we invest in renewables?"
],
"dissimilar_queries": [
"How do fossil fuels work?"
],
"expected_cache_hit": true,
"similarity_threshold": 0.87
}
```
### queries-10k (Standard Benchmark)
- **Size**: 10,240 queries
- **Purpose**: Standard performance baseline
- **Corpus**: Diverse knowledge domains and query patterns
- **Expected cache hit rate**: 68-72%
- **Runtime**: ~45 minutes on GPU
### queries-100k (Comprehensive)
- **Size**: 102,400 queries
- **Purpose**: Large-scale cache behavior validation
- **Corpus**: Realistic production query distribution
- **Expected cache hit rate**: 72-76%
- **Runtime**: ~8 hours on high-end GPU
- **Disk space**: 15 GB decompressed
## Metrics & Evaluation
### Cache Performance
| Metric | Qwen3-8B | LLaMA3-8B | Meaning |
|--------|----------|-----------|---------|
| **Hit Rate** | 71.2% | 69.8% | % of queries found in cache |
| **False Positives** | 0.3% | 0.4% | Incorrect cache matches |
| **False Negatives** | 2.1% | 2.4% | Missed cache opportunities |
| **Recall @ 0.90** | 94.2% | 92.8% | True positives at high threshold |
### Latency Improvement
```
Cache Miss: 125 ms (full inference)
Cache Hit: 2 ms (embedding lookup + cache retrieval)
Speedup: 62.5x
Average (71% hit rate): 125*0.29 + 2*0.71 = 38 ms
Effective speedup: 3.3x vs. no cache
```
### Memory Efficiency
- **Embedding cache size**: ~800 MB for 100K queries
- **Memory per cached embedding**: ~8 KB
- **Compression ratio**: 1.4x with optional zstd compression
- **Peak memory during benchmark**: 4 GB (with batch size 32)
## Running Benchmarks
### Standard Evaluation
```bash
# Benchmark specific model
akai semantic-cache:bench \
--model qwen3-8b \
--dataset queries-10k.jsonl \
--batch-size 32 \
--cache-size 2GB \
--output results.json
# With logging
akai semantic-cache:bench \
--model qwen3-8b \
--dataset queries-10k.jsonl \
--cache-size 2GB \
--verbose \
--log-interval 100 \
--output results.json
```
### Comparison Between Models
```bash
# Run on multiple models
for model in qwen3-8b llama3-8b; do
akai semantic-cache:bench \
--model $model \
--dataset queries-10k.jsonl \
--output results-$model.json
done
# Compare results
akai semantic-cache:compare \
--results results-qwen3-8b.json results-llama3-8b.json \
--report comparison-report.md
```
### Validation with Custom Threshold
```bash
# Test different similarity thresholds
akai semantic-cache:threshold-sweep \
--model qwen3-8b \
--dataset queries-10k.jsonl \
--thresholds "0.80,0.85,0.90,0.95" \
--output threshold-sweep.json
```
## Benchmark Results
### Latest Results (Aurekai v0.8.0-alpha.1)
**Hardware**: NVIDIA H100 80GB, AMD EPYC 9654
**Date**: 2026-05-02
| Model | Dataset | Hit Rate | P@0.90 | Latency (hit) | Latency (miss) |
|-------|---------|----------|--------|---------------|----------------|
| Qwen3-8B | 1K | 72.3% | 94.1% | 1.8ms | 124ms |
| Qwen3-8B | 10K | 71.2% | 93.8% | 1.9ms | 126ms |
| LLaMA3-8B | 1K | 70.1% | 92.4% | 2.1ms | 127ms |
| LLaMA3-8B | 10K | 69.8% | 92.1% | 2.2ms | 129ms |
See [results/](./results/) for detailed breakdowns by domain and query type.
## Implementation Notes
### Cache Configuration
```json
{
"semantic_cache": {
"enabled": true,
"similarity_threshold": 0.88,
"max_cache_size": "2GB",
"eviction_policy": "lru",
"embedding_model": "qwen3-8b",
"batch_size": 32,
"use_mmap": true
}
}
```
### Threshold Selection
- **Aggressive (0.80)**: High hit rate (75%+), more false positives
- **Balanced (0.88)**: Recommended default, 71% hit rate, minimal false positives
- **Conservative (0.95)**: Very few false positives, lower hit rate
## Methodology
### Query Similarity Annotation
Each benchmark dataset includes human-validated semantic similarity annotations:
1. Query pairs sampled from corpus
2. Annotators rate similarity (0-1)
3. Disagreements resolved with third annotator
4. Inter-rater reliability: Krippendorff's α = 0.89
### Cache Consistency Validation
All cache results validated against ground truth:
```python
# For each cached result:
1. Verify embedding matches original query
2. Re-rank all cached results for current query
3. Confirm top match was indeed in cache
4. Validate latency was significantly improved
```
## Contributing Results
To contribute benchmark results:
1. Run benchmark suite on your hardware
2. Include system specs (GPU, CPU, memory, disk)
3. Report all metrics from evaluation output
4. Submit results via PR with hardware metadata
**Result file format**:
```json
{
"metadata": {
"hardware": "NVIDIA H100, 512GB RAM",
"date": "2026-05-02",
"aurekai_version": "0.8.0-alpha.1"
},
"results": [
{
"model": "qwen3-8b",
"dataset": "queries-10k",
"hit_rate": 0.712,
"recall_at_0_90": 0.938
}
]
}
```
## Related Repositories
- **Main Aurekai Repo**: https://github.com/aurekai/aurekai
- **Model Memory**: https://huggingface.co/aurekai/model-memory
- **SAE Dictionaries**: https://huggingface.co/aurekai/sae-dictionaries
- **FPQx Alignments**: https://huggingface.co/aurekai/fpqx-alignments
## Tools & Scripts
- `akai semantic-cache:bench`: Run full benchmark suite
- `akai semantic-cache:compare`: Compare benchmark results
- `akai semantic-cache:threshold-sweep`: Test different thresholds
- `benchmark_to_csv.py`: Export results to CSV format
- `visualize_results.py`: Generate performance plots
## Citation
If you reference these benchmarks in research:
```bibtex
@dataset{aurekai_semantic_cache_bench_2026,
title={Aurekai Semantic Cache Benchmarks},
author={Aurekai Community},
year={2026},
url={https://huggingface.co/aurekai/semantic-cache-bench}
}
```
## License
Licensed under the Aurekai Open Source License. See main repository for details.