File size: 7,334 Bytes

7be0d58

# aurekai/semantic-cache-bench

Semantic caching benchmarks and performance suite for Aurekai. Validates cache consistency, hit rates, and query latency across different model architectures and corpus sizes.

## Overview

Semantic caching is a core optimization in Aurekai that deduplicates semantically similar queries without exact matching. This repository hosts:

- **Benchmark Datasets**: Query corpora with semantic similarity annotations
- **Evaluation Scripts**: Performance measurement and validation tools
- **Results**: Baseline metrics across different models and cache configurations
- **Methodology**: Detailed documentation of benchmark setup and evaluation protocols

## Quick Start

```bash
# Download benchmark suite
git clone https://huggingface.co/aurekai/semantic-cache-bench
cd semantic-cache-bench

# Run quick benchmark
akai semantic-cache:bench \
  --dataset queries-10k.jsonl \
  --model qwen3-8b \
  --cache-size 1GB \
  --output results.json

# Compare results
akai semantic-cache:compare \
  --baseline baseline-results.json \
  --current results.json
```

## Benchmark Datasets

### queries-1k (Minimal Validation)

- **Size**: 1,024 queries
- **Purpose**: Quick validation of cache functionality
- **Format**: JSONL with semantic similarity pairs
- **Runtime**: ~5 minutes on GPU

**Schema**:
```json
{
  "id": "q_001_234",
  "query": "What are the benefits of renewable energy?",
  "semantic_variations": [
    "Advantages of wind and solar power",
    "Why should we invest in renewables?"
  ],
  "dissimilar_queries": [
    "How do fossil fuels work?"
  ],
  "expected_cache_hit": true,
  "similarity_threshold": 0.87
}
```

### queries-10k (Standard Benchmark)

- **Size**: 10,240 queries
- **Purpose**: Standard performance baseline
- **Corpus**: Diverse knowledge domains and query patterns
- **Expected cache hit rate**: 68-72%
- **Runtime**: ~45 minutes on GPU

### queries-100k (Comprehensive)

- **Size**: 102,400 queries
- **Purpose**: Large-scale cache behavior validation
- **Corpus**: Realistic production query distribution
- **Expected cache hit rate**: 72-76%
- **Runtime**: ~8 hours on high-end GPU
- **Disk space**: 15 GB decompressed

## Metrics & Evaluation

### Cache Performance

| Metric | Qwen3-8B | LLaMA3-8B | Meaning |
|--------|----------|-----------|---------|
| **Hit Rate** | 71.2% | 69.8% | % of queries found in cache |
| **False Positives** | 0.3% | 0.4% | Incorrect cache matches |
| **False Negatives** | 2.1% | 2.4% | Missed cache opportunities |
| **Recall @ 0.90** | 94.2% | 92.8% | True positives at high threshold |

### Latency Improvement

```
Cache Miss:    125 ms (full inference)
Cache Hit:     2 ms (embedding lookup + cache retrieval)
Speedup:       62.5x

Average (71% hit rate): 125*0.29 + 2*0.71 = 38 ms
Effective speedup:      3.3x vs. no cache
```

### Memory Efficiency

- **Embedding cache size**: ~800 MB for 100K queries
- **Memory per cached embedding**: ~8 KB
- **Compression ratio**: 1.4x with optional zstd compression
- **Peak memory during benchmark**: 4 GB (with batch size 32)

## Running Benchmarks

### Standard Evaluation

```bash
# Benchmark specific model
akai semantic-cache:bench \
  --model qwen3-8b \
  --dataset queries-10k.jsonl \
  --batch-size 32 \
  --cache-size 2GB \
  --output results.json

# With logging
akai semantic-cache:bench \
  --model qwen3-8b \
  --dataset queries-10k.jsonl \
  --cache-size 2GB \
  --verbose \
  --log-interval 100 \
  --output results.json
```

### Comparison Between Models

```bash
# Run on multiple models
for model in qwen3-8b llama3-8b; do
  akai semantic-cache:bench \
    --model $model \
    --dataset queries-10k.jsonl \
    --output results-$model.json
done

# Compare results
akai semantic-cache:compare \
  --results results-qwen3-8b.json results-llama3-8b.json \
  --report comparison-report.md
```

### Validation with Custom Threshold

```bash
# Test different similarity thresholds
akai semantic-cache:threshold-sweep \
  --model qwen3-8b \
  --dataset queries-10k.jsonl \
  --thresholds "0.80,0.85,0.90,0.95" \
  --output threshold-sweep.json
```

## Benchmark Results

### Latest Results (Aurekai v0.8.0-alpha.1)

**Hardware**: NVIDIA H100 80GB, AMD EPYC 9654
**Date**: 2026-05-02

| Model | Dataset | Hit Rate | P@0.90 | Latency (hit) | Latency (miss) |
|-------|---------|----------|--------|---------------|----------------|
| Qwen3-8B | 1K | 72.3% | 94.1% | 1.8ms | 124ms |
| Qwen3-8B | 10K | 71.2% | 93.8% | 1.9ms | 126ms |
| LLaMA3-8B | 1K | 70.1% | 92.4% | 2.1ms | 127ms |
| LLaMA3-8B | 10K | 69.8% | 92.1% | 2.2ms | 129ms |

See [results/](./results/) for detailed breakdowns by domain and query type.

## Implementation Notes

### Cache Configuration

```json
{
  "semantic_cache": {
    "enabled": true,
    "similarity_threshold": 0.88,
    "max_cache_size": "2GB",
    "eviction_policy": "lru",
    "embedding_model": "qwen3-8b",
    "batch_size": 32,
    "use_mmap": true
  }
}
```

### Threshold Selection

- **Aggressive (0.80)**: High hit rate (75%+), more false positives
- **Balanced (0.88)**: Recommended default, 71% hit rate, minimal false positives
- **Conservative (0.95)**: Very few false positives, lower hit rate

## Methodology

### Query Similarity Annotation

Each benchmark dataset includes human-validated semantic similarity annotations:

1. Query pairs sampled from corpus
2. Annotators rate similarity (0-1)
3. Disagreements resolved with third annotator
4. Inter-rater reliability: Krippendorff's α = 0.89

### Cache Consistency Validation

All cache results validated against ground truth:

```python
# For each cached result:
1. Verify embedding matches original query
2. Re-rank all cached results for current query  
3. Confirm top match was indeed in cache
4. Validate latency was significantly improved
```

## Contributing Results

To contribute benchmark results:

1. Run benchmark suite on your hardware
2. Include system specs (GPU, CPU, memory, disk)
3. Report all metrics from evaluation output
4. Submit results via PR with hardware metadata

**Result file format**:
```json
{
  "metadata": {
    "hardware": "NVIDIA H100, 512GB RAM",
    "date": "2026-05-02",
    "aurekai_version": "0.8.0-alpha.1"
  },
  "results": [
    {
      "model": "qwen3-8b",
      "dataset": "queries-10k",
      "hit_rate": 0.712,
      "recall_at_0_90": 0.938
    }
  ]
}
```

## Related Repositories

- **Main Aurekai Repo**: https://github.com/aurekai/aurekai
- **Model Memory**: https://huggingface.co/aurekai/model-memory
- **SAE Dictionaries**: https://huggingface.co/aurekai/sae-dictionaries
- **FPQx Alignments**: https://huggingface.co/aurekai/fpqx-alignments

## Tools & Scripts

- `akai semantic-cache:bench`: Run full benchmark suite
- `akai semantic-cache:compare`: Compare benchmark results
- `akai semantic-cache:threshold-sweep`: Test different thresholds
- `benchmark_to_csv.py`: Export results to CSV format
- `visualize_results.py`: Generate performance plots

## Citation

If you reference these benchmarks in research:

```bibtex
@dataset{aurekai_semantic_cache_bench_2026,
  title={Aurekai Semantic Cache Benchmarks},
  author={Aurekai Community},
  year={2026},
  url={https://huggingface.co/aurekai/semantic-cache-bench}
}
```

## License

Licensed under the Aurekai Open Source License. See main repository for details.