File size: 7,334 Bytes
7be0d58
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
# aurekai/semantic-cache-bench

Semantic caching benchmarks and performance suite for Aurekai. Validates cache consistency, hit rates, and query latency across different model architectures and corpus sizes.

## Overview

Semantic caching is a core optimization in Aurekai that deduplicates semantically similar queries without exact matching. This repository hosts:

- **Benchmark Datasets**: Query corpora with semantic similarity annotations
- **Evaluation Scripts**: Performance measurement and validation tools
- **Results**: Baseline metrics across different models and cache configurations
- **Methodology**: Detailed documentation of benchmark setup and evaluation protocols

## Quick Start

```bash
# Download benchmark suite
git clone https://huggingface.co/aurekai/semantic-cache-bench
cd semantic-cache-bench

# Run quick benchmark
akai semantic-cache:bench \
  --dataset queries-10k.jsonl \
  --model qwen3-8b \
  --cache-size 1GB \
  --output results.json

# Compare results
akai semantic-cache:compare \
  --baseline baseline-results.json \
  --current results.json
```

## Benchmark Datasets

### queries-1k (Minimal Validation)

- **Size**: 1,024 queries
- **Purpose**: Quick validation of cache functionality
- **Format**: JSONL with semantic similarity pairs
- **Runtime**: ~5 minutes on GPU

**Schema**:
```json
{
  "id": "q_001_234",
  "query": "What are the benefits of renewable energy?",
  "semantic_variations": [
    "Advantages of wind and solar power",
    "Why should we invest in renewables?"
  ],
  "dissimilar_queries": [
    "How do fossil fuels work?"
  ],
  "expected_cache_hit": true,
  "similarity_threshold": 0.87
}
```

### queries-10k (Standard Benchmark)

- **Size**: 10,240 queries
- **Purpose**: Standard performance baseline
- **Corpus**: Diverse knowledge domains and query patterns
- **Expected cache hit rate**: 68-72%
- **Runtime**: ~45 minutes on GPU

### queries-100k (Comprehensive)

- **Size**: 102,400 queries
- **Purpose**: Large-scale cache behavior validation
- **Corpus**: Realistic production query distribution
- **Expected cache hit rate**: 72-76%
- **Runtime**: ~8 hours on high-end GPU
- **Disk space**: 15 GB decompressed

## Metrics & Evaluation

### Cache Performance

| Metric | Qwen3-8B | LLaMA3-8B | Meaning |
|--------|----------|-----------|---------|
| **Hit Rate** | 71.2% | 69.8% | % of queries found in cache |
| **False Positives** | 0.3% | 0.4% | Incorrect cache matches |
| **False Negatives** | 2.1% | 2.4% | Missed cache opportunities |
| **Recall @ 0.90** | 94.2% | 92.8% | True positives at high threshold |

### Latency Improvement

```
Cache Miss:    125 ms (full inference)
Cache Hit:     2 ms (embedding lookup + cache retrieval)
Speedup:       62.5x

Average (71% hit rate): 125*0.29 + 2*0.71 = 38 ms
Effective speedup:      3.3x vs. no cache
```

### Memory Efficiency

- **Embedding cache size**: ~800 MB for 100K queries
- **Memory per cached embedding**: ~8 KB
- **Compression ratio**: 1.4x with optional zstd compression
- **Peak memory during benchmark**: 4 GB (with batch size 32)

## Running Benchmarks

### Standard Evaluation

```bash
# Benchmark specific model
akai semantic-cache:bench \
  --model qwen3-8b \
  --dataset queries-10k.jsonl \
  --batch-size 32 \
  --cache-size 2GB \
  --output results.json

# With logging
akai semantic-cache:bench \
  --model qwen3-8b \
  --dataset queries-10k.jsonl \
  --cache-size 2GB \
  --verbose \
  --log-interval 100 \
  --output results.json
```

### Comparison Between Models

```bash
# Run on multiple models
for model in qwen3-8b llama3-8b; do
  akai semantic-cache:bench \
    --model $model \
    --dataset queries-10k.jsonl \
    --output results-$model.json
done

# Compare results
akai semantic-cache:compare \
  --results results-qwen3-8b.json results-llama3-8b.json \
  --report comparison-report.md
```

### Validation with Custom Threshold

```bash
# Test different similarity thresholds
akai semantic-cache:threshold-sweep \
  --model qwen3-8b \
  --dataset queries-10k.jsonl \
  --thresholds "0.80,0.85,0.90,0.95" \
  --output threshold-sweep.json
```

## Benchmark Results

### Latest Results (Aurekai v0.8.0-alpha.1)

**Hardware**: NVIDIA H100 80GB, AMD EPYC 9654
**Date**: 2026-05-02

| Model | Dataset | Hit Rate | P@0.90 | Latency (hit) | Latency (miss) |
|-------|---------|----------|--------|---------------|----------------|
| Qwen3-8B | 1K | 72.3% | 94.1% | 1.8ms | 124ms |
| Qwen3-8B | 10K | 71.2% | 93.8% | 1.9ms | 126ms |
| LLaMA3-8B | 1K | 70.1% | 92.4% | 2.1ms | 127ms |
| LLaMA3-8B | 10K | 69.8% | 92.1% | 2.2ms | 129ms |

See [results/](./results/) for detailed breakdowns by domain and query type.

## Implementation Notes

### Cache Configuration

```json
{
  "semantic_cache": {
    "enabled": true,
    "similarity_threshold": 0.88,
    "max_cache_size": "2GB",
    "eviction_policy": "lru",
    "embedding_model": "qwen3-8b",
    "batch_size": 32,
    "use_mmap": true
  }
}
```

### Threshold Selection

- **Aggressive (0.80)**: High hit rate (75%+), more false positives
- **Balanced (0.88)**: Recommended default, 71% hit rate, minimal false positives
- **Conservative (0.95)**: Very few false positives, lower hit rate

## Methodology

### Query Similarity Annotation

Each benchmark dataset includes human-validated semantic similarity annotations:

1. Query pairs sampled from corpus
2. Annotators rate similarity (0-1)
3. Disagreements resolved with third annotator
4. Inter-rater reliability: Krippendorff's α = 0.89

### Cache Consistency Validation

All cache results validated against ground truth:

```python
# For each cached result:
1. Verify embedding matches original query
2. Re-rank all cached results for current query  
3. Confirm top match was indeed in cache
4. Validate latency was significantly improved
```

## Contributing Results

To contribute benchmark results:

1. Run benchmark suite on your hardware
2. Include system specs (GPU, CPU, memory, disk)
3. Report all metrics from evaluation output
4. Submit results via PR with hardware metadata

**Result file format**:
```json
{
  "metadata": {
    "hardware": "NVIDIA H100, 512GB RAM",
    "date": "2026-05-02",
    "aurekai_version": "0.8.0-alpha.1"
  },
  "results": [
    {
      "model": "qwen3-8b",
      "dataset": "queries-10k",
      "hit_rate": 0.712,
      "recall_at_0_90": 0.938
    }
  ]
}
```

## Related Repositories

- **Main Aurekai Repo**: https://github.com/aurekai/aurekai
- **Model Memory**: https://huggingface.co/aurekai/model-memory
- **SAE Dictionaries**: https://huggingface.co/aurekai/sae-dictionaries
- **FPQx Alignments**: https://huggingface.co/aurekai/fpqx-alignments

## Tools & Scripts

- `akai semantic-cache:bench`: Run full benchmark suite
- `akai semantic-cache:compare`: Compare benchmark results
- `akai semantic-cache:threshold-sweep`: Test different thresholds
- `benchmark_to_csv.py`: Export results to CSV format
- `visualize_results.py`: Generate performance plots

## Citation

If you reference these benchmarks in research:

```bibtex
@dataset{aurekai_semantic_cache_bench_2026,
  title={Aurekai Semantic Cache Benchmarks},
  author={Aurekai Community},
  year={2026},
  url={https://huggingface.co/aurekai/semantic-cache-bench}
}
```

## License

Licensed under the Aurekai Open Source License. See main repository for details.