File size: 2,942 Bytes
b464490
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
# Multi-Manifold Retrieval: Proof of Concept

A proof-of-concept implementation of the Multi-Manifold Retrieval defense against spectral poisoning attacks (GeoPoison-RAG) on Retrieval-Augmented Generation systems.

## Core Idea

Standard RAG systems use a single shared embedding space for queries and documents, making the **document geometry identical to the retrieval geometry**. GeoPoison-RAG exploits this by computing the spectral structure (Fiedler vector) of the document graph Laplacian to find optimal adversarial placement.

Multi-Manifold Retrieval **decouples** these geometries by using:
- Separate query and document manifolds (M_Q and M_D)
- A non-decomposable cross-manifold relevance operator R(q, d)

This breaks the attack because the Laplacian the attacker computes (document space) no longer predicts the Laplacian governing retrieval (cross-manifold).

## Project Structure

```
multi_manifold_retrieval/
β”œβ”€β”€ models/
β”‚   β”œβ”€β”€ cross_manifold_operator.py   # Construction C: Attention-Geometric Hybrid
β”‚   β”œβ”€β”€ encoders.py                  # Sentence-transformer wrapper
β”‚   └── baseline.py                  # Standard cosine similarity baseline
β”œβ”€β”€ training/
β”‚   β”œβ”€β”€ train.py                     # Training loop
β”‚   β”œβ”€β”€ data.py                      # MS MARCO data loading
β”‚   └── losses.py                    # Contrastive loss
β”œβ”€β”€ evaluation/
β”‚   β”œβ”€β”€ spectral_analysis.py         # L_D, L_R, spectral discrepancy, Fiedler alignment
β”‚   β”œβ”€β”€ retrieval_metrics.py         # MRR@10, Recall@100
β”‚   └── attack_simulation.py         # GeoPoison-RAG simulation
proofs/
β”œβ”€β”€ proof_theorem_4_3.tex            # Spectral Decoupling theorem
└── proof_theorem_6_1.tex            # Query Complexity Lower Bound theorem
configs/
└── default.yaml                     # Hyperparameters
run_experiment.py                    # End-to-end pipeline
```

## Setup

```bash
pip install -r requirements.txt
```

## Running

Full experiment (train + evaluate + spectral analysis + attack):
```bash
python run_experiment.py --config configs/default.yaml
```

Skip training and load from checkpoint:
```bash
python run_experiment.py --skip-train --checkpoint checkpoints/best_operator.pt
```

## Key Metrics

| Metric | Baseline (expected) | Multi-Manifold (expected) |
|--------|-------------------|--------------------------|
| Spectral discrepancy Ξ΄ | β‰ˆ 0 | > 0 (significant) |
| Fiedler alignment cos(ΞΈ) | β‰ˆ 1 | < 0.5 |
| ASR@10 | > 0.8 | Significantly lower |
| MRR@10 | Reference | β‰₯ 80% of baseline |

## Formal Proofs

- `proofs/proof_theorem_4_3.tex`: Proves that non-decomposable R with positive cross-manifold curvature guarantees spectral decoupling Ξ΄ β‰₯ Ξ©(ΞΊ_R Β· Ξ»_2(L_D)).
- `proofs/proof_theorem_6_1.tex`: Proves that an adaptive adversary needs Ξ©(Vol(M_Q) / V_{d_Q}(Ξ΅/ΞΊ_R)) oracle queries to reconstruct R.