Upload README.md
Browse files
README.md
CHANGED
|
@@ -1,26 +1,191 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
-
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4 |
---
|
| 5 |
|
| 6 |
-
#
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
|
| 8 |
-
|
| 9 |
-
|
|
|
|
| 10 |
|
| 11 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
|
| 13 |
-
-
|
| 14 |
-
- Source code: https://github.com/huggingface/ml-intern
|
| 15 |
|
| 16 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 17 |
|
| 18 |
```python
|
| 19 |
-
from
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
|
|
|
|
|
|
| 24 |
```
|
| 25 |
|
| 26 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# StruCTA: Structured Causal Transformer with Abstraction
|
| 2 |
+
|
| 3 |
+
**A privacy-preserving transformer architecture that enables GPT-level reasoning while keeping sensitive entities completely outside the model.**
|
| 4 |
+
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
## Core Innovation
|
| 8 |
+
|
| 9 |
+
Modern LLMs process raw text β meaning every training batch and inference query exposes sensitive entities (names, addresses, SSNs, financial data). Even with entity masking or DP training, models still "see" contextual information that can leak private data.
|
| 10 |
+
|
| 11 |
+
**StruCTA** solves this by replacing raw text with **privacy-preserving structured representations**: abstract semantic graphs where sensitive entities are replaced with typed placeholders (`<PERSON_1>`, `<MONEY_1>`, etc.). The transformer operates natively on these graph structures.
|
| 12 |
+
|
| 13 |
+
---
|
| 14 |
+
|
| 15 |
+
## Architecture Overview
|
| 16 |
+
|
| 17 |
+
```
|
| 18 |
+
Raw Text Input
|
| 19 |
+
β
|
| 20 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 21 |
+
β [ABSTRACTION LAYER] β OUTSIDE MODEL (external pipeline) β
|
| 22 |
+
β β’ Named Entity Recognition β
|
| 23 |
+
β β’ Entity Abstraction: "John Smith" β "<PERSON_1>" β
|
| 24 |
+
β β’ AMR Graph Parsing β
|
| 25 |
+
β β’ Vault Storage (encrypted external) β
|
| 26 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 27 |
+
β
|
| 28 |
+
Structured Graph (AMR nodes with abstract entity types)
|
| 29 |
+
β
|
| 30 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 31 |
+
β [STRUCTURED ENCODER] β Graph Transformer with Structural Encodingsβ
|
| 32 |
+
β β’ Centrality Encoding: node degree β importance embeddings β
|
| 33 |
+
β β’ Spatial Encoding: shortest-path distance as attention bias β
|
| 34 |
+
β β’ Edge Encoding: relationship semantics between nodes β
|
| 35 |
+
β β’ Position-agnostic β NO raw text positions used β
|
| 36 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
|
| 37 |
+
β [PRIVACY VERIFICATION MODULE] β Run-time guards β
|
| 38 |
+
β β’ Structural invariant checking (graph schema validation) β
|
| 39 |
+
β β’ Forbidden token leakage detection (bloom-filter style) β
|
| 40 |
+
β β’ Entropy-based privacy score β
|
| 41 |
+
β β’ Privacy budget accountant (RDP moments accountant) β
|
| 42 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
|
| 43 |
+
β [REASONING HEAD] β Cross-Modal Causal Decoder β
|
| 44 |
+
β β’ Cross-attends from graph nodes to text generation β
|
| 45 |
+
β β’ Graph-based positional encoding (not text positions) β
|
| 46 |
+
β β’ Generates abstract answers (e.g., "<PERSON_1> owes <AMOUNT_1>") β
|
| 47 |
+
β β’ No sensitive data in weights, activations, or outputs β
|
| 48 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 49 |
+
β
|
| 50 |
+
Abstract Answer
|
| 51 |
+
β
|
| 52 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 53 |
+
β [DE-ABSTRACTION LAYER] β OUTSIDE MODEL (privileged operation) β
|
| 54 |
+
β β’ Maps abstract tokens back to concrete entities β
|
| 55 |
+
β β’ Uses external vault (NEVER part of model weights) β
|
| 56 |
+
β β’ Operates in secure enclave or HSM β
|
| 57 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 58 |
+
β
|
| 59 |
+
Concrete Answer
|
| 60 |
+
```
|
| 61 |
+
|
| 62 |
---
|
| 63 |
+
|
| 64 |
+
## Key Design Principles
|
| 65 |
+
|
| 66 |
+
### 1. **Sensitive Data Never Touches the Model**
|
| 67 |
+
- Raw text β Abstraction Layer (external pipeline)
|
| 68 |
+
- Only abstract tokens and structural graphs enter the transformer
|
| 69 |
+
- Vault keys are NEVER in model weights, gradients, or activations
|
| 70 |
+
|
| 71 |
+
### 2. **Structured Attention = Inductive Biases**
|
| 72 |
+
- Graphormer-style structural encodings (centrality, spatial, edge)
|
| 73 |
+
- These inject graph topology directly into attention scores
|
| 74 |
+
- More expressive than GCN/GIN while maintaining full self-attention
|
| 75 |
+
|
| 76 |
+
### 3. **Graph-Structured Reasoning**
|
| 77 |
+
- Reasoning follows AMR graph topology, not linear text positions
|
| 78 |
+
- Reduces logical drift compared to Chain-of-Thought
|
| 79 |
+
- Each reasoning step grounded in graph nodes
|
| 80 |
+
|
| 81 |
+
### 4. **Runtime Privacy Verification**
|
| 82 |
+
- Structural invariant checking enforces valid graph schemas
|
| 83 |
+
- Forbidden token detector catches accidental raw entity generation
|
| 84 |
+
- Entropy bound ensures outputs are unpredictable enough to prevent reconstruction
|
| 85 |
+
|
| 86 |
+
### 5. **Differentially Private Training**
|
| 87 |
+
- Ghost Clipping DP-Adam during fine-tuning
|
| 88 |
+
- Large batch sizes (1024-2048) with gradient clipping
|
| 89 |
+
- Entity-level DP, not document-level
|
| 90 |
+
|
| 91 |
---
|
| 92 |
|
| 93 |
+
## Component Comparison
|
| 94 |
+
|
| 95 |
+
| Dimension | Prior Art | StruCTA |
|
| 96 |
+
|-----------|-----------|---------|
|
| 97 |
+
| Input representation | Raw tokens / linearized graphs | Native graph attention on AMR nodes |
|
| 98 |
+
| Entity handling | Entity masking (context still visible) | Full abstraction + external vault |
|
| 99 |
+
| Structural encoding | Added to node features | Injected into attention scores (Graphormer) |
|
| 100 |
+
| Privacy during training | DP-SGD (memory-expensive) | Ghost Clipping + abstract entities |
|
| 101 |
+
| Privacy during inference | None / post-hoc filtering | Real-time verification module |
|
| 102 |
+
| Reasoning scaffold | Linear Chain-of-Thought | Graph-structured reasoning nodes |
|
| 103 |
+
| Position encoding | Text position | Graph centrality + shortest-path distance |
|
| 104 |
|
| 105 |
+
---
|
| 106 |
+
|
| 107 |
+
## Files
|
| 108 |
|
| 109 |
+
- `architecture.md` β Full technical specification with pseudocode
|
| 110 |
+
- `config.py` β Model configuration dataclass
|
| 111 |
+
- `encoder.py` β PrivacyGraphTransformer with structural encodings
|
| 112 |
+
- `decoder.py` β Cross-modal reasoning decoder
|
| 113 |
+
- `privacy.py` β PrivacyVerificationModule with multi-level checks
|
| 114 |
+
- `abstraction.py` β Entity abstraction and AMR graph pipeline
|
| 115 |
+
- `deabstraction.py` β De-abstraction to concrete entities
|
| 116 |
+
- `model.py` β End-to-end StruCTA composition
|
| 117 |
|
| 118 |
+
---
|
|
|
|
| 119 |
|
| 120 |
+
## Theoretical Privacy Guarantees
|
| 121 |
+
|
| 122 |
+
The architecture provides **three complementary privacy guarantees**:
|
| 123 |
+
|
| 124 |
+
1. **By Construction**: Raw sensitive tokens never enter the model. The abstraction layer is deterministic and invertible only via the external vault.
|
| 125 |
+
|
| 126 |
+
2. **(Ξ΅, Ξ΄)-Differential Privacy**: During fine-tuning, Ghost Clipping DP-Adam provides entity-level DP guarantees. Each entity's influence on the model is bounded.
|
| 127 |
+
|
| 128 |
+
3. **Structural Leakage Bound**: The privacy verification module enforces that output entropy on abstract tokens is bounded, preventing reconstruction attacks even with white-box model access.
|
| 129 |
+
|
| 130 |
+
---
|
| 131 |
+
|
| 132 |
+
## Training Pipeline
|
| 133 |
+
|
| 134 |
+
### Stage 1: Graph Pre-Training (Public Data)
|
| 135 |
+
- **Data**: Silver AMR graphs from Wikipedia
|
| 136 |
+
- **Task**: Node/edge masking + subgraph recovery
|
| 137 |
+
- **No privacy constraints** β learn structural reasoning
|
| 138 |
+
- **Config**: AdamW, lr=2e-4, batch=512, 500K steps
|
| 139 |
+
|
| 140 |
+
### Stage 2: Privacy-Aware Fine-Tuning (Domain Data)
|
| 141 |
+
- **Data**: Domain-specific text with entity annotations
|
| 142 |
+
- **Task**: Answer generation on abstract structured documents
|
| 143 |
+
- **Privacy**: Ghost Clipping DP-Adam, Ξ΅=3, Ξ΄=1e-5
|
| 144 |
+
- **Config**: lr=5e-4, batch=2048, 50K steps
|
| 145 |
+
|
| 146 |
+
### Stage 3: Reasoning Fine-Tuning
|
| 147 |
+
- **Data**: LogiQA, MedQA, or domain reasoning benchmarks
|
| 148 |
+
- **Task**: Structured-to-structured reasoning
|
| 149 |
+
- **Objective**: Cross-entropy + structural alignment loss
|
| 150 |
+
|
| 151 |
+
---
|
| 152 |
+
|
| 153 |
+
## Usage Example
|
| 154 |
|
| 155 |
```python
|
| 156 |
+
from structa import StruCTA, StruCTAConfig
|
| 157 |
+
|
| 158 |
+
# Initialize
|
| 159 |
+
config = StruCTAConfig(hidden_dim=768, num_encoder_layers=12)
|
| 160 |
+
model = StruCTA(config)
|
| 161 |
+
|
| 162 |
+
# Privacy-preserving reasoning
|
| 163 |
+
result = model.generate_from_text(
|
| 164 |
+
"John Smith (SSN: 123-45-6789) was born on January 15, 1980. "
|
| 165 |
+
"He earns $75,000 per year. What is his annual income?",
|
| 166 |
+
max_length=50
|
| 167 |
+
)
|
| 168 |
|
| 169 |
+
print(result["abstract_answer"])
|
| 170 |
+
# "<PERSON_1>'s annual income is <$AMOUNT_1>"
|
| 171 |
+
|
| 172 |
+
print(result["concrete_answer"])
|
| 173 |
+
# "John Smith's annual income is $75,000"
|
| 174 |
```
|
| 175 |
|
| 176 |
+
---
|
| 177 |
+
|
| 178 |
+
## Citation
|
| 179 |
+
|
| 180 |
+
Based on research from:
|
| 181 |
+
- **Graphormer** (Ying et al., NeurIPS 2021): Structural encodings for graph transformers
|
| 182 |
+
- **AMRBART** (Bai et al., ACL 2022): Graph pre-training for AMR
|
| 183 |
+
- **Self-Graph Reasoning** (Chen et al., 2025): Graph-structured reasoning for LLMs
|
| 184 |
+
- **Ghost Clipping** (Li et al., ICLR 2021): DP fine-tuning of large transformers
|
| 185 |
+
- **Controlled Generation for Privacy** (Zhao et al., 2025): Entity-aware control codes
|
| 186 |
+
|
| 187 |
+
---
|
| 188 |
+
|
| 189 |
+
## License
|
| 190 |
+
|
| 191 |
+
Apache-2.0
|