StruCTA: Structured Causal Transformer with Abstraction

A privacy-preserving transformer architecture that enables GPT-level reasoning while keeping sensitive entities completely outside the model.

Core Innovation

Modern LLMs process raw text — meaning every training batch and inference query exposes sensitive entities (names, addresses, SSNs, financial data). Even with entity masking or DP training, models still "see" contextual information that can leak private data.

StruCTA solves this by replacing raw text with privacy-preserving structured representations: abstract semantic graphs where sensitive entities are replaced with typed placeholders (<PERSON_1>, <MONEY_1>, etc.). The transformer operates natively on these graph structures.

Architecture Overview

Raw Text Input
      ↓
┌─────────────────────────────────────────────────────────────────────┐
│   [ABSTRACTION LAYER] — OUTSIDE MODEL (external pipeline)            │
│   • Named Entity Recognition                                          │
│   • Entity Abstraction: "John Smith" → "<PERSON_1>"                  │
│   • AMR Graph Parsing                                               │
│   • Vault Storage (encrypted external)                              │
└─────────────────────────────────────────────────────────────────────┘
      ↓
Structured Graph (AMR nodes with abstract entity types)
      ↓
┌─────────────────────────────────────────────────────────────────────┐
│   [STRUCTURED ENCODER] — Graph Transformer with Structural Encodings│
│   • Centrality Encoding: node degree → importance embeddings          │
│   • Spatial Encoding: shortest-path distance as attention bias        │
│   • Edge Encoding: relationship semantics between nodes               │
│   • Position-agnostic — NO raw text positions used                    │
├─────────────────────────────────────────────────────────────────────┤
│   [PRIVACY VERIFICATION MODULE] — Run-time guards                     │
│   • Structural invariant checking (graph schema validation)           │
│   • Forbidden token leakage detection (bloom-filter style)            │
│   • Entropy-based privacy score                                       │
│   • Privacy budget accountant (RDP moments accountant)                │
├─────────────────────────────────────────────────────────────────────┤
│   [REASONING HEAD] — Cross-Modal Causal Decoder                      │
│   • Cross-attends from graph nodes to text generation                 │
│   • Graph-based positional encoding (not text positions)              │
│   • Generates abstract answers (e.g., "<PERSON_1> owes <AMOUNT_1>") │
│   • No sensitive data in weights, activations, or outputs               │
└─────────────────────────────────────────────────────────────────────┘
      ↓
Abstract Answer
      ↓
┌─────────────────────────────────────────────────────────────────────┐
│   [DE-ABSTRACTION LAYER] — OUTSIDE MODEL (privileged operation)     │
│   • Maps abstract tokens back to concrete entities                   │
│   • Uses external vault (NEVER part of model weights)                  │
│   • Operates in secure enclave or HSM                               │
└─────────────────────────────────────────────────────────────────────┘
      ↓
Concrete Answer

Key Design Principles

1. Sensitive Data Never Touches the Model

Raw text ↔ Abstraction Layer (external pipeline)
Only abstract tokens and structural graphs enter the transformer
Vault keys are NEVER in model weights, gradients, or activations

2. Structured Attention = Inductive Biases

Graphormer-style structural encodings (centrality, spatial, edge)
These inject graph topology directly into attention scores
More expressive than GCN/GIN while maintaining full self-attention

3. Graph-Structured Reasoning

Reasoning follows AMR graph topology, not linear text positions
Reduces logical drift compared to Chain-of-Thought
Each reasoning step grounded in graph nodes

4. Runtime Privacy Verification

Structural invariant checking enforces valid graph schemas
Forbidden token detector catches accidental raw entity generation
Entropy bound ensures outputs are unpredictable enough to prevent reconstruction

5. Differentially Private Training

Ghost Clipping DP-Adam during fine-tuning
Large batch sizes (1024-2048) with gradient clipping
Entity-level DP, not document-level

Component Comparison

Dimension	Prior Art	StruCTA
Input representation	Raw tokens / linearized graphs	Native graph attention on AMR nodes
Entity handling	Entity masking (context still visible)	Full abstraction + external vault
Structural encoding	Added to node features	Injected into attention scores (Graphormer)
Privacy during training	DP-SGD (memory-expensive)	Ghost Clipping + abstract entities
Privacy during inference	None / post-hoc filtering	Real-time verification module
Reasoning scaffold	Linear Chain-of-Thought	Graph-structured reasoning nodes
Position encoding	Text position	Graph centrality + shortest-path distance

Files

architecture.md — Full technical specification with pseudocode
config.py — Model configuration dataclass
encoder.py — PrivacyGraphTransformer with structural encodings
decoder.py — Cross-modal reasoning decoder
privacy.py — PrivacyVerificationModule with multi-level checks
abstraction.py — Entity abstraction and AMR graph pipeline
deabstraction.py — De-abstraction to concrete entities
model.py — End-to-end StruCTA composition

Theoretical Privacy Guarantees

The architecture provides three complementary privacy guarantees:

By Construction: Raw sensitive tokens never enter the model. The abstraction layer is deterministic and invertible only via the external vault.
(ε, δ)-Differential Privacy: During fine-tuning, Ghost Clipping DP-Adam provides entity-level DP guarantees. Each entity's influence on the model is bounded.
Structural Leakage Bound: The privacy verification module enforces that output entropy on abstract tokens is bounded, preventing reconstruction attacks even with white-box model access.

Training Pipeline

Stage 1: Graph Pre-Training (Public Data)

Data: Silver AMR graphs from Wikipedia
Task: Node/edge masking + subgraph recovery
No privacy constraints — learn structural reasoning
Config: AdamW, lr=2e-4, batch=512, 500K steps

Stage 2: Privacy-Aware Fine-Tuning (Domain Data)

Data: Domain-specific text with entity annotations
Task: Answer generation on abstract structured documents
Privacy: Ghost Clipping DP-Adam, ε=3, δ=1e-5
Config: lr=5e-4, batch=2048, 50K steps

Stage 3: Reasoning Fine-Tuning

Data: LogiQA, MedQA, or domain reasoning benchmarks
Task: Structured-to-structured reasoning
Objective: Cross-entropy + structural alignment loss

Usage Example

from structa import StruCTA, StruCTAConfig

# Initialize
config = StruCTAConfig(hidden_dim=768, num_encoder_layers=12)
model = StruCTA(config)

# Privacy-preserving reasoning
result = model.generate_from_text(
    "John Smith (SSN: 123-45-6789) was born on January 15, 1980. "
    "He earns $75,000 per year. What is his annual income?",
    max_length=50
)

print(result["abstract_answer"])
# "<PERSON_1>'s annual income is <$AMOUNT_1>"

print(result["concrete_answer"])
# "John Smith's annual income is $75,000"

Citation

Based on research from:

Graphormer (Ying et al., NeurIPS 2021): Structural encodings for graph transformers
AMRBART (Bai et al., ACL 2022): Graph pre-training for AMR
Self-Graph Reasoning (Chen et al., 2025): Graph-structured reasoning for LLMs
Ghost Clipping (Li et al., ICLR 2021): DP fine-tuning of large transformers
Controlled Generation for Privacy (Zhao et al., 2025): Entity-aware control codes

License

Apache-2.0

Generated by ML Intern

This model repository was generated by ML Intern, an agent for machine learning research and development on the Hugging Face Hub.

Try ML Intern: https://smolagents-ml-intern.hf.space
Source code: https://github.com/huggingface/ml-intern

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support