StruCTA: Structured Causal Transformer with Abstraction

A privacy-preserving transformer architecture that enables GPT-level reasoning while keeping sensitive entities completely outside the model.


Core Innovation

Modern LLMs process raw text β€” meaning every training batch and inference query exposes sensitive entities (names, addresses, SSNs, financial data). Even with entity masking or DP training, models still "see" contextual information that can leak private data.

StruCTA solves this by replacing raw text with privacy-preserving structured representations: abstract semantic graphs where sensitive entities are replaced with typed placeholders (<PERSON_1>, <MONEY_1>, etc.). The transformer operates natively on these graph structures.


Architecture Overview

Raw Text Input
      ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   [ABSTRACTION LAYER] β€” OUTSIDE MODEL (external pipeline)            β”‚
β”‚   β€’ Named Entity Recognition                                          β”‚
β”‚   β€’ Entity Abstraction: "John Smith" β†’ "<PERSON_1>"                  β”‚
β”‚   β€’ AMR Graph Parsing                                               β”‚
β”‚   β€’ Vault Storage (encrypted external)                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
      ↓
Structured Graph (AMR nodes with abstract entity types)
      ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   [STRUCTURED ENCODER] β€” Graph Transformer with Structural Encodingsβ”‚
β”‚   β€’ Centrality Encoding: node degree β†’ importance embeddings          β”‚
β”‚   β€’ Spatial Encoding: shortest-path distance as attention bias        β”‚
β”‚   β€’ Edge Encoding: relationship semantics between nodes               β”‚
β”‚   β€’ Position-agnostic β€” NO raw text positions used                    β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚   [PRIVACY VERIFICATION MODULE] β€” Run-time guards                     β”‚
β”‚   β€’ Structural invariant checking (graph schema validation)           β”‚
β”‚   β€’ Forbidden token leakage detection (bloom-filter style)            β”‚
β”‚   β€’ Entropy-based privacy score                                       β”‚
β”‚   β€’ Privacy budget accountant (RDP moments accountant)                β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚   [REASONING HEAD] β€” Cross-Modal Causal Decoder                      β”‚
β”‚   β€’ Cross-attends from graph nodes to text generation                 β”‚
β”‚   β€’ Graph-based positional encoding (not text positions)              β”‚
β”‚   β€’ Generates abstract answers (e.g., "<PERSON_1> owes <AMOUNT_1>") β”‚
β”‚   β€’ No sensitive data in weights, activations, or outputs               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
      ↓
Abstract Answer
      ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   [DE-ABSTRACTION LAYER] β€” OUTSIDE MODEL (privileged operation)     β”‚
β”‚   β€’ Maps abstract tokens back to concrete entities                   β”‚
β”‚   β€’ Uses external vault (NEVER part of model weights)                  β”‚
β”‚   β€’ Operates in secure enclave or HSM                               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
      ↓
Concrete Answer

Key Design Principles

1. Sensitive Data Never Touches the Model

  • Raw text ↔ Abstraction Layer (external pipeline)
  • Only abstract tokens and structural graphs enter the transformer
  • Vault keys are NEVER in model weights, gradients, or activations

2. Structured Attention = Inductive Biases

  • Graphormer-style structural encodings (centrality, spatial, edge)
  • These inject graph topology directly into attention scores
  • More expressive than GCN/GIN while maintaining full self-attention

3. Graph-Structured Reasoning

  • Reasoning follows AMR graph topology, not linear text positions
  • Reduces logical drift compared to Chain-of-Thought
  • Each reasoning step grounded in graph nodes

4. Runtime Privacy Verification

  • Structural invariant checking enforces valid graph schemas
  • Forbidden token detector catches accidental raw entity generation
  • Entropy bound ensures outputs are unpredictable enough to prevent reconstruction

5. Differentially Private Training

  • Ghost Clipping DP-Adam during fine-tuning
  • Large batch sizes (1024-2048) with gradient clipping
  • Entity-level DP, not document-level

Component Comparison

Dimension Prior Art StruCTA
Input representation Raw tokens / linearized graphs Native graph attention on AMR nodes
Entity handling Entity masking (context still visible) Full abstraction + external vault
Structural encoding Added to node features Injected into attention scores (Graphormer)
Privacy during training DP-SGD (memory-expensive) Ghost Clipping + abstract entities
Privacy during inference None / post-hoc filtering Real-time verification module
Reasoning scaffold Linear Chain-of-Thought Graph-structured reasoning nodes
Position encoding Text position Graph centrality + shortest-path distance

Files

  • architecture.md β€” Full technical specification with pseudocode
  • config.py β€” Model configuration dataclass
  • encoder.py β€” PrivacyGraphTransformer with structural encodings
  • decoder.py β€” Cross-modal reasoning decoder
  • privacy.py β€” PrivacyVerificationModule with multi-level checks
  • abstraction.py β€” Entity abstraction and AMR graph pipeline
  • deabstraction.py β€” De-abstraction to concrete entities
  • model.py β€” End-to-end StruCTA composition

Theoretical Privacy Guarantees

The architecture provides three complementary privacy guarantees:

  1. By Construction: Raw sensitive tokens never enter the model. The abstraction layer is deterministic and invertible only via the external vault.

  2. (Ξ΅, Ξ΄)-Differential Privacy: During fine-tuning, Ghost Clipping DP-Adam provides entity-level DP guarantees. Each entity's influence on the model is bounded.

  3. Structural Leakage Bound: The privacy verification module enforces that output entropy on abstract tokens is bounded, preventing reconstruction attacks even with white-box model access.


Training Pipeline

Stage 1: Graph Pre-Training (Public Data)

  • Data: Silver AMR graphs from Wikipedia
  • Task: Node/edge masking + subgraph recovery
  • No privacy constraints β€” learn structural reasoning
  • Config: AdamW, lr=2e-4, batch=512, 500K steps

Stage 2: Privacy-Aware Fine-Tuning (Domain Data)

  • Data: Domain-specific text with entity annotations
  • Task: Answer generation on abstract structured documents
  • Privacy: Ghost Clipping DP-Adam, Ξ΅=3, Ξ΄=1e-5
  • Config: lr=5e-4, batch=2048, 50K steps

Stage 3: Reasoning Fine-Tuning

  • Data: LogiQA, MedQA, or domain reasoning benchmarks
  • Task: Structured-to-structured reasoning
  • Objective: Cross-entropy + structural alignment loss

Usage Example

from structa import StruCTA, StruCTAConfig

# Initialize
config = StruCTAConfig(hidden_dim=768, num_encoder_layers=12)
model = StruCTA(config)

# Privacy-preserving reasoning
result = model.generate_from_text(
    "John Smith (SSN: 123-45-6789) was born on January 15, 1980. "
    "He earns $75,000 per year. What is his annual income?",
    max_length=50
)

print(result["abstract_answer"])
# "<PERSON_1>'s annual income is <$AMOUNT_1>"

print(result["concrete_answer"])
# "John Smith's annual income is $75,000"

Citation

Based on research from:

  • Graphormer (Ying et al., NeurIPS 2021): Structural encodings for graph transformers
  • AMRBART (Bai et al., ACL 2022): Graph pre-training for AMR
  • Self-Graph Reasoning (Chen et al., 2025): Graph-structured reasoning for LLMs
  • Ghost Clipping (Li et al., ICLR 2021): DP fine-tuning of large transformers
  • Controlled Generation for Privacy (Zhao et al., 2025): Entity-aware control codes

License

Apache-2.0

Generated by ML Intern

This model repository was generated by ML Intern, an agent for machine learning research and development on the Hugging Face Hub.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support