reformulatee / docs /ARCHITECTURE.md
fmrod
deploy: docs atualizadas
c31002d

A newer version of the Gradio SDK is available: 6.14.0

Upgrade

ReformulatEE β€” System Architecture

πŸ—οΈ High-Level Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              Gradio Web Interface                            β”‚
β”‚   Rate limit: 10 req/min/session β€’ Privacy notice shown     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β”‚ Portuguese ↔ English
                       ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              Translation Layer (MarianMT)                    β”‚
β”‚         Helsinki-NLP/opus-mt-{ROMANCE-en, en-ROMANCE}       β”‚
β”‚         Local CPU inference β€” zero cost                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β”‚ English Research Question
                       ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚           Reformulation Pipeline (Best-of-N)                β”‚
β”‚                                                              β”‚
β”‚  1. GENERATION (8 parallel candidates)                      β”‚
β”‚     β”œβ”€ Backend: ollama        GGUF fine-tuned [local, FREE] β”‚
β”‚     β”œβ”€ Backend: claude        Claude Haiku [HF Space]       β”‚
β”‚     └─ Backend: hf_inference  HF Inference API [free]       β”‚
β”‚                                                              β”‚
β”‚  2. SCORING (Epistemic Effectiveness)                        β”‚
β”‚     β”œβ”€ Respondibilidade (BM25 + semantic search, 919 papers)β”‚
β”‚     β”œβ”€ Tratabilidade (Ridge classifier, local)              β”‚
β”‚     └─ NΓ£o-trivialidade (semantic dissimilarity probe)      β”‚
β”‚                                                              β”‚
β”‚  3. FILTERING (Stage 1)                                     β”‚
β”‚     └─ Keep only: EE(q_cand) > EE(q_bad) + Ξ΅              β”‚
β”‚                                                              β”‚
β”‚  4. SELECTION                                               β”‚
β”‚     └─ Return highest-scoring candidate                     β”‚
β”‚                                                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β”‚ English Reformulation
                       ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              Translation Layer (MarianMT)                    β”‚
β”‚                      (en β†’ pt)                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β”‚ Portuguese Reformulation
                       ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Persistence Layer                         β”‚
β”‚   β”œβ”€ SQLite (local): historico + cache_tratabilidade        β”‚
β”‚   └─ HF Dataset (cross-session): fmr34/reformulatee-logs   β”‚
β”‚       └─ All queries logged; feedback merged for DPO        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“¦ Core Components

1. Generation (src/rl/generate_free.py, src/rl/inference.py)

Produces N candidate reformulations. Backend is selected per environment:

Backends:

Backend Model Speed Cost Used when
ollama Fine-tuned GGUF (reformulatee) Fast FREE Local (recommended)
claude Claude Haiku Fast ~$0.001/req HF Space (auto)
hf_inference Qwen/Qwen2.5-1.5B Fast FREE Explicit config
gguf GGUF via llama-cpp-python Medium FREE Explicit config
local DPO fine-tuned PEFT Slow FREE Explicit config

Backend selection logic (app.py):

if SPACE_ID env var:
    INFERENCE_BACKEND = "claude"   # HF Space: always Claude
else:
    INFERENCE_BACKEND = "auto"     # Local: tries Ollama β†’ Claude

All user inputs are wrapped in <question> XML tags before being sent to any model to delimit data from instructions (prompt injection mitigation).

2. Translation (src/ee/translate_local.py)

Converts pt-br ↔ en using MarianMT.

Models:

  • Helsinki-NLP/opus-mt-ROMANCE-en (~300 MB, ptβ†’en)
  • Helsinki-NLP/opus-mt-en-ROMANCE + >>pt<< prefix (enβ†’pt)

Cost: FREE (local CPU inference)

Fallback: Claude API if transformers not installed

3. Epistemic Effectiveness Scoring (src/ee/reward.py)

Computes EE(Q) = 0.05Β·R + 0.05Β·T + 0.90Β·NT

3a. Respondibilidade (R)

How well-established is the research area?

  • Source: 919 papers (arXiv, Semantic Scholar, PubMed, Nobel Prize corpus)
  • Method: BM25 + cosine similarity re-ranking
  • Fallback: If corpus missing, R = 0 (app warns but continues)
  • Speed: ~200ms
  • Cache: In-memory + SQLite

3b. Tratabilidade (T)

Can we answer this with existing tools?

  • Primary: Ridge(alpha=50.0) on all-MiniLM-L6-v2 embeddings (local, ~22ms, free)
  • Fallback: Claude API with prompt caching if local classifier not trained
  • Cache: In-memory + SQLite cross-session

3c. NΓ£o-trivialidade (NT)

Is the reformulation significantly different from the original?

  • Method: Cosine distance between sentence embeddings + semantic classification
  • Speed: ~500ms (with prompt caching)
  • Cache: In-memory + SQLite

4. Stage 1 Filter (src/ee/reward.py)

Rejects candidates that don't improve over baseline.

Ξ΅ = 0.05  # threshold
passes = EE(candidate) > EE(original) + Ξ΅
  • Rejection rate: ~30% at runtime
  • Fallback: If 0 candidates pass, return highest-EE anyway

5. Persistence (src/db/historico.py, src/db/hf_logger.py)

Two-layer persistence strategy:

SQLite (local, ephemeral on HF Space):

CREATE TABLE historico (
  id INTEGER PRIMARY KEY,
  ts TIMESTAMP,
  idioma TEXT,
  pergunta_orig TEXT,           -- original question
  pergunta_en TEXT,             -- English translation
  candidatos JSON,              -- [{"text": "...", "ee": 0.5}, ...]
  melhor TEXT,                  -- best selected (English)
  melhor_pt TEXT,               -- best in Portuguese
  ee_antes FLOAT,               -- EE(original)
  ee_depois FLOAT,              -- EE(best)
  stage1_pass BOOLEAN,          -- passed filtering?
  feedback INTEGER              -- 1=πŸ‘, -1=πŸ‘Ž, NULL=none
);

HF Dataset (fmr34/reformulatee-logs, cross-session):

  • Every query logged as {"type": "record", ...} via background thread (non-blocking)
  • Feedback logged as {"type": "feedback", "id": ..., "feedback": 1} (urgent flush)
  • ultimas() falls back to HF Dataset when SQLite is empty (e.g. after Space restart)
  • Records validated (type/length/idioma) before display to prevent cache poisoning

Usage:

from src.db.historico import salvar, registrar_feedback, ultimas

record_id = salvar(pergunta_orig, candidatos, melhor, ...)
registrar_feedback(record_id, valor=1)  # πŸ‘
history = ultimas(n=10)  # SQLite β†’ HF Dataset fallback

πŸ”„ Data Flow Example

User Input: "O que Γ© a consciΓͺncia?"
    ↓
[Translate pt→en via MarianMT]
"What is consciousness?"
    ↓
[Generate 8 candidates via Claude Haiku (HF Space) / Ollama (local)]
Input wrapped: <question>What is consciousness?</question>
{
  "candidates": [
    "What neural signatures predict conscious reports?",
    "How do synchronized neural patterns relate to awareness?",
    ...
  ]
}
    ↓
[Score each candidate via EE scoring]
{
  "candidates": [
    {"text": "...", "ee": 0.82, "resp": 0.7, "tract": 0.6, "nt": 0.85},
    ...
  ]
}
    ↓
[Stage 1 Filter: keep EE > baseline + 0.05]
    ↓
[Select best: max(score)]
    ↓
[Translate en→pt via MarianMT]
"Quais sinais neurais predizem relatΓ³rios conscientes?"
    ↓
[Save to SQLite + async log to HF Dataset]
    ↓
[Audit log: {"action": "reformulate", "session": "hash...", "ee_antes": 0.15, "ee_depois": 0.89}]
    ↓
User sees result + πŸ‘/πŸ‘Ž buttons

🧠 Machine Learning Components

Tractability Classifier

Training:

python -m src.classifier.train_tractability --api
  • Trains Ridge regression on curated questions
  • Features: all-MiniLM-L6-v2 sentence embeddings (384-dim)
  • Target: binary labels (0/1) or real scores from Claude API
  • Output: data/models/tractability/classifier.pkl

DPO Fine-tuning

Data preparation:

python -m src.dataset.prepare_dpo

Consolidates DPO pairs from multiple sources (in priority order):

  • dpo_tier3.jsonl β€” adversarial cross-domain pairs (highest quality)
  • dpo_tier2.jsonl β€” adversarial validated pairs
  • dpo_tier1.jsonl β€” curated base pairs
  • batch_pairs.jsonl, batch_domains.jsonl, batch_large.jsonl β€” API-expanded
  • historico.db β€” local user feedback (πŸ‘)
  • HF Dataset (fmr34/reformulatee-logs) β€” online user feedback (πŸ‘)

Training on Colab:

# See notebooks/dpo_finetune_colab.ipynb
# Model: Qwen2.5-1.5B-Instruct
# Method: DPO + LoRA (4-bit QLoRA)
# Cost: FREE (Colab T4)
# Output: uploaded to HF Hub as GGUF

πŸ—„οΈ Caching Strategy

Three-level cache hierarchy for efficiency:

Level 1: In-Memory Dict
β”œβ”€ TTL: session lifetime
└─ Speed: O(1)
          ↓
Level 2: SQLite (cross-session, local)
β”œβ”€ Tables: cache_tratabilidade
β”œβ”€ TTL: infinite (until manual clear)
└─ Speed: ~5ms
          ↓
Level 3: Claude API (with prompt caching)
β”œβ”€ Type: ephemeral cache (TTL ~5 min)
β”œβ”€ Savings: ~70% cost reduction
└─ Speed: ~500ms (first call), cached after

πŸ”’ Security

  • Input sanitization: User input wrapped in <question> tags in all backends (prompt injection mitigation)
  • Rate limiting: 10 requests/min per session (sliding window, in-memory)
  • Audit logging: Structured JSON to stderr β€” action, timestamp, session hash (SHA-256 truncated), EE scores
  • SQLite permissions: chmod 600 applied on every connection
  • HF Dataset records: Validated (type, length ≀ 1000 chars, idioma whitelist) before display
  • Startup validation: ANTHROPIC_API_KEY checked at startup on HF Space (fails fast with clear error)

⚑ Performance Characteristics

Operation Speed Cost (HF Space) Cost (Local)
Generate 8 candidates ~8s Claude API FREE (Ollama)
Translate pt→en ~100ms FREE FREE
Score 8 candidates ~2s FREE FREE
Stage 1 Filter + select ~50ms FREE FREE
Translate en→pt ~100ms FREE FREE
Total pipeline ~10s ~$0.001 $0

πŸš€ Deployment Modes

Local (Zero Cost)

  • Ollama + fine-tuned GGUF model
  • MarianMT for translation
  • Ridge classifier for tractability
  • CPU-only (works on standard laptop)
  • Latency: ~10s per query

HF Space (Public Demo)

  • Claude Haiku for generation (forced when SPACE_ID present)
  • MarianMT loaded on first request (~300 MB download)
  • Questions persisted to HF Dataset (cross-session, cross-user)
  • SQLite ephemeral (resets on restart; HF Dataset used as fallback)

Production Scale

  • Docker container + load balancer
  • PostgreSQL for history (replace SQLite)
  • Redis for caching (replace in-memory dict)
  • Async workers for parallelization