Spaces:

fmr34
/

reformulatee

Running

App Files Files Community

reformulatee / docs /ARCHITECTURE.md

fmrod

deploy: docs atualizadas

c31002d 5 days ago

preview code

raw

history blame contribute delete

12.5 kB

A newer version of the Gradio SDK is available: 6.14.0

Upgrade

ReformulatEE — System Architecture

🏗️ High-Level Architecture

┌─────────────────────────────────────────────────────────────┐
│              Gradio Web Interface                            │
│   Rate limit: 10 req/min/session • Privacy notice shown     │
└──────────────────────┬──────────────────────────────────────┘
                       │ Portuguese ↔ English
                       ↓
┌─────────────────────────────────────────────────────────────┐
│              Translation Layer (MarianMT)                    │
│         Helsinki-NLP/opus-mt-{ROMANCE-en, en-ROMANCE}       │
│         Local CPU inference — zero cost                      │
└──────────────────────┬──────────────────────────────────────┘
                       │ English Research Question
                       ↓
┌─────────────────────────────────────────────────────────────┐
│           Reformulation Pipeline (Best-of-N)                │
│                                                              │
│  1. GENERATION (8 parallel candidates)                      │
│     ├─ Backend: ollama        GGUF fine-tuned [local, FREE] │
│     ├─ Backend: claude        Claude Haiku [HF Space]       │
│     └─ Backend: hf_inference  HF Inference API [free]       │
│                                                              │
│  2. SCORING (Epistemic Effectiveness)                        │
│     ├─ Respondibilidade (BM25 + semantic search, 919 papers)│
│     ├─ Tratabilidade (Ridge classifier, local)              │
│     └─ Não-trivialidade (semantic dissimilarity probe)      │
│                                                              │
│  3. FILTERING (Stage 1)                                     │
│     └─ Keep only: EE(q_cand) > EE(q_bad) + ε              │
│                                                              │
│  4. SELECTION                                               │
│     └─ Return highest-scoring candidate                     │
│                                                              │
└──────────────────────┬──────────────────────────────────────┘
                       │ English Reformulation
                       ↓
┌─────────────────────────────────────────────────────────────┐
│              Translation Layer (MarianMT)                    │
│                      (en → pt)                              │
└──────────────────────┬──────────────────────────────────────┘
                       │ Portuguese Reformulation
                       ↓
┌─────────────────────────────────────────────────────────────┐
│                    Persistence Layer                         │
│   ├─ SQLite (local): historico + cache_tratabilidade        │
│   └─ HF Dataset (cross-session): fmr34/reformulatee-logs   │
│       └─ All queries logged; feedback merged for DPO        │
└─────────────────────────────────────────────────────────────┘

📦 Core Components

1. Generation (`src/rl/generate_free.py`, `src/rl/inference.py`)

Produces N candidate reformulations. Backend is selected per environment:

Backends:

Backend	Model	Speed	Cost	Used when
`ollama`	Fine-tuned GGUF (reformulatee)	Fast	FREE	Local (recommended)
`claude`	Claude Haiku	Fast	~$0.001/req	HF Space (auto)
`hf_inference`	Qwen/Qwen2.5-1.5B	Fast	FREE	Explicit config
`gguf`	GGUF via llama-cpp-python	Medium	FREE	Explicit config
`local`	DPO fine-tuned PEFT	Slow	FREE	Explicit config

Backend selection logic (app.py):

if SPACE_ID env var:
    INFERENCE_BACKEND = "claude"   # HF Space: always Claude
else:
    INFERENCE_BACKEND = "auto"     # Local: tries Ollama → Claude

All user inputs are wrapped in <question> XML tags before being sent to any model to delimit data from instructions (prompt injection mitigation).

2. Translation (`src/ee/translate_local.py`)

Converts pt-br ↔ en using MarianMT.

Models:

Helsinki-NLP/opus-mt-ROMANCE-en (~300 MB, pt→en)
Helsinki-NLP/opus-mt-en-ROMANCE + >>pt<< prefix (en→pt)

Cost: FREE (local CPU inference)

Fallback: Claude API if transformers not installed

3. Epistemic Effectiveness Scoring (`src/ee/reward.py`)

Computes EE(Q) = 0.05·R + 0.05·T + 0.90·NT

3a. Respondibilidade (R)

How well-established is the research area?

Source: 919 papers (arXiv, Semantic Scholar, PubMed, Nobel Prize corpus)
Method: BM25 + cosine similarity re-ranking
Fallback: If corpus missing, R = 0 (app warns but continues)
Speed: ~200ms
Cache: In-memory + SQLite

3b. Tratabilidade (T)

Can we answer this with existing tools?

Primary: Ridge(alpha=50.0) on all-MiniLM-L6-v2 embeddings (local, ~22ms, free)
Fallback: Claude API with prompt caching if local classifier not trained
Cache: In-memory + SQLite cross-session

3c. Não-trivialidade (NT)

Is the reformulation significantly different from the original?

Method: Cosine distance between sentence embeddings + semantic classification
Speed: ~500ms (with prompt caching)
Cache: In-memory + SQLite

4. Stage 1 Filter (`src/ee/reward.py`)

Rejects candidates that don't improve over baseline.

ε = 0.05  # threshold
passes = EE(candidate) > EE(original) + ε

Rejection rate: ~30% at runtime
Fallback: If 0 candidates pass, return highest-EE anyway

5. Persistence (`src/db/historico.py`, `src/db/hf_logger.py`)

Two-layer persistence strategy:

SQLite (local, ephemeral on HF Space):

CREATE TABLE historico (
  id INTEGER PRIMARY KEY,
  ts TIMESTAMP,
  idioma TEXT,
  pergunta_orig TEXT,           -- original question
  pergunta_en TEXT,             -- English translation
  candidatos JSON,              -- [{"text": "...", "ee": 0.5}, ...]
  melhor TEXT,                  -- best selected (English)
  melhor_pt TEXT,               -- best in Portuguese
  ee_antes FLOAT,               -- EE(original)
  ee_depois FLOAT,              -- EE(best)
  stage1_pass BOOLEAN,          -- passed filtering?
  feedback INTEGER              -- 1=👍, -1=👎, NULL=none
);

HF Dataset (fmr34/reformulatee-logs, cross-session):

Every query logged as {"type": "record", ...} via background thread (non-blocking)
Feedback logged as {"type": "feedback", "id": ..., "feedback": 1} (urgent flush)
ultimas() falls back to HF Dataset when SQLite is empty (e.g. after Space restart)
Records validated (type/length/idioma) before display to prevent cache poisoning

Usage:

from src.db.historico import salvar, registrar_feedback, ultimas

record_id = salvar(pergunta_orig, candidatos, melhor, ...)
registrar_feedback(record_id, valor=1)  # 👍
history = ultimas(n=10)  # SQLite → HF Dataset fallback

🔄 Data Flow Example

User Input: "O que é a consciência?"
    ↓
[Translate pt→en via MarianMT]
"What is consciousness?"
    ↓
[Generate 8 candidates via Claude Haiku (HF Space) / Ollama (local)]
Input wrapped: <question>What is consciousness?</question>
{
  "candidates": [
    "What neural signatures predict conscious reports?",
    "How do synchronized neural patterns relate to awareness?",
    ...
  ]
}
    ↓
[Score each candidate via EE scoring]
{
  "candidates": [
    {"text": "...", "ee": 0.82, "resp": 0.7, "tract": 0.6, "nt": 0.85},
    ...
  ]
}
    ↓
[Stage 1 Filter: keep EE > baseline + 0.05]
    ↓
[Select best: max(score)]
    ↓
[Translate en→pt via MarianMT]
"Quais sinais neurais predizem relatórios conscientes?"
    ↓
[Save to SQLite + async log to HF Dataset]
    ↓
[Audit log: {"action": "reformulate", "session": "hash...", "ee_antes": 0.15, "ee_depois": 0.89}]
    ↓
User sees result + 👍/👎 buttons

🧠 Machine Learning Components

Tractability Classifier

Training:

python -m src.classifier.train_tractability --api

Trains Ridge regression on curated questions
Features: all-MiniLM-L6-v2 sentence embeddings (384-dim)
Target: binary labels (0/1) or real scores from Claude API
Output: data/models/tractability/classifier.pkl

DPO Fine-tuning

Data preparation:

python -m src.dataset.prepare_dpo

Consolidates DPO pairs from multiple sources (in priority order):

dpo_tier3.jsonl — adversarial cross-domain pairs (highest quality)
dpo_tier2.jsonl — adversarial validated pairs
dpo_tier1.jsonl — curated base pairs
batch_pairs.jsonl, batch_domains.jsonl, batch_large.jsonl — API-expanded
historico.db — local user feedback (👍)
HF Dataset (fmr34/reformulatee-logs) — online user feedback (👍)

Training on Colab:

# See notebooks/dpo_finetune_colab.ipynb
# Model: Qwen2.5-1.5B-Instruct
# Method: DPO + LoRA (4-bit QLoRA)
# Cost: FREE (Colab T4)
# Output: uploaded to HF Hub as GGUF

🗄️ Caching Strategy

Three-level cache hierarchy for efficiency:

Level 1: In-Memory Dict
├─ TTL: session lifetime
└─ Speed: O(1)
          ↓
Level 2: SQLite (cross-session, local)
├─ Tables: cache_tratabilidade
├─ TTL: infinite (until manual clear)
└─ Speed: ~5ms
          ↓
Level 3: Claude API (with prompt caching)
├─ Type: ephemeral cache (TTL ~5 min)
├─ Savings: ~70% cost reduction
└─ Speed: ~500ms (first call), cached after

🔒 Security

Input sanitization: User input wrapped in <question> tags in all backends (prompt injection mitigation)
Rate limiting: 10 requests/min per session (sliding window, in-memory)
Audit logging: Structured JSON to stderr — action, timestamp, session hash (SHA-256 truncated), EE scores
SQLite permissions: chmod 600 applied on every connection
HF Dataset records: Validated (type, length ≤ 1000 chars, idioma whitelist) before display
Startup validation: ANTHROPIC_API_KEY checked at startup on HF Space (fails fast with clear error)

⚡ Performance Characteristics

Operation	Speed	Cost (HF Space)	Cost (Local)
Generate 8 candidates	~8s	Claude API	FREE (Ollama)
Translate pt→en	~100ms	FREE	FREE
Score 8 candidates	~2s	FREE	FREE
Stage 1 Filter + select	~50ms	FREE	FREE
Translate en→pt	~100ms	FREE	FREE
Total pipeline	~10s	~$0.001	$0

🚀 Deployment Modes

Local (Zero Cost)

Ollama + fine-tuned GGUF model
MarianMT for translation
Ridge classifier for tractability
CPU-only (works on standard laptop)
Latency: ~10s per query

HF Space (Public Demo)

Claude Haiku for generation (forced when SPACE_ID present)
MarianMT loaded on first request (~300 MB download)
Questions persisted to HF Dataset (cross-session, cross-user)
SQLite ephemeral (resets on restart; HF Dataset used as fallback)

Production Scale

Docker container + load balancer
PostgreSQL for history (replace SQLite)
Redis for caching (replace in-memory dict)
Async workers for parallelization