Spaces:
Running
A newer version of the Gradio SDK is available: 6.14.0
ReformulatEE β System Architecture
ποΈ High-Level Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Gradio Web Interface β
β Rate limit: 10 req/min/session β’ Privacy notice shown β
ββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ
β Portuguese β English
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Translation Layer (MarianMT) β
β Helsinki-NLP/opus-mt-{ROMANCE-en, en-ROMANCE} β
β Local CPU inference β zero cost β
ββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ
β English Research Question
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Reformulation Pipeline (Best-of-N) β
β β
β 1. GENERATION (8 parallel candidates) β
β ββ Backend: ollama GGUF fine-tuned [local, FREE] β
β ββ Backend: claude Claude Haiku [HF Space] β
β ββ Backend: hf_inference HF Inference API [free] β
β β
β 2. SCORING (Epistemic Effectiveness) β
β ββ Respondibilidade (BM25 + semantic search, 919 papers)β
β ββ Tratabilidade (Ridge classifier, local) β
β ββ NΓ£o-trivialidade (semantic dissimilarity probe) β
β β
β 3. FILTERING (Stage 1) β
β ββ Keep only: EE(q_cand) > EE(q_bad) + Ξ΅ β
β β
β 4. SELECTION β
β ββ Return highest-scoring candidate β
β β
ββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ
β English Reformulation
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Translation Layer (MarianMT) β
β (en β pt) β
ββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ
β Portuguese Reformulation
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Persistence Layer β
β ββ SQLite (local): historico + cache_tratabilidade β
β ββ HF Dataset (cross-session): fmr34/reformulatee-logs β
β ββ All queries logged; feedback merged for DPO β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π¦ Core Components
1. Generation (src/rl/generate_free.py, src/rl/inference.py)
Produces N candidate reformulations. Backend is selected per environment:
Backends:
| Backend | Model | Speed | Cost | Used when |
|---|---|---|---|---|
ollama |
Fine-tuned GGUF (reformulatee) | Fast | FREE | Local (recommended) |
claude |
Claude Haiku | Fast | ~$0.001/req | HF Space (auto) |
hf_inference |
Qwen/Qwen2.5-1.5B | Fast | FREE | Explicit config |
gguf |
GGUF via llama-cpp-python | Medium | FREE | Explicit config |
local |
DPO fine-tuned PEFT | Slow | FREE | Explicit config |
Backend selection logic (app.py):
if SPACE_ID env var:
INFERENCE_BACKEND = "claude" # HF Space: always Claude
else:
INFERENCE_BACKEND = "auto" # Local: tries Ollama β Claude
All user inputs are wrapped in <question> XML tags before being sent to any model to delimit data from instructions (prompt injection mitigation).
2. Translation (src/ee/translate_local.py)
Converts pt-br β en using MarianMT.
Models:
Helsinki-NLP/opus-mt-ROMANCE-en(~300 MB, ptβen)Helsinki-NLP/opus-mt-en-ROMANCE+>>pt<<prefix (enβpt)
Cost: FREE (local CPU inference)
Fallback: Claude API if transformers not installed
3. Epistemic Effectiveness Scoring (src/ee/reward.py)
Computes EE(Q) = 0.05Β·R + 0.05Β·T + 0.90Β·NT
3a. Respondibilidade (R)
How well-established is the research area?
- Source: 919 papers (arXiv, Semantic Scholar, PubMed, Nobel Prize corpus)
- Method: BM25 + cosine similarity re-ranking
- Fallback: If corpus missing, R = 0 (app warns but continues)
- Speed: ~200ms
- Cache: In-memory + SQLite
3b. Tratabilidade (T)
Can we answer this with existing tools?
- Primary: Ridge(alpha=50.0) on all-MiniLM-L6-v2 embeddings (local, ~22ms, free)
- Fallback: Claude API with prompt caching if local classifier not trained
- Cache: In-memory + SQLite cross-session
3c. NΓ£o-trivialidade (NT)
Is the reformulation significantly different from the original?
- Method: Cosine distance between sentence embeddings + semantic classification
- Speed: ~500ms (with prompt caching)
- Cache: In-memory + SQLite
4. Stage 1 Filter (src/ee/reward.py)
Rejects candidates that don't improve over baseline.
Ξ΅ = 0.05 # threshold
passes = EE(candidate) > EE(original) + Ξ΅
- Rejection rate: ~30% at runtime
- Fallback: If 0 candidates pass, return highest-EE anyway
5. Persistence (src/db/historico.py, src/db/hf_logger.py)
Two-layer persistence strategy:
SQLite (local, ephemeral on HF Space):
CREATE TABLE historico (
id INTEGER PRIMARY KEY,
ts TIMESTAMP,
idioma TEXT,
pergunta_orig TEXT, -- original question
pergunta_en TEXT, -- English translation
candidatos JSON, -- [{"text": "...", "ee": 0.5}, ...]
melhor TEXT, -- best selected (English)
melhor_pt TEXT, -- best in Portuguese
ee_antes FLOAT, -- EE(original)
ee_depois FLOAT, -- EE(best)
stage1_pass BOOLEAN, -- passed filtering?
feedback INTEGER -- 1=π, -1=π, NULL=none
);
HF Dataset (fmr34/reformulatee-logs, cross-session):
- Every query logged as
{"type": "record", ...}via background thread (non-blocking) - Feedback logged as
{"type": "feedback", "id": ..., "feedback": 1}(urgent flush) ultimas()falls back to HF Dataset when SQLite is empty (e.g. after Space restart)- Records validated (type/length/idioma) before display to prevent cache poisoning
Usage:
from src.db.historico import salvar, registrar_feedback, ultimas
record_id = salvar(pergunta_orig, candidatos, melhor, ...)
registrar_feedback(record_id, valor=1) # π
history = ultimas(n=10) # SQLite β HF Dataset fallback
π Data Flow Example
User Input: "O que Γ© a consciΓͺncia?"
β
[Translate ptβen via MarianMT]
"What is consciousness?"
β
[Generate 8 candidates via Claude Haiku (HF Space) / Ollama (local)]
Input wrapped: <question>What is consciousness?</question>
{
"candidates": [
"What neural signatures predict conscious reports?",
"How do synchronized neural patterns relate to awareness?",
...
]
}
β
[Score each candidate via EE scoring]
{
"candidates": [
{"text": "...", "ee": 0.82, "resp": 0.7, "tract": 0.6, "nt": 0.85},
...
]
}
β
[Stage 1 Filter: keep EE > baseline + 0.05]
β
[Select best: max(score)]
β
[Translate enβpt via MarianMT]
"Quais sinais neurais predizem relatΓ³rios conscientes?"
β
[Save to SQLite + async log to HF Dataset]
β
[Audit log: {"action": "reformulate", "session": "hash...", "ee_antes": 0.15, "ee_depois": 0.89}]
β
User sees result + π/π buttons
π§ Machine Learning Components
Tractability Classifier
Training:
python -m src.classifier.train_tractability --api
- Trains Ridge regression on curated questions
- Features: all-MiniLM-L6-v2 sentence embeddings (384-dim)
- Target: binary labels (0/1) or real scores from Claude API
- Output:
data/models/tractability/classifier.pkl
DPO Fine-tuning
Data preparation:
python -m src.dataset.prepare_dpo
Consolidates DPO pairs from multiple sources (in priority order):
dpo_tier3.jsonlβ adversarial cross-domain pairs (highest quality)dpo_tier2.jsonlβ adversarial validated pairsdpo_tier1.jsonlβ curated base pairsbatch_pairs.jsonl,batch_domains.jsonl,batch_large.jsonlβ API-expandedhistorico.dbβ local user feedback (π)- HF Dataset (
fmr34/reformulatee-logs) β online user feedback (π)
Training on Colab:
# See notebooks/dpo_finetune_colab.ipynb
# Model: Qwen2.5-1.5B-Instruct
# Method: DPO + LoRA (4-bit QLoRA)
# Cost: FREE (Colab T4)
# Output: uploaded to HF Hub as GGUF
ποΈ Caching Strategy
Three-level cache hierarchy for efficiency:
Level 1: In-Memory Dict
ββ TTL: session lifetime
ββ Speed: O(1)
β
Level 2: SQLite (cross-session, local)
ββ Tables: cache_tratabilidade
ββ TTL: infinite (until manual clear)
ββ Speed: ~5ms
β
Level 3: Claude API (with prompt caching)
ββ Type: ephemeral cache (TTL ~5 min)
ββ Savings: ~70% cost reduction
ββ Speed: ~500ms (first call), cached after
π Security
- Input sanitization: User input wrapped in
<question>tags in all backends (prompt injection mitigation) - Rate limiting: 10 requests/min per session (sliding window, in-memory)
- Audit logging: Structured JSON to stderr β action, timestamp, session hash (SHA-256 truncated), EE scores
- SQLite permissions: chmod 600 applied on every connection
- HF Dataset records: Validated (type, length β€ 1000 chars, idioma whitelist) before display
- Startup validation: ANTHROPIC_API_KEY checked at startup on HF Space (fails fast with clear error)
β‘ Performance Characteristics
| Operation | Speed | Cost (HF Space) | Cost (Local) |
|---|---|---|---|
| Generate 8 candidates | ~8s | Claude API | FREE (Ollama) |
| Translate ptβen | ~100ms | FREE | FREE |
| Score 8 candidates | ~2s | FREE | FREE |
| Stage 1 Filter + select | ~50ms | FREE | FREE |
| Translate enβpt | ~100ms | FREE | FREE |
| Total pipeline | ~10s | ~$0.001 | $0 |
π Deployment Modes
Local (Zero Cost)
- Ollama + fine-tuned GGUF model
- MarianMT for translation
- Ridge classifier for tractability
- CPU-only (works on standard laptop)
- Latency: ~10s per query
HF Space (Public Demo)
- Claude Haiku for generation (forced when SPACE_ID present)
- MarianMT loaded on first request (~300 MB download)
- Questions persisted to HF Dataset (cross-session, cross-user)
- SQLite ephemeral (resets on restart; HF Dataset used as fallback)
Production Scale
- Docker container + load balancer
- PostgreSQL for history (replace SQLite)
- Redis for caching (replace in-memory dict)
- Async workers for parallelization