reformulatee / docs /ARCHITECTURE.md
fmrod
deploy: docs atualizadas
c31002d
# ReformulatEE β€” System Architecture
## πŸ—οΈ High-Level Architecture
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Gradio Web Interface β”‚
β”‚ Rate limit: 10 req/min/session β€’ Privacy notice shown β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ Portuguese ↔ English
↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Translation Layer (MarianMT) β”‚
β”‚ Helsinki-NLP/opus-mt-{ROMANCE-en, en-ROMANCE} β”‚
β”‚ Local CPU inference β€” zero cost β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ English Research Question
↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Reformulation Pipeline (Best-of-N) β”‚
β”‚ β”‚
β”‚ 1. GENERATION (8 parallel candidates) β”‚
β”‚ β”œβ”€ Backend: ollama GGUF fine-tuned [local, FREE] β”‚
β”‚ β”œβ”€ Backend: claude Claude Haiku [HF Space] β”‚
β”‚ └─ Backend: hf_inference HF Inference API [free] β”‚
β”‚ β”‚
β”‚ 2. SCORING (Epistemic Effectiveness) β”‚
β”‚ β”œβ”€ Respondibilidade (BM25 + semantic search, 919 papers)β”‚
β”‚ β”œβ”€ Tratabilidade (Ridge classifier, local) β”‚
β”‚ └─ NΓ£o-trivialidade (semantic dissimilarity probe) β”‚
β”‚ β”‚
β”‚ 3. FILTERING (Stage 1) β”‚
β”‚ └─ Keep only: EE(q_cand) > EE(q_bad) + Ξ΅ β”‚
β”‚ β”‚
β”‚ 4. SELECTION β”‚
β”‚ └─ Return highest-scoring candidate β”‚
β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ English Reformulation
↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Translation Layer (MarianMT) β”‚
β”‚ (en β†’ pt) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ Portuguese Reformulation
↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Persistence Layer β”‚
β”‚ β”œβ”€ SQLite (local): historico + cache_tratabilidade β”‚
β”‚ └─ HF Dataset (cross-session): fmr34/reformulatee-logs β”‚
β”‚ └─ All queries logged; feedback merged for DPO β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
## πŸ“¦ Core Components
### 1. Generation (`src/rl/generate_free.py`, `src/rl/inference.py`)
Produces N candidate reformulations. Backend is selected per environment:
**Backends:**
| Backend | Model | Speed | Cost | Used when |
|---------|-------|-------|------|-----------|
| `ollama` | Fine-tuned GGUF (reformulatee) | Fast | FREE | Local (recommended) |
| `claude` | Claude Haiku | Fast | ~$0.001/req | HF Space (auto) |
| `hf_inference` | Qwen/Qwen2.5-1.5B | Fast | FREE | Explicit config |
| `gguf` | GGUF via llama-cpp-python | Medium | FREE | Explicit config |
| `local` | DPO fine-tuned PEFT | Slow | FREE | Explicit config |
**Backend selection logic (`app.py`):**
```python
if SPACE_ID env var:
INFERENCE_BACKEND = "claude" # HF Space: always Claude
else:
INFERENCE_BACKEND = "auto" # Local: tries Ollama β†’ Claude
```
All user inputs are wrapped in `<question>` XML tags before being sent to any model to delimit data from instructions (prompt injection mitigation).
### 2. Translation (`src/ee/translate_local.py`)
Converts pt-br ↔ en using MarianMT.
**Models:**
- `Helsinki-NLP/opus-mt-ROMANCE-en` (~300 MB, pt→en)
- `Helsinki-NLP/opus-mt-en-ROMANCE` + `>>pt<<` prefix (en→pt)
**Cost:** FREE (local CPU inference)
**Fallback:** Claude API if `transformers` not installed
### 3. Epistemic Effectiveness Scoring (`src/ee/reward.py`)
Computes `EE(Q) = 0.05Β·R + 0.05Β·T + 0.90Β·NT`
#### 3a. Respondibilidade (R)
How well-established is the research area?
- **Source:** 919 papers (arXiv, Semantic Scholar, PubMed, Nobel Prize corpus)
- **Method:** BM25 + cosine similarity re-ranking
- **Fallback:** If corpus missing, R = 0 (app warns but continues)
- **Speed:** ~200ms
- **Cache:** In-memory + SQLite
#### 3b. Tratabilidade (T)
Can we answer this with existing tools?
- **Primary:** Ridge(alpha=50.0) on all-MiniLM-L6-v2 embeddings (local, ~22ms, free)
- **Fallback:** Claude API with prompt caching if local classifier not trained
- **Cache:** In-memory + SQLite cross-session
#### 3c. NΓ£o-trivialidade (NT)
Is the reformulation significantly different from the original?
- **Method:** Cosine distance between sentence embeddings + semantic classification
- **Speed:** ~500ms (with prompt caching)
- **Cache:** In-memory + SQLite
### 4. Stage 1 Filter (`src/ee/reward.py`)
Rejects candidates that don't improve over baseline.
```python
Ξ΅ = 0.05 # threshold
passes = EE(candidate) > EE(original) + Ξ΅
```
- **Rejection rate:** ~30% at runtime
- **Fallback:** If 0 candidates pass, return highest-EE anyway
### 5. Persistence (`src/db/historico.py`, `src/db/hf_logger.py`)
Two-layer persistence strategy:
**SQLite (local, ephemeral on HF Space):**
```sql
CREATE TABLE historico (
id INTEGER PRIMARY KEY,
ts TIMESTAMP,
idioma TEXT,
pergunta_orig TEXT, -- original question
pergunta_en TEXT, -- English translation
candidatos JSON, -- [{"text": "...", "ee": 0.5}, ...]
melhor TEXT, -- best selected (English)
melhor_pt TEXT, -- best in Portuguese
ee_antes FLOAT, -- EE(original)
ee_depois FLOAT, -- EE(best)
stage1_pass BOOLEAN, -- passed filtering?
feedback INTEGER -- 1=πŸ‘, -1=πŸ‘Ž, NULL=none
);
```
**HF Dataset (`fmr34/reformulatee-logs`, cross-session):**
- Every query logged as `{"type": "record", ...}` via background thread (non-blocking)
- Feedback logged as `{"type": "feedback", "id": ..., "feedback": 1}` (urgent flush)
- `ultimas()` falls back to HF Dataset when SQLite is empty (e.g. after Space restart)
- Records validated (type/length/idioma) before display to prevent cache poisoning
**Usage:**
```python
from src.db.historico import salvar, registrar_feedback, ultimas
record_id = salvar(pergunta_orig, candidatos, melhor, ...)
registrar_feedback(record_id, valor=1) # πŸ‘
history = ultimas(n=10) # SQLite β†’ HF Dataset fallback
```
## πŸ”„ Data Flow Example
```
User Input: "O que Γ© a consciΓͺncia?"
↓
[Translate pt→en via MarianMT]
"What is consciousness?"
↓
[Generate 8 candidates via Claude Haiku (HF Space) / Ollama (local)]
Input wrapped: <question>What is consciousness?</question>
{
"candidates": [
"What neural signatures predict conscious reports?",
"How do synchronized neural patterns relate to awareness?",
...
]
}
↓
[Score each candidate via EE scoring]
{
"candidates": [
{"text": "...", "ee": 0.82, "resp": 0.7, "tract": 0.6, "nt": 0.85},
...
]
}
↓
[Stage 1 Filter: keep EE > baseline + 0.05]
↓
[Select best: max(score)]
↓
[Translate en→pt via MarianMT]
"Quais sinais neurais predizem relatΓ³rios conscientes?"
↓
[Save to SQLite + async log to HF Dataset]
↓
[Audit log: {"action": "reformulate", "session": "hash...", "ee_antes": 0.15, "ee_depois": 0.89}]
↓
User sees result + πŸ‘/πŸ‘Ž buttons
```
## 🧠 Machine Learning Components
### Tractability Classifier
**Training:**
```bash
python -m src.classifier.train_tractability --api
```
- Trains Ridge regression on curated questions
- Features: all-MiniLM-L6-v2 sentence embeddings (384-dim)
- Target: binary labels (0/1) or real scores from Claude API
- Output: `data/models/tractability/classifier.pkl`
### DPO Fine-tuning
**Data preparation:**
```bash
python -m src.dataset.prepare_dpo
```
Consolidates DPO pairs from multiple sources (in priority order):
- `dpo_tier3.jsonl` β€” adversarial cross-domain pairs (highest quality)
- `dpo_tier2.jsonl` β€” adversarial validated pairs
- `dpo_tier1.jsonl` β€” curated base pairs
- `batch_pairs.jsonl`, `batch_domains.jsonl`, `batch_large.jsonl` β€” API-expanded
- `historico.db` β€” local user feedback (πŸ‘)
- HF Dataset (`fmr34/reformulatee-logs`) β€” online user feedback (πŸ‘)
**Training on Colab:**
```bash
# See notebooks/dpo_finetune_colab.ipynb
# Model: Qwen2.5-1.5B-Instruct
# Method: DPO + LoRA (4-bit QLoRA)
# Cost: FREE (Colab T4)
# Output: uploaded to HF Hub as GGUF
```
## πŸ—„οΈ Caching Strategy
Three-level cache hierarchy for efficiency:
```
Level 1: In-Memory Dict
β”œβ”€ TTL: session lifetime
└─ Speed: O(1)
↓
Level 2: SQLite (cross-session, local)
β”œβ”€ Tables: cache_tratabilidade
β”œβ”€ TTL: infinite (until manual clear)
└─ Speed: ~5ms
↓
Level 3: Claude API (with prompt caching)
β”œβ”€ Type: ephemeral cache (TTL ~5 min)
β”œβ”€ Savings: ~70% cost reduction
└─ Speed: ~500ms (first call), cached after
```
## πŸ”’ Security
- **Input sanitization:** User input wrapped in `<question>` tags in all backends (prompt injection mitigation)
- **Rate limiting:** 10 requests/min per session (sliding window, in-memory)
- **Audit logging:** Structured JSON to stderr β€” action, timestamp, session hash (SHA-256 truncated), EE scores
- **SQLite permissions:** chmod 600 applied on every connection
- **HF Dataset records:** Validated (type, length ≀ 1000 chars, idioma whitelist) before display
- **Startup validation:** ANTHROPIC_API_KEY checked at startup on HF Space (fails fast with clear error)
## ⚑ Performance Characteristics
| Operation | Speed | Cost (HF Space) | Cost (Local) |
|-----------|-------|-----------------|--------------|
| Generate 8 candidates | ~8s | Claude API | FREE (Ollama) |
| Translate pt→en | ~100ms | FREE | FREE |
| Score 8 candidates | ~2s | FREE | FREE |
| Stage 1 Filter + select | ~50ms | FREE | FREE |
| Translate en→pt | ~100ms | FREE | FREE |
| **Total pipeline** | **~10s** | **~$0.001** | **$0** |
## πŸš€ Deployment Modes
### Local (Zero Cost)
- Ollama + fine-tuned GGUF model
- MarianMT for translation
- Ridge classifier for tractability
- CPU-only (works on standard laptop)
- Latency: ~10s per query
### HF Space (Public Demo)
- Claude Haiku for generation (forced when SPACE_ID present)
- MarianMT loaded on first request (~300 MB download)
- Questions persisted to HF Dataset (cross-session, cross-user)
- SQLite ephemeral (resets on restart; HF Dataset used as fallback)
### Production Scale
- Docker container + load balancer
- PostgreSQL for history (replace SQLite)
- Redis for caching (replace in-memory dict)
- Async workers for parallelization