Spaces:
Running
Running
| # ReformulatEE β System Architecture | |
| ## ποΈ High-Level Architecture | |
| ``` | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β Gradio Web Interface β | |
| β Rate limit: 10 req/min/session β’ Privacy notice shown β | |
| ββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ | |
| β Portuguese β English | |
| β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β Translation Layer (MarianMT) β | |
| β Helsinki-NLP/opus-mt-{ROMANCE-en, en-ROMANCE} β | |
| β Local CPU inference β zero cost β | |
| ββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ | |
| β English Research Question | |
| β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β Reformulation Pipeline (Best-of-N) β | |
| β β | |
| β 1. GENERATION (8 parallel candidates) β | |
| β ββ Backend: ollama GGUF fine-tuned [local, FREE] β | |
| β ββ Backend: claude Claude Haiku [HF Space] β | |
| β ββ Backend: hf_inference HF Inference API [free] β | |
| β β | |
| β 2. SCORING (Epistemic Effectiveness) β | |
| β ββ Respondibilidade (BM25 + semantic search, 919 papers)β | |
| β ββ Tratabilidade (Ridge classifier, local) β | |
| β ββ NΓ£o-trivialidade (semantic dissimilarity probe) β | |
| β β | |
| β 3. FILTERING (Stage 1) β | |
| β ββ Keep only: EE(q_cand) > EE(q_bad) + Ξ΅ β | |
| β β | |
| β 4. SELECTION β | |
| β ββ Return highest-scoring candidate β | |
| β β | |
| ββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ | |
| β English Reformulation | |
| β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β Translation Layer (MarianMT) β | |
| β (en β pt) β | |
| ββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ | |
| β Portuguese Reformulation | |
| β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β Persistence Layer β | |
| β ββ SQLite (local): historico + cache_tratabilidade β | |
| β ββ HF Dataset (cross-session): fmr34/reformulatee-logs β | |
| β ββ All queries logged; feedback merged for DPO β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| ## π¦ Core Components | |
| ### 1. Generation (`src/rl/generate_free.py`, `src/rl/inference.py`) | |
| Produces N candidate reformulations. Backend is selected per environment: | |
| **Backends:** | |
| | Backend | Model | Speed | Cost | Used when | | |
| |---------|-------|-------|------|-----------| | |
| | `ollama` | Fine-tuned GGUF (reformulatee) | Fast | FREE | Local (recommended) | | |
| | `claude` | Claude Haiku | Fast | ~$0.001/req | HF Space (auto) | | |
| | `hf_inference` | Qwen/Qwen2.5-1.5B | Fast | FREE | Explicit config | | |
| | `gguf` | GGUF via llama-cpp-python | Medium | FREE | Explicit config | | |
| | `local` | DPO fine-tuned PEFT | Slow | FREE | Explicit config | | |
| **Backend selection logic (`app.py`):** | |
| ```python | |
| if SPACE_ID env var: | |
| INFERENCE_BACKEND = "claude" # HF Space: always Claude | |
| else: | |
| INFERENCE_BACKEND = "auto" # Local: tries Ollama β Claude | |
| ``` | |
| All user inputs are wrapped in `<question>` XML tags before being sent to any model to delimit data from instructions (prompt injection mitigation). | |
| ### 2. Translation (`src/ee/translate_local.py`) | |
| Converts pt-br β en using MarianMT. | |
| **Models:** | |
| - `Helsinki-NLP/opus-mt-ROMANCE-en` (~300 MB, ptβen) | |
| - `Helsinki-NLP/opus-mt-en-ROMANCE` + `>>pt<<` prefix (enβpt) | |
| **Cost:** FREE (local CPU inference) | |
| **Fallback:** Claude API if `transformers` not installed | |
| ### 3. Epistemic Effectiveness Scoring (`src/ee/reward.py`) | |
| Computes `EE(Q) = 0.05Β·R + 0.05Β·T + 0.90Β·NT` | |
| #### 3a. Respondibilidade (R) | |
| How well-established is the research area? | |
| - **Source:** 919 papers (arXiv, Semantic Scholar, PubMed, Nobel Prize corpus) | |
| - **Method:** BM25 + cosine similarity re-ranking | |
| - **Fallback:** If corpus missing, R = 0 (app warns but continues) | |
| - **Speed:** ~200ms | |
| - **Cache:** In-memory + SQLite | |
| #### 3b. Tratabilidade (T) | |
| Can we answer this with existing tools? | |
| - **Primary:** Ridge(alpha=50.0) on all-MiniLM-L6-v2 embeddings (local, ~22ms, free) | |
| - **Fallback:** Claude API with prompt caching if local classifier not trained | |
| - **Cache:** In-memory + SQLite cross-session | |
| #### 3c. NΓ£o-trivialidade (NT) | |
| Is the reformulation significantly different from the original? | |
| - **Method:** Cosine distance between sentence embeddings + semantic classification | |
| - **Speed:** ~500ms (with prompt caching) | |
| - **Cache:** In-memory + SQLite | |
| ### 4. Stage 1 Filter (`src/ee/reward.py`) | |
| Rejects candidates that don't improve over baseline. | |
| ```python | |
| Ξ΅ = 0.05 # threshold | |
| passes = EE(candidate) > EE(original) + Ξ΅ | |
| ``` | |
| - **Rejection rate:** ~30% at runtime | |
| - **Fallback:** If 0 candidates pass, return highest-EE anyway | |
| ### 5. Persistence (`src/db/historico.py`, `src/db/hf_logger.py`) | |
| Two-layer persistence strategy: | |
| **SQLite (local, ephemeral on HF Space):** | |
| ```sql | |
| CREATE TABLE historico ( | |
| id INTEGER PRIMARY KEY, | |
| ts TIMESTAMP, | |
| idioma TEXT, | |
| pergunta_orig TEXT, -- original question | |
| pergunta_en TEXT, -- English translation | |
| candidatos JSON, -- [{"text": "...", "ee": 0.5}, ...] | |
| melhor TEXT, -- best selected (English) | |
| melhor_pt TEXT, -- best in Portuguese | |
| ee_antes FLOAT, -- EE(original) | |
| ee_depois FLOAT, -- EE(best) | |
| stage1_pass BOOLEAN, -- passed filtering? | |
| feedback INTEGER -- 1=π, -1=π, NULL=none | |
| ); | |
| ``` | |
| **HF Dataset (`fmr34/reformulatee-logs`, cross-session):** | |
| - Every query logged as `{"type": "record", ...}` via background thread (non-blocking) | |
| - Feedback logged as `{"type": "feedback", "id": ..., "feedback": 1}` (urgent flush) | |
| - `ultimas()` falls back to HF Dataset when SQLite is empty (e.g. after Space restart) | |
| - Records validated (type/length/idioma) before display to prevent cache poisoning | |
| **Usage:** | |
| ```python | |
| from src.db.historico import salvar, registrar_feedback, ultimas | |
| record_id = salvar(pergunta_orig, candidatos, melhor, ...) | |
| registrar_feedback(record_id, valor=1) # π | |
| history = ultimas(n=10) # SQLite β HF Dataset fallback | |
| ``` | |
| ## π Data Flow Example | |
| ``` | |
| User Input: "O que Γ© a consciΓͺncia?" | |
| β | |
| [Translate ptβen via MarianMT] | |
| "What is consciousness?" | |
| β | |
| [Generate 8 candidates via Claude Haiku (HF Space) / Ollama (local)] | |
| Input wrapped: <question>What is consciousness?</question> | |
| { | |
| "candidates": [ | |
| "What neural signatures predict conscious reports?", | |
| "How do synchronized neural patterns relate to awareness?", | |
| ... | |
| ] | |
| } | |
| β | |
| [Score each candidate via EE scoring] | |
| { | |
| "candidates": [ | |
| {"text": "...", "ee": 0.82, "resp": 0.7, "tract": 0.6, "nt": 0.85}, | |
| ... | |
| ] | |
| } | |
| β | |
| [Stage 1 Filter: keep EE > baseline + 0.05] | |
| β | |
| [Select best: max(score)] | |
| β | |
| [Translate enβpt via MarianMT] | |
| "Quais sinais neurais predizem relatΓ³rios conscientes?" | |
| β | |
| [Save to SQLite + async log to HF Dataset] | |
| β | |
| [Audit log: {"action": "reformulate", "session": "hash...", "ee_antes": 0.15, "ee_depois": 0.89}] | |
| β | |
| User sees result + π/π buttons | |
| ``` | |
| ## π§ Machine Learning Components | |
| ### Tractability Classifier | |
| **Training:** | |
| ```bash | |
| python -m src.classifier.train_tractability --api | |
| ``` | |
| - Trains Ridge regression on curated questions | |
| - Features: all-MiniLM-L6-v2 sentence embeddings (384-dim) | |
| - Target: binary labels (0/1) or real scores from Claude API | |
| - Output: `data/models/tractability/classifier.pkl` | |
| ### DPO Fine-tuning | |
| **Data preparation:** | |
| ```bash | |
| python -m src.dataset.prepare_dpo | |
| ``` | |
| Consolidates DPO pairs from multiple sources (in priority order): | |
| - `dpo_tier3.jsonl` β adversarial cross-domain pairs (highest quality) | |
| - `dpo_tier2.jsonl` β adversarial validated pairs | |
| - `dpo_tier1.jsonl` β curated base pairs | |
| - `batch_pairs.jsonl`, `batch_domains.jsonl`, `batch_large.jsonl` β API-expanded | |
| - `historico.db` β local user feedback (π) | |
| - HF Dataset (`fmr34/reformulatee-logs`) β online user feedback (π) | |
| **Training on Colab:** | |
| ```bash | |
| # See notebooks/dpo_finetune_colab.ipynb | |
| # Model: Qwen2.5-1.5B-Instruct | |
| # Method: DPO + LoRA (4-bit QLoRA) | |
| # Cost: FREE (Colab T4) | |
| # Output: uploaded to HF Hub as GGUF | |
| ``` | |
| ## ποΈ Caching Strategy | |
| Three-level cache hierarchy for efficiency: | |
| ``` | |
| Level 1: In-Memory Dict | |
| ββ TTL: session lifetime | |
| ββ Speed: O(1) | |
| β | |
| Level 2: SQLite (cross-session, local) | |
| ββ Tables: cache_tratabilidade | |
| ββ TTL: infinite (until manual clear) | |
| ββ Speed: ~5ms | |
| β | |
| Level 3: Claude API (with prompt caching) | |
| ββ Type: ephemeral cache (TTL ~5 min) | |
| ββ Savings: ~70% cost reduction | |
| ββ Speed: ~500ms (first call), cached after | |
| ``` | |
| ## π Security | |
| - **Input sanitization:** User input wrapped in `<question>` tags in all backends (prompt injection mitigation) | |
| - **Rate limiting:** 10 requests/min per session (sliding window, in-memory) | |
| - **Audit logging:** Structured JSON to stderr β action, timestamp, session hash (SHA-256 truncated), EE scores | |
| - **SQLite permissions:** chmod 600 applied on every connection | |
| - **HF Dataset records:** Validated (type, length β€ 1000 chars, idioma whitelist) before display | |
| - **Startup validation:** ANTHROPIC_API_KEY checked at startup on HF Space (fails fast with clear error) | |
| ## β‘ Performance Characteristics | |
| | Operation | Speed | Cost (HF Space) | Cost (Local) | | |
| |-----------|-------|-----------------|--------------| | |
| | Generate 8 candidates | ~8s | Claude API | FREE (Ollama) | | |
| | Translate ptβen | ~100ms | FREE | FREE | | |
| | Score 8 candidates | ~2s | FREE | FREE | | |
| | Stage 1 Filter + select | ~50ms | FREE | FREE | | |
| | Translate enβpt | ~100ms | FREE | FREE | | |
| | **Total pipeline** | **~10s** | **~$0.001** | **$0** | | |
| ## π Deployment Modes | |
| ### Local (Zero Cost) | |
| - Ollama + fine-tuned GGUF model | |
| - MarianMT for translation | |
| - Ridge classifier for tractability | |
| - CPU-only (works on standard laptop) | |
| - Latency: ~10s per query | |
| ### HF Space (Public Demo) | |
| - Claude Haiku for generation (forced when SPACE_ID present) | |
| - MarianMT loaded on first request (~300 MB download) | |
| - Questions persisted to HF Dataset (cross-session, cross-user) | |
| - SQLite ephemeral (resets on restart; HF Dataset used as fallback) | |
| ### Production Scale | |
| - Docker container + load balancer | |
| - PostgreSQL for history (replace SQLite) | |
| - Redis for caching (replace in-memory dict) | |
| - Async workers for parallelization | |