File size: 12,545 Bytes
c31002d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
# ReformulatEE β€” System Architecture

## πŸ—οΈ High-Level Architecture

```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              Gradio Web Interface                            β”‚
β”‚   Rate limit: 10 req/min/session β€’ Privacy notice shown     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β”‚ Portuguese ↔ English
                       ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              Translation Layer (MarianMT)                    β”‚
β”‚         Helsinki-NLP/opus-mt-{ROMANCE-en, en-ROMANCE}       β”‚
β”‚         Local CPU inference β€” zero cost                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β”‚ English Research Question
                       ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚           Reformulation Pipeline (Best-of-N)                β”‚
β”‚                                                              β”‚
β”‚  1. GENERATION (8 parallel candidates)                      β”‚
β”‚     β”œβ”€ Backend: ollama        GGUF fine-tuned [local, FREE] β”‚
β”‚     β”œβ”€ Backend: claude        Claude Haiku [HF Space]       β”‚
β”‚     └─ Backend: hf_inference  HF Inference API [free]       β”‚
β”‚                                                              β”‚
β”‚  2. SCORING (Epistemic Effectiveness)                        β”‚
β”‚     β”œβ”€ Respondibilidade (BM25 + semantic search, 919 papers)β”‚
β”‚     β”œβ”€ Tratabilidade (Ridge classifier, local)              β”‚
β”‚     └─ NΓ£o-trivialidade (semantic dissimilarity probe)      β”‚
β”‚                                                              β”‚
β”‚  3. FILTERING (Stage 1)                                     β”‚
β”‚     └─ Keep only: EE(q_cand) > EE(q_bad) + Ξ΅              β”‚
β”‚                                                              β”‚
β”‚  4. SELECTION                                               β”‚
β”‚     └─ Return highest-scoring candidate                     β”‚
β”‚                                                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β”‚ English Reformulation
                       ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              Translation Layer (MarianMT)                    β”‚
β”‚                      (en β†’ pt)                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β”‚ Portuguese Reformulation
                       ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Persistence Layer                         β”‚
β”‚   β”œβ”€ SQLite (local): historico + cache_tratabilidade        β”‚
β”‚   └─ HF Dataset (cross-session): fmr34/reformulatee-logs   β”‚
β”‚       └─ All queries logged; feedback merged for DPO        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

## πŸ“¦ Core Components

### 1. Generation (`src/rl/generate_free.py`, `src/rl/inference.py`)

Produces N candidate reformulations. Backend is selected per environment:

**Backends:**
| Backend | Model | Speed | Cost | Used when |
|---------|-------|-------|------|-----------|
| `ollama` | Fine-tuned GGUF (reformulatee) | Fast | FREE | Local (recommended) |
| `claude` | Claude Haiku | Fast | ~$0.001/req | HF Space (auto) |
| `hf_inference` | Qwen/Qwen2.5-1.5B | Fast | FREE | Explicit config |
| `gguf` | GGUF via llama-cpp-python | Medium | FREE | Explicit config |
| `local` | DPO fine-tuned PEFT | Slow | FREE | Explicit config |

**Backend selection logic (`app.py`):**
```python
if SPACE_ID env var:
    INFERENCE_BACKEND = "claude"   # HF Space: always Claude
else:
    INFERENCE_BACKEND = "auto"     # Local: tries Ollama β†’ Claude
```

All user inputs are wrapped in `<question>` XML tags before being sent to any model to delimit data from instructions (prompt injection mitigation).

### 2. Translation (`src/ee/translate_local.py`)

Converts pt-br ↔ en using MarianMT.

**Models:**
- `Helsinki-NLP/opus-mt-ROMANCE-en` (~300 MB, pt→en)
- `Helsinki-NLP/opus-mt-en-ROMANCE` + `>>pt<<` prefix (en→pt)

**Cost:** FREE (local CPU inference)

**Fallback:** Claude API if `transformers` not installed

### 3. Epistemic Effectiveness Scoring (`src/ee/reward.py`)

Computes `EE(Q) = 0.05Β·R + 0.05Β·T + 0.90Β·NT`

#### 3a. Respondibilidade (R)

How well-established is the research area?

- **Source:** 919 papers (arXiv, Semantic Scholar, PubMed, Nobel Prize corpus)
- **Method:** BM25 + cosine similarity re-ranking
- **Fallback:** If corpus missing, R = 0 (app warns but continues)
- **Speed:** ~200ms
- **Cache:** In-memory + SQLite

#### 3b. Tratabilidade (T)

Can we answer this with existing tools?

- **Primary:** Ridge(alpha=50.0) on all-MiniLM-L6-v2 embeddings (local, ~22ms, free)
- **Fallback:** Claude API with prompt caching if local classifier not trained
- **Cache:** In-memory + SQLite cross-session

#### 3c. NΓ£o-trivialidade (NT)

Is the reformulation significantly different from the original?

- **Method:** Cosine distance between sentence embeddings + semantic classification
- **Speed:** ~500ms (with prompt caching)
- **Cache:** In-memory + SQLite

### 4. Stage 1 Filter (`src/ee/reward.py`)

Rejects candidates that don't improve over baseline.

```python
Ξ΅ = 0.05  # threshold
passes = EE(candidate) > EE(original) + Ξ΅
```

- **Rejection rate:** ~30% at runtime
- **Fallback:** If 0 candidates pass, return highest-EE anyway

### 5. Persistence (`src/db/historico.py`, `src/db/hf_logger.py`)

Two-layer persistence strategy:

**SQLite (local, ephemeral on HF Space):**
```sql
CREATE TABLE historico (
  id INTEGER PRIMARY KEY,
  ts TIMESTAMP,
  idioma TEXT,
  pergunta_orig TEXT,           -- original question
  pergunta_en TEXT,             -- English translation
  candidatos JSON,              -- [{"text": "...", "ee": 0.5}, ...]
  melhor TEXT,                  -- best selected (English)
  melhor_pt TEXT,               -- best in Portuguese
  ee_antes FLOAT,               -- EE(original)
  ee_depois FLOAT,              -- EE(best)
  stage1_pass BOOLEAN,          -- passed filtering?
  feedback INTEGER              -- 1=πŸ‘, -1=πŸ‘Ž, NULL=none
);
```

**HF Dataset (`fmr34/reformulatee-logs`, cross-session):**
- Every query logged as `{"type": "record", ...}` via background thread (non-blocking)
- Feedback logged as `{"type": "feedback", "id": ..., "feedback": 1}` (urgent flush)
- `ultimas()` falls back to HF Dataset when SQLite is empty (e.g. after Space restart)
- Records validated (type/length/idioma) before display to prevent cache poisoning

**Usage:**
```python
from src.db.historico import salvar, registrar_feedback, ultimas

record_id = salvar(pergunta_orig, candidatos, melhor, ...)
registrar_feedback(record_id, valor=1)  # πŸ‘
history = ultimas(n=10)  # SQLite β†’ HF Dataset fallback
```

## πŸ”„ Data Flow Example

```
User Input: "O que Γ© a consciΓͺncia?"
    ↓
[Translate pt→en via MarianMT]
"What is consciousness?"
    ↓
[Generate 8 candidates via Claude Haiku (HF Space) / Ollama (local)]
Input wrapped: <question>What is consciousness?</question>
{
  "candidates": [
    "What neural signatures predict conscious reports?",
    "How do synchronized neural patterns relate to awareness?",
    ...
  ]
}
    ↓
[Score each candidate via EE scoring]
{
  "candidates": [
    {"text": "...", "ee": 0.82, "resp": 0.7, "tract": 0.6, "nt": 0.85},
    ...
  ]
}
    ↓
[Stage 1 Filter: keep EE > baseline + 0.05]
    ↓
[Select best: max(score)]
    ↓
[Translate en→pt via MarianMT]
"Quais sinais neurais predizem relatΓ³rios conscientes?"
    ↓
[Save to SQLite + async log to HF Dataset]
    ↓
[Audit log: {"action": "reformulate", "session": "hash...", "ee_antes": 0.15, "ee_depois": 0.89}]
    ↓
User sees result + πŸ‘/πŸ‘Ž buttons
```

## 🧠 Machine Learning Components

### Tractability Classifier

**Training:**
```bash
python -m src.classifier.train_tractability --api
```

- Trains Ridge regression on curated questions
- Features: all-MiniLM-L6-v2 sentence embeddings (384-dim)
- Target: binary labels (0/1) or real scores from Claude API
- Output: `data/models/tractability/classifier.pkl`

### DPO Fine-tuning

**Data preparation:**
```bash
python -m src.dataset.prepare_dpo
```

Consolidates DPO pairs from multiple sources (in priority order):
- `dpo_tier3.jsonl` β€” adversarial cross-domain pairs (highest quality)
- `dpo_tier2.jsonl` β€” adversarial validated pairs
- `dpo_tier1.jsonl` β€” curated base pairs
- `batch_pairs.jsonl`, `batch_domains.jsonl`, `batch_large.jsonl` β€” API-expanded
- `historico.db` β€” local user feedback (πŸ‘)
- HF Dataset (`fmr34/reformulatee-logs`) β€” online user feedback (πŸ‘)

**Training on Colab:**
```bash
# See notebooks/dpo_finetune_colab.ipynb
# Model: Qwen2.5-1.5B-Instruct
# Method: DPO + LoRA (4-bit QLoRA)
# Cost: FREE (Colab T4)
# Output: uploaded to HF Hub as GGUF
```

## πŸ—„οΈ Caching Strategy

Three-level cache hierarchy for efficiency:

```
Level 1: In-Memory Dict
β”œβ”€ TTL: session lifetime
└─ Speed: O(1)
          ↓
Level 2: SQLite (cross-session, local)
β”œβ”€ Tables: cache_tratabilidade
β”œβ”€ TTL: infinite (until manual clear)
└─ Speed: ~5ms
          ↓
Level 3: Claude API (with prompt caching)
β”œβ”€ Type: ephemeral cache (TTL ~5 min)
β”œβ”€ Savings: ~70% cost reduction
└─ Speed: ~500ms (first call), cached after
```

## πŸ”’ Security

- **Input sanitization:** User input wrapped in `<question>` tags in all backends (prompt injection mitigation)
- **Rate limiting:** 10 requests/min per session (sliding window, in-memory)
- **Audit logging:** Structured JSON to stderr β€” action, timestamp, session hash (SHA-256 truncated), EE scores
- **SQLite permissions:** chmod 600 applied on every connection
- **HF Dataset records:** Validated (type, length ≀ 1000 chars, idioma whitelist) before display
- **Startup validation:** ANTHROPIC_API_KEY checked at startup on HF Space (fails fast with clear error)

## ⚑ Performance Characteristics

| Operation | Speed | Cost (HF Space) | Cost (Local) |
|-----------|-------|-----------------|--------------|
| Generate 8 candidates | ~8s | Claude API | FREE (Ollama) |
| Translate pt→en | ~100ms | FREE | FREE |
| Score 8 candidates | ~2s | FREE | FREE |
| Stage 1 Filter + select | ~50ms | FREE | FREE |
| Translate en→pt | ~100ms | FREE | FREE |
| **Total pipeline** | **~10s** | **~$0.001** | **$0** |

## πŸš€ Deployment Modes

### Local (Zero Cost)
- Ollama + fine-tuned GGUF model
- MarianMT for translation
- Ridge classifier for tractability
- CPU-only (works on standard laptop)
- Latency: ~10s per query

### HF Space (Public Demo)
- Claude Haiku for generation (forced when SPACE_ID present)
- MarianMT loaded on first request (~300 MB download)
- Questions persisted to HF Dataset (cross-session, cross-user)
- SQLite ephemeral (resets on restart; HF Dataset used as fallback)

### Production Scale
- Docker container + load balancer
- PostgreSQL for history (replace SQLite)
- Redis for caching (replace in-memory dict)
- Async workers for parallelization