Spaces:

caarleexx
/

paraAI_rag

Build error

App Files Files Community

caarleexx commited on 10 days ago

Commit

785785b

verified ·

1 Parent(s): d9b9bbd

Upload 7 files

Browse files

Files changed (5) hide show

INSTRUCTIONS.md +94 -515
QUICKSTART.txt +1 -4
filter_fields.py +44 -0
query_engine.py +184 -0
rag_builder.py +105 -0

INSTRUCTIONS.md CHANGED Viewed

@@ -1,591 +1,170 @@
-# 📘 INSTRUCTIONS - Para.AI RAG Cluster
-## 🎯 Objetivo deste Documento
-Este guia detalha como fazer **deploy** de um micro-cluster RAG no **Hugging Face Spaces (free tier)** para indexar e buscar em jurisprudências do TJPR.
----
-## 📋 Pré-requisitos
-Antes de começar, você precisa:
-1. **Conta no Hugging Face** (gratuita)
-   - Criar em: https://huggingface.co/join
-2. **Dados no GitHub** (chunks)
-   - Repositório com chunks `.tar.gz`
-   - Exemplo: `github.com/caarleexx/para-ai-data`
-3. **Git instalado localmente**
-   - Download: https://git-scm.com/downloads
-4. **Python 3.11+** (apenas para testes locais)
-   - Download: https://python.org/downloads
----
-## 🚀 PARTE 1: Deploy Rápido (5 minutos)
-### Passo 1: Criar Space no Hugging Face
 ```bash
-# Instalar Hugging Face CLI
-pip install huggingface-hub[cli]
 # Login
 huggingface-cli login
-# Criar novo Space
 huggingface-cli repo create para-ai-rag-0301 --type space --space_sdk docker
 ```
-**Resultado esperado:**
-```
-✓ Space criado: https://huggingface.co/spaces/seu-usuario/para-ai-rag-0301
-```
----
-### Passo 2: Clonar e Configurar
 ```bash
-# Clonar o Space vazio
-git clone https://huggingface.co/spaces/seu-usuario/para-ai-rag-0301
-cd para-ai-rag-0301
-# Copiar arquivos deste exemplo
-cp -r /caminho/para/hf_space_rag_example/* .
-# Editar config.yaml
-nano config.yaml
-```
-**Edite estas linhas em `config.yaml`:**
-```yaml
-cluster_id: "RAG-0301"        # Identificador único
-chunk_start: 301              # Primeiro chunk
-chunk_end: 600                # Último chunk
-github_repo: "https://github.com/SEU-USUARIO/para-ai-data.git"  # Seu repo
-```
----
-### Passo 3: Deploy
-```bash
-# Adicionar arquivos
-git add .
 # Commit
-git commit -m "Initial deployment - Chunks 301-600"
-# Push para Hugging Face
-git push
 ```
-**HF Spaces vai automaticamente:**
-1. Detectar `Dockerfile`
-2. Construir container
-3. Executar `entrypoint.sh`
-4. Clonar chunks do GitHub
-5. Construir ChromaDB
-6. Iniciar FastAPI
-**⏱️ Tempo total:** ~10-15 minutos
----
-### Passo 4: Verificar Status
-Acesse: `https://huggingface.co/spaces/seu-usuario/para-ai-rag-0301`
-**Você verá:**
-- 🟢 **Building** (5-10 min) → Container sendo construído
-- 🟡 **Running** (5-10 min) → Clonando dados e construindo ChromaDB
-- 🟢 **Ready** → API disponível!
-**Testar API:**
 ```bash
-curl https://seu-usuario-para-ai-rag-0301.hf.space/cluster/info
-```
-**Resposta esperada:**
-```json
-{
-  "cluster_id": "RAG-0301",
-  "chunk_range": [301, 600],
-  "total_records": 295432,
-  "status": "ready"
-}
-```
----
-## 🔧 PARTE 2: Customização Avançada
-### Opção 1: Mudar Modelo de Embedding
-**Em `config.yaml`:**
-```yaml
-# Modelo atual (leve, rápido)
-embedding_model: "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
-embedding_dim: 384
-# Alternativas:
-# Modelo maior (melhor qualidade, mais lento)
-embedding_model: "sentence-transformers/paraphrase-multilingual-mpnet-base-v2"
-embedding_dim: 768
-# Modelo português-específico
-embedding_model: "neuralmind/bert-base-portuguese-cased"
-embedding_dim: 768
-```
-**⚠️ Atenção:**
-- Modelos maiores usam mais RAM
-- Verifique se cabe no free tier (16GB)
----
-### Opção 2: Adicionar Mais Campos
-**Editar `config.yaml`:**
-```yaml
-campos_filter:
-  - id
-  - ementa
-  - data_decisao      # Adicionar
-  - relator           # Adicionar
-  - orgao_julgador    # Adicionar
-```
-**Editar `filter_fields.py`:**
-Já está pronto! Ele lê automaticamente de `config.yaml`.
-**Editar `query_engine.py`:**
-Se quiser buscar por campos adicionais:
-```python
-def search_by_metadata(self, field: str, value: str):
-    """Busca por campo de metadata"""
-    results = self.collection.query(
-        query_texts=[""],
-        n_results=100,
-        where={field: value}
-    )
-    return results
-```
----
-### Opção 3: Ajustar Performance
-**Em `config.yaml`:**
-```yaml
-# Batch size para embeddings
-embedding_batch_size: 64  # Padrão
-# Aumentar para 128 se tiver RAM disponível
-# Diminuir para 32 se ficar sem RAM
-# Workers
-max_workers: 2  # Padrão (free tier tem 2 vCPU)
-```
-**Em `app.py` (FastAPI):**
-```python
-# Aumentar workers do Uvicorn
-uvicorn.run(app, host="0.0.0.0", port=7860, workers=2)  # Padrão: 1
 ```
-**⚠️ Cuidado:** Mais workers = mais RAM
----
-## 📊 PARTE 3: Múltiplos Clusters
-Para cobrir **todos os 4500 chunks**, crie **15 Spaces**:
-| Space ID | Chunks | Registros | HF Space URL |
-|----------|--------|-----------|--------------|
-| RAG-0001 | 1-300 | ~300k | /para-ai-rag-0001 |
-| RAG-0301 | 301-600 | ~300k | /para-ai-rag-0301 |
-| RAG-0601 | 601-900 | ~300k | /para-ai-rag-0601 |
-| RAG-0901 | 901-1200 | ~300k | /para-ai-rag-0901 |
-| ... | ... | ... | ... |
-| RAG-4201 | 4201-4500 | ~300k | /para-ai-rag-4201 |
-### Script de Deploy em Massa
 ```bash
 #!/bin/bash
-# deploy_all_clusters.sh
 for START in 1 301 601 901 1201 1501 1801 2101 2401 2701 3001 3301 3601 3901 4201; do
   END=$((START + 299))
-  CLUSTER_ID=$(printf "RAG-%04d" $START)
   SPACE_NAME="para-ai-rag-$(printf "%04d" $START)"
-  echo "Criando $SPACE_NAME (chunks $START-$END)..."
-  # Criar Space
   huggingface-cli repo create $SPACE_NAME --type space --space_sdk docker
-  # Clonar
-  git clone https://huggingface.co/spaces/seu-usuario/$SPACE_NAME
   cd $SPACE_NAME
-  # Copiar template
   cp -r ../hf_space_rag_example/* .
-  # Atualizar config
-  sed -i "s/cluster_id: .*/cluster_id: "$CLUSTER_ID"/" config.yaml
   sed -i "s/chunk_start: .*/chunk_start: $START/" config.yaml
   sed -i "s/chunk_end: .*/chunk_end: $END/" config.yaml
-  # Deploy
   git add .
-  git commit -m "Deploy $CLUSTER_ID"
   git push
   cd ..
-  echo "✅ $SPACE_NAME deployed!"
-  sleep 5  # Esperar 5s para não sobrecarregar HF
 done
-echo "🎉 Todos os 15 clusters deployados!"
-```
----
-## 🌐 PARTE 4: Gateway Agregador (Opcional)
-Para buscar em **todos os clusters** ao mesmo tempo, crie um **Space Gateway**:
-### Arquitetura do Gateway
-```python
-# gateway_app.py
-from fastapi import FastAPI
-import asyncio
-import httpx
-app = FastAPI(title="Para.AI Gateway")
-# Lista de todos os clusters
-CLUSTERS = [
-    "https://seu-usuario-para-ai-rag-0001.hf.space",
-    "https://seu-usuario-para-ai-rag-0301.hf.space",
-    "https://seu-usuario-para-ai-rag-0601.hf.space",
-    # ... todos os 15
-]
-@app.post("/search/embedding")
-async def search_all_clusters(query: str, top_k: int = 10):
-    """Busca em todos os clusters e agrega resultados"""
-    async with httpx.AsyncClient(timeout=30.0) as client:
-        tasks = [
-            client.post(
-                f"{cluster}/search/embedding",
-                json={"query": query, "top_k": top_k}
-            )
-            for cluster in CLUSTERS
-        ]
-        responses = await asyncio.gather(*tasks, return_exceptions=True)
-    # Agregar resultados
-    all_results = []
-    for resp in responses:
-        if isinstance(resp, Exception):
-            continue
-        data = resp.json()
-        all_results.extend(data['results'])
-    # Ordenar por score
-    all_results.sort(key=lambda x: x['score'], reverse=True)
-    return {
-        "clusters_consulted": len(CLUSTERS),
-        "total_found": len(all_results),
-        "results": all_results[:top_k]  # Top K global
-    }
-```
-**Deploy do Gateway:**
-```bash
-huggingface-cli repo create para-ai-gateway --type space --space_sdk docker
-# ... copiar gateway_app.py, Dockerfile, etc.
 ```
----
-## 🧪 PARTE 5: Testes Locais (Desenvolvimento)
-### Testar sem Deploy
-```bash
-# Entrar na pasta
-cd hf_space_rag_example
-# Instalar dependências
-pip install -r requirements.txt
-# Editar config (usar poucos chunks para teste)
-# config.yaml: chunk_start: 301, chunk_end: 305 (apenas 5 chunks!)
-# Executar passos manualmente
-# 1. Filtrar campos
-python filter_fields.py --input /tmp/test.jsonl --output /tmp/filtered.jsonl
-# 2. Build ChromaDB
-python rag_builder.py --input /tmp/filtered.jsonl
-# 3. Iniciar API
-python app.py
-# ou
-uvicorn app:app --reload
-```
-**Testar API localmente:**
-```bash
-# Teste 1: Info do cluster
-curl http://localhost:7860/cluster/info
-# Teste 2: Busca semântica
-curl -X POST http://localhost:7860/search/embedding \
-  -H "Content-Type: application/json" \
-  -d '{"query": "despejo", "top_k": 3}'
-# Teste 3: Busca por keywords
-curl -X POST http://localhost:7860/search/keywords \
-  -H "Content-Type: application/json" \
-  -d '{"keywords": ["despejo", "aluguel"], "operator": "AND"}'
-```
----
-## 🐛 PARTE 6: Troubleshooting
-### Problema 1: Build Timeout no HF
-**Sintoma:** Space fica em "Building" por >1h e falha
-**Causa:** Download de chunks muito grande
-**Solução:**
-1. Verificar se sparse checkout está funcionando:
-```bash
-# No entrypoint.sh, adicionar debug:
-echo "Chunks encontrados:"
-find chunks_dados -name "*.tar.gz" | wc -l
-```
-2. Reduzir intervalo de chunks temporariamente:
 ```yaml
-chunk_start: 301
-chunk_end: 320  # Apenas 20 chunks para teste
 ```
----
-### Problema 2: Out of Memory (OOM)
-**Sintoma:** Space reinicia continuamente
-**Causa:** Modelo de embedding muito grande ou muitos chunks
-**Soluções:**
-1. **Usar modelo menor:**
 ```yaml
-embedding_model: "sentence-transformers/all-MiniLM-L6-v2"  # Apenas 80MB!
 embedding_dim: 384
-```
-2. **Reduzir batch size:**
-```yaml
-embedding_batch_size: 32  # Ao invés de 64
-```
-3. **Reduzir número de chunks:**
-```yaml
-chunk_end: 400  # Ao invés de 600
-```
----
-### Problema 3: Git Clone Muito Lento
-**Sintoma:** Clonagem demora >30min
-**Causa:** Sparse checkout não funcionou
-**Solução:**
-Verificar se pattern está correto:
-```bash
-# No entrypoint.sh
-echo "Pattern de sparse checkout:"
-echo "$PATTERN"
-# Deve mostrar algo como:
-# chunks_dados/chunk_dados_0301.tar.gz chunks_dados/chunk_dados_0302.tar.gz ...
-```
-Se estiver errado, corrigir geração do pattern:
-```bash
-# Bash expansion correta
-for i in $(seq -f "%04g" $CHUNK_START $CHUNK_END); do
-  echo "chunks_dados/chunk_dados_$i.tar.gz"
-done
-```
----
-### Problema 4: API Retorna 503
-**Sintoma:** API acessível mas sempre retorna erro 503
-**Causa:** ChromaDB não foi construído ou está corrompido
-**Solução:**
-1. Ver logs do Space no HF
-2. Verificar se `/app/chromadb` existe e tem conteúdo
-3. Forçar rebuild:
-```bash
-# No entrypoint.sh, remover check de persistência:
-# if [ -d "/app/chromadb" ] && [ "$(ls -A /app/chromadb)" ]; then
-#     ...
-# fi
-# Comentar essas linhas para sempre rebuildar
-```
----
-## 📈 PARTE 7: Monitoramento
-### Logs do Space
-Acesse: `https://huggingface.co/spaces/seu-usuario/para-ai-rag-0301/logs`
-**O que procurar:**
-```
-✅ BOM:
-━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
-✓ 300 chunks clonados
-✓ 295432 registros concatenados
-✓ Campos filtrados
-✓ ChromaDB pronto!
-✓ Para.AI RAG Cluster RAG-0301 ONLINE
-❌ RUIM:
-━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
-✗ Erro ao clonar repositório
-✗ Out of memory
-✗ ChromaDB corrupted
-```
----
-### Métricas de Performance
-Adicionar em `app.py`:
-```python
-from prometheus_client import Counter, Histogram, generate_latest
-# Métricas
-search_requests = Counter('search_requests_total', 'Total search requests')
-search_latency = Histogram('search_latency_seconds', 'Search latency')
-@app.get("/metrics")
-async def metrics():
-    return generate_latest()
 ```
----
-## 🎓 PARTE 8: Próximos Passos
-Após ter os clusters funcionando:
-1. **Integrar com aplicação front-end**
-   - Criar Gradio UI que consulta os clusters
-   - Deploy em outro HF Space
-2. **Adicionar cache**
-   - Redis para cachear queries frequentes
-   - Reduz latência e custo
-3. **Fine-tuning do modelo**
-   - Treinar embedding model específico para jurisprudência
-   - Melhora qualidade dos resultados
-4. **Adicionar re-ranking**
-   - Cross-encoder para re-ranquear top results
-   - Aumenta precisão
----
-## 📚 Recursos Adicionais
-- **Documentação HF Spaces:** https://huggingface.co/docs/hub/spaces
-- **ChromaDB Docs:** https://docs.trychroma.com/
-- **Sentence Transformers:** https://www.sbert.net/
-- **FastAPI Docs:** https://fastapi.tiangolo.com/
----
-## ✅ Checklist Final
-Antes de considerar o deploy completo:
-- [ ] Space está **Ready** (não Building)
-- [ ] `/cluster/info` retorna dados corretos
-- [ ] Busca semântica funciona
-- [ ] Busca por keywords funciona
-- [ ] Busca por ID funciona
-- [ ] Latência < 200ms para queries
-- [ ] Logs sem erros
-- [ ] README.md atualizado com URL do Space
-- [ ] Testado com queries reais
----
-## 🎉 Conclusão
-Parabéns! Você agora tem um **micro-cluster RAG** funcional no HF Spaces, capaz de:
-✅ Indexar ~300k jurisprudências
-✅ Buscar em <100ms
-✅ Custo zero (free tier)
-✅ Escalável para milhões de registros
-**⚖️ InJustiça não para o Paraná!**

+# 📘 Para.AI RAG Cluster - Instructions
+## 🎯 Objetivo
+Deploy de micro-cluster RAG no Hugging Face Spaces (free tier) para indexar ~300k jurisprudências do TJPR.
+## 📋 Arquivos do Projeto
+| Arquivo | Função |
+|---------|---------|
+| `config.yaml` | Configuração (EDITE AQUI!) |
+| `Dockerfile` | Container Docker |
+| `entrypoint.sh` | Script de inicialização |
+| `requirements.txt` | Dependências Python |
+| `filter_fields.py` | Filtrar campos JSONL |
+| `rag_builder.py` | Construir ChromaDB |
+| `query_engine.py` | Engine de busca |
+| `app.py` | FastAPI REST API |
+| `README.md` | Documentação básica |
+| `.gitignore` | Arquivos ignorados |
+## 🚀 Deploy Step-by-Step
+### 1. Preparar Configuração
+Editar `config.yaml`:
+```yaml
+cluster_id: "RAG-0301"      # Seu ID único
+chunk_start: 301            # Primeiro chunk
+chunk_end: 600              # Último chunk (300 chunks = ~300k registros)
+github_repo: "https://github.com/SEU-USUARIO/para-ai-data.git"
+```
+### 2. Criar Space no HF
 ```bash
 # Login
 huggingface-cli login
+# Criar Space
 huggingface-cli repo create para-ai-rag-0301 --type space --space_sdk docker
 ```
+### 3. Fazer Upload
 ```bash
+cd hf_space_rag_example
+# Inicializar Git
+git init
+git remote add origin https://huggingface.co/spaces/SEU-USUARIO/para-ai-rag-0301
 # Commit
+git add .
+git commit -m "Initial deployment"
+# Push
+git push origin main
 ```
+### 4. Aguardar Build
+- Acesse: https://huggingface.co/spaces/SEU-USUARIO/para-ai-rag-0301
+- Status: Building (5-10 min) → Running (5-10 min) → Ready ✅
+### 5. Testar
 ```bash
+# Info do cluster
+curl https://SEU-USUARIO-para-ai-rag-0301.hf.space/cluster/info
+# Busca semântica
+curl -X POST https://SEU-USUARIO-para-ai-rag-0301.hf.space/search/embedding \
+  -H "Content-Type: application/json" \
+  -d '{"query": "despejo falta pagamento", "top_k": 5}'
 ```
+## 🌐 Arquitetura Distribuída
+Para cobrir todos os 4.5M registros, crie 15 Spaces:
+| Space | Chunks | Config |
+|-------|--------|--------|
+| para-ai-rag-0001 | 1-300 | chunk_start: 1, chunk_end: 300 |
+| para-ai-rag-0301 | 301-600 | chunk_start: 301, chunk_end: 600 |
+| para-ai-rag-0601 | 601-900 | chunk_start: 601, chunk_end: 900 |
+| ... | ... | ... |
+| para-ai-rag-4201 | 4201-4500 | chunk_start: 4201, chunk_end: 4500 |
+**Script de deploy automático** (bash):
 ```bash
 #!/bin/bash
 for START in 1 301 601 901 1201 1501 1801 2101 2401 2701 3001 3301 3601 3901 4201; do
   END=$((START + 299))
   SPACE_NAME="para-ai-rag-$(printf "%04d" $START)"
   huggingface-cli repo create $SPACE_NAME --type space --space_sdk docker
+  git clone https://huggingface.co/spaces/SEU-USUARIO/$SPACE_NAME
   cd $SPACE_NAME
+  # Copiar template e atualizar config
   cp -r ../hf_space_rag_example/* .
   sed -i "s/chunk_start: .*/chunk_start: $START/" config.yaml
   sed -i "s/chunk_end: .*/chunk_end: $END/" config.yaml
   git add .
+  git commit -m "Deploy cluster $START-$END"
   git push
   cd ..
 done
 ```
+## 🔧 Customização
+### Adicionar Mais Campos
+Em `config.yaml`:
 ```yaml
+campos_filter:
+  - id
+  - ementa
+  - data_decisao     # Adicionar
+  - relator          # Adicionar
 ```
+### Trocar Modelo de Embedding
 ```yaml
+# Opção 1: Mais leve (padrão)
+embedding_model: "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
 embedding_dim: 384
+# Opção 2: Melhor qualidade
+embedding_model: "sentence-transformers/paraphrase-multilingual-mpnet-base-v2"
+embedding_dim: 768
 ```
+## 🐛 Troubleshooting
+### Build muito lento
+- Verificar se sparse checkout está funcionando
+- Reduzir número de chunks temporariamente
+### Out of Memory
+- Usar modelo menor: `all-MiniLM-L6-v2`
+- Reduzir `embedding_batch_size: 32`
+- Diminuir intervalo de chunks
+### API retorna 503
+- Ver logs do Space no HF
+- Verificar se ChromaDB foi construído
+## 📊 Recursos Utilizados
+| Recurso | Usado | Disponível (Free) |
+|---------|-------|-------------------|
+| RAM | ~2GB | 16GB ✅ |
+| Disco | ~2.6GB | 50GB ✅ |
+| CPU | 1 core | 2 cores ✅ |
+## ⚖️ Sobre Para.AI
+Projeto open-source para democratizar o acesso à justiça no Paraná usando IA.
+🐝 **InJustiça não para o Paraná!**
+📧 github.com/caarleexx/para-ai

QUICKSTART.txt CHANGED Viewed

@@ -18,7 +18,6 @@
 3. Deploy:
-   $ cd hf_space_rag_example
    $ git init
    $ git remote add origin https://huggingface.co/spaces/SEU-USUARIO/para-ai-rag-0301
    $ git add .
@@ -33,8 +32,6 @@
 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
-📖 LEIA:
-   • README.md - Docs da API
-   • INSTRUCTIONS.md - Guia completo
 ✅ PRONTO! Seu RAG está online!

 3. Deploy:
    $ git init
    $ git remote add origin https://huggingface.co/spaces/SEU-USUARIO/para-ai-rag-0301
    $ git add .
 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
+📖 LEIA: INSTRUCTIONS.md para guia completo
 ✅ PRONTO! Seu RAG está online!

filter_fields.py ADDED Viewed

	@@ -0,0 +1,44 @@

+#!/usr/bin/env python3
+"""
+Filtrar campos de JSONL mantendo apenas os especificados
+"""
+import json
+import yaml
+from pathlib import Path
+import argparse
+from tqdm import tqdm
+def filter_jsonl(input_path: str, output_path: str, keep_fields: list = None):
+    """Filtra campos de arquivo JSONL"""
+    # Carregar campos da config se não especificados
+    if keep_fields is None:
+        with open('config.yaml') as f:
+            config = yaml.safe_load(f)
+            keep_fields = config['campos_filter']
+    print(f"📥 Input: {input_path}")
+    print(f"📤 Output: {output_path}")
+    print(f"🔧 Mantendo campos: {keep_fields}")
+    # Contar linhas
+    with open(input_path) as f:
+        total = sum(1 for _ in f)
+    # Filtrar
+    with open(input_path) as fin, open(output_path, 'w') as fout:
+        for line in tqdm(fin, total=total, desc="Filtrando"):
+            record = json.loads(line)
+            filtered = {k: record[k] for k in keep_fields if k in record}
+            fout.write(json.dumps(filtered, ensure_ascii=False) + '\n')
+    print(f"✅ {total} registros filtrados!")
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--input', required=True)
+    parser.add_argument('--output', required=True)
+    parser.add_argument('--keep', nargs='+', default=None)
+    args = parser.parse_args()
+    filter_jsonl(args.input, args.output, args.keep)

query_engine.py ADDED Viewed

	@@ -0,0 +1,184 @@

+#!/usr/bin/env python3
+"""
+Engine de busca para ChromaDB
+"""
+import yaml
+import chromadb
+from sentence_transformers import SentenceTransformer
+from typing import List, Dict, Optional
+import logging
+logger = logging.getLogger(__name__)
+class QueryEngine:
+    """Engine de busca com ChromaDB"""
+    def __init__(self, config_path: str = 'config.yaml'):
+        # Carregar config
+        with open(config_path) as f:
+            self.config = yaml.safe_load(f)
+        # Carregar modelo de embedding
+        logger.info(f"Carregando modelo {self.config['embedding_model']}...")
+        self.model = SentenceTransformer(self.config['embedding_model'])
+        # Conectar ao ChromaDB
+        logger.info(f"Conectando ao ChromaDB...")
+        self.client = chromadb.PersistentClient(path=self.config['chromadb_path'])
+        self.collection = self.client.get_collection(self.config['collection_name'])
+        logger.info(f"✅ QueryEngine pronto ({self.collection.count():,} registros)")
+    def search_by_embedding(
+        self,
+        query: str,
+        top_k: int = 10,
+        return_embeddings: bool = False
+    ) -> Dict:
+        """Busca por similaridade semântica"""
+        # Gerar embedding da query
+        query_embedding = self.model.encode(query).tolist()
+        # Buscar no ChromaDB
+        results = self.collection.query(
+            query_embeddings=[query_embedding],
+            n_results=top_k,
+            include=['documents', 'metadatas', 'distances', 'embeddings'] if return_embeddings
+                     else ['documents', 'metadatas', 'distances']
+        )
+        # Formatar resposta
+        formatted_results = []
+        for i in range(len(results['ids'][0])):
+            result = {
+                'id': results['ids'][0][i],
+                'ementa': results['documents'][0][i],
+                'distance': results['distances'][0][i],
+                'score': 1.0 - results['distances'][0][i]  # Converter distância para score
+            }
+            if return_embeddings and 'embeddings' in results:
+                result['embedding'] = results['embeddings'][0][i]
+            formatted_results.append(result)
+        return {
+            'cluster_id': self.config['cluster_id'],
+            'chunk_range': [self.config['chunk_start'], self.config['chunk_end']],
+            'results': formatted_results,
+            'total_found': len(formatted_results)
+        }
+    def search_by_keywords(
+        self,
+        keywords: List[str],
+        operator: str = 'AND',
+        top_k: int = 20
+    ) -> Dict:
+        """Busca por termos-chave (full-text search)"""
+        # Construir query string
+        if operator.upper() == 'AND':
+            query_str = ' '.join(keywords)
+        else:  # OR
+            query_str = '|'.join(keywords)
+        # Buscar usando where_document (full-text search do ChromaDB)
+        results = self.collection.query(
+            query_texts=[query_str],
+            n_results=top_k,
+            include=['documents', 'metadatas']
+        )
+        # Formatar resposta
+        formatted_results = []
+        for i in range(len(results['ids'][0])):
+            # Verificar quais keywords foram matchadas
+            doc = results['documents'][0][i].lower()
+            matched = [kw for kw in keywords if kw.lower() in doc]
+            formatted_results.append({
+                'id': results['ids'][0][i],
+                'ementa': results['documents'][0][i],
+                'matched_keywords': matched
+            })
+        return {
+            'cluster_id': self.config['cluster_id'],
+            'results': formatted_results,
+            'total_found': len(formatted_results)
+        }
+    def search_by_ids(
+        self,
+        ids: List[str],
+        return_embeddings: bool = False
+    ) -> Dict:
+        """Busca direta por ID(s)"""
+        # Buscar por IDs
+        try:
+            results = self.collection.get(
+                ids=ids,
+                include=['documents', 'metadatas', 'embeddings'] if return_embeddings
+                         else ['documents', 'metadatas']
+            )
+        except Exception as e:
+            logger.error(f"Erro ao buscar IDs: {e}")
+            return {
+                'cluster_id': self.config['cluster_id'],
+                'results': [],
+                'not_found': ids,
+                'total_found': 0
+            }
+        # Formatar resposta
+        formatted_results = []
+        found_ids = set(results['ids'])
+        for i in range(len(results['ids'])):
+            result = {
+                'id': results['ids'][i],
+                'ementa': results['documents'][i]
+            }
+            if return_embeddings and 'embeddings' in results:
+                result['embedding'] = results['embeddings'][i]
+            formatted_results.append(result)
+        # IDs não encontrados
+        not_found = [id for id in ids if id not in found_ids]
+        return {
+            'cluster_id': self.config['cluster_id'],
+            'results': formatted_results,
+            'not_found': not_found,
+            'total_found': len(formatted_results)
+        }
+    def get_cluster_info(self) -> Dict:
+        """Retorna informações do cluster"""
+        import os
+        # Calcular tamanho do ChromaDB
+        db_path = self.config['chromadb_path']
+        total_size = 0
+        for dirpath, dirnames, filenames in os.walk(db_path):
+            for f in filenames:
+                fp = os.path.join(dirpath, f)
+                total_size += os.path.getsize(fp)
+        db_size_mb = total_size / (1024 * 1024)
+        return {
+            'cluster_id': self.config['cluster_id'],
+            'chunk_range': [self.config['chunk_start'], self.config['chunk_end']],
+            'total_records': self.collection.count(),
+            'embedding_model': self.config['embedding_model'],
+            'embedding_dim': self.config['embedding_dim'],
+            'campos_disponiveis': self.config['campos_filter'],
+            'db_size_mb': round(db_size_mb, 2),
+            'status': 'ready'
+        }

rag_builder.py ADDED Viewed

	@@ -0,0 +1,105 @@

+#!/usr/bin/env python3
+"""
+Constrói ChromaDB com embeddings a partir de JSONL filtrado
+"""
+import json
+import yaml
+from pathlib import Path
+import argparse
+import chromadb
+from sentence_transformers import SentenceTransformer
+from tqdm import tqdm
+import logging
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+def build_chromadb(input_jsonl: str, config_path: str = 'config.yaml'):
+    """Constrói ChromaDB a partir de JSONL"""
+    # Carregar config
+    with open(config_path) as f:
+        config = yaml.safe_load(f)
+    logger.info("="*80)
+    logger.info("🔧 CONSTRUINDO CHROMADB")
+    logger.info("="*80)
+    logger.info(f"Cluster ID: {config['cluster_id']}")
+    logger.info(f"Chunks: {config['chunk_start']} - {config['chunk_end']}")
+    logger.info(f"Embedding Model: {config['embedding_model']}")
+    # Carregar modelo de embedding
+    logger.info("\n📥 Carregando modelo de embedding...")
+    model = SentenceTransformer(config['embedding_model'])
+    logger.info(f"✅ Modelo carregado (dim={config['embedding_dim']})")
+    # Inicializar ChromaDB
+    logger.info(f"\n💾 Inicializando ChromaDB em {config['chromadb_path']}...")
+    client = chromadb.PersistentClient(path=config['chromadb_path'])
+    # Criar/obter collection
+    try:
+        collection = client.get_collection(config['collection_name'])
+        logger.info(f"⚠️  Collection '{config['collection_name']}' já existe! Apagando...")
+        client.delete_collection(config['collection_name'])
+    except:
+        pass
+    collection = client.create_collection(
+        name=config['collection_name'],
+        metadata={
+            "cluster_id": config['cluster_id'],
+            "chunk_start": config['chunk_start'],
+            "chunk_end": config['chunk_end']
+        }
+    )
+    logger.info(f"✅ Collection criada")
+    # Carregar registros
+    logger.info(f"\n📖 Carregando registros de {input_jsonl}...")
+    records = []
+    with open(input_jsonl) as f:
+        for line in f:
+            records.append(json.loads(line))
+    total = len(records)
+    logger.info(f"✅ {total:,} registros carregados")
+    # Processar em batches
+    batch_size = config['embedding_batch_size']
+    logger.info(f"\n🚀 Gerando embeddings em batches de {batch_size}...")
+    for i in tqdm(range(0, total, batch_size), desc="Embedding"):
+        batch = records[i:i+batch_size]
+        # IDs
+        ids = [str(r['id']) for r in batch]
+        # Documentos (usar ementa para embedding)
+        documents = [r.get('ementa', '') for r in batch]
+        # Metadatas
+        metadatas = [{'id': r['id']} for r in batch]
+        # Gerar embeddings
+        embeddings = model.encode(documents, show_progress_bar=False).tolist()
+        # Adicionar ao ChromaDB
+        collection.add(
+            ids=ids,
+            embeddings=embeddings,
+            documents=documents,
+            metadatas=metadatas
+        )
+    logger.info(f"\n✅ ChromaDB construído com sucesso!")
+    logger.info(f"📊 Total de registros: {collection.count():,}")
+    logger.info("="*80)
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--input', required=True, help='JSONL filtrado')
+    parser.add_argument('--config', default='config.yaml')
+    args = parser.parse_args()
+    build_chromadb(args.input, args.config)