{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Quickstart — Community Notes Reranker (PT-BR)\n", "\n", "Notebook mínimo de inferência. Baixa o modelo (base Qwen3-Reranker-0.6B + adapter LoRA), monta o template e devolve a probabilidade de utilidade para um par `(tweet, nota)`.\n", "\n", "- **Runtime sugerido:** GPU (T4 basta). CPU também funciona, mas ~5-10s por inferência.\n", "- **Modos disponíveis:** *fold único* (rápido) e *ensemble dos 5 folds* (reproduz exatamente o número reportado no model card).\n" ] }, { "cell_type": "code", "metadata": {}, "source": [ "# Instala dependencias se necessario\n", "import sys, subprocess, importlib\n", "for mod, pkg in [(\"torch\",\"torch\"), (\"transformers\",\"transformers\"),\n", " (\"peft\",\"peft\"), (\"huggingface_hub\",\"huggingface_hub\")]:\n", " try: importlib.import_module(mod)\n", " except Exception:\n", " subprocess.run([sys.executable, \"-m\", \"pip\", \"install\", \"-q\", pkg], check=True)\n" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Carrega base + um fold (modo rápido)" ] }, { "cell_type": "code", "metadata": {}, "source": [ "import json, torch\n", "from transformers import AutoTokenizer, AutoModelForCausalLM\n", "from peft import PeftModel\n", "from huggingface_hub import snapshot_download\n", "\n", "REPO = \"histlearn/community-notes-reranker-ptbr\"\n", "path = snapshot_download(REPO, allow_patterns=[\"manifesto.json\", \"adapter_fold_1/*\"])\n", "m = json.load(open(f\"{path}/manifesto.json\"))\n", "\n", "tok = AutoTokenizer.from_pretrained(m[\"base_model\"], padding_side=\"left\")\n", "dtype = torch.float16 if torch.cuda.is_available() else torch.float32\n", "base = AutoModelForCausalLM.from_pretrained(m[\"base_model\"], torch_dtype=dtype)\n", "model = PeftModel.from_pretrained(base, f\"{path}/adapter_fold_1\")\n", "if torch.cuda.is_available(): model.cuda()\n", "model.eval()\n", "print(f\"Modelo pronto em: {model.device}\")\n" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Função de inferência" ] }, { "cell_type": "code", "metadata": {}, "source": [ "def util_prob(tweet: str, nota: str) -> float:\n", " \"\"\"Probabilidade de que a comunidade marcaria a nota como util.\n", " Threshold otimo medido sob CV (Platt scaling) = 0.38.\"\"\"\n", " text = (m[\"prompt_prefixo\"] + \": \" + m[\"instrucao\"] +\n", " \"\\n: \" + tweet + \"\\n: \" + nota + m[\"prompt_sufixo\"])\n", " enc = tok(text, return_tensors=\"pt\", truncation=True, max_length=m[\"max_length\"]).to(model.device)\n", " with torch.no_grad():\n", " logits = model(**enc).logits[:, -1, :]\n", " return float(torch.sigmoid(logits[:, m[\"id_yes\"]] - logits[:, m[\"id_no\"]]).item())\n" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exemplo" ] }, { "cell_type": "code", "metadata": {}, "source": [ "tweet = (\"Lula anunciou que o salario minimo subira para R$ 5 mil em 2026.\")\n", "nota = (\"E falso. Em 12/12/2024, o presidente Lula anunciou que o salario minimo \"\n", " \"subiria de R$ 1.412 para R$ 1.518 a partir de janeiro de 2025, segundo a \"\n", " \"Agencia Brasil (https://agenciabrasil.ebc.com.br). Nao ha qualquer \"\n", " \"anuncio oficial de valor proximo a R$ 5 mil.\")\n", "\n", "p = util_prob(tweet, nota)\n", "print(f\"P(util) = {p:.4f}\")\n", "print(f\"Classificacao (threshold 0.38): {'UTIL' if p >= 0.38 else 'NAO-UTIL'}\")\n" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Ensemble dos 5 folds (reproduz o número do model card)\n", "\n", "Para resultados estatisticamente comparáveis aos reportados (macro-F1 0.7920), use a média das probabilidades dos 5 adapters." ] }, { "cell_type": "code", "metadata": {}, "source": [ "from huggingface_hub import snapshot_download\n", "path_full = snapshot_download(REPO, allow_patterns=[\"manifesto.json\", \"adapter_fold_*/*\"])\n", "\n", "def util_prob_ensemble(tweet: str, nota: str) -> float:\n", " probs = []\n", " for k in range(1, 6):\n", " m_k = PeftModel.from_pretrained(base, f\"{path_full}/adapter_fold_{k}\")\n", " m_k.eval()\n", " text = (m[\"prompt_prefixo\"] + \": \" + m[\"instrucao\"] +\n", " \"\\n: \" + tweet + \"\\n: \" + nota + m[\"prompt_sufixo\"])\n", " enc = tok(text, return_tensors=\"pt\", truncation=True, max_length=m[\"max_length\"]).to(m_k.device)\n", " with torch.no_grad():\n", " l = m_k(**enc).logits[:, -1, :]\n", " probs.append(float(torch.sigmoid(l[:, m[\"id_yes\"]] - l[:, m[\"id_no\"]]).item()))\n", " # Libera o adapter k antes de carregar o k+1\n", " if hasattr(m_k, \"unload\"):\n", " m_k.unload()\n", " return sum(probs) / 5\n", "\n", "p_ens = util_prob_ensemble(tweet, nota)\n", "print(f\"P(util) ensemble = {p_ens:.4f}\")\n" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Próximos passos\n", "\n", "- Para documentação completa, métricas e contexto do projeto, ver o [model card](https://huggingface.co/histlearn/community-notes-reranker-ptbr).\n", "- Para reproduzir o treino fold-a-fold ou regenerar os artefatos, ver o notebook `02_pipeline_experimento.ipynb` no [Space do projeto](https://huggingface.co/spaces/histlearn/communitynotesbr).\n", "- Para o dataset bruto, ver [`histlearn/notas-comunidade-ptbr`](https://huggingface.co/datasets/histlearn/notas-comunidade-ptbr).\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "name": "python", "version": "3.11" } }, "nbformat": 4, "nbformat_minor": 5 }