{ "cells": [ { "cell_type": "markdown", "id": "48470cbd", "metadata": {}, "source": [ "\n", "# Projeto Final – Machine Learning e Deep Learning (PLN: Análise de Sentimentos)\n", "\n", "**Professor Rodrigo aqui!** \n", "Este notebook é o guia didático para o **Projeto Final**. Vamos construir uma solução completa de **Classificação de Sentimentos** usando avaliações da Amazon (**dataset `amazon_polarity` do Hugging Face**), cobrindo todo o pipeline:\n", "\n", "1. Definição do problema e escolha do dataset \n", "2. Coleta/limpeza, preparação e divisão do conjunto de dados \n", "3. **Baseline** com *Machine Learning tradicional* (TF-IDF + Regressão Logística) \n", "4. Modelo de *Deep Learning* com **LSTM (PyTorch)** \n", "5. Avaliação com métricas adequadas (Accuracy, F1, Matriz de Confusão) \n", "6. Exportação dos artefatos e **deploy** com **Gradio** (+ passo a passo para publicar no **Hugging Face Spaces**) \n", "\n", "> **Importante**: Execute célula por célula e leia as explicações. Onde houver blocos \"Experimente\", preencha as suas observações. Esse notebook pode ser entregue como parte dos **entregáveis** do projeto.\n", "\n", "---\n", "\n", "## Objetivo Geral\n", "Desenvolver uma solução prática de **ML + DL** aplicada a um problema de **PLN** (classificação binária de sentimento), integrando desde a preparação até o deploy em ambiente público gratuito.\n", "\n", "## Entregáveis\n", "- Notebook **.ipynb** com comentários e resultados \n", "- **README.md** do projeto (modelo fornecido) \n", "- Deploy funcional com **Gradio** (arquivos `app.py` e `requirements.txt` prontos) \n", "- Relatório (5–8 páginas) — usar o modelo do README como base\n", "\n", "---\n", "\n", "> **Dica para execução no Google Colab**: ative GPU (Menu: Runtime → Change runtime type → **GPU**).\n" ] }, { "cell_type": "code", "execution_count": null, "id": "f8e7be1b", "metadata": {}, "outputs": [], "source": [ "\n", "# @title Instalação de dependências (Colab)\n", "# Se estiver no Colab, descomente as linhas abaixo para instalar.\n", "# Em ambiente local com venv, rode `pip install -r requirements.txt`.\n", "\n", "# !pip install -q datasets==3.0.1 scikit-learn==1.5.2 matplotlib==3.9.2 torch==2.4.1 \\\n", "# pandas==2.2.2 numpy==2.1.3 gradio==5.7.1 tqdm==4.66.5\n", "\n", "print(\"✅ Ambiente pronto (ajuste as instalações se necessário).\")\n" ] }, { "cell_type": "code", "execution_count": null, "id": "99d5bff0", "metadata": {}, "outputs": [], "source": [ "\n", "# @title Importações centrais\n", "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "from tqdm import tqdm\n", "from datasets import load_dataset\n", "\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.feature_extraction.text import TfidfVectorizer\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.ensemble import RandomForestClassifier\n", "from sklearn.pipeline import Pipeline\n", "from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, classification_report\n", "import joblib\n", "import os\n", "import torch\n", "import torch.nn as nn\n", "from torch.utils.data import Dataset, DataLoader\n", "\n", "SEED = 42\n", "np.random.seed(SEED)\n", "torch.manual_seed(SEED)\n", "print(\"✅ Imports OK\")\n" ] }, { "cell_type": "markdown", "id": "dde7d907", "metadata": {}, "source": [ "\n", "## 1) Definição do Problema\n", "\n", "**Tarefa**: Classificar avaliações de produtos como **positivas (1)** ou **negativas (-1)**. \n", "**Dataset**: `amazon_polarity` (Hugging Face Datasets). \n", "**Justificativa**: análise de sentimentos é amplamente usada em e-commerce e suporte a decisões.\n", "\n", "> **Critérios de avaliação**: accuracy, F1, matriz de confusão; comparação entre baseline (ML) e LSTM (DL).\n" ] }, { "cell_type": "code", "execution_count": null, "id": "4b875e79", "metadata": {}, "outputs": [], "source": [ "\n", "# @title Coleta e preparação dos dados (amostragem para execução rápida)\n", "# Carrega partições 'train' e 'test' do dataset amazon_polarity\n", "ds_train = load_dataset(\"amazon_polarity\", split=\"train\")\n", "ds_test = load_dataset(\"amazon_polarity\", split=\"test\")\n", "\n", "# Convertendo para DataFrame\n", "df_train = pd.DataFrame({\"text\": ds_train[\"content\"], \"label\": ds_train[\"label\"]})\n", "df_test = pd.DataFrame({\"text\": ds_test[\"content\"], \"label\": ds_test[\"label\"]})\n", "\n", "# O dataset possui rótulos {0,1}; vamos mapeá-los para {-1, +1} opcionalmente para leitura humana\n", "label_map = {0:0, 1:1} # manter 0/1 para facilitar as métricas de sklearn\n", "df_train[\"label\"] = df_train[\"label\"].map(label_map)\n", "df_test[\"label\"] = df_test[\"label\"].map(label_map)\n", "\n", "# Amostragem para acelerar (ajuste conforme sua GPU/tempo):\n", "N_TRAIN = 12000 # experimente 50k+ com GPU boa\n", "N_TEST = 6000\n", "df_train = df_train.sample(n=N_TRAIN, random_state=SEED).reset_index(drop=True)\n", "df_test = df_test.sample(n=N_TEST, random_state=SEED).reset_index(drop=True)\n", "\n", "# Split treino/val\n", "train_text, val_text, train_y, val_y = train_test_split(\n", " df_train[\"text\"].values, df_train[\"label\"].values, test_size=0.2, random_state=SEED, stratify=df_train[\"label\"].values\n", ")\n", "\n", "print(\"Tamanhos: \", len(train_text), len(val_text), len(df_test))\n", "df_train.head()\n" ] }, { "cell_type": "markdown", "id": "ed2e0c79", "metadata": {}, "source": [ "\n", "## 2) Baseline com Machine Learning Tradicional\n", "\n", "Vamos iniciar com um pipeline simples: **TF-IDF** para vetorização + **Regressão Logística**. \n", "Depois, comparamos com um **Random Forest** para observar diferenças.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "d5d7ba98", "metadata": {}, "outputs": [], "source": [ "\n", "# @title Treino e avaliação: TF-IDF + Regressão Logística\n", "baseline_pipe = Pipeline([\n", " (\"tfidf\", TfidfVectorizer(max_features=40000, ngram_range=(1,2))),\n", " (\"clf\", LogisticRegression(max_iter=1000, n_jobs=None))\n", "])\n", "\n", "baseline_pipe.fit(train_text, train_y)\n", "\n", "val_pred = baseline_pipe.predict(val_text)\n", "test_pred = baseline_pipe.predict(df_test[\"text\"].values)\n", "\n", "print(\"Val Accuracy:\", accuracy_score(val_y, val_pred))\n", "print(\"Val F1:\", f1_score(val_y, val_pred, average=\"weighted\"))\n", "print(\"\\nTest Accuracy:\", accuracy_score(df_test[\"label\"].values, test_pred))\n", "print(\"Test F1:\", f1_score(df_test[\"label\"].values, test_pred, average=\"weighted\"))\n", "\n", "# Matriz de confusão (teste)\n", "cm = confusion_matrix(df_test[\"label\"].values, test_pred)\n", "plt.figure()\n", "plt.imshow(cm, cmap='Blues')\n", "plt.title(\"Matriz de Confusão - Baseline (Teste)\")\n", "plt.xlabel(\"Predito\")\n", "plt.ylabel(\"Verdadeiro\")\n", "for i in range(cm.shape[0]):\n", " for j in range(cm.shape[1]):\n", " plt.text(j, i, cm[i, j], ha=\"center\", va=\"center\")\n", "plt.show()\n", "\n", "print(\"\\nRelatório de Classificação (Teste):\\n\")\n", "print(classification_report(df_test[\"label\"].values, test_pred))\n" ] }, { "cell_type": "code", "execution_count": null, "id": "fbdd4c7a", "metadata": {}, "outputs": [], "source": [ "\n", "# @title Comparativo rápido: TF-IDF + RandomForest\n", "rf_pipe = Pipeline([\n", " (\"tfidf\", TfidfVectorizer(max_features=30000, ngram_range=(1,1))),\n", " (\"rf\", RandomForestClassifier(n_estimators=200, random_state=SEED, n_jobs=-1))\n", "])\n", "\n", "rf_pipe.fit(train_text, train_y)\n", "rf_val = rf_pipe.predict(val_text)\n", "rf_test = rf_pipe.predict(df_test[\"text\"].values)\n", "\n", "print(\"RF Val Acc:\", accuracy_score(val_y, rf_val), \" | Val F1:\", f1_score(val_y, rf_val, average=\"weighted\"))\n", "print(\"RF Test Acc:\", accuracy_score(df_test[\"label\"].values, rf_test), \" | Test F1:\", f1_score(df_test[\"label\"].values, rf_test, average=\"weighted\"))\n" ] }, { "cell_type": "markdown", "id": "02952330", "metadata": {}, "source": [ "\n", "> **Experimente:** \n", "> - Aumente/diminua `max_features` do TF-IDF. \n", "> - Troque Regressão Logística por SVM (`LinearSVC`). \n", "> - Compare overfitting entre ML tradicional e DL. \n", ">\n", "> **Suas observações:** *(escreva abaixo)*\n" ] }, { "cell_type": "markdown", "id": "22ba8a44", "metadata": {}, "source": [ "\n", "## 3) Deep Learning com LSTM (PyTorch)\n", "\n", "Vamos construir um pipeline enxuto com **tokenização simples**, **vocab** baseado no treino e uma **LSTM** para classificação binária. \n", "> Para resultados de SOTA, considere **transformers** (BERT, DistilBERT). Aqui focamos nos fundamentos.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "3b9994fb", "metadata": {}, "outputs": [], "source": [ "\n", "# @title Tokenização simples + Dataset/Dataloader\n", "import re\n", "from collections import Counter\n", "\n", "def basic_tokenize(text):\n", " # minuscula, remove caracteres não alfabéticos exceto apóstrofos básicos, separa por espaços\n", " text = text.lower()\n", " text = re.sub(r\"[^a-z0-9' ]+\", \" \", text)\n", " return text.split()\n", "\n", "# constrói vocabulário a partir do treino\n", "MAX_VOCAB = 30000\n", "counter = Counter()\n", "for t in train_text:\n", " counter.update(basic_tokenize(t))\n", "most_common = counter.most_common(MAX_VOCAB - 2) # reserva para PAD/UNK\n", "itos = [\"\", \"\"] + [w for w,_ in most_common]\n", "stoi = {w:i for i,w in enumerate(itos)}\n", "\n", "def encode(tokens, max_len=80):\n", " ids = [stoi.get(tok, 1) for tok in tokens] # 1 = \n", " if len(ids) < max_len:\n", " ids = ids + [0] * (max_len - len(ids)) # 0 = \n", " else:\n", " ids = ids[:max_len]\n", " return np.array(ids, dtype=np.int64)\n", "\n", "MAX_LEN = 80\n", "\n", "class SentimentDataset(Dataset):\n", " def __init__(self, texts, labels):\n", " self.texts = texts\n", " self.labels = labels\n", " def __len__(self):\n", " return len(self.texts)\n", " def __getitem__(self, idx):\n", " x = encode(basic_tokenize(self.texts[idx]), MAX_LEN)\n", " y = int(self.labels[idx])\n", " return torch.tensor(x), torch.tensor(y)\n", "\n", "train_ds = SentimentDataset(train_text, train_y)\n", "val_ds = SentimentDataset(val_text, val_y)\n", "test_ds = SentimentDataset(df_test[\"text\"].values, df_test[\"label\"].values)\n", "\n", "BATCH_SIZE = 128\n", "train_dl = DataLoader(train_ds, batch_size=BATCH_SIZE, shuffle=True)\n", "val_dl = DataLoader(val_ds, batch_size=BATCH_SIZE)\n", "test_dl = DataLoader(test_ds, batch_size=BATCH_SIZE)\n", "\n", "len(itos), MAX_LEN, BATCH_SIZE\n" ] }, { "cell_type": "code", "execution_count": null, "id": "71d27538", "metadata": {}, "outputs": [], "source": [ "\n", "# @title Modelo LSTM\n", "class LSTMClassifier(nn.Module):\n", " def __init__(self, vocab_size, embed_dim=128, hidden_dim=128, num_classes=2, num_layers=1, dropout=0.2):\n", " super().__init__()\n", " self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)\n", " self.lstm = nn.LSTM(embed_dim, hidden_dim, num_layers=num_layers, batch_first=True, dropout=dropout if num_layers>1 else 0.0)\n", " self.dropout = nn.Dropout(dropout)\n", " self.fc = nn.Linear(hidden_dim, num_classes)\n", " def forward(self, x):\n", " emb = self.embedding(x)\n", " out, _ = self.lstm(emb)\n", " h = out[:, -1, :]\n", " h = self.dropout(h)\n", " return self.fc(h)\n", "\n", "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n", "model = LSTMClassifier(vocab_size=len(itos)).to(device)\n", "\n", "criterion = nn.CrossEntropyLoss()\n", "optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)\n", "\n", "EPOCHS = 4 # aumente se tiver tempo/GPU\n" ] }, { "cell_type": "code", "execution_count": null, "id": "c639c797", "metadata": {}, "outputs": [], "source": [ "\n", "# @title Treino simples + validação\n", "def evaluate(model, loader):\n", " model.eval()\n", " ys, ps = [], []\n", " with torch.no_grad():\n", " for xb, yb in loader:\n", " xb, yb = xb.to(device), yb.to(device)\n", " logits = model(xb)\n", " pred = torch.argmax(logits, dim=1)\n", " ys.append(yb.cpu().numpy())\n", " ps.append(pred.cpu().numpy())\n", " ys = np.concatenate(ys)\n", " ps = np.concatenate(ps)\n", " return accuracy_score(ys, ps), f1_score(ys, ps, average=\"weighted\")\n", "\n", "best_val = 0.0\n", "for epoch in range(1, EPOCHS+1):\n", " model.train()\n", " total_loss = 0.0\n", " for xb, yb in tqdm(train_dl, desc=f\"Epoch {epoch}/{EPOCHS}\"):\n", " xb, yb = xb.to(device), yb.to(device)\n", " optimizer.zero_grad()\n", " logits = model(xb)\n", " loss = criterion(logits, yb)\n", " loss.backward()\n", " optimizer.step()\n", " total_loss += loss.item()\n", " val_acc, val_f1 = evaluate(model, val_dl)\n", " print(f\"Epoch {epoch} | Loss: {total_loss/len(train_dl):.4f} | Val Acc: {val_acc:.4f} | Val F1: {val_f1:.4f}\")\n", " if val_acc > best_val:\n", " best_val = val_acc\n", " torch.save({\n", " \"model_state\": model.state_dict(),\n", " \"vocab\": itos,\n", " \"max_len\": MAX_LEN\n", " }, \"lstm_sentiment_best.pt\")\n", " print(\"✅ Modelo LSTM salvo: lstm_sentiment_best.pt\")\n" ] }, { "cell_type": "code", "execution_count": null, "id": "b44eb2e8", "metadata": {}, "outputs": [], "source": [ "\n", "# @title Avaliação no conjunto de teste\n", "# Carrega melhor checkpoint (se houver)\n", "if os.path.exists(\"lstm_sentiment_best.pt\"):\n", " ckpt = torch.load(\"lstm_sentiment_best.pt\", map_location=device)\n", " model.load_state_dict(ckpt[\"model_state\"])\n", "\n", "test_acc, test_f1 = evaluate(model, test_dl)\n", "print(\"LSTM Test Accuracy:\", test_acc)\n", "print(\"LSTM Test F1:\", test_f1)\n" ] }, { "cell_type": "markdown", "id": "7b866b6f", "metadata": {}, "source": [ "\n", "## 4) Exportação de Artefatos\n", "\n", "Vamos salvar:\n", "- Pipeline TF-IDF + Regressão Logística (`baseline_pipe.pkl`)\n", "- Modelo LSTM (`lstm_sentiment_best.pt`) + vocabulário embutido no checkpoint\n", "\n", "Esses arquivos serão usados no **deploy** (Gradio + Hugging Face Spaces).\n" ] }, { "cell_type": "code", "execution_count": null, "id": "ccf5e781", "metadata": {}, "outputs": [], "source": [ "\n", "# @title Salvar pipeline baseline\n", "joblib.dump(baseline_pipe, \"baseline_pipe.pkl\")\n", "print(\"✅ Pipeline baseline salvo como baseline_pipe.pkl\")\n", "\n", "# O LSTM já foi salvo como lstm_sentiment_best.pt durante o treino (melhor época).\n", "print(\"✅ Verifique se lstm_sentiment_best.pt foi gerado na etapa anterior.\")\n" ] }, { "cell_type": "markdown", "id": "f5d63f93", "metadata": {}, "source": [ "\n", "## 5) Demonstração com Gradio (local)\n", "\n", "Abaixo, uma interface mínima com **Gradio**. Para publicar no **Hugging Face Spaces**, usaremos o arquivo `app.py` (já pronto e salvo ao lado deste notebook).\n" ] }, { "cell_type": "code", "execution_count": null, "id": "7efbc3cc", "metadata": {}, "outputs": [], "source": [ "\n", "# @title Demo local (opcional)\n", "# Para executar no notebook, descomente:\n", "# import gradio as gr\n", "# import torch\n", "# import joblib\n", "\n", "# # Carregar baseline (mais leve para demo)\n", "# baseline = joblib.load(\"baseline_pipe.pkl\")\n", "\n", "# def predict_sentiment(text):\n", "# proba = baseline.predict_proba([text])[0]\n", "# pred = int(np.argmax(proba))\n", "# label = \"positivo\" if pred == 1 else \"negativo\"\n", "# conf = float(np.max(proba))\n", "# return {\"predição\": label, \"confiança\": conf}\n", "\n", "# demo = gr.Interface(fn=predict_sentiment,\n", "# inputs=gr.Textbox(label=\"Digite uma avaliação\"),\n", "# outputs=gr.JSON(label=\"Resultado\"),\n", "# title=\"Análise de Sentimentos (Baseline)\")\n", "# demo.launch()\n", "print(\"ℹ️ Use o app.py para deploy no Hugging Face Spaces.\")\n" ] }, { "cell_type": "markdown", "id": "c8454fcd", "metadata": {}, "source": [ "\n", "## 6) Conclusões & Próximos Passos\n", "\n", "- Comparamos **ML tradicional** (TF-IDF + LR/RF) com uma **LSTM** simples. \n", "- Para melhores resultados, considere **transformers** (ex.: `distilbert-base-uncased` com `transformers`). \n", "- Faça *tuning* de hiperparâmetros (LR, batch size, epochs, max_features, max_len). \n", "- Documente no **Relatório**: escolhas, resultados, limitações e próximos passos.\n", "\n", "> **Checklist para o Deploy** \n", "> - `baseline_pipe.pkl` e/ou `lstm_sentiment_best.pt` gerados \n", "> - `app.py` pronto (fornecido) \n", "> - `requirements.txt` pronto (fornecido) \n", "> - Criar o **Space** no Hugging Face (template Gradio/Python) e subir os arquivos \n", "> - Preencher o `README.md` com prints e explicações\n" ] }, { "cell_type": "markdown", "id": "17df5370", "metadata": {}, "source": [ "\n", "---\n", "\n", "### 🧪 Experimente (preencha suas anotações abaixo)\n", "\n", "1. **TF-IDF**: Mude `ngram_range`, `max_features` e compare *accuracy* e *F1* no **val** e **test**. \n", "2. **Classificador**: Troque para `LinearSVC` e compare com a Regressão Logística. \n", "3. **LSTM**: Aumente `EPOCHS` e `embed_dim` (128→256) e anote mudanças. \n", "4. **Limpeza**: Remova *stopwords* no TF-IDF e compare. \n", "5. **Amostra**: Compare tempos e métricas usando `N_TRAIN`=12k vs. 50k+.\n", "\n", "**Observações do grupo:**\n", "\n", "- \n", "- \n", "- \n" ] } ], "metadata": {}, "nbformat": 4, "nbformat_minor": 5 }