{"cells":[{"cell_type":"markdown","id":"48470cbd","metadata":{"id":"48470cbd"},"source":["\n","# Projeto Final – Machine Learning e Deep Learning (PLN: Análise de Sentimentos)\n","\n","**Professor Rodrigo aqui!** \n","Este notebook é o guia didático para o **Projeto Final**. Vamos construir uma solução completa de **Classificação de Sentimentos** usando avaliações da Amazon (**dataset `amazon_polarity` do Hugging Face**), cobrindo todo o pipeline:\n","\n","1. Definição do problema e escolha do dataset \n","2. Coleta/limpeza, preparação e divisão do conjunto de dados \n","3. **Baseline** com *Machine Learning tradicional* (TF-IDF + Regressão Logística) \n","4. Modelo de *Deep Learning* com **LSTM (PyTorch)** \n","5. Avaliação com métricas adequadas (Accuracy, F1, Matriz de Confusão) \n","6. Exportação dos artefatos e **deploy** com **Gradio** (+ passo a passo para publicar no **Hugging Face Spaces**) \n","\n","> **Importante**: Execute célula por célula e leia as explicações. Onde houver blocos \"Experimente\", preencha as suas observações. Esse notebook pode ser entregue como parte dos **entregáveis** do projeto.\n","\n","---\n","\n","## Objetivo Geral\n","Desenvolver uma solução prática de **ML + DL** aplicada a um problema de **PLN** (classificação binária de sentimento), integrando desde a preparação até o deploy em ambiente público gratuito.\n","\n","## Entregáveis\n","- Notebook **.ipynb** com comentários e resultados \n","- **README.md** do projeto (modelo fornecido) \n","- Deploy funcional com **Gradio** (arquivos `app.py` e `requirements.txt` prontos) \n","- Relatório (5–8 páginas) — usar o modelo do README como base\n","\n","---\n","\n","> **Dica para execução no Google Colab**: ative GPU (Menu: Runtime → Change runtime type → **GPU**).\n"]},{"cell_type":"code","execution_count":1,"id":"f8e7be1b","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"f8e7be1b","executionInfo":{"status":"ok","timestamp":1762969993412,"user_tz":180,"elapsed":9,"user":{"displayName":"Rodrigo Moreira dos Santos","userId":"02453320943127197527"}},"outputId":"4eb46da3-fe2b-42ea-e64b-8f8baf9a54b8"},"outputs":[{"output_type":"stream","name":"stdout","text":["✅ Ambiente pronto (ajuste as instalações se necessário).\n"]}],"source":["\n","# @title Instalação de dependências (Colab)\n","# Se estiver no Colab, descomente as linhas abaixo para instalar.\n","# Em ambiente local com venv, rode `pip install -r requirements.txt`.\n","\n","# !pip install -q datasets==3.0.1 scikit-learn==1.5.2 matplotlib==3.9.2 torch==2.4.1 \\\n","# pandas==2.2.2 numpy==2.1.3 gradio==5.7.1 tqdm==4.66.5\n","\n","print(\"✅ Ambiente pronto (ajuste as instalações se necessário).\")\n"]},{"cell_type":"code","execution_count":2,"id":"99d5bff0","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"99d5bff0","executionInfo":{"status":"ok","timestamp":1762970009339,"user_tz":180,"elapsed":14119,"user":{"displayName":"Rodrigo Moreira dos Santos","userId":"02453320943127197527"}},"outputId":"d92a2d96-5ff8-4353-9bc8-5fb77793976b"},"outputs":[{"output_type":"stream","name":"stdout","text":["✅ Imports OK\n"]}],"source":["\n","# @title Importações centrais\n","import pandas as pd\n","import numpy as np\n","import matplotlib.pyplot as plt\n","from tqdm import tqdm\n","from datasets import load_dataset\n","\n","from sklearn.model_selection import train_test_split\n","from sklearn.feature_extraction.text import TfidfVectorizer\n","from sklearn.linear_model import LogisticRegression\n","from sklearn.ensemble import RandomForestClassifier\n","from sklearn.pipeline import Pipeline\n","from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, classification_report\n","import joblib\n","import os\n","import torch\n","import torch.nn as nn\n","from torch.utils.data import Dataset, DataLoader\n","\n","SEED = 42\n","np.random.seed(SEED)\n","torch.manual_seed(SEED)\n","print(\"✅ Imports OK\")\n"]},{"cell_type":"markdown","id":"dde7d907","metadata":{"id":"dde7d907"},"source":["\n","## 1) Definição do Problema\n","\n","**Tarefa**: Classificar avaliações de produtos como **positivas (1)** ou **negativas (-1)**. \n","**Dataset**: `amazon_polarity` (Hugging Face Datasets). \n","**Justificativa**: análise de sentimentos é amplamente usada em e-commerce e suporte a decisões.\n","\n","> **Critérios de avaliação**: accuracy, F1, matriz de confusão; comparação entre baseline (ML) e LSTM (DL).\n"]},{"cell_type":"code","execution_count":3,"id":"4b875e79","metadata":{"colab":{"base_uri":"https://localhost:8080/","height":603,"referenced_widgets":["28a5100c868e4bc3b6d431aa7688ba7a","00476f0ede5f42518937ded9861f890d","ddeeff0fcbb4456c979f2201f320214c","543ad9465d6648ab97be1dbe41f156e6","55461d8ba84348c3901790a2dadcdea9","3856d00179e94af4a2d7dd424b3d113d","f30de77d68c74cf3b184bb3aaaea17c9","f2b6656db10c4338b4f818bb55e5659e","e012b659e26641aa85c951cc4876cbf2","f4e0943491f44bd485b79b0bcd72efc9","d8e6f0892f534701aac8562a8facce65","9bd673d2112243dcb341fed4aae6c3f8","20bf68fc635d432491410f013d62f8c4","765197e5c33e4115b10971a4bdd7dff6","7ba46ef5b73b4935bf0414bbefbb1ea8","d867c891e22742b190385a5f5b6cab06","491902b8c0c2489e8c6fa98ce9d0022b","9ab2aef3590e42f68fdc42fd7026e52e","6583aa1322e64d18aa7524edc924d6fa","a8d1abe3674e400d8564b1de42638167","dca974250abb407598132820b77585c9","2ba1e1dc1e4a4a9ab198212aee0378a3","6caccdc5d75e4a81871295aecfb64cf9","ab18a8aceb0f48438e5d8385dca3d764","b39831b3d74b4eb383c463f62b9ce7e2","a35cdc3d467a4a5c8d0d2c354bee0c6f","9fb9ed28d4f544b3aee454a16d9b5495","668751256fc74ac180247a5b6102acc6","b92ae0dbfa27484eae49fadbc98f0623","1f2ee0e09bf54798ab342989fa6b7f92","9730559d33f242398d26049721b12e38","12af04bdf93b4521b33ed1d6a0fba277","5b1b65f91fe047ff94fc3aa3ab36bd5e","97ffdaf1a2884498b299f4bae2525871","67c2989964814cba9c8f6244a3628238","c2523028bfa2418981949e9c98a03e8a","e8aa0273403642348b5f4683f2127b66","cd9edefbeab243c99b59aa73f2f2ffa3","d8f7ae158dfc4930b51b36e01d79b8f7","444987e578c74b01bf676339aa5a4d26","e9f158d656b84a298989af09ace6dab6","3aa401d9a36341a9abafcaf65a8b2c13","d343029e050049559a00af725f490210","9af10f88deed4579af2117027674b119","e9a01b058e2749188c99bbfd23fb736e","7c3dc9d863c0443f8bf57412843bbf8a","c316f1edf28547b39b63ff2b14824595","785fce8ae5bc4d0083e07b7e33f961be","846c851e844b43958115ce50e6742b27","4776d0f51cf940e7a7bd8dedd7d6827c","542be3d908d74e13a80b882489d3368c","29fc8dbc59a24987b938b1e743f1468d","528fa4fa13df48528065106df45a1aa9","3134b0148fa14515944adf614c48967e","910d8f2558fb4131945b7c9c9dc54ddd","d631b91838304564b15e8c55f4635118","906f3cf697eb4c90933abf098a3a9d6f","00f23a6ac7ef4142bb02aa2debcf418c","86069c48f2b142bfa7eaf6536db55fa0","b40739dd55a24772965c513dff8edcbb","98ce5ff4c9414b0eb7f84e4c3ee2c07e","44426a8f63224f2da6e648daafc3b212","ba79ed97383045c4bacec2949f746d97","d2da244ca94741afb2e5a667f72bf7d2","381698061555465c810119026d30096e","3845387a395a450493bad84d1da44887","c2f417686f7449baa20384f1fc23d48d","fd4d17f20ef6411f9bf039fef31e1b74","2734cb9e8f3147dbb34a71ec81dd2003","57fbb053fbae4befa715a9fa25a43969","b8b2ce55c87247fcb07a0916b08f9f7e","6377cb99d22644ac98f872739d06b06e","586bbe331f654b3a9db794f8619d0913","239f61adfdfc4cac9f724e0941da864f","ce94451266a345a080cbfc0f6d596c0d","25984323a64d42c1a47ff6d0fa11e843","aa006c2fd45d40a7a0ee465d2763bfaa","e36090c489754453b720f76d267dfa7d","db24acaf110a4d6ab4e1f44d6ea6a2f4","2000cb587d754f5e98e48425e5e4f672","96a1e57064a14eaeabc5907eeaaac6ad","dc8a27479cc94b64af0dece399644c1f","ae54fa359bcf423a96be36acecb5014b","0d54c1f30a4c4159a791c9b2ed56bec9","2ae7955968494ea2ba72c3a2d6090cf9","d8ec7771d4314de194e2217c5a93fe9c","a6c3a66c4b304673a5744bca05fcd100","20e9697419c04643886a325cea3821a9"]},"id":"4b875e79","executionInfo":{"status":"ok","timestamp":1762970172126,"user_tz":180,"elapsed":159871,"user":{"displayName":"Rodrigo Moreira dos Santos","userId":"02453320943127197527"}},"outputId":"a4aa81a7-efc5-4aa8-f4d9-2d821e079f99"},"outputs":[{"output_type":"stream","name":"stderr","text":["/usr/local/lib/python3.12/dist-packages/huggingface_hub/utils/_auth.py:94: UserWarning: \n","The secret `HF_TOKEN` does not exist in your Colab secrets.\n","To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.\n","You will be able to reuse this secret in all of your notebooks.\n","Please note that authentication is recommended but still optional to access public models or datasets.\n"," warnings.warn(\n"]},{"output_type":"display_data","data":{"text/plain":["README.md: 0.00B [00:00, ?B/s]"],"application/vnd.jupyter.widget-view+json":{"version_major":2,"version_minor":0,"model_id":"28a5100c868e4bc3b6d431aa7688ba7a"}},"metadata":{}},{"output_type":"display_data","data":{"text/plain":["amazon_polarity/train-00000-of-00004.par(…): 0%| | 0.00/260M [00:00, ?B/s]"],"application/vnd.jupyter.widget-view+json":{"version_major":2,"version_minor":0,"model_id":"9bd673d2112243dcb341fed4aae6c3f8"}},"metadata":{}},{"output_type":"display_data","data":{"text/plain":["amazon_polarity/train-00001-of-00004.par(…): 0%| | 0.00/258M [00:00, ?B/s]"],"application/vnd.jupyter.widget-view+json":{"version_major":2,"version_minor":0,"model_id":"6caccdc5d75e4a81871295aecfb64cf9"}},"metadata":{}},{"output_type":"display_data","data":{"text/plain":["amazon_polarity/train-00002-of-00004.par(…): 0%| | 0.00/255M [00:00, ?B/s]"],"application/vnd.jupyter.widget-view+json":{"version_major":2,"version_minor":0,"model_id":"97ffdaf1a2884498b299f4bae2525871"}},"metadata":{}},{"output_type":"display_data","data":{"text/plain":["amazon_polarity/train-00003-of-00004.par(…): 0%| | 0.00/254M [00:00, ?B/s]"],"application/vnd.jupyter.widget-view+json":{"version_major":2,"version_minor":0,"model_id":"e9a01b058e2749188c99bbfd23fb736e"}},"metadata":{}},{"output_type":"display_data","data":{"text/plain":["amazon_polarity/test-00000-of-00001.parq(…): 0%| | 0.00/117M [00:00, ?B/s]"],"application/vnd.jupyter.widget-view+json":{"version_major":2,"version_minor":0,"model_id":"d631b91838304564b15e8c55f4635118"}},"metadata":{}},{"output_type":"display_data","data":{"text/plain":["Generating train split: 0%| | 0/3600000 [00:00, ? examples/s]"],"application/vnd.jupyter.widget-view+json":{"version_major":2,"version_minor":0,"model_id":"c2f417686f7449baa20384f1fc23d48d"}},"metadata":{}},{"output_type":"display_data","data":{"text/plain":["Generating test split: 0%| | 0/400000 [00:00, ? examples/s]"],"application/vnd.jupyter.widget-view+json":{"version_major":2,"version_minor":0,"model_id":"e36090c489754453b720f76d267dfa7d"}},"metadata":{}},{"output_type":"stream","name":"stdout","text":["Tamanhos: 9600 2400 6000\n"]},{"output_type":"execute_result","data":{"text/plain":[" text label\n","0 This product consists of a piece of thin flexi... 0\n","1 Even on the lowest setting, the toast is too d... 0\n","2 I enjoyed this disc. The video is stunning. I ... 1\n","3 The authors pretend that parents neither die n... 0\n","4 Might as well just use a knife, this product h... 0"],"text/html":["\n","
| \n"," | text | \n","label | \n","
|---|---|---|
| 0 | \n","This product consists of a piece of thin flexi... | \n","0 | \n","
| 1 | \n","Even on the lowest setting, the toast is too d... | \n","0 | \n","
| 2 | \n","I enjoyed this disc. The video is stunning. I ... | \n","1 | \n","
| 3 | \n","The authors pretend that parents neither die n... | \n","0 | \n","
| 4 | \n","Might as well just use a knife, this product h... | \n","0 | \n","