{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "GTsdkk_6TGX5" }, "source": [ "# **STEP 1: Environment Setup & Ecosystem Basics**" ] }, { "cell_type": "markdown", "metadata": { "id": "v4IaW9ezfXVz" }, "source": [ "# Research/Learn:\n", "Hugging Face Hub: A central repository for hosting models and datasets, functioning similarly to GitHub for Machine Learning.\n", "\n", "Core Libraries:\n", "\n", "* transformers: The primary library for downloading and using pre-trained models.\n", "\n", "* tokenizers: A library to convert raw data into numerical tensors that machines can process.\n", "\n", "* datasets: A library for accessing and managing training data for various AI models.\n", "\n", "* accelerate: A tool to simplify and speed up training across different hardware configurations.\n", "\n", "* peft: Implements LoRA (Low-Rank Adaptation) for efficient and fast model fine-tuning.\n", "\n", "* trl: A library for Transformer Reinforcement Learning.\n", "\n", "* bitsandbytes: A library for model quantization (compressing models to 8-bit or 4-bit) to run smoothly on limited GPU memory.\n", "\n", "Google Colab: A cloud-based platform for running Jupyter Notebooks with GPU support.\n", "\n", "Authentication: Using an Access Token to securely authenticate with Hugging Face without manual password entry" ] }, { "cell_type": "markdown", "metadata": { "id": "dWMd1cYLiQdC" }, "source": [ "Because I have a trouble prolbem with the Billing issue in Google Cloud Platform, I use collab to make notebook for step 1\n", "\n", "#Actions:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 351, "referenced_widgets": [ "1825461cf5984f88968adc2ddb4158cd", "8f6eb311d3d14944835e672f855b63e0", "05cdbafac0ad4ac2b8fd97f66141103c", "c8310616cadf47d19aea172de8de4714", "178b79b5f7ed4e699add6bc1ad782756", "9239fa3064d04fad8cd66ffdc49d6e5a", "508915d745d14275a7571b763f6bb835", "f5a50e5aa5c1461f8c137243440967c2", "281b5d87ff1d41249181e1370e587604", "ca752b24fb6f4794b3fd3cd374b33957", "e87badb0c9584aeabd3c6e9a656ec461", "5f9fecdc86a84e8c8a083464999962b4", "18d59ae1deea4089bbeaf2cd5c57c782", "2a24470e4f4647a282f47e440006c117", "4595a14ecd1342a8b3c9f4a9b53dd16f", "3d370900ccd2420689bfa30221cfa42f", "d1cb609ab5c34bc5a6cbf470026e4d76", "6ead5248a7ca48c8b4e9c6ec83f757ad", "4c17de0d52e24301aec299fd4fc56957", "610e9393f7d7428e918eaaa50e5d7dc9", "e043c6f755ad48e181d117bd2dcffc34", "3aa6db65edde4e2991ba8695399c9555", "3c2dc7576a7e4582b1cbc6560f35b36d", "c0a9d780dc6f4d9180d25e61063a9dab", "fb22137495954765910dc9f2e6c4fb36", "b26b2b995e19478583037a38a8c09f87", "12113eb2390f4c1ebe1e11fa2e73a34d", "a5a3425d9c8a43329a24301b30225c48", "65fad1ce52ea41e0b9b94f0133bac0db", "57c0623345794296b3147c57f52f46f5", "a437c1c3bf744bc1abf1087c250a7b04", "59d762a9bc174186a16f6893ba9a751e", "6fa37b8285f1493d84da9ea330e84c9b", "daf6420690884054ae8b7c45103b5e38", "bfa26765686a40aaade30fa12285b4c0", "872c3fffc82f419d9c83fc799d71b5ed", "225ae3d75cfb4b569ad668a76a3a1662", "aebb3029a6754d188789d681acb6d522", "233f7f2349e34e2e815d0c7bdf4f9ad8", "d5515b944d8f4b3a8f9060c2a5b4d59c", "2ebd5179540243368553f75f802d587d", "f89f5bdb93214a068b622048725563e1", "47114f04a83d4fdea9d55a6429643088", "b862af99b981492686498441260c387c", "dbebc5f1583741d397b2d324098b48eb", "c6cf9a62e89e4097b8bb39ad28ceaee2", "27acfcf13ed44764bf021d8828ff726e", "39c51f5853794ab09384a471ccb10d6b", "697c593484f94a438cb71716f6337c4e", "07d590579cdc4624a9bd1a4041e19b10", "2bbd685cb10644f58850312fb3df4a99", "e35cb4490be84c08a05be84134b8f989", "4f478f2df8e94871b1d14d41ce736907", "3e2e32ef5c4440cb8ed22ff0d87f8b32", "eb7515902fdd46b0ae0a40af09848cb9", "0f5be051488242a3abc37dcc0f1989e3", "9b4afc0092cd4185b1859c203434edb6", "9ffd8d335bf649eea8b41c3d4a89c171", "6c30c834970d4aa084386a9a8917836f", "3aa9e2db7bfc45608412f396912732a9", "000ed95538c44a78b95abcf2f0991adc", "a9211c553a8c447d883bfc8269f22474", "9f86818efc934b1e89d443edac4d5d69", "30a33389ad674e58a8f6af95e36d381f", "29d7b5b9ab8b465bb4b0f76f08f297b9", "3f53b4612f3443dd8fa6f6f68f42802c" ] }, "id": "Y5GaVMmXPfiz", "outputId": "0ff6aed8-a3df-47a5-f415-26af1450985e" }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "1825461cf5984f88968adc2ddb4158cd", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Processing Files (0 / 0) : | | 0.00B / 0.00B " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "5f9fecdc86a84e8c8a083464999962b4", "version_major": 2, "version_minor": 0 }, "text/plain": [ "New Data Upload : | | 0.00B / 0.00B " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "3c2dc7576a7e4582b1cbc6560f35b36d", "version_major": 2, "version_minor": 0 }, "text/plain": [ " /content/vectorizer.joblib : 100%|##########| 182kB / 182kB " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "daf6420690884054ae8b7c45103b5e38", "version_major": 2, "version_minor": 0 }, "text/plain": [ " /content/model.joblib : 100%|##########| 161kB / 161kB " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "dbebc5f1583741d397b2d324098b48eb", "version_major": 2, "version_minor": 0 }, "text/plain": [ " ...ata/mnist_train_small.csv: 100%|##########| 36.5MB / 36.5MB " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "0f5be051488242a3abc37dcc0f1989e3", "version_major": 2, "version_minor": 0 }, "text/plain": [ " ...ample_data/mnist_test.csv: 100%|##########| 18.3MB / 18.3MB " ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "No files have been modified since last commit. Skipping to prevent empty commit.\n", "WARNING:huggingface_hub.hf_api:No files have been modified since last commit. Skipping to prevent empty commit.\n" ] }, { "data": { "application/vnd.google.colaboratory.intrinsic+json": { "type": "string" }, "text/plain": [ "CommitInfo(commit_url='https://huggingface.co/duyb2207513/step1/commit/cdc7bd40797e5c858ad4a17a2dd39e830a6ca920', commit_message='Upload folder using huggingface_hub', commit_description='', oid='cdc7bd40797e5c858ad4a17a2dd39e830a6ca920', pr_url=None, repo_url=RepoUrl('https://huggingface.co/duyb2207513/step1', endpoint='https://huggingface.co', repo_type='model', repo_id='duyb2207513/step1'), pr_revision=None, pr_num=None)" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from huggingface_hub import login, upload_folder\n", "login()\n", "upload_folder(folder_path=\".\", repo_id=\"duyb2207513/step1\", repo_type=\"model\")" ] }, { "cell_type": "markdown", "metadata": { "id": "Jg1ZWKZ7idj3" }, "source": [ "Cài đặt các thư viện chính:\n", "Because colab have all of those library, so i don't install again." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "_bm8gRnbTVxS" }, "outputs": [], "source": [ "\n", "# !pip install transformers datasets tokenizers accelerate peft trl bitsandbytes huggingface_hub\n" ] }, { "cell_type": "markdown", "metadata": { "id": "qpQ2K3KgilDM" }, "source": [ " Successful login to HF and running basic inference from a Hub model\n", " # Milestone 1:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 264, "referenced_widgets": [ "7a0866e7251246d480d4e6e733767c3a", "f6b6e540c6fc4c3ab6d3afa695a51a25", "3fd0e8da643b4e039dfb5cc3a5f1af45", "ae20324e380a4080963098f7b25d30f3", "bbf28a0085e64878b0fa82ba823c7a4e", "2361788c09944f1482a433f5ef5c36f2", "3e9575f498b648348c2915e71231d709", "61853f28403a4c7284d6d28066396a83", "87a810994e714e7382c256be3720c050", "3f02eedba21f406baa053fd50dd485d7", "26da9a791e1d4925b04f1a3f3fc63ea5", "889eec3219ef4e9daf030ff020b079c6", "170becbc4e744598b46113f3d2fc677f", "a80e2770550e4b6e9353e3c957db081b", "4f1b4b64eaf74b57b5eb9a147283b7b7", "55e48afcb7834cfcb9325be3f8375899", "306afee92d4845b8ae9bc28d2a4b2554", "5f837db172734c4e8b83cfa01659fbf0", "1f07a773087e411f82a50fca124277d5", "a0dc371a46f246fc9ea31f6eed5ee4ea", "f333cd6df4084d939046de8617a94f07", "3eb0608336304f89b14dfd177d98169b", "e5165f34b15d45039695b9b70bdc79e9", "fbc10ec192ed4450b9aa1bc4b561f05c", "75e27c38c7c7482c951c8d8a21d90221", "34d7a85b87ba4e14846171798dd1098e", "ca5bb4d34d9d428db8d94ef809e537d9", "b65f7765a02a4c53ba0704897c9a4f73", "9293e0b20fb843d089c839f80c7fba4e", "11f7ed6ef7ee4f3f9f8876996892898a", "d5700d2321334c4f8a715555149d6ca5", "6dfaf251655548778878eeb92f7fef31", "2354d4c077274216bc9faa3b719b5792", "7bb42a9cd0874fcca46cb1e0e50f2cfe", "096d251fb7ee4e64bbab91823f3afaaa", "3c3829e16fcc4c02afd4ffe473aad012", "2a0a608f67c04f0283f0030ac7e0c0df", "f53f43a692f14d1db3ca51f6457bef78", "51bcee9185e348508594b48dfa4f5dd5", "4f6b7d8d5e44402e8bbfa8d7b48dd854", "9fe1de79037047f5af71261438dec7c6", "cb33ba811ad34e4b889b679656c46d4d", "93cf9d6237d64bec86038174d29d9ed7", "458b597b9b2b47a8bba2cba9212bd8ce", "a613ee0f6f904b3cbe1e126722af500f", "211e7e7b6bec4b4eb1a027a50820aa54", "66c5d3fca0104eb1b365cb4bcd32bb26", "fbe4b98399e04f4d90113e637fbf8f42", "2140afac3376401caa434c0a657d6f2c", "219c543a5abd4eada46819d7d2df3065", "f0015655c32743f88e4543145a3d524f", "0c04961a010b4e72a29a162b530f3e76", "07675b37d1974e8ea603aa6a84794940", "1d2d6ce0066a4124b1971327193a8668", "3d3a000f621041bb914f0c0daddbcf0a" ] }, "id": "6IZQSvAfYM4H", "outputId": "a7169e6a-bf0f-4582-c83a-6dc437fd98cf" }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "7a0866e7251246d480d4e6e733767c3a", "version_major": 2, "version_minor": 0 }, "text/plain": [ "config.json: 0%| | 0.00/629 [00:00, ?B/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "889eec3219ef4e9daf030ff020b079c6", "version_major": 2, "version_minor": 0 }, "text/plain": [ "tokenizer_config.json: 0%| | 0.00/48.0 [00:00, ?B/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "e5165f34b15d45039695b9b70bdc79e9", "version_major": 2, "version_minor": 0 }, "text/plain": [ "vocab.txt: 0.00B [00:00, ?B/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "7bb42a9cd0874fcca46cb1e0e50f2cfe", "version_major": 2, "version_minor": 0 }, "text/plain": [ "model.safetensors: 0%| | 0.00/268M [00:00, ?B/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "a613ee0f6f904b3cbe1e126722af500f", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Loading weights: 0%| | 0/104 [00:00, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "1. Raw scores (Logits): [[-4.351162910461426, 4.709738731384277]]\n", "2. Probabilities (%): [[0.0001161047475761734, 0.9998838901519775]]\n", "\n", "Input sentence: 'This movie is absolutely amazing! I loved it.'\n", "AI conclusion: POSITIVE\n" ] } ], "source": [ "import torch\n", "from transformers import AutoTokenizer, AutoModelForSequenceClassification\n", "\n", "# 1. Define the model name (A model already fine-tuned for sentiment analysis)\n", "model_name = \"distilbert-base-uncased-finetuned-sst-2-english\"\n", "\n", "tokenizer = AutoTokenizer.from_pretrained(model_name)\n", "model = AutoModelForSequenceClassification.from_pretrained(model_name)\n", "\n", "# 2. Provide a sample sentence (e.g., a compliment)\n", "text = \"This movie is absolutely amazing! I loved it.\"\n", "\n", "# Tokenize the sentence and return PyTorch tensors\n", "inputs = tokenizer(text, return_tensors=\"pt\")\n", "\n", "# 3. Feed the input into the model\n", "with torch.no_grad():\n", " outputs = model(**inputs)\n", "\n", "# 4. DECODE THE OUTPUT\n", "# Extract the raw mathematical scores calculated by the model\n", "logits = outputs.logits\n", "print(\"1. Raw scores (Logits):\", logits.tolist())\n", "\n", "# Convert raw scores into probabilities (percentages) using the Softmax function\n", "probabilities = torch.nn.functional.softmax(logits, dim=-1)\n", "print(\"2. Probabilities (%):\", probabilities.tolist())\n", "# Expected output format: [[0.0001, 0.9999]] (meaning 0.01% Negative, 99.99% Positive)\n", "\n", "# Select the index (ID) with the highest probability score using argmax\n", "predicted_class_id = torch.argmax(probabilities).item()\n", "\n", "# Map the predicted ID back to a human-readable string label using the model's configuration\n", "predicted_label = model.config.id2label[predicted_class_id]\n", "\n", "# Print the final result\n", "print(f\"\\nInput sentence: '{text}'\")\n", "print(f\"AI conclusion: {predicted_label}\")" ] }, { "cell_type": "markdown", "metadata": { "id": "ALHRr2Tni0uA" }, "source": [ " # Milestone 2:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 585, "referenced_widgets": [ "f1a4214f73cf42bdbc1f24118330ad0f", "733f5e9134c2474190b968900fc34087", "3db7ce3e034d46b888971bf597037706", "b22955f5ea17481ca093d5f12ce1c030", "47917d82449d4d558f98cf43b61bccab", "813cc8a177c24063a3b858f5bed36587", "dfbc245d24f44c9da1c41d05190e49e5", "90ddce97ea794eca9443639cb59a85b9", "d9b739d5586b45f983900de7a5aacb95", "ef79b0244fb74166a42b041178b10f06", "67718a9ab43f42b29b657b2de9085dc3", "426e7dd108b2448eb608d9660187148a", "7653881911b2480e87dabf3c441d0414", "16fc5b64e9344e8cb8219cdb0a28ac8b", "f6749505aaf24aa7945277f503ad2117", "9f43f29f67704536ac514cb79aea46de", "cc6183bb3cdf4c37ba6d2252b21882b8", "5acafee9fcb846a4bb617a1657efbb94", "381077ef6a004804ac066e3dc4bed39c", "2821a6ee2ffa43f287306acec7721c2c", "3204c7e3cb4046af9d9d5f927692bb0b", "31a96a8a5c4d42f192b5941a7f7ae578", "9d0b1916a1564835ae61f8b76ca8a835", "6f04809d075d4c94aa3e6a2dd6f6bf90", "596547ac55814517b740ad69fbc75214", "3ae00d38b527493e9fabd73755d3184e", "7c56bc46b6e54442a5915f636f77b9e6", "e17d8c19e01a47aebd83675c4e27fee7", "5383656a1ed54ec9a5d258c408949977", "19a2c4b90b66450b83ae35a5ff0eecfe", "0c9e6710159a45d2993d27c5bb3ba31f", "8c986b5d61f64b5399dc2017cc4b0ec8", "a6ee16d3e8414191b692af16fc6e41a3", "23134367becb4d2bb74c8e362e0c0c70", "d90a8b65bf6940e0896c90e9a768b274", "e9bdb46f9309452699e6af17c6d08b91", "664fcb5f9fcf403087e1231d0301e78c", "29f94c2507ba4e008822d03fd043fd42", "00f9376481f740b09c3e6c098e3ff6b1", "ead04f4f52194019baa5e505a9d51fc1", "6624e175abac440880b5100f8f600e98", "08a56a2be6c14bc981ff10492bcf588d", "974c3d28c4ec4d4d9cfaf996b9a959fc", "ff40e2e391eb498eaf494dc1f8abcd30", "9e6fa67f03754bf899a79c84e95392c4", "86d2c01d2809450797fff7604644526e", "84647967e28f4c65a8335e1cd0c28314", "474b069609ed49cb85a700a4e1feb086", "09cc5ffd3a434dc1b4c5867eeb456d91", "b1028237502446e988033c641910d291", "f2271aa46eb946d3a4dfa14d64ee1664", "783dd73fb58348d5aef0320977fd729d", "b4a98d2f5b334d8ba9cf8d2ab9331911", "0606ecf781ac48e0b9deae6b088d5a53", "da1e1feca7a341a0bec22d32699f8d13", "363e5597ea0349aca98587c0ae15212b", "ef5e9c4c58d34815b2511767934bc3a9", "c9a8d71e7bc64350a264c61814dbe322", "0a6f9bbbb779486abe511e2cc95a8f8a", "5776ea70ce49475aa315b1624c8e8912", "2088b078e2c14c59bf09b41910cbef3d", "ae5ea1d2c94046fbb2f464670aca936e", "3c2df5d4a6f2414e9fc7880d063efcf2", "a27139fd9e5d484dbdbeb90307df7f55", "66468571acd84f89b19f75d9c34d670d", "ac62575a35c64775afd46584a39ffd21", "dbe8ea0c7db449659c596ef0f69ae70c", "5a309346a9ad4221ac32f21f202f27a5", "9775c4bafc474925bb636e9616564440", "d06beca9ea154b24b44157bb91c897e0", "f42e5a9b72e341678324b9ea94cad583", "377a32d6b8064c5e9348ef9f04b1322f", "052ca0bd49724dcd805ef44866f188f3", "5396aefd27284c9e9147ed6a23611268", "80373607997842fa8780f30173f5179d", "2657667003c3496a8c9a3a20f894a06e", "1de3454108c24e6a9136c78cae470849", "a0b7338b724f4cbb92527737861069f8", "043be5aac4fc4bfa881f32907fc9e69f", "d592e6b0e5aa492db14235a438ea81af", "3c59a0a62cc64b9a9b009fc2022be9b7", "d7136f9766e1458fb44f9f190cb845de", "dd9d44834753481590c7ff9b25c0d66a", "65a917deb9ff4c8f8a2c0ce913eb3e50", "b9386aef740a4d90b559c4127c80a7d4", "f18f5b0450f9490ab8b81d1f5d301f49", "e541eb45bcab49b7a480b39ed4e21698", "1b913035d2e94a84b67ca3a2c61bd3b7", "404aeb1674ab437d8bc63ff42a74bb6e", "7dd511f925a14c0d931fecf493f0efaa", "9e10eea4f03e4653a8e3d208f4e3a0e0", "625fc85015f940539e6b77a15ad2e398", "991fa99ce0c246558cb4ab6d5513f68a", "2890e902adfa4285bcb0cb3d1ed6aaf1", "66b7496f6ec6479e9e72c619f09aca97", "fbe6c519e8db41fab545a3a4ea5f3673", "b9bbe55514b947dab7564396ad01d055", "b6ea9c4fe9d94d6188fe7bfdd6b6eaab", "14a7bfd63a984fabb650e3621e89ffe8" ] }, "id": "OmP1j7wYRBLI", "outputId": "0c20ad45-24e5-4e24-d52e-b6d224a770bd" }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "f1a4214f73cf42bdbc1f24118330ad0f", "version_major": 2, "version_minor": 0 }, "text/plain": [ "config.json: 0%| | 0.00/838 [00:00, ?B/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "`torch_dtype` is deprecated! Use `dtype` instead!\n", "`torch_dtype` is deprecated! Use `dtype` instead!\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "426e7dd108b2448eb608d9660187148a", "version_major": 2, "version_minor": 0 }, "text/plain": [ "model.safetensors.index.json: 0.00B [00:00, ?B/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "9d0b1916a1564835ae61f8b76ca8a835", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Downloading (incomplete total...): 0.00B [00:00, ?B/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "23134367becb4d2bb74c8e362e0c0c70", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Fetching 2 files: 0%| | 0/2 [00:00, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "9e6fa67f03754bf899a79c84e95392c4", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Loading weights: 0%| | 0/288 [00:00, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "363e5597ea0349aca98587c0ae15212b", "version_major": 2, "version_minor": 0 }, "text/plain": [ "generation_config.json: 0%| | 0.00/187 [00:00, ?B/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "dbe8ea0c7db449659c596ef0f69ae70c", "version_major": 2, "version_minor": 0 }, "text/plain": [ "tokenizer_config.json: 0.00B [00:00, ?B/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "a0b7338b724f4cbb92527737861069f8", "version_major": 2, "version_minor": 0 }, "text/plain": [ "tokenizer.json: 0%| | 0.00/17.5M [00:00, ?B/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "404aeb1674ab437d8bc63ff42a74bb6e", "version_major": 2, "version_minor": 0 }, "text/plain": [ "special_tokens_map.json: 0%| | 0.00/636 [00:00, ?B/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "Passing `generation_config` together with generation-related arguments=({'max_new_tokens'}) is deprecated and will be removed in future versions. Please pass either a `generation_config` object OR all generation parameters explicitly, but not both.\n", "Both `max_new_tokens` (=256) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "## Hugging Face: Nền tảng cho trí tuệ nhân tạo\n", "\n", "Hugging Face là một nền tảng trực tuyến cho việc phát triển và chia sẻ các mô hình trí tuệ nhân tạo (AI). Nó cung cấp một kho repository phong phú với:\n", "\n", "* **Mô hình AI:** Từ các mô hình ngôn ngữ, phân loại hình ảnh, đến các thuật toán xử lý âm thanh, bạn có thể tìm kiếm và sử dụng các mô hình AI đã được tạo ra.\n", "* **Tạo dựng mô hình:** Sử dụng công cụ mạnh mẽ trong Hugging Face để xây dựng và fine-tuning các mô hình AI riêng của bạn.\n", "* **Chia sẻ và kết nối:** Chia sẻ mã nguồn và mô hình với cộng đồng, trao đổi kinh nghiệm và thu thập ý tưởng.\n", "\n", "\n", "Hugging Face giúp cho việc sử dụng AI trở nên dễ dàng hơn, mở ra cánh cửa cho các nhà phát triển và người dùng, từ dân số bình dân đến các doanh nghiệp lớn. \n", "\n" ] } ], "source": [ "# Pipeline:\n", "\n", "#Run inference basic:\n", "from transformers import pipeline\n", "import torch\n", "\n", "# 1 model example\n", "model_id = \"google/gemma-2-2b-it\"\n", "\n", "# 2. Make an pipeline for conversation\n", "# pineline will automatic download model\n", "pipe = pipeline(\n", " \"text-generation\",\n", " model=model_id,\n", " model_kwargs={\"torch_dtype\": torch.bfloat16},\n", " device=\"cuda\", # Chạy bằng GPU cho nhanh\n", ")\n", "\n", "# 3. Give a question like this:\n", "messages = [\n", " {\"role\": \"user\", \"content\": \"Chào bạn, hãy giới thiệu ngắn gọn về Hugging Face bằng tiếng Việt.\"},\n", "]\n", "\n", "outputs = pipe(messages, max_new_tokens=256)\n", "\n", "# 4. the result\n", "print(outputs[0][\"generated_text\"][-1][\"content\"])" ] }, { "cell_type": "markdown", "metadata": { "id": "DV-4AFqRTMuy" }, "source": [ "# **STEP 2 Standard Training Pipelines for ML Models**" ] }, { "cell_type": "markdown", "metadata": { "id": "rRkKpotKksfd" }, "source": [ "# Research/Learn:\n", "A standard Machine Learning (ML) training workflow using Hugging Face (HF) involves loading datasets with datasets, pre-processing using tokenizers, fine-tuning models via transformers.Trainer, and evaluating performance. This iterative process optimizes models using pre-trained weights for specific tasks, utilizing GPU acceleration for efficiency. Steps in the HF Workflow:\n", "* Data Preparation: Load datasets using the HF Datasets library and preprocess data (e.g., tokenization) using HF Tokenizers to convert text into numerical input suitable for models.\n", "* Model Selection: Select a pre-trained model architecture from HF Transformers appropriate for the task (e.g., classification, generation).\n", "* Training Setup: Utilize the Trainer API to define training arguments such as learning rate, batch size, and evaluation strategies.\n", "* Training: Fine-tune the model on your dataset.Evaluation: Assess the model's performance on a validation set, commonly tracking metrics like accuracy, F1-score, or perplexity.\n", "* Sharing: Push the final model to the Hugging Face Hub to share or deploy\n", "\n", "Differences from DL and LLM is:Deep Learning is child of Machine Learning, using a lot of neural network layer to simulate how brain process data. LLM (Large Language Model) is application of Deep learning that using a special neural network architecture is called Transformer to process text.\n", "\n", "Data preprocessing: is the progress that make cleaned data.\n", "\n", "Evaluation metrics are used to measure how well a machine learning model performs on a given task. Different tasks require different metrics. For example, with clasification tasks use accuracy, F1-Score,; with regression tasks use MSE,MAE,.; with seq2seq tasks use rouge or perplexity,...\n", "\n", "Trainer API in Hugging Face simplifies the training and evaluation process of transformer models by providing a high-level interface." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Te9kCiKktGuQ" }, "outputs": [], "source": [ "# Example about Trainer API\n", "from transformers import Trainer, TrainingArguments\n", "\n", "training_args = TrainingArguments(\n", " output_dir=\"./results\",\n", " num_train_epochs=3,\n", " per_device_train_batch_size=8,\n", ")\n", "\n", "trainer = Trainer(\n", " model=model,\n", " args=training_args,\n", " train_dataset=train_dataset,\n", " eval_dataset=val_dataset,\n", ")\n", "\n", "trainer.train()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "MsQqUIm4TC8E" }, "outputs": [], "source": [ "#Action1: Load dataset\n", "from datasets import load_dataset\n", "# load iris classification dataset\n", "dataset = load_dataset(\"hitorilabs/iris\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "RRtc1kblV_I3", "outputId": "4ac12275-cd80-43ff-f67a-9a8d4a0ecaae" }, "outputs": [ { "data": { "text/plain": [ "Dataset({\n", " features: ['petal_length', 'petal_width', 'sepal_length', 'sepal_width', 'species'],\n", " num_rows: 150\n", "})" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dataset['train']\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "m7L-R4iBWzdA" }, "outputs": [], "source": [ "train_ds = dataset['train']" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "wyiDNHcrVAGc", "outputId": "99f3bdb9-5b73-4d27-b76f-b012681e0bf2" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " petal_length petal_width sepal_length sepal_width species\n", "0 0.067797 0.041667 0.222222 0.625000 0\n", "1 0.067797 0.041667 0.166667 0.416667 0\n", "2 0.050847 0.041667 0.111111 0.500000 0\n", "3 0.084746 0.041667 0.083333 0.458333 0\n", "4 0.067797 0.041667 0.194444 0.666667 0\n" ] } ], "source": [ "#Action 2:normalize to [0,1]\n", "from sklearn.preprocessing import MinMaxScaler\n", "import pandas as pd\n", "\n", "def normalize_dataset(dataset):\n", " # Convert HF Dataset -> pandas\n", " df = dataset.to_pandas()\n", "\n", " feature_cols = ['petal_length', 'petal_width', 'sepal_length', 'sepal_width']\n", "\n", " # Khởi tạo scaler\n", " scaler = MinMaxScaler()\n", "\n", " # Fit + transform\n", " df[feature_cols] = scaler.fit_transform(df[feature_cols])\n", "\n", " return df, scaler\n", "# Convert sang pandas\n", "df = train_ds.to_pandas()\n", "\n", "# Feature columns\n", "feature_cols = ['petal_length', 'petal_width', 'sepal_length', 'sepal_width']\n", "\n", "# Normalize\n", "scaler = MinMaxScaler()\n", "df[feature_cols] = scaler.fit_transform(df[feature_cols])\n", "\n", "print(df.head())" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 80 }, "id": "z04-nWMBX4ES", "outputId": "97283e7b-0d71-4e47-b5cc-22f59235a737" }, "outputs": [ { "data": { "text/html": [ "
GaussianNB()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
GaussianNB()
LogisticRegression(max_iter=200)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LogisticRegression(max_iter=200)