Upload 10 files
Browse files- README.md +58 -0
- document.txt +203 -0
- download_model.py +22 -0
- fine_tune/app.py +52 -0
- fine_tune/fine_tune_model.py +101 -0
- fine_tune/loss_plot.png +0 -0
- fine_tune/sample_data.txt +4 -0
- fine_tune/templates/index.html +68 -0
- requirements.txt +17 -0
- test_model.py +55 -0
README.md
ADDED
|
@@ -0,0 +1,58 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
Tiny-GPT2 Text Generation Project
|
| 2 |
+
This repository provides resources to run and fine-tune the sshleifer/tiny-gpt2 model locally on a CPU, suitable for laptops with 8GB or 16GB RAM. The goal is to enable students to learn about AI model workings, experiment, and conduct research.
|
| 3 |
+
Prerequisites
|
| 4 |
+
|
| 5 |
+
Python: Version 3.10.9 recommended (3.9.10 also works).
|
| 6 |
+
Hardware: Minimum 8GB RAM, CPU-only (GPU optional but not required).
|
| 7 |
+
Hugging Face Account: Required for downloading model weights (create at huggingface.co).
|
| 8 |
+
|
| 9 |
+
Setup Instructions
|
| 10 |
+
|
| 11 |
+
Create a Virtual Environment:
|
| 12 |
+
python -m venv venv
|
| 13 |
+
source venv/bin/activate # On Windows: venv\Scripts\activate
|
| 14 |
+
|
| 15 |
+
|
| 16 |
+
Install Libraries:
|
| 17 |
+
pip install torch==2.3.0 transformers==4.38.2 huggingface_hub==0.22.2 datasets==2.21.0 numpy==1.26.4
|
| 18 |
+
|
| 19 |
+
|
| 20 |
+
Download Model Weights:
|
| 21 |
+
|
| 22 |
+
Copy download_model.py from the repository to your project folder.
|
| 23 |
+
Replace YOUR_HUGGINGFACE_API_TOKEN with your Hugging Face token (from huggingface.co/settings/tokens).
|
| 24 |
+
Run:python download_model.py
|
| 25 |
+
|
| 26 |
+
|
| 27 |
+
|
| 28 |
+
|
| 29 |
+
Test the Model:
|
| 30 |
+
|
| 31 |
+
Copy test_model.py to your project folder.
|
| 32 |
+
Run:python test_model.py
|
| 33 |
+
|
| 34 |
+
|
| 35 |
+
Expected output: Generated text starting with "Once upon a time".
|
| 36 |
+
|
| 37 |
+
|
| 38 |
+
Fine-Tune the Model:
|
| 39 |
+
|
| 40 |
+
Navigate to the fine_tune folder.
|
| 41 |
+
Add your dataset as sample_data.txt (or use the provided example).
|
| 42 |
+
Run:python fine_tune_model.py
|
| 43 |
+
|
| 44 |
+
|
| 45 |
+
The fine-tuned model will be saved in fine_tuned_model.
|
| 46 |
+
|
| 47 |
+
|
| 48 |
+
|
| 49 |
+
Notes for GPU Users
|
| 50 |
+
|
| 51 |
+
The scripts are configured to run on CPU (CUDA_VISIBLE_DEVICES="" in fine_tune_model.py).
|
| 52 |
+
To use a GPU (if available), remove os.environ["CUDA_VISIBLE_DEVICES"] = "" and no_cuda=True from fine_tune_model.py. Ensure your PyTorch installation supports CUDA (run pip install torch==2.3.0+cu121 for GPU support).
|
| 53 |
+
|
| 54 |
+
Troubleshooting
|
| 55 |
+
|
| 56 |
+
Memory Issues: If you have 8GB RAM, ensure no other heavy applications are running.
|
| 57 |
+
Library Conflicts: Use the exact versions listed above to avoid compatibility issues.
|
| 58 |
+
File Not Found: Verify the model files are in tiny-gpt2-model/models--sshleifer--tiny-gpt2/snapshots/5f91d94bd9cd7190a9f3216ff93cd1dd95f2c7be.
|
document.txt
ADDED
|
@@ -0,0 +1,203 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
Tiny-GPT2 Text Generation Project Documentation
|
| 2 |
+
=============================================
|
| 3 |
+
|
| 4 |
+
This project enables students to run, fine-tune, and experiment with the `sshleifer/tiny-gpt2`
|
| 5 |
+
model locally on a laptop with 8GB or 16GB RAM, using CPU (GPU optional). The goal is to provide
|
| 6 |
+
hands-on experience with AI model workflows, including downloading, fine-tuning, and deploying a
|
| 7 |
+
text generation model via a web interface. This document covers all steps to set up and run the
|
| 8 |
+
project, with credits to the original model and organization.
|
| 9 |
+
|
| 10 |
+
---
|
| 11 |
+
|
| 12 |
+
1. Project Overview
|
| 13 |
+
The project uses the `sshleifer/tiny-gpt2` model, a lightweight version of GPT-2, for text generation.
|
| 14 |
+
It includes scripts to:
|
| 15 |
+
- Download model weights from Hugging Face.
|
| 16 |
+
- Test the model with a sample prompt.
|
| 17 |
+
- Fine-tune the model on a custom dataset.
|
| 18 |
+
- Deploy a web app to generate text interactively.
|
| 19 |
+
The setup is optimized for low-memory systems (8GB RAM) and defaults to CPU execution, but includes
|
| 20 |
+
instructions for GPU users.
|
| 21 |
+
|
| 22 |
+
---
|
| 23 |
+
|
| 24 |
+
2. Prerequisites
|
| 25 |
+
- Hardware: Laptop with at least 8GB RAM (16GB recommended). GPU (e.g., NVIDIA GTX) is optional;
|
| 26 |
+
scripts default to CPU.
|
| 27 |
+
- Operating System: Windows, macOS, or Linux.
|
| 28 |
+
- Software:
|
| 29 |
+
- Python 3.10.9 (recommended) or 3.9.10. Download from https://www.python.org/downloads/.
|
| 30 |
+
- Visual Studio Code (VS Code) for development (optional but recommended). Download from
|
| 31 |
+
https://code.visualstudio.com/.
|
| 32 |
+
- Hugging Face Account: Required to download model weights.
|
| 33 |
+
|
| 34 |
+
---
|
| 35 |
+
|
| 36 |
+
3. Step-by-Step Setup Instructions
|
| 37 |
+
|
| 38 |
+
.1. Obtain a Hugging Face Token
|
| 39 |
+
1. Go to https://huggingface.co/ and sign up or log in.
|
| 40 |
+
2. Navigate to https://huggingface.co/settings/tokens.
|
| 41 |
+
3. Click "New token", select "Read" or "Write" access, and copy the token
|
| 42 |
+
(e.g., hf_XXXXXXXXXXXXXXXXXXXXXXXXXX).
|
| 43 |
+
4. Store the token securely; you’ll use it in the download script.
|
| 44 |
+
|
| 45 |
+
3.2. Install Python
|
| 46 |
+
1. Download Python 3.10.9 from https://www.python.org/downloads/release/python-3109/.
|
| 47 |
+
2. Install Python, ensuring "Add Python to PATH" is checked.
|
| 48 |
+
3. Verify installation in a terminal:
|
| 49 |
+
```
|
| 50 |
+
python --version
|
| 51 |
+
```
|
| 52 |
+
Expected output: Python 3.10.9
|
| 53 |
+
|
| 54 |
+
3.3. Set Up a Virtual Environment
|
| 55 |
+
1. Open a terminal in your project folder (e.g., C:\Users\YourName\Documents\project).
|
| 56 |
+
2. Create a virtual environment:
|
| 57 |
+
```
|
| 58 |
+
python -m venv venv
|
| 59 |
+
```
|
| 60 |
+
3. Activate the virtual environment:
|
| 61 |
+
- Windows: `venv\Scripts\activate`
|
| 62 |
+
- macOS/Linux: `source venv/bin/activate`
|
| 63 |
+
4. Confirm activation (you’ll see `(venv)` in the terminal prompt).
|
| 64 |
+
|
| 65 |
+
3.4. Install Dependencies
|
| 66 |
+
1. In the activated virtual environment, create a file named `requirements.txt` with the following
|
| 67 |
+
content:
|
| 68 |
+
```
|
| 69 |
+
torch==2.3.0
|
| 70 |
+
transformers==4.38.2
|
| 71 |
+
huggingface_hub==0.22.2
|
| 72 |
+
datasets==2.21.0
|
| 73 |
+
numpy==1.26.4
|
| 74 |
+
matplotlib==3.8.3
|
| 75 |
+
flask==3.0.3
|
| 76 |
+
```
|
| 77 |
+
2. Install the libraries:
|
| 78 |
+
```
|
| 79 |
+
pip install -r requirements.txt
|
| 80 |
+
```
|
| 81 |
+
3. For GPU users (optional):
|
| 82 |
+
- Uninstall CPU PyTorch: `pip uninstall torch -y`
|
| 83 |
+
- Install GPU PyTorch: `pip install torch==2.3.0+cu121`
|
| 84 |
+
- Verify CUDA: `python -c "import torch; print(torch.cuda.is_available())"` (should print `True`).
|
| 85 |
+
Note: Scripts default to CPU, so GPU users don’t need to change this unless desired.
|
| 86 |
+
|
| 87 |
+
3.5. Download Model Weights
|
| 88 |
+
1. Create a folder named `dalle` (or any name) for the project.
|
| 89 |
+
2. Copy the `download_model.py` script from the repository (or create it):
|
| 90 |
+
```
|
| 91 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 92 |
+
from huggingface_hub import login
|
| 93 |
+
import os
|
| 94 |
+
|
| 95 |
+
hf_token = "YOUR_HUGGINGFACE_TOKEN" # Replace with your token
|
| 96 |
+
login(token=hf_token)
|
| 97 |
+
|
| 98 |
+
model_name = "sshleifer/tiny-gpt2"
|
| 99 |
+
save_directory = "./tiny-gpt2-model"
|
| 100 |
+
os.makedirs(save_directory, exist_ok=True)
|
| 101 |
+
|
| 102 |
+
model = AutoModelForCausalLM.from_pretrained(model_name, cache_dir=save_directory)
|
| 103 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name, cache_dir=save_directory)
|
| 104 |
+
print(f"Model and tokenizer downloaded to {save_directory}")
|
| 105 |
+
```
|
| 106 |
+
3. Replace `YOUR_HUGGINGFACE_TOKEN` with your Hugging Face token.
|
| 107 |
+
4. Run the script:
|
| 108 |
+
```
|
| 109 |
+
python download_model.py
|
| 110 |
+
```
|
| 111 |
+
5. Verify the model files in
|
| 112 |
+
`tiny-gpt2-model/models--sshleifer--tiny-gpt2/snapshots/5f91d94bd9cd7190a9f3216ff93cd1dd95f2c7be`
|
| 113 |
+
(contains `config.json`, `pytorch_model.bin`, `vocab.json`, `merges.txt`).
|
| 114 |
+
|
| 115 |
+
3.6. Test the Model
|
| 116 |
+
1. Copy the `test_model.py` script from the repository to the `dalle` folder.
|
| 117 |
+
2. Run the script:
|
| 118 |
+
```
|
| 119 |
+
python test_model.py
|
| 120 |
+
```
|
| 121 |
+
3. Expected output: Generated text starting with "Once upon a time" (e.g., may be semi-coherent due
|
| 122 |
+
to the model’s small size).
|
| 123 |
+
|
| 124 |
+
3.7. Fine-Tune the Model
|
| 125 |
+
1. Create a `fine_tune` folder inside `dalle`:
|
| 126 |
+
```
|
| 127 |
+
mkdir fine_tune
|
| 128 |
+
cd fine_tune
|
| 129 |
+
```
|
| 130 |
+
2. Create a dataset file `sample_data.txt` (or use your own text). Example content:
|
| 131 |
+
```
|
| 132 |
+
Once upon a time, there was a brave knight who explored a magical forest.
|
| 133 |
+
The forest was filled with mystical creatures and ancient ruins.
|
| 134 |
+
The knight discovered a hidden treasure guarded by a wise dragon.
|
| 135 |
+
With courage and wisdom, the knight befriended the dragon and shared the treasure with the village.
|
| 136 |
+
```
|
| 137 |
+
3. Copy the `fine_tune_model.py` script from the repository to `fine_tune`.
|
| 138 |
+
4. Run the script:
|
| 139 |
+
```
|
| 140 |
+
python fine_tune_model.py
|
| 141 |
+
```
|
| 142 |
+
5. The script fine-tunes the model, saves it to `fine_tuned_model`, and generates a `loss_plot.png`
|
| 143 |
+
showing training loss.
|
| 144 |
+
6. Verify `fine_tuned_model` contains model files and check `loss_plot.png`.
|
| 145 |
+
|
| 146 |
+
3.8. Run the Web App
|
| 147 |
+
1. In the `fine_tune` folder, copy `app.py` and create a `templates` folder with `index.html` from the
|
| 148 |
+
repository.
|
| 149 |
+
2. Run the web app:
|
| 150 |
+
```
|
| 151 |
+
python app.py
|
| 152 |
+
```
|
| 153 |
+
3. Open a browser and go to `http://127.0.0.1:5000`.
|
| 154 |
+
4. Enter a prompt (e.g., "Once upon a time") and click "Generate Text" to see the output from the
|
| 155 |
+
fine-tuned model.
|
| 156 |
+
|
| 157 |
+
---
|
| 158 |
+
|
| 159 |
+
4. Notes for Students
|
| 160 |
+
- Model Limitations: `tiny-gpt2` is a small model, so generated text may not be highly coherent. For
|
| 161 |
+
better results, consider larger models like `gpt2` (requires more memory or GPU).
|
| 162 |
+
- Memory Management: On 8GB RAM systems, close other applications to free memory. The scripts use a
|
| 163 |
+
small batch size to minimize memory usage.
|
| 164 |
+
- GPU Support: Scripts default to CPU for compatibility. To use an NVIDIA GPU:
|
| 165 |
+
- Install `torch==2.3.0+cu121` (see step 3.4).
|
| 166 |
+
- Remove `os.environ["CUDA_VISIBLE_DEVICES"] = ""` from `fine_tune_model.py` and `app.py`.
|
| 167 |
+
- Change `use_cpu=True` to `use_cpu=False` in `fine_tune_model.py`.
|
| 168 |
+
- Experimentation: Try different prompts, datasets, or fine-tuning parameters (e.g., `num_train_epochs`,
|
| 169 |
+
`learning_rate`) to explore AI model behavior.
|
| 170 |
+
|
| 171 |
+
---
|
| 172 |
+
|
| 173 |
+
5. Troubleshooting
|
| 174 |
+
- Library Conflicts: Use the exact versions in `requirements.txt` to avoid issues.
|
| 175 |
+
- File Not Found: Ensure model files are in the correct path
|
| 176 |
+
(`tiny-gpt2-model/models--sshleifer--tiny-gpt2/snapshots/5f91d94bd9cd7190a9f3216ff93cd1dd95f2c7be`).
|
| 177 |
+
- Memory Errors: Reduce `max_length` in `fine_tune_model.py` (e.g., from 128 to 64) for 8GB RAM systems.
|
| 178 |
+
- Hugging Face Token Issues: Verify your token has "Read" or "Write" access at
|
| 179 |
+
https://huggingface.co/settings/tokens.
|
| 180 |
+
|
| 181 |
+
---
|
| 182 |
+
|
| 183 |
+
6. Credits and Attribution
|
| 184 |
+
- Original Model: `sshleifer/tiny-gpt2`, a distilled version of GPT-2, created by Steven Shleifer.
|
| 185 |
+
Available at https://huggingface.co/sshleifer/tiny-gpt2.
|
| 186 |
+
- Organization: Hugging Face, Inc. (https://huggingface.co/) provides the model weights, `transformers`
|
| 187 |
+
library, and `huggingface_hub` for model access.
|
| 188 |
+
- Project Creator: Remiai3 (GitHub/Hugging Face username). This project was developed to facilitate AI
|
| 189 |
+
learning and experimentation for students.
|
| 190 |
+
- AI Assistance: Grok 3, created by xAI (https://x.ai/), assisted in generating and debugging the code,
|
| 191 |
+
ensuring compatibility for low-resource systems.
|
| 192 |
+
|
| 193 |
+
---
|
| 194 |
+
|
| 195 |
+
7. Next Steps for Students
|
| 196 |
+
- Experiment with different datasets in `sample_data.txt` to fine-tune the model for specific tasks
|
| 197 |
+
(e.g., storytelling, dialogue).
|
| 198 |
+
- Modify `fine_tune_model.py` parameters (e.g., `learning_rate`, `num_train_epochs`) to study their
|
| 199 |
+
impact.
|
| 200 |
+
- Enhance `index.html` or `app.py` to add features like multiple prompt inputs or generation options.
|
| 201 |
+
- Explore larger models on Hugging Face (e.g., `gpt2-medium`) if you have a GPU or more RAM.
|
| 202 |
+
|
| 203 |
+
For questions or issues, contact Remiai3 via Hugging Face or check the repository for updates.
|
download_model.py
ADDED
|
@@ -0,0 +1,22 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 2 |
+
from huggingface_hub import login
|
| 3 |
+
import os
|
| 4 |
+
|
| 5 |
+
# Set your Hugging Face API token
|
| 6 |
+
hf_token = "hf_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
|
| 7 |
+
|
| 8 |
+
# Log in to Hugging Face
|
| 9 |
+
login(token=hf_token)
|
| 10 |
+
|
| 11 |
+
# Define the model name and local directory to save the model
|
| 12 |
+
model_name = "sshleifer/tiny-gpt2"
|
| 13 |
+
save_directory = "./tiny-gpt2-model"
|
| 14 |
+
|
| 15 |
+
# Create the directory if it doesn't exist
|
| 16 |
+
os.makedirs(save_directory, exist_ok=True)
|
| 17 |
+
|
| 18 |
+
# Download the model and tokenizer
|
| 19 |
+
model = AutoModelForCausalLM.from_pretrained(model_name, cache_dir=save_directory)
|
| 20 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name, cache_dir=save_directory)
|
| 21 |
+
|
| 22 |
+
print(f"Model and tokenizer downloaded successfully to {save_directory}")
|
fine_tune/app.py
ADDED
|
@@ -0,0 +1,52 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from flask import Flask, request, render_template
|
| 2 |
+
import os
|
| 3 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 4 |
+
import torch
|
| 5 |
+
|
| 6 |
+
app = Flask(__name__)
|
| 7 |
+
|
| 8 |
+
# Ensure CPU execution
|
| 9 |
+
os.environ["CUDA_VISIBLE_DEVICES"] = ""
|
| 10 |
+
device = torch.device("cpu")
|
| 11 |
+
|
| 12 |
+
# Load fine-tuned model and tokenizer
|
| 13 |
+
model_path = "./fine_tuned_model"
|
| 14 |
+
tokenizer_path = "./fine_tuned_model"
|
| 15 |
+
|
| 16 |
+
try:
|
| 17 |
+
tokenizer = AutoTokenizer.from_pretrained(tokenizer_path, local_files_only=True)
|
| 18 |
+
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float32, local_files_only=True)
|
| 19 |
+
model.to(device)
|
| 20 |
+
model.eval()
|
| 21 |
+
except Exception as e:
|
| 22 |
+
print(f"Error loading model or tokenizer: {e}")
|
| 23 |
+
exit(1)
|
| 24 |
+
|
| 25 |
+
# Set pad_token_id
|
| 26 |
+
if tokenizer.pad_token_id is None:
|
| 27 |
+
tokenizer.pad_token_id = tokenizer.eos_token_id
|
| 28 |
+
|
| 29 |
+
@app.route("/", methods=["GET", "POST"])
|
| 30 |
+
def index():
|
| 31 |
+
generated_text = ""
|
| 32 |
+
if request.method == "POST":
|
| 33 |
+
prompt = request.form.get("prompt", "")
|
| 34 |
+
if prompt:
|
| 35 |
+
inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True, max_length=128).to(device)
|
| 36 |
+
outputs = model.generate(
|
| 37 |
+
input_ids=inputs["input_ids"],
|
| 38 |
+
attention_mask=inputs["attention_mask"],
|
| 39 |
+
max_length=50,
|
| 40 |
+
num_return_sequences=1,
|
| 41 |
+
no_repeat_ngram_size=2,
|
| 42 |
+
do_sample=True,
|
| 43 |
+
top_k=50,
|
| 44 |
+
top_p=0.95,
|
| 45 |
+
temperature=0.7,
|
| 46 |
+
pad_token_id=tokenizer.eos_token_id
|
| 47 |
+
)
|
| 48 |
+
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
|
| 49 |
+
return render_template("index.html", generated_text=generated_text)
|
| 50 |
+
|
| 51 |
+
if __name__ == "__main__":
|
| 52 |
+
app.run(debug=True)
|
fine_tune/fine_tune_model.py
ADDED
|
@@ -0,0 +1,101 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import os
|
| 2 |
+
import matplotlib.pyplot as plt
|
| 3 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments
|
| 4 |
+
import torch
|
| 5 |
+
from datasets import load_dataset
|
| 6 |
+
|
| 7 |
+
# Ensure CPU execution (force CPU even if GPU is available)
|
| 8 |
+
os.environ["CUDA_VISIBLE_DEVICES"] = "" # Disable GPU
|
| 9 |
+
device = torch.device("cpu")
|
| 10 |
+
|
| 11 |
+
# Define paths
|
| 12 |
+
model_path = "../tiny-gpt2-model/models--sshleifer--tiny-gpt2/snapshots/5f91d94bd9cd7190a9f3216ff93cd1dd95f2c7be"
|
| 13 |
+
tokenizer_path = "../tiny-gpt2-model/models--sshleifer--tiny-gpt2/snapshots/5f91d94bd9cd7190a9f3216ff93cd1dd95f2c7be"
|
| 14 |
+
dataset_path = "./sample_data.txt"
|
| 15 |
+
output_dir = "./fine_tuned_model"
|
| 16 |
+
|
| 17 |
+
# Verify paths
|
| 18 |
+
if not os.path.exists(model_path) or not os.path.exists(tokenizer_path):
|
| 19 |
+
print(f"Error: Model or tokenizer directory not found")
|
| 20 |
+
exit(1)
|
| 21 |
+
if not os.path.exists(dataset_path):
|
| 22 |
+
print(f"Error: Dataset file not found at {dataset_path}")
|
| 23 |
+
exit(1)
|
| 24 |
+
|
| 25 |
+
# Load tokenizer and model
|
| 26 |
+
try:
|
| 27 |
+
tokenizer = AutoTokenizer.from_pretrained(tokenizer_path, local_files_only=True)
|
| 28 |
+
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float32, local_files_only=True)
|
| 29 |
+
model.to(device)
|
| 30 |
+
except Exception as e:
|
| 31 |
+
print(f"Error loading model or tokenizer: {e}")
|
| 32 |
+
exit(1)
|
| 33 |
+
|
| 34 |
+
# Set pad_token_id
|
| 35 |
+
if tokenizer.pad_token_id is None:
|
| 36 |
+
tokenizer.pad_token_id = tokenizer.eos_token_id
|
| 37 |
+
|
| 38 |
+
# Load and preprocess dataset
|
| 39 |
+
def preprocess_data(examples):
|
| 40 |
+
encodings = tokenizer(examples["text"], truncation=True, padding="max_length", max_length=128)
|
| 41 |
+
encodings["labels"] = encodings["input_ids"].copy() # Set labels for language modeling
|
| 42 |
+
return encodings
|
| 43 |
+
|
| 44 |
+
dataset = load_dataset("text", data_files=dataset_path)
|
| 45 |
+
tokenized_dataset = dataset.map(preprocess_data, batched=True, remove_columns=["text"])
|
| 46 |
+
|
| 47 |
+
# Custom callback to collect loss
|
| 48 |
+
class LossCallback(Trainer):
|
| 49 |
+
def __init__(self, *args, **kwargs):
|
| 50 |
+
super().__init__(*args, **kwargs)
|
| 51 |
+
self.losses = []
|
| 52 |
+
|
| 53 |
+
def log(self, logs):
|
| 54 |
+
super().log(logs)
|
| 55 |
+
if "loss" in logs:
|
| 56 |
+
self.losses.append(logs["loss"])
|
| 57 |
+
|
| 58 |
+
# Define training arguments
|
| 59 |
+
training_args = TrainingArguments(
|
| 60 |
+
output_dir=output_dir,
|
| 61 |
+
num_train_epochs=3,
|
| 62 |
+
per_device_train_batch_size=1, # Small batch size for low memory
|
| 63 |
+
save_steps=500,
|
| 64 |
+
save_total_limit=2,
|
| 65 |
+
logging_steps=1, # Log every step for small dataset
|
| 66 |
+
learning_rate=5e-5,
|
| 67 |
+
use_cpu=True,
|
| 68 |
+
)
|
| 69 |
+
|
| 70 |
+
# Initialize Trainer with custom callback
|
| 71 |
+
trainer = LossCallback(
|
| 72 |
+
model=model,
|
| 73 |
+
args=training_args,
|
| 74 |
+
train_dataset=tokenized_dataset["train"],
|
| 75 |
+
)
|
| 76 |
+
|
| 77 |
+
# Fine-tune the model
|
| 78 |
+
try:
|
| 79 |
+
trainer.train()
|
| 80 |
+
print("Fine-tuning completed successfully")
|
| 81 |
+
except Exception as e:
|
| 82 |
+
print(f"Error during fine-tuning: {e}")
|
| 83 |
+
exit(1)
|
| 84 |
+
|
| 85 |
+
# Save the fine-tuned model and tokenizer
|
| 86 |
+
model.save_pretrained(output_dir)
|
| 87 |
+
tokenizer.save_pretrained(output_dir)
|
| 88 |
+
print(f"Fine-tuned model and tokenizer saved to {output_dir}")
|
| 89 |
+
|
| 90 |
+
# Plot and save training loss
|
| 91 |
+
if trainer.losses:
|
| 92 |
+
plt.plot(trainer.losses, label="Training Loss")
|
| 93 |
+
plt.xlabel("Training Steps")
|
| 94 |
+
plt.ylabel("Loss")
|
| 95 |
+
plt.title("Training Loss Over Time")
|
| 96 |
+
plt.legend()
|
| 97 |
+
plt.savefig("loss_plot.png")
|
| 98 |
+
plt.close()
|
| 99 |
+
print("Loss plot saved as loss_plot.png")
|
| 100 |
+
else:
|
| 101 |
+
print("No loss data available to plot")
|
fine_tune/loss_plot.png
ADDED
|
fine_tune/sample_data.txt
ADDED
|
@@ -0,0 +1,4 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
Once upon a time, there was a brave knight who explored a magical forest.
|
| 2 |
+
The forest was filled with mystical creatures and ancient ruins.
|
| 3 |
+
The knight discovered a hidden treasure guarded by a wise dragon.
|
| 4 |
+
With courage and wisdom, the knight befriended the dragon and shared the treasure with the village.
|
fine_tune/templates/index.html
ADDED
|
@@ -0,0 +1,68 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
<!DOCTYPE html>
|
| 2 |
+
<html lang="en">
|
| 3 |
+
<head>
|
| 4 |
+
<meta charset="UTF-8">
|
| 5 |
+
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
| 6 |
+
<title>Tiny-GPT2 Text Generation</title>
|
| 7 |
+
<style>
|
| 8 |
+
body {
|
| 9 |
+
font-family: Arial, sans-serif;
|
| 10 |
+
max-width: 800px;
|
| 11 |
+
margin: 0 auto;
|
| 12 |
+
padding: 20px;
|
| 13 |
+
background-color: #f4f4f9;
|
| 14 |
+
}
|
| 15 |
+
h1 {
|
| 16 |
+
text-align: center;
|
| 17 |
+
color: #333;
|
| 18 |
+
}
|
| 19 |
+
.container {
|
| 20 |
+
background-color: #fff;
|
| 21 |
+
padding: 20px;
|
| 22 |
+
border-radius: 8px;
|
| 23 |
+
box-shadow: 0 0 10px rgba(0, 0, 0, 0.1);
|
| 24 |
+
}
|
| 25 |
+
textarea {
|
| 26 |
+
width: 100%;
|
| 27 |
+
height: 100px;
|
| 28 |
+
margin-bottom: 10px;
|
| 29 |
+
padding: 10px;
|
| 30 |
+
border: 1px solid #ccc;
|
| 31 |
+
border-radius: 4px;
|
| 32 |
+
}
|
| 33 |
+
button {
|
| 34 |
+
padding: 10px 20px;
|
| 35 |
+
background-color: #007bff;
|
| 36 |
+
color: #fff;
|
| 37 |
+
border: none;
|
| 38 |
+
border-radius: 4px;
|
| 39 |
+
cursor: pointer;
|
| 40 |
+
}
|
| 41 |
+
button:hover {
|
| 42 |
+
background-color: #0056b3;
|
| 43 |
+
}
|
| 44 |
+
.output {
|
| 45 |
+
margin-top: 20px;
|
| 46 |
+
padding: 10px;
|
| 47 |
+
border: 1px solid #ccc;
|
| 48 |
+
border-radius: 4px;
|
| 49 |
+
background-color: #f9f9f9;
|
| 50 |
+
}
|
| 51 |
+
</style>
|
| 52 |
+
</head>
|
| 53 |
+
<body>
|
| 54 |
+
<div class="container">
|
| 55 |
+
<h1>Tiny-GPT2 Text Generation</h1>
|
| 56 |
+
<form method="POST">
|
| 57 |
+
<textarea name="prompt" placeholder="Enter your prompt (e.g., Once upon a time)" required></textarea>
|
| 58 |
+
<button type="submit">Generate Text</button>
|
| 59 |
+
</form>
|
| 60 |
+
{% if generated_text %}
|
| 61 |
+
<div class="output">
|
| 62 |
+
<h3>Generated Text:</h3>
|
| 63 |
+
<p>{{ generated_text }}</p>
|
| 64 |
+
</div>
|
| 65 |
+
{% endif %}
|
| 66 |
+
</div>
|
| 67 |
+
</body>
|
| 68 |
+
</html>
|
requirements.txt
ADDED
|
@@ -0,0 +1,17 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Required libraries for running tiny-gpt2 model (CPU and GPU compatible)
|
| 2 |
+
torch==2.3.0 # CPU version; for GPU, use torch==2.3.0+cu121 (see notes below)
|
| 3 |
+
transformers==4.38.2
|
| 4 |
+
huggingface_hub==0.22.2
|
| 5 |
+
datasets==2.21.0
|
| 6 |
+
numpy==1.26.4
|
| 7 |
+
matplotlib==3.8.3
|
| 8 |
+
flask==3.0.3
|
| 9 |
+
|
| 10 |
+
# Notes:
|
| 11 |
+
# - For CPU-only systems (e.g., 16GB or 8GB RAM, no GPU), the above versions work directly.
|
| 12 |
+
# - For GPU-supported systems (e.g., NVIDIA GTX), install GPU-compatible PyTorch:
|
| 13 |
+
# 1. Uninstall torch: pip uninstall torch -y
|
| 14 |
+
# 2. Install GPU version: pip install torch==2.3.0+cu121
|
| 15 |
+
# 3. Verify CUDA: python -c "import torch; print(torch.cuda.is_available())"
|
| 16 |
+
# - To force CPU execution on GPU systems, scripts include os.environ["CUDA_VISIBLE_DEVICES"] = ""
|
| 17 |
+
# - Compatible with Python 3.10.9 or 3.9.10
|
test_model.py
ADDED
|
@@ -0,0 +1,55 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import os
|
| 2 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 3 |
+
import torch
|
| 4 |
+
|
| 5 |
+
# Define the model and tokenizer paths
|
| 6 |
+
model_path = "./tiny-gpt2-model/models--sshleifer--tiny-gpt2/snapshots/5f91d94bd9cd7190a9f3216ff93cd1dd95f2c7be"
|
| 7 |
+
tokenizer_path = "./tiny-gpt2-model/models--sshleifer--tiny-gpt2/snapshots/5f91d94bd9cd7190a9f3216ff93cd1dd95f2c7be"
|
| 8 |
+
|
| 9 |
+
# Verify the directory contents
|
| 10 |
+
if not os.path.exists(model_path) or not os.path.exists(tokenizer_path):
|
| 11 |
+
print(f"Error: Directory not found at {model_path}")
|
| 12 |
+
exit(1)
|
| 13 |
+
|
| 14 |
+
required_files = ["config.json", "pytorch_model.bin", "vocab.json", "merges.txt"]
|
| 15 |
+
for file in required_files:
|
| 16 |
+
if not os.path.exists(os.path.join(model_path, file)):
|
| 17 |
+
print(f"Error: {file} not found in {model_path}")
|
| 18 |
+
exit(1)
|
| 19 |
+
|
| 20 |
+
# Load the tokenizer and model
|
| 21 |
+
try:
|
| 22 |
+
tokenizer = AutoTokenizer.from_pretrained(tokenizer_path, local_files_only=True)
|
| 23 |
+
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float32, local_files_only=True)
|
| 24 |
+
except Exception as e:
|
| 25 |
+
print(f"Error loading model or tokenizer: {e}")
|
| 26 |
+
exit(1)
|
| 27 |
+
|
| 28 |
+
# Set pad_token_id to eos_token_id if not already set
|
| 29 |
+
if tokenizer.pad_token_id is None:
|
| 30 |
+
tokenizer.pad_token_id = tokenizer.eos_token_id
|
| 31 |
+
|
| 32 |
+
# Set model to evaluation mode
|
| 33 |
+
model.eval()
|
| 34 |
+
|
| 35 |
+
# Prepare input text
|
| 36 |
+
prompt = "Once upon a time"
|
| 37 |
+
inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True).to("cpu")
|
| 38 |
+
|
| 39 |
+
# Generate text
|
| 40 |
+
outputs = model.generate(
|
| 41 |
+
input_ids=inputs["input_ids"],
|
| 42 |
+
attention_mask=inputs["attention_mask"],
|
| 43 |
+
max_length=50,
|
| 44 |
+
num_return_sequences=1,
|
| 45 |
+
no_repeat_ngram_size=2,
|
| 46 |
+
do_sample=True,
|
| 47 |
+
top_k=50,
|
| 48 |
+
top_p=0.95,
|
| 49 |
+
temperature=0.7,
|
| 50 |
+
pad_token_id=tokenizer.eos_token_id
|
| 51 |
+
)
|
| 52 |
+
|
| 53 |
+
# Decode and print the generated text
|
| 54 |
+
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
|
| 55 |
+
print("Generated Text:", generated_text)
|