Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,46 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
pipeline_tag: text-generation
|
| 3 |
+
tags:
|
| 4 |
+
- phi3
|
| 5 |
+
- LLM
|
| 6 |
+
- onnx
|
| 7 |
+
language:
|
| 8 |
+
- ja
|
| 9 |
+
library_name: transformers
|
| 10 |
+
---
|
| 11 |
+
# Phi 3 Model with Extended Vocabulary and Fine-Tuning for Japanese
|
| 12 |
+
|
| 13 |
+
## Overview
|
| 14 |
+
|
| 15 |
+
This project is a proof of concept that extends the base vocabulary of the Phi 3 model and then applies supervised fine-tuning to teach it a new language (Japanese). Despite using a very small custom dataset, the improvement in Japanese language understanding is substantial.
|
| 16 |
+
|
| 17 |
+
## Model Details
|
| 18 |
+
|
| 19 |
+
- **Base Model**: Phi 3
|
| 20 |
+
- **Objective**: Extend the base vocabulary and fine-tune for Japanese language understanding.
|
| 21 |
+
- **Dataset**: Custom dataset of 1,000 entries generated using ChatGPT-4.
|
| 22 |
+
- **Language**: Japanese
|
| 23 |
+
|
| 24 |
+
## Dataset
|
| 25 |
+
|
| 26 |
+
The dataset used for this project was generated with the assistance of ChatGPT-4. It comprises 1,000 entries, carefully curated to cover a diverse range of topics and linguistic structures.
|
| 27 |
+
|
| 28 |
+
## Training
|
| 29 |
+
|
| 30 |
+
### Vocabulary Extension
|
| 31 |
+
|
| 32 |
+
The base vocabulary of the Phi 3 model was extended to include new Japanese tokens. This was a crucial step to enable the model to comprehend and generate Japanese text more effectively.
|
| 33 |
+
|
| 34 |
+
### Fine-Tuning
|
| 35 |
+
|
| 36 |
+
Supervised fine-tuning was performed on the extended model using the custom dataset. Despite the small dataset size, the model showed significant improvement in understanding and generating Japanese text.
|
| 37 |
+
|
| 38 |
+
## Results
|
| 39 |
+
|
| 40 |
+
Even with the limited dataset and vocabulary size, the fine-tuned model demonstrated substantial improvements over the base model in terms of Japanese language understanding and generation.
|
| 41 |
+
|
| 42 |
+
## Future Work
|
| 43 |
+
|
| 44 |
+
1. **Dataset Expansion**: Increase the size and diversity of the dataset to further enhance model performance.
|
| 45 |
+
2. **Evaluation**: Conduct comprehensive evaluation and benchmarking against standard Japanese language tasks.
|
| 46 |
+
3. **Optimization**: Optimize the model for better performance and efficiency.
|