File size: 1,852 Bytes
f7b39fa
8bb068f
 
 
 
f7b39fa
 
8bb068f
f7b39fa
8bb068f
f7b39fa
8bb068f
f7b39fa
 
 
8bb068f
 
 
 
f7b39fa
8bb068f
f7b39fa
8bb068f
f7b39fa
8bb068f
f7b39fa
8bb068f
f7b39fa
8bb068f
f7b39fa
8bb068f
f7b39fa
8bb068f
f7b39fa
8bb068f
f7b39fa
8bb068f
f7b39fa
8bb068f
f7b39fa
8bb068f
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
---
pipeline_tag: text-generation
tags:
- phi3
- LLM
library_name: transformers
---
# Phi 3 Model with Extended Vocabulary and Fine-Tuning for Japanese

## Overview

This project is a proof of concept that extends the base vocabulary of the Phi 3 model and then applies supervised fine-tuning to teach it a new language (Japanese). Despite using a very small custom dataset, the improvement in Japanese language understanding is substantial.

## Model Details

- **Base Model**: Phi 3
- **Objective**: Extend the base vocabulary and fine-tune for Japanese language understanding.
- **Dataset**: Custom dataset of 1,000 entries generated using ChatGPT-4.
- **Language**: Japanese

## Dataset

The dataset used for this project was generated with the assistance of ChatGPT-4. It comprises 1,000 entries, carefully curated to cover a diverse range of topics and linguistic structures.

## Training

### Vocabulary Extension

The base vocabulary of the Phi 3 model was extended to include new Japanese tokens. This was a crucial step to enable the model to comprehend and generate Japanese text more effectively.

### Fine-Tuning

Supervised fine-tuning was performed on the extended model using the custom dataset. Despite the small dataset size, the model showed significant improvement in understanding and generating Japanese text.

## Results

Even with the limited dataset and vocabulary size, the fine-tuned model demonstrated substantial improvements over the base model in terms of Japanese language understanding and generation.

## Future Work

1. **Dataset Expansion**: Increase the size and diversity of the dataset to further enhance model performance.
2. **Evaluation**: Conduct comprehensive evaluation and benchmarking against standard Japanese language tasks.
3. **Optimization**: Optimize the model for better performance and efficiency.